Node Installation Documentation: Difference between revisions

From MRC Centre for Outbreak Analysis and Modelling
Jump to navigation Jump to search
No edit summary
 
(22 intermediate revisions by the same user not shown)
Line 1: Line 1:
This document is my log of how to install a new linux node and add it to the Microsoft HPC 2012 R2 U4 cluster.
This document is my log of installing the Microsoft Linux Cluster...


== Get ready ==
== HeadNode ==
* Install Windows 2012 R2, and HPC Pack 2012 R2 U3 Head Node onto a domain server - I called it fi--didelxhn.
* Create a folder C:\HPCLinux, and create a network share called hpclinux that allow everyone access to it.
* <code>copy "%CCP_DATA%InstallShare\LinuxNodeAgent\*.*</code> in that folder. (setup.py and hpcnodeagent.tar.gz arrive)
* Run powershell as admin.
* <code>Export-HpcLinuxCertificate –FilePath C:\HPCLinux\cert.pfx</code> and give it a magic password.
* (To make a certificate manually, a script something like the below might do it, but I couldn't make it work...
<pre>
New-SelfsignedCertificateEx -Subject "CN=Microsoft HPC Linux Communication" -EKU "Server Authentication","Client Authentication" -KeySpec "Signature" -KeyUsage "DigitalSignature,DataEncipherment,KeyEn
cipherment,NonRepudiation,KeyCertSign" -SAN "fi--didemrchnb","fi--didemrchnb.dide.local","fi--didemrchnb.dide.ic.ac.uk" -NotAfter 2039/01/01 -StoreLocation "LocalMachine" -exportable
</pre>
 
== Nodes  ==
=== Install linux and enable SSH ===
=== Install linux and enable SSH ===
* I used the normal Ubuntu 14.04 desktop USB, as the others didn't work.
* I'm now using the normal Ubuntu 16.04.02 server, on USB with Rufus.  
* It all worked pretty smoothly really.
* Use entire disk with LVM
* Select OpenSSH-server when offered.
* <code>sudo apt-get install gcc g++ openjdk-9-jdk-headless subversion</code>
* <code>sudo apt-get update</code>
* <code>sudo apt-get update</code>
* <code>sudo apt-get upgrade</code>
* <code>sudo apt-get upgrade</code>
* <code>sudo apt-get install openssh-server</code>
* <code>sudo usermod -aG sudo user</code> if you need to add any sudo-ers.
* <code>sudo nano /etc/ssh/sshd_config</code> if you need to set ssh users.
** Add a line <code>AllowGroups ssh</code>
** Also, be good and add <code>DenyUsers root</code> and <code>DenyGroups root</code> when you've setup sudo-ers.
** <code>sudo usermod -aG ssh user</code> to add each user to ssh.
** <code>sudo service ssh restart</code> to apply changes. Don't lock yourself out muppet-brain.


=== Sort out infiniband support ===
=== Sort out infiniband support ===
* The cards I used were the old Voltaire ones, so a bit of hacking was needed:-
* The cards I used were the old Voltaire ones, so a bit of hacking was needed:-
* <code>sudo nano /etc modules</code> - and add ib_mthca rdma_ucm ib_umad ib_uverbs ib_ipoib ib_srp ib_sdp
* <code>sudo nano /etc/modules</code> - and add ib_mthca rdma_ucm ib_umad ib_uverbs ib_ipoib ib_srp ib_sdp
* <code>sudo modprobe ib_ipoib</code>
* <code>sudo modprobe ib_ipoib</code>
* <code>sudo nano /etc/network/interfaces</code> and add the below, where x is the node number+1. (eg, fi--didelx15 should be 12.0.0.16). Don't add anything about eth0 or eth1 or it will break.
* <code>sudo nano /etc/network/interfaces</code> and add the below, where x is the node number+1. (eg, fi--didelx15 should be 12.0.0.16).  
<pre>
<pre>
auto eth0
  iface eth0 inet dhcp
  metric 100
auto eth1
  iface eth1 inet dhcp
  metric 101
auto ib0
auto ib0
iface ib0 inet static
iface ib0 inet static
Line 26: Line 41:
     netmask 255.255.255.0
     netmask 255.255.255.0
     broadcast 12.0.0.255
     broadcast 12.0.0.255
    metric 102
</pre>
</pre>
* This assumes that eth0 is the enterprise network (129.31.26.x) and eth1 is the private (11.0.0.x) networks.
* We may need to disable IPv6.
* We may need to disable IPv6.
* <code>sudo nano /etc/sysctl.conf</code>, and add the following somewhere:
* <code>sudo nano /etc/sysctl.conf</code>, and add the following somewhere:
Line 43: Line 60:
=== Install NTP support ===
=== Install NTP support ===
* <code>sudo apt-get install ntp</code>
* <code>sudo apt-get install ntp</code>
* <code>sudo cp /hpclinux/linux_inst/ntp.conf /etc/ntp.conf</code>
* <code>sudo cp /hpclinux/linux_inst/ntp.conf /etc/ntp.conf</code> (This essentially removed the pools as the main source of time, and replaces with `server time.imperial.ac.uk`
* (That sets the only server to be time.imperial.ac.uk)
* <code>sudo /etc/init.d/ntp stop</code>
* <code>sudo /etc/init.d/ntp stop</code>
* <code>sudo ntupdate time.imperial.ac.uk</code>
* <code>sudo apt install ntpdate</code>
* <code>sudo ntpdate time.imperial.ac.uk</code>
* <code>sudo /etc/init.d/ntp start</code>
* <code>sudo /etc/init.d/ntp start</code>


Line 53: Line 70:
* The domain, when asked, is DIDE.local - case sensitive.
* The domain, when asked, is DIDE.local - case sensitive.
* <code>sudo cp /hpclinux/linux_inst/nsswitch.conf /etc/nsswitch.conf</code> - adds winbind to passwd group, and removes [NOTFOUND=return] from hosts.
* <code>sudo cp /hpclinux/linux_inst/nsswitch.conf /etc/nsswitch.conf</code> - adds winbind to passwd group, and removes [NOTFOUND=return] from hosts.
* <code>sudo cp /hpclinux/linux_inst/smb.conf /etc/samba/smbconf</code> - lots of config for DIDE.
* <code>sudo cp /hpclinux/linux_inst/smb.conf /etc/samba/smb.conf</code> - lots of config for DIDE.
* <code>sudo cp /hpclinux/linux_inst/krb5.conf /etc/krb5.conf</code> - lots more config for DIDE.
* <code>sudo cp /hpclinux/linux_inst/krb5.conf /etc/krb5.conf</code> - lots more config for DIDE.
* <code>ifconfig -a</code> and make note of the IP address if you haven't already.
* <code>ifconfig -a</code> and make note of the IP address if you haven't already.
* <code>sudo nano /etc/hosts</code> and replace with:-
* <code>sudo nano /etc/hosts</code> and replace with:-
<pre>
<pre>
127.0.0.1   localhost
127.0.0.1     localhost
129.31.x.y   fi--didelx99.dide.local fi--didelx99.dide.ic.ac.uk fi--didelx99
129.31.x.y   fi--didelx99.dide.local fi--didelx99.dide.ic.ac.uk fi--didelx99
129.31.26.137 fi--didelxhn.dide.local fi--didelxhn.dide.ic.ac.uk fi--didelxhn
129.31.26.21  fi--didedc1.dide.local fi--didedc1.dide.ic.ac.uk fi--didedc1
129.31.26.171 fi--didedc6.dide.local fi--didedc6.dide.ic.ac.uk fi--didedc6
129.31.26.172 fi--didedc7.dide.local fi--didedc7.dide.ic.ac.uk fi--didedc7
</pre>
</pre>
I'm not sure strictly why some of these (eg, fi--didelxhn) needs adding, since Ubuntu can already ping fi--didelxhn in all of those forms. However, without adding that, an HttpException occurs when adding the node to the cluster, so this is the non-entirely understood workaround.
* <code>sudo net cache flush</code>
* <code>sudo net cache flush</code>
* <code>sudo service smbd restart</code>
* <code>sudo service smbd restart</code>
Line 71: Line 95:


* <code>sudo apt-get install libpam-mount</code>
* <code>sudo apt-get install libpam-mount</code>
* <code>sudo cp /hpclinux/linux_inst/pam_mount.conf.xml /etc/security/pam_mount.conf.xml</code> - this enables looking for .pam_mount_conf.xml in the home folder, and automatically sets up a mount point (on fi--san02) to that folder beforehand.
* <code>sudo cp /hpclinux/linux_inst/pam_mount.conf.xml /etc/security/pam_mount.conf.xml</code> - this enables looking for .pam_mount.conf.xml in the home folder, and automatically sets up a mount point (on fi--san02) to that folder beforehand.
* <code>sudo cp /hpclinux/linux_inst/.pam_mount.conf.xml /etc/skel</code> - for convenience really. Suggest that users copy all the "." files from /etc/skel to their home folder, to get a nice experience when ssh-ing.
* <code>sudo cp /hpclinux/linux_inst/.pam_mount.conf.xml /etc/skel</code> - for convenience really. Suggest that users copy all the "." files from /etc/skel to their home folder, to get a nice experience when ssh-ing.
* The home folder is set to /media/home, and automatically mounts \\fi--san02\homes\username. Users should edit <code>.pam_mount.conf.xml</code> in their home folder, and add the volumes they want mounted. For example:-
<pre>
<?xml version="1.0" encoding="utf-8" ?>
<pam_mount>
  <volume options="nodev,nosuid" user="*" mountpoint="/media/f2gsim" path="GlobalSim" server="fi--didef2.dide.ic.ac.uk" fstype="cifs" />
</pam_mount>
</pre>


== Installing HPC ==
== Installing HPC ==
* <code>cd /hpclinux</code>
* <code>cd /hpclinux</code>
* <code>sudo python setup.py -install -clusname:fi--didelxhn -certfile:hpc4.pfx</code>
* <code>sudo ./install_filters.sh</code>
* <code>sudo python setup.py -install -clusname:fi--didelxhn -certfile:cert.pfx</code>   (you'll need the magic password).
* If you need to reinstall/readd, then <code>sudo python setup.py -uninstall</code> and redo the line above.
 
== Securing SSH ==
* <code>sudo usermod -aG sudo user</code> if you need to add any sudo-ers.
* <code>sudo nano /etc/ssh/sshd_config</code> if you need to set ssh users.
** Add a line <code>AllowGroups ssh</code>
** Also, be good and add <code>DenyUsers root</code> and <code>DenyGroups root</code> when you've setup sudo-ers.
** <code>sudo usermod -aG ssh user</code> to add each user to ssh.
** <code>sudo service ssh restart</code> to apply changes. Don't lock yourself out muppet-brain.

Latest revision as of 12:31, 28 September 2020

This document is my log of installing the Microsoft Linux Cluster...

HeadNode

  • Install Windows 2012 R2, and HPC Pack 2012 R2 U3 Head Node onto a domain server - I called it fi--didelxhn.
  • Create a folder C:\HPCLinux, and create a network share called hpclinux that allow everyone access to it.
  • copy "%CCP_DATA%InstallShare\LinuxNodeAgent\*.* in that folder. (setup.py and hpcnodeagent.tar.gz arrive)
  • Run powershell as admin.
  • Export-HpcLinuxCertificate –FilePath C:\HPCLinux\cert.pfx and give it a magic password.
  • (To make a certificate manually, a script something like the below might do it, but I couldn't make it work...
New-SelfsignedCertificateEx -Subject "CN=Microsoft HPC Linux Communication" -EKU "Server Authentication","Client Authentication" -KeySpec "Signature" -KeyUsage "DigitalSignature,DataEncipherment,KeyEn
cipherment,NonRepudiation,KeyCertSign" -SAN "fi--didemrchnb","fi--didemrchnb.dide.local","fi--didemrchnb.dide.ic.ac.uk" -NotAfter 2039/01/01 -StoreLocation "LocalMachine" -exportable

Nodes

Install linux and enable SSH

  • I'm now using the normal Ubuntu 16.04.02 server, on USB with Rufus.
  • Use entire disk with LVM
  • Select OpenSSH-server when offered.
  • sudo apt-get install gcc g++ openjdk-9-jdk-headless subversion
  • sudo apt-get update
  • sudo apt-get upgrade

Sort out infiniband support

  • The cards I used were the old Voltaire ones, so a bit of hacking was needed:-
  • sudo nano /etc/modules - and add ib_mthca rdma_ucm ib_umad ib_uverbs ib_ipoib ib_srp ib_sdp
  • sudo modprobe ib_ipoib
  • sudo nano /etc/network/interfaces and add the below, where x is the node number+1. (eg, fi--didelx15 should be 12.0.0.16).
auto eth0
  iface eth0 inet dhcp
  metric 100

auto eth1
  iface eth1 inet dhcp
  metric 101

auto ib0
iface ib0 inet static
    address 12.0.0.x
    netmask 255.255.255.0
    broadcast 12.0.0.255
    metric 102
  • This assumes that eth0 is the enterprise network (129.31.26.x) and eth1 is the private (11.0.0.x) networks.
  • We may need to disable IPv6.
  • sudo nano /etc/sysctl.conf, and add the following somewhere:
net.ipv6.conf.all.disable_ipv6=1
net.ipv6.conf.default.disable_ipv6=1
net.ipv6.conf.lo.disable_ipv6=1

Add the HPC mount for some useful bits

  • sudo mkdir -p /hpclinux
  • sudo apt-get install cifs-utils
  • sudo mount -t cifs //fi--didelxhn/HPCLinux /hpclinux -o user=adminuser,dom=dide.local

Adding to the domain

Install NTP support

  • sudo apt-get install ntp
  • sudo cp /hpclinux/linux_inst/ntp.conf /etc/ntp.conf (This essentially removed the pools as the main source of time, and replaces with `server time.imperial.ac.uk`
  • sudo /etc/init.d/ntp stop
  • sudo apt install ntpdate
  • sudo ntpdate time.imperial.ac.uk
  • sudo /etc/init.d/ntp start

Domain things

  • sudo apt-get install winbind libpam-winbind libnss-winbind krb5-user krb5-config libpam-krb5
  • The domain, when asked, is DIDE.local - case sensitive.
  • sudo cp /hpclinux/linux_inst/nsswitch.conf /etc/nsswitch.conf - adds winbind to passwd group, and removes [NOTFOUND=return] from hosts.
  • sudo cp /hpclinux/linux_inst/smb.conf /etc/samba/smb.conf - lots of config for DIDE.
  • sudo cp /hpclinux/linux_inst/krb5.conf /etc/krb5.conf - lots more config for DIDE.
  • ifconfig -a and make note of the IP address if you haven't already.
  • sudo nano /etc/hosts and replace with:-
127.0.0.1     localhost
129.31.x.y    fi--didelx99.dide.local fi--didelx99.dide.ic.ac.uk fi--didelx99
129.31.26.137 fi--didelxhn.dide.local fi--didelxhn.dide.ic.ac.uk fi--didelxhn
129.31.26.21  fi--didedc1.dide.local fi--didedc1.dide.ic.ac.uk fi--didedc1
129.31.26.171 fi--didedc6.dide.local fi--didedc6.dide.ic.ac.uk fi--didedc6
129.31.26.172 fi--didedc7.dide.local fi--didedc7.dide.ic.ac.uk fi--didedc7

I'm not sure strictly why some of these (eg, fi--didelxhn) needs adding, since Ubuntu can already ping fi--didelxhn in all of those forms. However, without adding that, an HttpException occurs when adding the node to the cluster, so this is the non-entirely understood workaround.

  • sudo net cache flush
  • sudo service smbd restart
  • sudo service nmbd restart
  • sudo service winbind restart
  • sudo kinit adminuser@DIDE.LOCAL
  • sudo net ads join -U adminuser

Preparing drive mounting

  • sudo apt-get install libpam-mount
  • sudo cp /hpclinux/linux_inst/pam_mount.conf.xml /etc/security/pam_mount.conf.xml - this enables looking for .pam_mount.conf.xml in the home folder, and automatically sets up a mount point (on fi--san02) to that folder beforehand.
  • sudo cp /hpclinux/linux_inst/.pam_mount.conf.xml /etc/skel - for convenience really. Suggest that users copy all the "." files from /etc/skel to their home folder, to get a nice experience when ssh-ing.

Installing HPC

  • cd /hpclinux
  • sudo ./install_filters.sh
  • sudo python setup.py -install -clusname:fi--didelxhn -certfile:cert.pfx (you'll need the magic password).
  • If you need to reinstall/readd, then sudo python setup.py -uninstall and redo the line above.

Securing SSH

  • sudo usermod -aG sudo user if you need to add any sudo-ers.
  • sudo nano /etc/ssh/sshd_config if you need to set ssh users.
    • Add a line AllowGroups ssh
    • Also, be good and add DenyUsers root and DenyGroups root when you've setup sudo-ers.
    • sudo usermod -aG ssh user to add each user to ssh.
    • sudo service ssh restart to apply changes. Don't lock yourself out muppet-brain.