Cluster Documentation

From MRC Centre for Outbreak Analysis and Modelling
Revision as of 15:42, 5 August 2015 by Admin (talk | contribs) (→‎R)
Jump to navigation Jump to search

We have two high performance clusters, both running Microsoft HPC 2012 R2. The smaller older cluster is fi--dideclusthn and the larger is fi--didemrchnb. The HPC 2012 R2 upgrade was done in April 2014 - if you have an older HPC client, then I recommend uninstalling it through Control Panel, and following the instructions below to install the most recent client tools. If you're not a windows user, see the HPC Web Portal.

Windows Users

The cluster nodes are Windows 2012-R2 based. This is good news if you're a Windows user. You have the option of using the MS client for launching jobs (see below), or you can use try the new HPC Web Portal. If you're preparing to use the cluster, and have doubts about whether it can run what you want - and whether it will be straightforward to develop and run on it, best talk to me and/or the IT guys first!

Linux and MAC Users

There are two problems, but both can be at least partially overcome. The first is that there is no client for linux. The second is that it can only run windows binaries.

Launching jobs on Linux/Mac

The best way might be the new HPC Web Portal, which lets you submit jobs through a webpage. It's very new, but quite neat. Give it a try. And feedback to me any issues!

Alternatively, you can either (1) go down the VM route, installing a windows virtual machine, eg VMWare or Parallels, and follow the instructions below. (2) Remote Desktop to your friend's Windows machine. These tend not to be very convenient, but some users have submitted thousands of jobs this way.

What Linux/Mac jobs can I run?

Many jobs will be platform indepedent (R, Java, Perl, Python). You may need to check you have the right dependencies - packages in R for example can vary between different platforms, so make sure you've got a windows binary folder in your package repositry. Also, for many linux applications, there is a windows port. Mileage can vary, but it's worth a try before you give up hope. Some of these might be installed on the cluster already - see below. Or some you may be able to download and call them from your run script.

For C code, (including R with RCpp I think), we'd have to cross-compile. So either you compile them on a windows virtual machine, or you work out how to cross-compile and produce a windows excecutable from linux. (I believe you do that by installing MinGW for linux. Yes I know, MinGW is a conversion of GCC to Windows, which produces windows binaries. Well, there's been a back-port which does the same thing in linux I believe.) Other C compilers may have options for this. We'll solve these problems if we get to them.

Lastly, compiled Matlab is a complex one; I don't think there's any way you can run Matlab on Windows/Linux/Mac, and produce an executable for a different operating system that you're running in. The only way for that, I'm afraid, is the VM/borrow a Windows machine.

Still feeling optimistic?

Well done - and all this is a work in progress - if you want something the cluster can't do here, let me know, and we'll solve it together. Most of the rest of this page will be relevant; you'll still need to create a file called "run.bat" (or "run.cmd") - sorry if this makes you feel unclean, but Windows won't execute it otherwise! What's in that file will often be similar to what's in a shell script.

Getting Started

Getting access to the cluster

Send a mail to Wes (w.hinsley@imperial.ac.uk) requesting access to the cluster. He will add you to the necessary groups, and work out what cluster you should be running on. Unless you are told otherwise, this will be fi--dideclusthn. But if you have been told otherwise, then whenever you see fi--dideclusthn, replace it with the cluster name you’ve been given, either mentally or (less effectively) on your screen with tip-ex.

Windows: Install the HPC client software

  • Run \\fi--dideclusthn.dide.ic.ac.uk\REMINST\setup.exe.
  • Confirm Yes, I really want to run this.
  • There's an introductory screen, and you click Next
  • You have to tick the Accept, then click Next
  • If it gives you any choice, select Install only the client utilities
  • Next, Next, Next
  • Select Use Microsoft Update and then Next
  • And then Install
  • And possibly Install again – if it had to install any pre-requisites first.
  • And Finish, and the client installation is done.
  • To check everything is alright so far, open a new command prompt.
  • (Find it under Accessories, or you can use Start, Run, cmd).
  • Type job list /scheduler:fi--dideclusthn.dide.ic.ac.uk

And it will list the jobs you have running, which will be none so far!

Id         Owner         Name              State        Priority    Resource Request
---------- ------------- ----------------- ------------ ----------- -------- -------

0 matching jobs for DIDE\user

Windows: Using a non-domain machine

If you're using a machine that isn't actually logged into DIDE, then the client software will have a problem working out who you are. In this case, there are two things you need to do.

  • Check you've really got a DIDE username. If you don't know what it is, talk to your lovely IT team.
  • Connect to the DIDE VPN (see http://www1.imperial.ac.uk/publichealth/departments/ide/it/remote/) - login to that with your DIDE account.
  • Alternatively to the DIDE VPN, you can install ICT's Juniper client from http://secureaccess.imperial.ac.uk, and login to that with your IC account - but you'll still need your DIDE account to access DIDE servers after the Juniper connection is up and running.
  • Now we'll open a command window using "runas" - which lets you choose which identity the system thinks you are within that command window:-
runas /netonly /user:DIDE\user cmd
  • (Change user to your own login obviously). It will ask for your password, then open a new command window. In that window, you'll be able to do cluster-related things as your DIDE username.

Windows: Launching and cancelling jobs

Windows: Command Line

Suppose you have a file called "run.bat" in your home directory, which does what you want to run on the cluster. Let's say it's a single-core, very simple job. We’ll discuss what should be inside "run.bat" later. To submit your job on the cluster, at the command prompt, type this (all on one line) - or put it in a file called "launch.bat", and run it:-

job submit /scheduler:fi--dideclusthn.dide.ic.ac.uk /jobtemplate:GeneralNodes /numcores:1 \\fi--san02.dide.ic.ac.uk\homes\user\run.bat

If it's the first time you’ve run a job - or if you've recently changed your password, then it might ask you for your DIDE password and offer to remember it. Otherwise, it will just tell you the ID number of your job.

Enter the password for ‘DIDE\user’ to connect to ‘FI—DIDECLUSTHN’:
Remember this password? (Y/N)
job has been submitted. ID: 123

If you want to remove the job, then:- job cancel 123 /scheduler:fi--dideclusthn.dide.ic.ac.uk

Or view its details with job view 123 /scheduler:fi--dideclusthn.dide.ic.ac.uk

Windows: Job Manager GUI

Alternatively to the command-line, you can use the job management software, rather than the command-line. The advantage is that it’s a GUI. The disadvantage is, as in all GUIs, you may not feel totally sure you know what it’s up to – where most of the time the things you want to do might not be very complex, as above.

The job management software will be on your start menu, as above. All the features are under the “Actions” menu, and hopefully it will be self explanatory. Read the details below about launching, and you'll find the interface bits that do it in the Job Manager. However, you may find over time, especially if you run lots of jobs, that learning to do it the scripted way with the command-line can be quicker.


All platforms: The Web Portal

Information for running any job

Visibility

First rule: the executable (or batch) file that the cluster will run must be somewhere on the network that the cluster has access to, when it logs in as you. This amounts to any network accessible drive that you have access to when you login – including network shares, such as the temp drive, your home directory, and any specific network space set up for your project.

Do not assume that the program will run in any specific directory – even though there are ways that are meant to do that. Use full paths to specify where files should be read from, or written to. You may want to write code that takes the paths either from a parameter file, or as a command-line parameter, to give as much flexibility as possible. In the long run, this will help you more. REMEMBER THAT your home directory is backed up every day – and it’s not generally very big. So please avoid filling it with enormous sets of results that you don’t actually want to keep – it will make lots of people happy if you can rather write your files to somewhere that doesn’t get backed up. Even a network share on your desktop will do…

If you would like to create a network share on your desktop, then simply… Right Click on the folder you’d like to share, and choose “Share” The next page shows who has rights to the folder – by default, you! Click on Share. A share is created called \\your-computer-name\the-share-name And you’ll be able to access this from the cluster.

BUT, there are limits on how many connections can be made to your desktop, so a desktop share may be useful for testing, but not for running lots of jobs.

If you really need to map a network drive letter, then at the top of your “run.bat” file:- net use X: \\your-computer-name\the-share-name

Summary Comment: Try and use a project share on one of the proper servers.

Interactivity

The job must run in an entirely scripted, unattended way. It must not require any key presses or mouse clicks, or other live interactivity while it runs. So jobs generally will read some sort of input (from a file, or from the way you run the job), do some processing, and write the results somewhere for you - all without intervention.

Launching jobs

Jobs can be launched either through the Job Manager interface, or through the command line tools, which offer greater flexibility. We'll describe the command line method here; if you want to use the GUI, then it'll be a case of finding the matching boxes... Below are the specifics for our clusters. For more information, simply type job on the commandline, or job submit for the list of submission-related commands.

FI--DIDECLUSTHN vs FI--DIDEMRCHNB

Job submissions, as shown below, must specify a "job template", which sets up a load of default things to make the job work. On fi--dideclusthn, the job templates are called 4Core, 8Core and GeneralNodes, which will respectively force jobs to run on the 4-core nodes, the 8-core nodes, or on any machine available.

On fi--didemrchnb, you can set the job template to be... 8Core, 12Core, 16Core, 12and16Core, and GeneralNodes - which hopefully are fairly self-explanatory. There are a couple of other job templates, (24Core and Phi), but those are a bit special purpose for now, so don't use them!

Job Submission

Job submissions from the command line can take the following form (all on one line):-

job submit /scheduler:fi--dideclusthn.dide.ic.ac.uk /stdout:\\path\to\stdout.txt
   /stderr:\\path\to\stderr.txt /numcores:1-1 /jobtemplate:4Core \\path\to\run.bat

The /singlenode argument

In MS HPC (2012), Microsoft finally added a tag to allow you to say that the 'n' cores you requested must be on the same computer. Therefore, if you know precisely how many cores you want, then use the following (on one line):-

job submit /scheduler:fi--dideclusthn.dide.ic.ac.uk /singlenode:true 
   /jobtemplate:8Core \\path\to\run.bat

However, there is one bug with this, for the specific case where you request a whole node, regardless of how many cores it has. In this case, oddly, you have to disable single node:-

job submit /scheduler:fi--dideclusthn.dide.ic.ac.uk /singlenode:false 
   /numnodes:1 /jobtemplate:8Core \\path\to\run.bat

Languages and libraries supported

A number of languages, and different versions of languages are available on the cluster. The sections below refer to your "run.bat" file - a batch file that you will create which will get run by a cluster node when a suitable one is available. The commands described below are to be put in this "run.bat" file, and they add various directories to the path, so that the software you want will be added to the path.

C/C++

A number of Microsoft C++ and Intel C++ runtimes are installed, but it's usually better to try and avoid using them, and make your executable as stand-alone as possible. If it requires any external libraries that you've had to download, then put the .dll file in the same directory as the .exe file. If you use Microsoft Visual Studio, in Project Preferences, C/C++, Code Generation, make sure the Runtime Library is Multi-threaded (/MT) – the ones with DLL files won’t work. Even so, on recent versions of the Intel and Microsoft C compilers, "static linking" doesn’t really mean static when it comes to OpenMP, and you’ll have to copy some DLLs and put them next to your EXE file. See the OpenMP section below.

The cluster nodes are all 64-bit, but they will run 32-bit compiled code. Just make sure you provide the right DLLs!

Java

Java doesn't need installing really, you can just put whichever version of the JRE you want somewhere that the cluster can see, and run java.exe directly. However, for convenience, you can write call setJava64 and subsequent lines mentioning java will run Oracle's 64-bit Java 1.8.0.

Perl

Strawberry Perl 32-bit portable, v5.12.3.0. Put call setPerl at the top of your script.

R

You can R jobs on all the clusters. This is the latest wisdom on how to do so – thanks to James, Jeff, Hannah and others.

First, if you are wanting to use packages, then set up a package repository on your home directory by running this in R.

install.packages("<package>",lib="Q:/R")

Now, write your run script. Suppose you have an R script in your home directory: Q:\R-scripts\test.R. And suppose you’ve set up your repository as above. Your run.bat should then be:-

	call <script for the R version you want – see below>
	net use Q: \\fi--san02.dide.ic.ac.uk\homes\user
	set R_LIBS=Q:\R
	set R_LIBS_USER=Q:\R
	Rscript Q:\R-scripts\test.R

Various packages require various R version, and the cluster supports a few versions. To choose which one, change the first line of the script above to one of these – 32-bit or 64-bit versions of R releases. Purely for amusement, R's codenames are shown here too. If you really need support for the "Great Pumpkin" (2.14.0), "December Snowflakes" (2.14.1), "Easter Beagle" (2.15.0), "Roasted Marshmallow" (2.15.1), "Security Blanket" (2.15.3), "Masked Marvel" (3.0.0), the "Warm Puppy" (3.0.3), or the mildly disconcerting "Sock it to me" (3.1.1), do let me know... General advice: keep up with the most recent version if you can.

call setr32_2_13_1.bat		call setr64_2_13_1.bat    (anyone know the codename?)
call setr32_2_14_2.bat		call setr64_2_14_2.bat    (Gift-getting season)
call setr32_2_15_2.bat		call setr64_2_15_2.bat    (Trick or treat)
call setr32_3_0_1.bat           call setr64_3_0_1.bat     (Good Sport)
call setr32_3_0_2.bat           call setr64_3_0_2.bat     (Frisbee Sailing)
call setr32_3_1_0.bat           call setr64_3_1_0.bat     (Spring Dance)
call setr32_3_1_2.bat           call setr64_3_1_2.bat     (Pumpkin Helmet)

It also seems that different R versions put their packages in different structures – sometimes adding "win-library" into there for fun. Basically, R_LIBS and R_LIBS_USER should be paths to a folder that contains a list of other folders, one for each package you've installed.

IMPORTANT: R_LIBS and R_LIBS_USER paths must NOT contain quotes, nor spaces. If the path to your library contains spaces, you need to use old-fashioned 8-character names. If in your command window, you type dir /x, you’ll see the names – Program Files tends to become PROGRA~1 for instance.

Passing parameters to R scripts

Passing parameters to R scripts means you can have fewer versions of your R and bat files and easily run whole sets of jobs. You can get parameters into R using Rscript (but not Rcmd BATCH, I think) as follows. In the run.bat example above, the Rscript statement becomes:

Rscript Q:\R-scripts\test.R arg1 arg2 arg3

Within the R code, the arguments can be recovered using

args <- commandArgs(trailingOnly = TRUE)

outFileName <- args[1]     ## name of output file.
dataFileName <- args[2]    ## name of local data file. 
currentR0 <- as.numeric(args[3]) ## convert this explicitly to number. 

If you want your arguments to have a particular type, best to explicitly convert (see R0 above). Better still, you can pass parameters directly into the batch file that runs the R script. Command line arguments can be referenced within the batch file using %1, %2, etc. For example, if you have a batch file, runArgs.bat:

call <script for the R version you want – see below>
net use Q: \\fi--san02.dide.ic.ac.uk\homes\user
set R_LIBS=Q:\R
set R_LIBS_USER=Q:\R
Rscript Q:\R-scripts\%1.R %2 %3 %4

then

runArgs.bat myRScript arg1 arg2 arg3

will run the R script myRScript.R and pass it the parameters arg1, arg2 and arg3. The batch file runArgs.bat is now almost a generic R script runner.

OpenMP

This has been tested with C/C++ - but the same should apply to other languages such as fortran, that achieve multi-threading with a DLL file.

Microsoft's C/C++ Compiler

You will need to copy vcomp100.dll (for Visual Studio 2010 – it maybe vcomp90.dll for 2008), into the same directory as your executable. You can usually find the dll file in a directory similar to:- C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\redist\x64\Microsoft.VC100.OpenMP. Also make sure that you've enabled OpenMP in the Properties (C++/Language).

GCC or MinGW

Usually, this applies to Eclipse use. In project properties, the C++ command “g++ -fopenmp”, the C command “gcc –fopenmp”, and in the Linker options, under Miscellaneous, Linker flags, put “-fopenmp” too. Copy the OpenMP DLLs from MinGW into the same directory as your final executable. You may find the dlls are in C:\MinGW\Bin\libgomp-1.dll and in the same place, libpthread-2.dll, libstdc++-6.dll and libgcc_s_dw2-1.dll.

Intel C++ Compiler

Copy the libiomp5md.dll file from somewhere like C:\Program Files (x86)\Intel\ComposerXE-2011\redist\intel64\compiler\libiomp5md.dll to the same place as your executable. And in Visual Studio, make sure you've enabled OpenMP in the project properties.

How many threads?

The OpenMP function omp_max_threads() returns the number of physical cores on a machine, not the number of cores allocated to your job. To determine how many cores the scheduler actually allocated to you, use the following code to dig up the environment variable CCP_NUMCPUS, which will be set by the cluster management software:-

		int set_NCores() {
		   char * val;
		   char * endchar;
		   val = getenv("CCP_NUMCPUS");
		   int cores = omp_get_max_threads();
		   if (val!=NULL) cores = strtol (val,&endchar,10);
                   omp_set_num_threads(cores);
		   return cores;
		}

WinBugs (or OpenBugs)

They are similar, but OpenBugs 3.2.2 is the one we've gone for. Something like this as your run.bat script will work:-

call setOpenBugs
openbugs \\path\to\script.txt /HEADLESS

MatLab

There are various ways of producing a non-interactive executable from Matlab. Perhaps the simplest (not necessarily the best performance) way is to use “mcc.exe” – supplied with most full versions of Matlab, including the Imperial site licence version that you've probably got.

Use mcc.exe to compile your code

Use windows explorer to navigate to the folder where the “.m” files are for the project you want to compile. Now use a good text editor to create a file called “compile.bat” in that folder. It should contain something similar to the following:-

mcc -m file1.m file2.m file3.m -o myexe

Don’t copy/paste the text from this document by the way – Word has a different idea of what a dash is to most other software, and will probably replace the two dashes with funny characters. so you have to list every single .m file that your project needs, after the’-m’. If you save this file, then double-click on it, then it will think for some while, and produce “myexe.exe” in this example. Copy your .exe file into a network accessible place as usual.

Launch on the cluster.

The launch.bat file will be exactly the same as before – see page 2. The run script will then start with a line that tells the cluster which version of Matlab you used to compile the cluster. Below is the table of different versions. The runtimes are huge and cumbersome to install on the cluster, so as a result I haven’t installed every single one. If you need one that’s not listed, get in touch.


Matlab Version First line of run script
R2009b call useMatLab79
R2010a call useMatLab713
R2011a call useMatLab715
R2011b call useMatLab716
R2012a (64-bit) call useMatLab717_64
R2012b (64-bit) call useMatLab80_64
R2013a (64-bit) call useMatLab81_64
R2014a (64-bit) call useMatLab83_64
R2015a (64-bit) call useMatLab85_64


Python

Python Version First line of run script
2.6.6 call setPython26
2.7.2 (64-bit) call setPython

BOW (BioInformatics on Windows)

Start your run.bat file with call setBOW to add these to the path:-

EXE File Version
SamTools.exe 0.1.18 (r982:295)
BCFTools.exe 0.1.17-dev (r973:277)
bgzip ?
razip ?
bwa 0.6.1-r104
tabix 0.2.5 (r1005)

MAFFT

For version 7.212 (Win64), write call setMAFFT.bat at the top of your run script.

Applied-Maths Open Source

Write call setAppliedMaths.bat at the top of your run script, to add these to the path:

EXE File(s) Description Version
velvetg_mt_x86.exe Graph Construction Multi-threaded, 32-bit 1.01.04
velvetg_mt_x64.exe Graph Construction Multi-threaded, 64-bit 1.01.04
velvetg_st_x86.exe Graph Construction Multi-threaded, 32-bit 1.01.04
velvetg_st_x64.exe Graph Construction Multi-threaded, 64-bit 1.01.04
velveth_mt_x86.exe Hashing Multi-threaded, 32-bit 1.01.04
velveth_mt_x64.exe Hashing Multi-threaded, 64-bit 1.01.04
velveth_st_x86.exe Hashing Multi-threaded, 32-bit 1.01.04
velveth_st_x64.exe Hashing Multi-threaded, 64-bit 1.01.04
ray_x64.exe Ray 64-bit (for MPI) 1.6
ray_x86.exe Ray 32-bit (for MPI) 1.6
Mothur_x64.exe Mothur 64-bit (for MPI) 1.25.1
Mothur_x86.exe Mothur 32-bit (for MPI) 1.25.1


Launching a job

Submitting many jobs

Suppose you write an exe that you might run with… Mysim.exe init.txt 1 2 3, and you want to run it many times with a range of parameters. One way of many, is to write a launch.bat file that will run “job submit” separately, for example (thanks Tini/James!):-

@echo off
set SubDetails=job submit /scheduler:fi--dideclusthn.dide.ic.ac.uk /jobtemplate:GeneralNodes /numcores:1
set initFile=\\networkshare\job\init.txt
set exeFile=\\networkshare\job\mysim.exe

%SubDetails% %exeFile% %initfile% 1 2 3
%SubDetails% %exeFile% %initfile% 4 5 6
%SubDetails% %exeFile% %initfile% 7 8 9

Suppose the job you want to run is an R script? To specify arguments to an R script, you have to add ′--args a=1 b=2′ - so… you might make launch.bat like this:-

@echo off
set SubDetails=job submit /scheduler:fi--dideclusthn.dide.ic.ac.uk /jobtemplate:GeneralNodes /numcores:1
set rbatFile=\\networkshare\R-scripts\run.bat

%SubDetails% %rbatFile% 1 2
%SubDetails% %rbatFile% 3 4

And make the significant line of your run.bat:- Rcmd BATCH ′--args a=%1 b=%2′ U:\R-scripts\test.R %1 and %2 will map to the first and second thing after the batch file. You can go all the way up to %9.

IMPORTANT NOTE

Make sure you get the apostrophe character right in the above example – NEVER copy and paste from a word document into a script. It will go hideously wrong. Type the apostrophes (and dashes for that matter) in a good text editor – you want just the standard old-fashioned characters.

Requesting resources

The following modifiers next to the “/scheduler:” part of the job submit line (before your app.exe 1 2 3 part), will request things you might want…

/numcores:8 - number of cores you want

/numcores:8-12 - minimum and maximum cores appropriate for your job

/memorypernode:1024 - amount of mem in MegaBytes needed.

/workdir:\\networkshare\work - set working directory

/stdout:\\networkshare\out.txt - divert stdout to a file

/stderr: or /stdin: - similar for stderr and stdin

Troubleshooting / Miscellany / Q & A

  • My job doesn’t work.
    • Run the HPC Job Manager application. Find your job id, possible under “My Jobs, Failed”. Double click on it, then on “View All Tasks”. Perhaps something in the output section will help.
  • Check that the path to your job is visible everywhere.
    • Avoid spaces in your paths - “job submit” doesn’t like them very much. If you must have them, they’ll be ok in the “run.bat” batch file that the cluster will run – in which case, put the path in standard double-quotes ("). But avoid them in your “launch.bat” file – you may have to relocate your run.bat to a simple non-space-containing directory.
    • Rather than putting the full application and parameters on the job submit line, you might want to write a batch-file to do all that, and submit the batch file to the cluster. (See section 6 about R for example). But make sure the batch file is somewhere visible to the cluster.
  • My job seems to work, but reports as having failed.
    • The success/failure depends on the error code returned. If you’re running C code, end with “return 0;” for success.
  • job submit ..blah blah.. app.exe >out.txt doesn’t work!
    • The contents of out.txt will be the result of “job submit”, not the result of “app.exe”. You meant to say this job submit ..blah blah.. /stdout:out.txt app.exe correcting out.txt and app.exe to network paths of course.

Contributing Authors

Wes Hinsley, James Truscott, Tini Garske, Jeff Eaton, Hannah Clapham