Cluster Documentation
We have two high performance clusters, both running Microsoft HPC 2012 R2. The smaller older cluster is fi--dideclusthn and the larger is fi--didemrchnb. The HPC 2012 R2 upgrade was done in April 2014 - if you have an older HPC client, then I recommend uninstalling it through Control Panel, and following the instructions below to install the most recent client tools.
Before you get started
The clusters are both Windows clusters; the client for launching jobs on them is a Windows-only one I'm afraid. Therefore, at the moment, in order to launch jobs on the cluster, you need to be able to run windows. You won't be able to launch jobs "natively" from Linux or Mac OS; you either need to dual-boot them, or run a Windows virtual machine, or somehow get access to a windows desktop or laptop.
And if you are writing some C/C++/Matlab/Fortran/other compiled code on a Linux or MAC machine, remember that it will have to be compiled into a 64-bit Windows executable to run on the cluster. Interpreted languages such as R, Python, Perl will be ok - provided that they don't rely on packages that are operating-system specific. Similar for Java - it'll run fine across platforms, unless you specifically do something platform-specific with it.
If you're preparing to use the cluster, and have doubts about whether it can run what you want - and whether it will be straightforward to develop and run on it, best talk to me and/or the IT guys first!
Getting Started
Getting access to the cluster
Send a mail to Wes (w.hinsley@imperial.ac.uk) requesting access to the cluster. He will add you to the necessary groups, and work out what cluster you should be running on. Unless you are told otherwise, this will be fi--dideclusthn. But if you have been told otherwise, then whenever you see fi--dideclusthn, replace it with the cluster name you’ve been given, either mentally or (less effectively) on your screen with tip-ex.
Install the HPC client software
- Run \\fi--dideclusthn.dide.local\REMINST\setup.exe.
- Confirm Yes, I really want to run this.
- There's an introductory screen, and you click Next
- You have to tick the Accept, then click Next
- If it gives you any choice, select Install only the client utilities
- Next, Next, Next
- Select Use Microsoft Update and then Next
- And then Install
- And possibly Install again – if it had to install any pre-requisites first.
- And Finish, and the client installation is done.
Test so far
- To check everything is alright so far, open a new command prompt.
- (Find it under Accessories, or you can use Start, Run, cmd).
- Type
job list /scheduler:fi--dideclusthn.dide.local
And it will list the jobs you have running, which will be none so far!
Id Owner Name State Priority Resource Request ---------- ------------- ----------------- ------------ ----------- -------- ------- 0 matching jobs for DIDE\user
If it doesn’t say this, stop and mail for help, reporting what error it gives you.
Launching and cancelling jobs
Command Line
Suppose you have a file called "run.bat" in your home directory, which does what you want to run on the cluster. Let's say it's a single-core, very simple job. We’ll discuss what should be inside "run.bat" later. To submit your job on the cluster, at the command prompt, type this (all on one line) - or put it in a file called "launch.bat", and run it:-
job submit /scheduler:fi--dideclusthn.dide.local /jobtemplate:GeneralNodes /numcores:1 \\fi--san02.dide.local\homes\user\run.bat
If it's the first time you’ve run a job - or if you've recently changed your password, then it might ask you for your DIDE password and offer to remember it. Otherwise, it will just tell you the ID number of your job.
Enter the password for ‘DIDE\user’ to connect to ‘FI—DIDECLUSTHN’: Remember this password? (Y/N) job has been submitted. ID: 123
If you want to remove the job, then:- job cancel 123 /scheduler:fi--dideclusthn.dide.local
Or view its details with job view 123 /scheduler:fi--dideclusthn.dide.local
Job Manager GUI
Alternatively to the command-line, you can use the job management software, rather than the command-line. The advantage is that it’s a GUI. The disadvantage is, as in all GUIs, you may not feel totally sure you know what it’s up to – where most of the time the things you want to do might not be very complex, as above.
The job management software will be on your start menu, as above. All the features are under the “Actions” menu, and hopefully it will be self explanatory. Read the details below about launching, and you'll find the interface bits that do it in the Job Manager. However, you may find over time, especially if you run lots of jobs, that learning to do it the scripted way with the command-line can be quicker.
Information for running any job
Visibility
First rule: the executable (or batch) file that the cluster will run must be somewhere on the network that the cluster has access to, when it logs in as you. This amounts to any network accessible drive that you have access to when you login – including network shares, such as the temp drive, your home directory, and any specific network space set up for your project.
Do not assume that the program will run in any specific directory – even though there are ways that are meant to do that. Use full paths to specify where files should be read from, or written to. You may want to write code that takes the paths either from a parameter file, or as a command-line parameter, to give as much flexibility as possible. In the long run, this will help you more. REMEMBER THAT your home directory is backed up every day – and it’s not generally very big. So please avoid filling it with enormous sets of results that you don’t actually want to keep – it will make lots of people happy if you can rather write your files to somewhere that doesn’t get backed up. Even a network share on your desktop will do…
If you would like to create a network share on your desktop, then simply… Right Click on the folder you’d like to share, and choose “Share” The next page shows who has rights to the folder – by default, you! Click on Share. A share is created called \\your-computer-name\the-share-name And you’ll be able to access this from the cluster.
BUT, there are limits on how many connections can be made to your desktop, so a desktop share may be useful for testing, but not for running lots of jobs.
If you really need to map a network drive letter, then at the top of your “run.bat” file:- net use X: \\your-computer-name\the-share-name
Summary Comment: Try and use a project share on one of the proper servers.
Interactivity
The job must run in an entirely scripted, unattended way. It must not require any key presses or mouse clicks, or other live interactivity while it runs. So jobs generally will read some sort of input (from a file, or from the way you run the job), do some processing, and write the results somewhere for you - all without intervention.
Launching jobs
Jobs can be launched either through the Job Manager interface, or through the command line tools, which offer greater flexibility. We'll describe the command line method here; if you want to use the GUI, then it'll be a case of finding the matching boxes... Below are the specifics for our clusters. For more information, simply type job
on the commandline, or job submit
for the list of submission-related commands.
FI--DIDECLUSTHN vs FI--DIDEMRCHNB
Job submissions, as shown below, must specify a "job template", which sets up a load of default things to make the job work. On fi--dideclusthn, the job templates are called 4Core, 8Core and GeneralNodes, which will respectively force jobs to run on the 4-core nodes, the 8-core nodes, or on any machine available.
On fi--didemrchnb, you can set the job template to be... 8Core, 12Core, 16Core, 12and16Core, and GeneralNodes - which hopefully are fairly self-explanatory. There are a couple of other job templates, (24Core and Phi), but those are a bit special purpose for now, so don't use them!
Job Submission
Job submissions from the command line can take the following form (all on one line):-
job submit /scheduler:fi--dideclusthn.dide.local /stdout:\\path\to\stdout.txt \stderr:\\path\to\stderr.txt /numcores:1-1 /jobtemplate:4Core \\path\to\run.bat
The /singlenode argument
In MS HPC (2012), Microsoft finally added a tag to allow you to say that the 'n' cores you requested must be on the same computer. Therefore, if you know precisely how many cores you want, then use the following (on one line):-
job submit /scheduler:fi--dideclusthn.dide.local /singlenode:true /jobtemplate:8Core \\path\to\run.bat
However, there is one bug with this, for the specific case where you request a whole node, regardless of how many cores it has. In this case, oddly, you have to disable single node:-
job submit /scheduler:fi--dideclusthn.dide.local /singlenode:false /numnodes:1 /jobtemplate:8Core \\path\to\run.bat
Languages and libraries supported
A number of languages, and different versions of languages are available on the cluster. The sections below refer to your "run.bat" file - a batch file that you will create which will get run by a cluster node when a suitable one is available. The commands described below are to be put in this "run.bat" file, and they add various directories to the path, so that the software you want will be added to the path.
C/C++
A number of Microsoft C++ and Intel C++ runtimes are installed, but it's usually better to try and avoid using them, and make your executable as stand-alone as possible. If it requires any external libraries that you've had to download, then put the .dll file in the same directory as the .exe file. If you use Microsoft Visual Studio, in Project Preferences, C/C++, Code Generation, make sure the Runtime Library is Multi-threaded (/MT) – the ones with DLL files won’t work. Even so, on recent versions of the Intel and Microsoft C compilers, "static linking" doesn’t really mean static when it comes to OpenMP, and you’ll have to copy some DLLs and put them next to your EXE file. See the OpenMP section below.
The cluster nodes are all 64-bit, but they will run 32-bit compiled code. Just make sure you provide the right DLLs!
Java
Java doesn't need installing really, you can just put whichever version of the JRE you want somewhere that the cluster can see, and run java.exe directly. However, for convenience, you can write call setJava
and subsequent lines mentioning java
will run Oracle's 64-bit Java 1.7.0u25.
Perl
Strawberry Perl 32-bit portable, v5.12.3.0. Put call setPerl
at the top of your script.
R
You can R jobs on all the clusters. This is the latest wisdom on how to do so – thanks to James, Jeff, Hannah and others.
First, if you are wanting to use packages, then set up a package repository on your home directory by running this in R.
install.packages("<package>",lib="U:/R/",type="win.binary")
Now, write your run script. Suppose you have an R script in your home directory: U:\R-scripts\test.R. And suppose you’ve set up your repository as above. Your run.bat should then be:-
call <script for the R version you want – see below> net use U: \\fi--san02.dide.local\homes\user set R_LIBS=U:\R set R_LIBS_USER=U:\R Rscript U:\R-scripts\test.R
Various packages require various R version, and the cluster supports a few versions. To choose which one, change the first line of the script above to one of these – 32-bit or 64-bit versions of R releases.
call setr32_2_13_1.bat call setr64_2_13_1.bat call setr32_2_14_2.bat call setr64_2_14_2.bat call setr32_2_15_2.bat call setr64_2_15_2.bat call setr32_3_0_1.bat call setr64_3_0_1.bat call setr32_3_0_2.bat call setr64_3_0_2.bat call setr32_3_1_0.bat call setr64_3_1_0.bat
It also seems that different R versions put their packages in different structures – sometimes adding "win-library" into there for fun. Basically, R_LIBS and R_LIBS_USER should be paths to a folder that contains a list of other folders, one for each package you've installed.
IMPORTANT: R_LIBS and R_LIBS_USER paths must NOT contain quotes, nor spaces. If the path to your library contains spaces, you need to use old-fashioned 8-character names. If in your command window, you type dir /x
, you’ll see the names – Program Files tends to become PROGRA~1 for instance.
Passing parameters to R scripts
Passing parameters to R scripts means you can have fewer versions of your R and bat files and easily run whole sets of jobs. You can get parameters into R using Rscript (but not Rcmd BATCH, I think) as follows. In the run.bat example above, the Rscript statement becomes:
Rscript U:\R-scripts\test.R arg1 arg2 arg3
Within the R code, the arguments can be recovered using
args <- commandArgs(trailingOnly = TRUE) outFileName <- args[1] ## name of output file. dataFileName <- args[2] ## name of local data file. currentR0 <- as.numeric(args[3]) ## convert this explicitly to number.
If you want your arguments to have a particular type, best to explicitly convert (see R0 above). Better still, you can pass parameters directly into the batch file that runs the R script. Command line arguments can be referenced within the batch file using %1, %2, etc. For example, if you have a batch file, runArgs.bat:
call <script for the R version you want – see below> net use U: \\fi--san02.dide.local\homes\user set R_LIBS=U:\R set R_LIBS_USER=U:\R Rscript U:\R-scripts\%1.R %2 %3 %4
then
runArgs.bat myRScript arg1 arg2 arg3
will run the R script myRScript.R and pass it the parameters arg1, arg2 and arg3. The batch file runArgs.bat is now almost a generic R script runner.
OpenMP
This has been tested with C/C++ - but the same should apply to other languages such as fortran, that achieve multi-threading with a DLL file.
Microsoft's C/C++ Compiler
You will need to copy vcomp100.dll (for Visual Studio 2010 – it maybe vcomp90.dll for 2008), into the same directory as your executable. You can usually find the dll file in a directory similar to:- C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\redist\x64\Microsoft.VC100.OpenMP. Also make sure that you've enabled OpenMP in the Properties (C++/Language).
GCC or MinGW
Usually, this applies to Eclipse use. In project properties, the C++ command “g++ -fopenmp”, the C command “gcc –fopenmp”, and in the Linker options, under Miscellaneous, Linker flags, put “-fopenmp” too. Copy the OpenMP DLLs from MinGW into the same directory as your final executable. You may find the dlls are in C:\MinGW\Bin\libgomp-1.dll and in the same place, libpthread-2.dll, libstdc++-6.dll and libgcc_s_dw2-1.dll.
Intel C++ Compiler
Copy the libiomp5md.dll file from somewhere like C:\Program Files (x86)\Intel\ComposerXE-2011\redist\intel64\compiler\libiomp5md.dll to the same place as your executable. And in Visual Studio, make sure you've enabled OpenMP in the project properties.
How many threads?
The OpenMP function omp_max_threads() returns the number of physical cores on a machine, not the number of cores allocated to your job. To determine how many cores the scheduler actually allocated to you, use the following code to dig up the environment variable CCP_NUMCPUS, which will be set by the cluster management software:-
int set_NCores() { char * val; char * endchar; val = getenv("CCP_NUMCPUS"); int cores = omp_get_max_threads(); if (val!=NULL) cores = strtol (val,&endchar,10); omp_set_num_threads(cores); return cores; }
WinBugs (or OpenBugs)
They are similar, but OpenBugs 3.2.1 is the one we've gone for. Something like this as your run.bat script will work:-
call setOpenBugs openbugs \\path\to\script.txt /HEADLESS
MatLab
There are various ways of producing a non-interactive executable from Matlab. Perhaps the simplest (not necessarily the best performance) way is to use “mcc.exe” – supplied with most full versions of Matlab, including the Imperial site licence version that you've probably got.
Use mcc.exe to compile your code
Use windows explorer to navigate to the folder where the “.m” files are for the project you want to compile. Now use a good text editor to create a file called “compile.bat” in that folder. It should contain something similar to the following:-
mcc -m file1.m file2.m file3.m -o myexe
Don’t copy/paste the text from this document by the way – Word has a different idea of what a dash is to most other software, and will probably replace the two dashes with funny characters. so you have to list every single .m file that your project needs, after the’-m’. If you save this file, then double-click on it, then it will think for some while, and produce “myexe.exe” in this example. Copy your .exe file into a network accessible place as usual.
Launch on the cluster.
The launch.bat file will be exactly the same as before – see page 2. The run script will then start with a line that tells the cluster which version of Matlab you used to compile the cluster. Below is the table of different versions. The runtimes are huge and cumbersome to install on the cluster, so as a result I haven’t installed every single one. If you need one that’s not listed, get in touch.
Matlab Version | First line of run script |
---|---|
R2009b | call useMatLab79 |
R2010a | call useMatLab713 |
R2011a | call useMatLab715 |
R2011b | call useMatLab716 |
R2012a (64-bit) | call useMatLab717_64 |
R2012b (64-bit) | call useMatLab80_64 |
R2013a (64-bit) | call useMatLab81_64 |
Python
Python Version | First line of run script |
---|---|
2.6.6 | call setPython26 |
2.7.2 (64-bit) | call setPython |
Launching a job
Submitting many jobs
Suppose you write an exe that you might run with… Mysim.exe init.txt 1 2 3, and you want to run it many times with a range of parameters. One way of many, is to write a launch.bat file that will run “job submit” separately, for example (thanks Tini/James!):-
@echo off set SubDetails=job submit /scheduler:fi--dideclusthn.dide.local /jobtemplate:GeneralNodes /numcores:1 set initFile=\\networkshare\job\init.txt set exeFile=\\networkshare\job\mysim.exe %SubDetails% %exeFile% %initfile% 1 2 3 %SubDetails% %exeFile% %initfile% 4 5 6 %SubDetails% %exeFile% %initfile% 7 8 9
Suppose the job you want to run is an R script? To specify arguments to an R script, you have to add ′--args a=1 b=2′ - so… you might make launch.bat like this:-
@echo off set SubDetails=job submit /scheduler:fi--dideclusthn.dide.local /jobtemplate:GeneralNodes /numcores:1 set rbatFile=\\networkshare\R-scripts\run.bat %SubDetails% %rbatFile% 1 2 %SubDetails% %rbatFile% 3 4
And make the significant line of your run.bat:-
Rcmd BATCH ′--args a=%1 b=%2′ U:\R-scripts\test.R
%1 and %2 will map to the first and second thing after the batch file. You can go all the way up to %9.
IMPORTANT NOTE
Make sure you get the apostrophe character right in the above example – NEVER copy and paste from a word document into a script. It will go hideously wrong. Type the apostrophes (and dashes for that matter) in a good text editor – you want just the standard old-fashioned characters.
Requesting resources
The following modifiers next to the “/scheduler:” part of the job submit line (before your app.exe 1 2 3 part), will request things you might want…
/numcores:8 - number of cores you want
/numcores:8-12 - minimum and maximum cores appropriate for your job
/memorypernode:1024 - amount of mem in MegaBytes needed.
/workdir:\\networkshare\work - set working directory
/stdout:\\networkshare\out.txt - divert stdout to a file
/stderr: or /stdin: - similar for stderr and stdin
Troubleshooting / Miscellany / Q & A
- My job doesn’t work.
- Run the HPC Job Manager application. Find your job id, possible under “My Jobs, Failed”. Double click on it, then on “View All Tasks”. Perhaps something in the output section will help.
- Check that the path to your job is visible everywhere.
- Avoid spaces in your paths - “job submit” doesn’t like them very much. If you must have them, they’ll be ok in the “run.bat” batch file that the cluster will run – in which case, put the path in standard double-quotes ("). But avoid them in your “launch.bat” file – you may have to relocate your run.bat to a simple non-space-containing directory.
- Rather than putting the full application and parameters on the job submit line, you might want to write a batch-file to do all that, and submit the batch file to the cluster. (See section 6 about R for example). But make sure the batch file is somewhere visible to the cluster.
- My job seems to work, but reports as having failed.
- The success/failure depends on the error code returned. If you’re running C code, end with “return 0;” for success.
- job submit ..blah blah.. app.exe >out.txt doesn’t work!
- The contents of out.txt will be the result of “job submit”, not the result of “app.exe”. You meant to say this job submit ..blah blah.. /stdout:out.txt app.exe correcting out.txt and app.exe to network paths of course.
Contributing Authors
Wes Hinsley, James Truscott, Tini Garske, Jeff Eaton, Hannah Clapham