Cluster Documentation

From MRC Centre for Outbreak Analysis and Modelling
Revision as of 18:59, 27 February 2014 by Admin (talk | contribs) (Created page with "We have two high performance clusters. The smaller, fi--dideclusthn, is a Microsoft HPC 2008 R2, and the larger, fi--didemrchnb, runs Microsoft HPC 2012. == Getting S...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

We have two high performance clusters. The smaller, fi--dideclusthn, is a Microsoft HPC 2008 R2, and the larger, fi--didemrchnb, runs Microsoft HPC 2012.

Getting Started

Getting access to the cluster

Send a mail to Wes (w.hinsley@imperial.ac.uk) requesting access to the cluster. He will add you to the necessary groups, and work out what cluster you should be running on. Unless you are told otherwise, this will be fi--dideclusthn. But if you have been told otherwise, then whenever you see fi--dideclusthn, replace it with the cluster name you’ve been given, either mentally or (less effectively) on your screen with tip-ex.

Install the HPC client software

Note A: For computers connected to the network through VPN, this should all work, but for strange reasons, you sometimes need to add "dide.local" to fi--dideclusthn on the external machine to coax it to look for the head node in the right domain. Usually after it's been told once, it remembers from then on.

  • Run \\fi--dideclusthn\REMINST\setup.exe.
  • Confirm Yes, I really want to run this.
  • There's an introductory screen, and you click Next
  • You have to tick the Accept, then click Next
  • If it gives you any choice, select Install only the client utilities
  • Next, Next, Next
  • Select Use Microsoft Update and then Next
  • And then Install
  • And possibly Install again – if it had to install any pre-requisites first.
  • And Finish, and the client installation is done.

Test so far

  • To check everything is alright so far, open a new command prompt.
  • (Find it under Accessories, or you can use Start, Run, cmd).
  • Type job list /scheduler:fi--dideclusthn

And it will list the jobs you have running, which will be none so far!

Id         Owner         Name              State        Priority    Resource Request
---------- ------------- ----------------- ------------ ----------- -------- -------

0 matching jobs for DIDE\user

If it doesn’t say this, stop and mail for help, reporting what error it gives you.

Launching and cancelling jobs

Command Line

Suppose you have a file called "run.bat" in your home directory, which does what you want to run on the cluster. Let's say it's a single-core, very simple job. We’ll discuss what should be inside "run.bat" later. To submit your job on the cluster, at the command prompt, type this (all on one line) - or put it in a file called "launch.bat", and run it:-

job submit /scheduler:fi--dideclusthn /jobtemplate:GeneralNodes /numcores:1 \\fi--san01\homes\user\run.bat

If it's the first time you’ve run a job - or if you've recently changed your password, then it might ask you for your DIDE password and offer to remember it. Otherwise, it will just tell you the ID number of your job.

Enter the password for ‘DIDE\user’ to connect to ‘FI—DIDECLUSTHN’:
Remember this password? (Y/N)
job has been submitted. ID: 123

If you want to remove the job, then:- job cancel 123 /scheduler:fi--dideclusthn

Or view its details with job view 123 /scheduler:fi--dideclusthn

Job Manager GUI

Alternatively to the command-line, you can use the job management software, rather than the command-line. The advantage is that it’s a GUI. The disadvantage is, as in all GUIs, you may not feel totally sure you know what it’s up to – where most of the time the things you want to do might not be very complex, as above.

The job management software will be on your start menu, as above. All the features are under the “Actions” menu, and hopefully it will be self explanatory. Read the details below about launching, and you'll find the interface bits that do it in the Job Manager. However, you may find over time, especially if you run lots of jobs, that learning to do it the scripted way with the command-line can be quicker.

Information for running any job

Visibility

First rule: the executable (or batch) file that the cluster will run must be somewhere on the network that the cluster has access to, when it logs in as you. This amounts to any network accessible drive that you have access to when you login – including network shares, such as the temp drive, your home directory, and any specific network space set up for your project.

Do not assume that the program will run in any specific directory – even though there are ways that are meant to do that. Use full paths to specify where files should be read from, or written to. You may want to write code that takes the paths either from a parameter file, or as a command-line parameter, to give as much flexibility as possible. In the long run, this will help you more. REMEMBER THAT your home directory is backed up every day – and it’s not generally very big. So please avoid filling it with enormous sets of results that you don’t actually want to keep – it will make lots of people happy if you can rather write your files to somewhere that doesn’t get backed up. Even a network share on your desktop will do…

If you would like to create a network share on your desktop, then simply… Right Click on the folder you’d like to share, and choose “Share” The next page shows who has rights to the folder – by default, you! Click on Share. A share is created called \\your-computer-name\the-share-name And you’ll be able to access this from the cluster.

BUT, there are limits on how many connections can be made to your desktop, so a desktop share may be useful for testing, but not for running lots of jobs.

If you really need to map a network drive letter, then at the top of your “run.bat” file:- net use X: \\your-computer-name\the-share-name

Summary Comment: Try and use a project share on one of the proper servers.

Interactivity

The job must run in an entirely scripted, unattended way. It must not require any key presses or mouse clicks, or other live interactivity while it runs. So jobs generally will read some sort of input (from a file, or from the way you run the job), do some processing, and write the results somewhere for you - all without intervention.

Launching jobs

Jobs can be launched either through the Job Manager interface, or through the command line tools, which offer greater flexibility. We'll describe the command line method here; if you want to use the GUI, then it'll be a case of finding the matching boxes... Below are the specifics for our clusters. For more information, simply type job on the commandline, or job submit for the list of submission-related commands.

On FI--DIDECLUSTHN

Job submissions from the command line can take the following form:-

job submit /scheduler:fi--dideclusthn /stdout:\\path\to\stdout.txt \stderr:\\path\to\stderr.txt /numcores:1-1 /jobtemplate:4Core \\path\to\run.bat

The job template is not strictly necessary, but it's good to specify it anyway. It can be either 4Core, 8Core, or GeneralNodes if you don't mind which nodes get used.

On FI--DIDEMRCHNB

job submit /scheduler:fi--didemrchnb /stdout:\\path\to\stdout.txt \stderr:\\path\to\stderr.txt /numcores:1-1 /jobtemplate:8Core \\path\to\run.bat

The job template here is compulsory. It can be one of 8Core, 12Core, 12and16Core, 16Core, or 24Core.

The /singlenode argument

In MS HPC (2012), Microsoft finally added a tag to allow you to say that the 'n' cores you requested must be on the same computer. Therefore, if you know precisely how many cores you want, then use the following:-

job submit /scheduler:fi--didemrchnb /singlenode:true /stdout:\\path\to\stdout.txt \stderr:\\path\to\stderr.txt /numcores:8-8 /jobtemplate:8Core \\path\to\run.bat

However, there is one bug with this, for the specific case where you request a whole node, regardless of how many cores it has. In this case, oddly, you have to disable single node:-

job submit /scheduler:fi--didemrchnb /singlenode:false /stdout:\\path\to\stdout.txt \stderr:\\path\to\stderr.txt /numnodes:1 /jobtemplate:8Core \\path\to\run.bat

Languages and libraries supported

A number of languages, and different versions of languages are available on the cluster. The sections below refer to your "run.bat" file - a batch file that you will create which will get run by a cluster node when a suitable one is available. The commands described below are to be put in this "run.bat" file, and they add various directories to the path, so that the software you want will be added to the path.

C/C++

A number of Microsoft C++ and Intel C++ runtimes are installed, but it's usually better to try and avoid using them, and make your executable as stand-alone as possible. If it requires any external libraries that you've had to download, then put the .dll file in the same directory as the .exe file. If you use Microsoft Visual Studio, in Project Preferences, C/C++, Code Generation, make sure the Runtime Library is Multi-threaded (/MT) – the ones with DLL files won’t work. Even so, on recent versions of the Intel and Microsoft C compilers, "static linking" doesn’t really mean static when it comes to OpenMP, and you’ll have to copy some DLLs and put them next to your EXE file. See the OpenMP section below.

The cluster nodes are all 64-bit, but they will run 32-bit compiled code. Just make sure you provide the right DLLs!

Java

Java doesn't need installing really, you can just put whichever version of the JRE you want somewhere that the cluster can see, and run java.exe directly. However, for convenience, you can write call setJava and subsequent lines mentioning java will run Oracle's 64-bit Java 1.7.0u25.

Perl

Strawberry Perl 32-bit portable, v5.12.3.0. Put call setPerl at the top of your script.

R

You can R jobs on all the clusters. This is the latest wisdom on how to do so – thanks to James, Jeff, Hannah and others.

First, if you are wanting to use packages, then set up a package repository on your home directory by running this in R.

install.packages("<package>",lib="U:/R/",type="win.binary")

Now, write your run script. Suppose you have an R script in your home directory: U:\R-scripts\test.R. And suppose you’ve set up your repository as above. Your run.bat should then be:-

	call <script for the R version you want – see below>
	net use U: \\fi--san01\homes\user
	set R_LIBS=U:\R
	set R_LIBS_USER=U:\R
	Rscript U:\R-scripts\test.R

Various packages require various R version, and the cluster supports a few versions. To choose which one, change the first line of the script above to one of these – 32-bit or 64-bit versions of R releases.

call setr32_2_13_1.bat		call setr64_2_13_1.bat
call setr32_2_14_2.bat		call setr64_2_14_2.bat
call setr32_2_15_2.bat		call setr64_2_15_2.bat
call setr32_3_0_1.bat           call setr64_3_0_1.bat
call setr32_3_0_2.bat           call setr64_3_0_2.bat

It also seems that different R versions put their packages in different structures – sometimes adding "win-library" into there for fun. Basically, R_LIBS and R_LIBS_USER should be paths to a folder that contains a list of other folders, one for each package you've installed.

IMPORTANT: R_LIBS and R_LIBS_USER paths must NOT contain quotes, nor spaces. If the path to your library contains spaces, you need to use old-fashioned 8-character names. If in your command window, you type dir /x, you’ll see the names – Program Files tends to become PROGRA~1 for instance.

Passing parameters to R scripts

Passing parameters to R scripts means you can have fewer versions of your R and bat files and easily run whole sets of jobs. You can get parameters into R using Rscript (but not Rcmd BATCH, I think) as follows. In the run.bat example above, the Rscript statement becomes:

Rscript U:\R-scripts\test.R arg1 arg2 arg3

Within the R code, the arguments can be recovered using

args <- commandArgs(trailingOnly = TRUE)

outFileName <- args[1]     ## name of output file.
dataFileName <- args[2]    ## name of local data file. 
currentR0 <- as.numeric(args[3]) ## convert this explicitly to number. 

If you want your arguments to have a particular type, best to explicitly convert (see R0 above). Better still, you can pass parameters directly into the batch file that runs the R script. Command line arguments can be referenced within the batch file using %1, %2, etc. For example, if you have a batch file, runArgs.bat:

call <script for the R version you want – see below>
net use U: \\fi--san01\homes\user
set R_LIBS=U:\R
set R_LIBS_USER=U:\R
Rscript U:\R-scripts\%1.R %2 %3 %4

then

runArgs.bat myRScript arg1 arg2 arg3

will run the R script myRScript.R and pass it the parameters arg1, arg2 and arg3. The batch file runArgs.bat is now almost a generic R script runner.

OpenMP

This has been tested with C/C++ - but the same should apply to other languages such as fortran, that achieve multi-threading with a DLL file.

Microsoft's C/C++ Compiler

You will need to copy vcomp100.dll (for Visual Studio 2010 – it maybe vcomp90.dll for 2008), into the same directory as your executable. You can usually find the dll file in a directory similar to:- C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\redist\x64\Microsoft.VC100.OpenMP. Also make sure that you've enabled OpenMP in the Properties (C++/Language).

GCC or MinGW

Usually, this applies to Eclipse use. In project properties, the C++ command “g++ -fopenmp”, the C command “gcc –fopenmp”, and in the Linker options, under Miscellaneous, Linker flags, put “-fopenmp” too. Copy the OpenMP DLLs from MinGW into the same directory as your final executable. You may find the dlls are in C:\MinGW\Bin\libgomp-1.dll and in the same place, libpthread-2.dll, libstdc++-6.dll and libgcc_s_dw2-1.dll.

Intel C++ Compiler

Copy the libiomp5md.dll file from somewhere like C:\Program Files (x86)\Intel\ComposerXE-2011\redist\intel64\compiler\libiomp5md.dll to the same place as your executable. And in Visual Studio, make sure you've enabled OpenMP in the project properties.

How many threads?

The OpenMP function omp_max_threads() returns the number of physical cores on a machine, not the number of cores allocated to your job. To determine how many cores the scheduler actually allocated to you, use the following code to dig up the environment variable CCP_NUMCPUS, which will be set by the cluster management software:-

		int set_NCores() {
		   char * val;
		   char * endchar;
		   val = getenv("CCP_NUMCPUS");
		   int cores = omp_get_max_threads();
		   if (val!=NULL) cores = strtol (val,&endchar,10);
                   omp_set_num_threads(cores);
		   return cores;
		}

Considerations for FI--DIDECLUSTHN

Microsoft HPC 2008 R2 and earlier had a major oversight: if you submitted a job requesting a number of cores, HPC would allocate them to you, but not necessarily on the same node. For this reason, previous advice was to use /numnodes:1 on the job submission line to ensure it would work properly, at the cost of cluster-wide efficiency.

A complex workaround was made using activation filters, to ensure that jobs requesting a certain number of cores always get them on the same node. So please use job submit /numcores: to specify how many cores you would like. You can specify /numcores:min-max, and the job will start with as many cores as possible in that range.

  • Note that if you do submit with /numcores:min-max, then when you see the job running in the job manager, the min and max will have been adjusted to how many cores it gave you.
  • While the cluster manager is arranging the job to run, it may flicker, change state briefly between configuring and queued, before running… this is all normal.
  • Jobs may sit in the queue for up to 60 seconds while free nodes are available – this is the maximum time that HPC will wait before checking if it can run queued jobs.

WinBugs (or OpenBugs)

They are similar, but OpenBugs 3.2.1 is the one we've gone for. Something like this as your run.bat script will work:-

call setOpenBugs
openbugs \\path\to\script.txt /HEADLESS

MatLab

There are various ways of producing a non-interactive executable from Matlab. Perhaps the simplest (not necessarily the best performance) way is to use “mcc.exe” – supplied with most full versions of Matlab, including the Imperial site licence version that you've probably got.

Use mcc.exe to compile your code

Use windows explorer to navigate to the folder where the “.m” files are for the project you want to compile. Now use a good text editor to create a file called “compile.bat” in that folder. It should contain something similar to the following:-

mcc -m file1.m file2.m file3.m -o myexe

Don’t copy/paste the text from this document by the way – Word has a different idea of what a dash is to most other software, and will probably replace the two dashes with funny characters. so you have to list every single .m file that your project needs, after the’-m’. If you save this file, then double-click on it, then it will think for some while, and produce “myexe.exe” in this example. Copy your .exe file into a network accessible place as usual.

Launch on the cluster.

The launch.bat file will be exactly the same as before – see page 2. The run script will then start with a line that tells the cluster which version of Matlab you used to compile the cluster. Below is the table of different versions. The runtimes are huge and cumbersome to install on the cluster, so as a result I haven’t installed every single one. If you need one that’s not listed, get in touch.


Matlab Version First line of run script
R2009b call useMatLab79
R2010a call useMatLab713
R2011a call useMatLab715
R2011b call useMatLab716
R2012b (64-bit) call useMatLab80_64

Python

Python Version First line of run script
2.6.6 call setPython26
2.7.2 (64-bit) call setPython

Launching a job

Submitting many jobs

Suppose you write an exe that you might run with… Mysim.exe init.txt 1 2 3, and you want to run it many times with a range of parameters. One way of many, is to write a launch.bat file that will run “job submit” separately, for example (thanks Tini/James!):-

@echo off
set SubDetails=job submit /scheduler:fi--dideclusthn /jobtemplate:GeneralNodes /numcores:1
set initFile=\\networkshare\job\init.txt
set exeFile=\\networkshare\job\mysim.exe

%SubDetails% %exeFile% %initfile% 1 2 3
%SubDetails% %exeFile% %initfile% 4 5 6
%SubDetails% %exeFile% %initfile% 7 8 9

Suppose the job you want to run is an R script? To specify arguments to an R script, you have to add ′--args a=1 b=2′ - so… you might make launch.bat like this:-

@echo off
set SubDetails=job submit /scheduler:fi--dideclusthn /jobtemplate:GeneralNodes /numcores:1
set rbatFile=\\networkshare\R-scripts\run.bat

%SubDetails% %rbatFile% 1 2
%SubDetails% %rbatFile% 3 4

And make the significant line of your run.bat:- Rcmd BATCH ′--args a=%1 b=%2′ U:\R-scripts\test.R %1 and %2 will map to the first and second thing after the batch file. You can go all the way up to %9.

IMPORTANT NOTE

Make sure you get the apostrophe character right in the above example – NEVER copy and paste from a word document into a script. It will go hideously wrong. Type the apostrophes (and dashes for that matter) in a good text editor – you want just the standard old-fashioned characters.


Requesting resources

The following modifiers next to the “/scheduler:” part of the job submit line (before your app.exe 1 2 3 part), will request things you might want…

/numcores:8 - number of cores you want

/numcores:8-12 - minimum and maximum cores appropriate for your job

/memorypernode:1024 - amount of mem in MegaBytes needed.

/workdir:\\networkshare\work - set working directory

/stdout:\\networkshare\out.txt - divert stdout to a file

/stderr: or /stdin: - similar for stderr and stdin

Troubleshooting / Miscellany / Q & A

  • My job doesn’t work.
    • Run the HPC Job Manager application. Find your job id, possible under “My Jobs, Failed”. Double click on it, then on “View All Tasks”. Perhaps something in the output section will help.
  • Check that the path to your job is visible everywhere.
    • Avoid spaces in your paths - “job submit” doesn’t like them very much. If you must have them, they’ll be ok in the “run.bat” batch file that the cluster will run – in which case, put the path in standard double-quotes ("). But avoid them in your “launch.bat” file – you may have to relocate your run.bat to a simple non-space-containing directory.
    • Rather than putting the full application and parameters on the job submit line, you might want to write a batch-file to do all that, and submit the batch file to the cluster. (See section 6 about R for example). But make sure the batch file is somewhere visible to the cluster.
  • My job seems to work, but reports as having failed.
    • The success/failure depends on the error code returned. If you’re running C code, end with “return 0;” for success.
  • job submit ..blah blah.. app.exe >out.txt doesn’t work!
    • The contents of out.txt will be the result of “job submit”, not the result of “app.exe”. You meant to say this job submit ..blah blah.. /stdout:out.txt app.exe correcting out.txt and app.exe to network paths of course.

Contributing Authors

Wes Hinsley, James Truscott, Tini Garske, Jeff Eaton, Hannah Clapham