Cluster Documentation: Difference between revisions

From MRC Centre for Outbreak Analysis and Modelling
Jump to navigation Jump to search
mNo edit summary
 
(34 intermediate revisions by the same user not shown)
Line 1: Line 1:
We have two high performance clusters, both running Microsoft HPC 2012 R2, Update 3 The smaller older cluster is [[fi--dideclusthn]] and the larger is [[fi--didemrchnb]]. The latest upgrade was done in January 2016 - if you have an older HPC client, then I recommend uninstalling it through Control Panel, and following the instructions below to install the most recent client tools. If you're not a windows user, see the [[HPC Web Portal]]. COMING SOON: we're working on a linux cluster, [[fi--didelxhn]].  
* We have two high performance clusters for general use, both running Microsoft HPC 2012 R2, Update 3. The smaller, older cluster is [[fi--dideclusthn]] and the larger is [[fi--didemrchnb]].  
 
* Many users in the department write code and packages in R. We have in-house packages that can allow an almost-interactive experience of running jobs on our high performance clusters, so if that sounds like what you'd like, you might as well stop reading here, and go straight to [https://github.com/mrc-ide/didehpc The DIDEHPC package] - or if you need to join the mrc-ide github org first, go to [https://github.com/mrc-ide/welcome The welcome page].
 
If you're not using R, or you are interested in other ways, then continue...


== Windows Users ==
== Windows Users ==


The cluster nodes are Windows 2012-R2 based. This is good news if you're a Windows user. You have the option of using the MS client for launching jobs (see below), or you can use try the new [[HPC Web Portal]]. If you're preparing to use the cluster, and have doubts about whether it can run what you want - and whether it will be straightforward to develop and run on it, best talk to me and/or the IT guys first!
The cluster nodes are Windows 2012-R2 based. This is good news if you're a Windows user, and especially easy for PCs within the departmental DIDE domain. You have the option of using the MS client for launching jobs (see below), or you can use try the new [[HPC Web Portal]]. If you're preparing to use the cluster, and have doubts about whether it can run what you want - and whether it will be straightforward to develop and run on it, best talk to me and/or the IT guys first!


== Linux and MAC Users ==
== Linux and MAC Users ==
Line 11: Line 15:
=== Launching jobs on Linux/Mac ===
=== Launching jobs on Linux/Mac ===


The best way might be the new [[HPC Web Portal]], which lets you submit jobs through a webpage. It's very new, but quite neat. Give it a try. And feedback to me any issues!  
* The best way might be the [[HPC Web Portal]], which lets you submit jobs through a webpage. Give it a try, and feedback to me any issues!  


Alternatively, you can either (1) go down the VM route, installing a windows virtual machine, eg VMWare or Parallels, and follow the instructions below. (2) Remote Desktop to your friend's Windows machine. These tend not to be very convenient, but some users have submitted thousands of jobs this way.
* Alternatively, you can either (1) go down the VM route, installing a windows virtual machine, eg VMWare or Parallels, and follow the instructions below. (2) Remote Desktop to your friend's Windows machine. These tend not to be very convenient, but some users have submitted thousands of jobs this way.


=== What Linux/Mac jobs can I run? ===
=== What Linux/Mac jobs can I run? ===


Many jobs will be platform indepedent (R, Java, Perl, Python). You may need to check you have the right dependencies - packages in R for example can vary between different platforms, so make sure you've got a windows binary folder in your package repositry. Also, for many linux applications, there is a windows port. Mileage can vary, but it's worth a try before you give up hope. Some of these might be installed on the cluster already - see below. Or some you may be able to download and call them from your run script.
* Many jobs may be platform indepedent. Where compiled code is involved though, you'll have some extra work to make sure the binary parts can be run in windows. For R, [https://github.com/mrc-ide/didehpc the tools] help you sort that out. For others, you may need to run a job on the cluster to do your builds. Also, for many linux applications, there is a windows port. Mileage can vary, but it's worth a try searching for that before you give up hope. Some of these might be installed on the cluster already - see below. Or some you may be able to download and call them from your run script.


For C code, (including R with RCpp I think), we'd have to cross-compile. So either you compile them on a windows virtual machine, or you work out how to cross-compile and produce a windows excecutable from linux. (I believe you do that by installing MinGW for linux. Yes I know, MinGW is a conversion of GCC to Windows, which produces windows binaries. Well, there's been a back-port which does the same thing in linux I believe.) Other C compilers may have options for this. We'll solve these problems if we get to them.
* For C code, again, we can run scripts on the cluster to build the binaries for you, if your code can be compiled with (say) the MinGW port of gcc. Cygwin can also be used (although seems less common these days) with accompanying dlls. Get in touch if you need further ideas; most things can be solved.  


Lastly, compiled Matlab is a complex one; I don't think there's any way you can run Matlab on Windows/Linux/Mac, and produce an executable for a different operating system that you're running in. The only way for that, I'm afraid, is the VM/borrow a Windows machine.
* Matlab is a little more tricky - we have a range of colossal matlab runtimes on the cluster, but you'd need to compile a windows binary to use them, since the matlab platforms are distinct in every way. For this, probably find a friend, or get in touch, and we'll work out how you can compile for it.


=== Still feeling optimistic? ===
* There is also the option of running a Windows VM, which might solve all of the above problems.


Well done - and all this is a work in progress - if you want something the cluster can't do here, let me know, and we'll solve it together. Most of the rest of this page will be relevant; you'll still need to create a file called "run.bat" (or "run.cmd") - sorry if this makes you feel unclean, but Windows won't execute it otherwise! What's in that file will often be similar to what's in a shell script.
=== But I really want a unix cluster. ===
 
* We have run linux clusters in the past ([[fi--didelxhn]]), but there was very low demand, whereas the other clusters are consistently well used. Accessing DIDE network shares from linux was a little harder in some aspects, but we did not implement the more traditional linux approach of having dedicated separate cluster storage that inputs and outputs would have to be staged on to, in order for the cluster to see them. Generally, the flexibility of the other clusters makes such things quite straightforward, and may even feel a pleasure if your background is with the less domain-integrated style of cluster architecture.
 
* It has also become far easier to write software that builds on multiple platforms with good practice. Tools that only run on one platform are rare, and in some circumstances have raised questions about the quality of those packages. Alternatives are often available.
 
* Where compilation is the issue - for example if you want to run binaries on the cluster, but your compilation machine runs linux, or is a mac, (Matlab via MCC, or C++ code for example), then get in touch, and there are a few ways we can think of solving that, either by getting someone else, or the cluster itself, to build your code.
 
* There are also resources in [https://www.imperial.ac.uk/admin-services/ict/self-service/research-support/rcs/computing/high-throughput-computing/ ICT) who run a linux cluster, with the file-staging approach, and also (being shared between a much larger pool of users) have rather stricter rules on job durations and resources than we provide locally in DIDE.


== Getting Started ==
== Getting Started ==
Line 63: Line 75:
If you're using a machine that isn't actually logged into DIDE, then the client software will have a problem working out who you are. In this case, there are two things you need to do.
If you're using a machine that isn't actually logged into DIDE, then the client software will have a problem working out who you are. In this case, there are two things you need to do.
* Check you've really got a DIDE username. If you don't know what it is, talk to your lovely IT team.
* Check you've really got a DIDE username. If you don't know what it is, talk to your lovely IT team.
* Connect to the DIDE VPN (see http://www1.imperial.ac.uk/publichealth/departments/ide/it/remote/) - login to that with your DIDE account.
* Connect to the DIDE VPN using ZScaler from https://uafiles.cc.ic.ac.uk/, and login to that with your IC account - but you'll still need your DIDE account to access DIDE servers after the ZScaler connection is up and running.
* Alternatively to the DIDE VPN, you can install ICT's Juniper client from http://secureaccess.imperial.ac.uk, and login to that with your IC account - but you'll still need your DIDE account to access DIDE servers after the Juniper connection is up and running.
* Now we'll open a command window using "runas" - which lets you choose which identity the system thinks you are within that command window:-
* Now we'll open a command window using "runas" - which lets you choose which identity the system thinks you are within that command window:-
<pre>
<pre>
Line 94: Line 105:


The job management software will be on your start menu, as above. All the features are under the “Actions” menu, and hopefully it will be self explanatory. Read the details below about launching, and you'll find the interface bits that do it in the Job Manager. However, you may find over time, especially if you run lots of jobs, that learning to do it the scripted way with the command-line can be quicker.
The job management software will be on your start menu, as above. All the features are under the “Actions” menu, and hopefully it will be self explanatory. Read the details below about launching, and you'll find the interface bits that do it in the Job Manager. However, you may find over time, especially if you run lots of jobs, that learning to do it the scripted way with the command-line can be quicker.
=== All platforms: The Web Portal ===




Line 105: Line 111:
=== Visibility ===
=== Visibility ===


First rule: the executable (or batch) file that the cluster will run must be somewhere on the network that the cluster has access to, when it logs in as you. This amounts to any network accessible drive that you have access to when you login – including network shares, such as the temp drive, your home directory, and any specific network space set up for your project.
Any executable/batch file that the cluster will run must be somewhere on the network that the cluster has access to, when it logs in as you. This amounts to any network accessible drive that you have access to when you login – including network shares, such as the temp drive, your home directory, and any specific network space set up for your project.


Do not assume that the program will run in any specific directory – even though there are ways that are meant to do that. Use full paths to specify where files should be read from, or written to. You may want to write code that takes the paths either from a parameter file, or as a command-line parameter, to give as much flexibility as possible. In the long run, this will help you more.
After a job is launched, all paths are relative to the '''/workdir:''' specified on the job submit command-line. Windows does not usually allow a network path to be the current directory, and MS-HPC somehow gets around this. But if in doubt, use fully-qualified paths to specify where files should be read from, or written to. You may want to write code that takes the paths from a parameter file, or as a command-line parameter, to give as much flexibility as possible.  
REMEMBER THAT your home directory is backed up every day – and it’s not generally very big. So please avoid filling it with enormous sets of results that you don’t actually want to keep – it will make lots of people happy if you can rather write your files to somewhere that doesn’t get backed up. Even a network share on your desktop will do…


If you would like to create a network share on your desktop, then simply…
Your home directory, and the temp drive are on departmental servers that don't have an especially fast connection to the cluster. So avoid having thousands of jobs all reading/writing to those places at once. Additionally, the temp drive is shared between all users, and your home drive has a limited quota. It is also backed up every day, so avoid using it as scratch storage for results you don’t actually want to keep. Generally, the shares you'll want to use from the clusters will be ones containing the word "NAS", which the big cluster has fast access to.
Right Click on the folder you’d like to share, and choose “Share”
The next page shows who has rights to the folder – by default, you!
Click on Share.
A share is created called \\your-computer-name\the-share-name
And you’ll be able to access this from the cluster.
 
BUT, there are limits on how many connections can be made to your desktop, so a desktop share may be useful for testing, but not for running lots of jobs.
 
If you really need to map a network drive letter, then at the top of your “run.bat” file:-
net use X: \\your-computer-name\the-share-name
 
Summary Comment: Try and use a project share on one of the proper servers.


=== Interactivity ===
=== Interactivity ===
Line 180: Line 173:
     set path=%JAVA_HOME%\bin;%JAVA_HOME%\bin\server;%path%
     set path=%JAVA_HOME%\bin;%JAVA_HOME%\bin\server;%path%


For convenience, if you're happy with Oracle's Java 1.8.0 update 181 (July 2018), then <code>call setJava64</code> in your run script will set this up for you. Incidentally the above will also add what's needed for using the rJava package in R - in which case you'll perhaps also want something like <code>call setR64_3_5_1</code> in your run script.
For convenience, if you're happy with Azul's OpenJDK Java 1.8.0 update 308 (September 2021), then <code>call setJava64</code> in your run script will set this up for you. Incidentally the above will also add what's needed for using the rJava package in R - in which case you'll perhaps also want something like <code>call setR64_4_1_0</code> in your run script.


=== Perl ===
=== Perl ===
Line 229: Line 222:
call setr32_3_6_3.bat          call setr64_3_6_3.bat    (Holding the Windsock)
call setr32_3_6_3.bat          call setr64_3_6_3.bat    (Holding the Windsock)
call setr32_4_0_2.bat          call setr64_4_0_2.bat    (Taking Off Again)
call setr32_4_0_2.bat          call setr64_4_0_2.bat    (Taking Off Again)
call setr32_4_0_3.bat          call setr64_4_0_3.bat    (Bunny-Wunnies Freak Out)
call setr32_4_0_5.bat          call setr64_4_0_5.bat    (Shake and Throw)
</pre>
</pre>


Line 254: Line 249:
4.0.0    Arbor Day
4.0.0    Arbor Day
4.0.1    See Things Now
4.0.1    See Things Now
4.0.4    Lost Library Book
</pre>
</pre>


Line 385: Line 381:
| R2019a (64-bit)
| R2019a (64-bit)
| call useMatLab96_64
| call useMatLab96_64
|-
| R2021a (64-bit)
| call useMatLab910_64
|-
| R2022a (64-bit)
| call useMatLab912_64
|}
|}


=== Python ===
=== Python ===


* Current Python versions available on the cluster are 3.5, 3.6 and 3.7 based. Different python versions within the same release (e.g., 3.5.1 and 3.5.2) do not behave nicely together in automated installation, so if you have a desperate need for a previous minor version, get in touch. While 2.7.17 is available, we '''strongly''' recommend migrating away from Python 2.7, as it has been officially discontinued on 1st January 2020. Experiences may vary, and future success is not guaranteed. You have been warned...
* The most recent version of each major release is on the cluster. Different python versions within the same release (e.g., 3.5.1 and 3.5.2) do not behave nicely together in automated installation, so if you have a desperate need for a previous minor version, get in touch. While 2.7.18 is available, we '''strongly''' recommend migrating away from Python 2.7, as it has been officially discontinued on 1st January 2020. Experiences may vary, and future success is not guaranteed.


<pre>
<pre>
2.7.17 64-bit  call set_python_27_64.bat
2.7.18  64-bit  call set_python_27_64.bat
3.5.64-bit  call set_python_35_64.bat
3.5.64-bit  call set_python_35_64.bat
3.6.8 64-bit  call set python_36_64.bat
3.6.15 64-bit  call set python_36_64.bat
3.7.5 64-bit  call set_python_37_64.bat</pre>
3.7.17 64-bit  call set_python_37_64.bat
3.8.17  64-bit  call set python_38_64.bat
3.9.17  64-bit  call set_python_39_64.bat
3.10.12 64-bit  call set python_310_64.bat
3.11.4  64-bit  call set_python_311_64.bat
 
</pre>


* You'll most likely want to use some packages, so you'll need to have a network-accessible location to put your packages in. If you're running a version of Windows with Python that matches one installed on the cluster, then you'll be able to share that package repository between the two. But if you're running a different version, or a different operating system, or a different Python version, then you'll need to keep your local package repository separate from the one the cluster uses.
* You'll most likely want to use some packages, so you'll need to have a network-accessible location to put your packages in. If you're running a version of Windows with Python that matches one installed on the cluster, then you'll be able to share that package repository between the two. But if you're running a different version, or a different operating system, or a different Python version, then you'll need to keep your local package repository separate from the one the cluster uses.
Line 457: Line 465:
=== BOW (BioInformatics on Windows) ===
=== BOW (BioInformatics on Windows) ===


Start your run.bat file with <code>call setBOW</code> to add these to the path:-
A number of Bio-informatics tools are ready to go. These used to be provided together on codeplex, but that has since been retired. Ugene is one project that
maintains windows builds of them. Start your run.bat file with <code>call setBOW</code> to add these to the path:-


{| border="1" cellspacing="0" cellpadding="5" align="center"
{| border="1" cellspacing="0" cellpadding="5" align="center"
Line 465: Line 474:
|-  
|-  
| SamTools.exe
| SamTools.exe
| 0.1.18 (r982:295)
| 0.1.19-44429cd
|-  
|-  
| BCFTools.exe
| BCFTools.exe
| 0.1.17-dev (r973:277)
| 0.1.19-44428cd
|-  
|-  
| bgzip
| bgzip
Line 477: Line 486:
|-  
|-  
| bwa
| bwa
| 0.6.1-r104
| 0.7.17-r1188
|-
|-
| tabix
| tabix
| 0.2.5 (r1005)
| 0.2.5 (r1005)
|-
| minimap2
| 2.24 (r1122)
|-
| raxml
| 8.2.12 (best for each node)
|-
| raxml_sse3
| 8.2.12 (pthreads+sse3)
|-
| raxml_avx
| 8.2.12 (pthreads+avx)
|-
| raxml_avx2
| 8.2.12 (pthreads+avx2)
|}
|}
Note that BWA-MEM is hardwired to use a linux-only feature, so will produce an odd message and not do what you want if you use that mode on Windows. However, it seems that minimap2 may be a faster and more accurate way
of doing what bwa does in any case, and that seems to have no such limitation.
Also, note that raxml.exe is a copy of the most optimised build for the particular cluster node the job is running on. fi--dideclusthn only supports SSE3;
fi--didemrchnb varies a little; the 32-cores support AVX2, the 24, 20 and 16 core machines support AVX, and only the oldest 12-core nodes only support SSE3. You don't have to worry about, except to expect performance to vary between them.
=== GATK and Picard ===
These both rely on Java, so put this in your run script:
<pre>
call setJava64
call setgatk
</pre>
and then <code>gatk</code> in your scripts will use Python 3.7 to call the argument handling wrapper for GATK 4.2.2.0, and <code>picard</code> maps to the picard (2.26.2) jar file.
=== FastQC ===
This one has a slight eccentricity that instead of passing normal command-line args, it uses java's -D command-line options to buffer them out of an environment variable
called _JAVA_OPTIONS. Strange but true. The ZIP comes with a bash script, which I'll reimplement if anyone actually wants me to. But in the meantime, it's easy to work
out the arguments by inspection. For example, most likely usage:-
<pre>
call setFastQC
set _JAVA_OPTIONS='-Dfastqc.output_dir=%OUT%'
call fastqc thing.fq
</pre>


=== MAFFT ===
=== MAFFT ===
Line 562: Line 616:
   set TMPDIR=%TEMP%
   set TMPDIR=%TEMP%
   set PYTHONPATH=\\qdrive.dide.ic.ac.uk\homes\USER\python_37_repo
   set PYTHONPATH=\\qdrive.dide.ic.ac.uk\homes\USER\python_37_repo
   set R_LIBS=Q:\R
   set R_LIBS=Q:\R4
   set R_LIBS_USER=Q:\R
   set R_LIBS_USER=Q:\R4
   call set_python_37_64.bat
   call set_python_37_64.bat
   call setr64_4_0_3.bat
   call setr64_4_0_3.bat
Line 575: Line 629:
</pre>
</pre>


* Now submit the prepare job to the cluster with <code>job submit /scheduler:fi--dideclusthn /jobtemplate:GeneralNodes \\fi--san03.dide.ic.ac.uk\homes\USER\tensorflow\prepare.bat</code> - or if you use the portal, just <code>\\fi--san03.dide.ic.ac.uk\homes\USER\tensorflow\prepare.bat</code> in the big window will do it. That will take a while, but you shuld notice lots of new files in your Q:/python_37_repo folder. You'll only need to do this once (unless you delete your Python repo folder, or it randomly breaks and you start again...)
* Now submit the prepare job to the cluster with <code>job submit /scheduler:fi--dideclusthn /jobtemplate:GeneralNodes \\fi--san03.dide.ic.ac.uk\homes\USER\tensorflow\prepare.bat</code> - or if you use the portal, just <code>\\fi--san03.dide.ic.ac.uk\homes\USER\tensorflow\prepare.bat</code> in the big window will do it. That will take a while, but you should notice lots of new files in your Q:/python_37_repo folder. You'll only need to do this once (unless you delete your Python repo folder, or it randomly breaks and you start again...)


* Next, create a file <code>Q:/Tensorflow/run.bat</code> containing the following the two "call" lines near the end are the ones that do some "work" in python or R, so this script basically a template for running your future jobs. Just change those two calls to set off the python or R code you want to run that uses Tensorflow.
* Next, create a file <code>Q:/Tensorflow/run.bat</code> containing the following the two "call" lines near the end are the ones that do some "work" in python or R, so this script basically a template for running your future jobs. Just change those two calls to set off the python or R code you want to run that uses Tensorflow.
Line 608: Line 662:
</pre>
</pre>


Both should give you some sensible output to stdout.
* Now submit the run job to the cluster with <code>job submit /scheduler:fi--dideclusthn /jobtemplate:GeneralNodes \\fi--san03.dide.ic.ac.uk\homes\USER\tensorflow\run.bat</code> - or if you use the portal, just <code>\\fi--san03.dide.ic.ac.uk\homes\USER\tensorflow\run.bat</code> in the big window will do it - and perhaps paste <code>\\fi--san03.dide.ic.ac.uk\homes\USER\tensorflow\output.txt</code> in the stdout box to collect the input and see how it gets on. You should see sensible output from both tests.


== Launching a job ==
== Launching a job ==

Latest revision as of 17:58, 28 February 2024

  • We have two high performance clusters for general use, both running Microsoft HPC 2012 R2, Update 3. The smaller, older cluster is fi--dideclusthn and the larger is fi--didemrchnb.
  • Many users in the department write code and packages in R. We have in-house packages that can allow an almost-interactive experience of running jobs on our high performance clusters, so if that sounds like what you'd like, you might as well stop reading here, and go straight to The DIDEHPC package - or if you need to join the mrc-ide github org first, go to The welcome page.

If you're not using R, or you are interested in other ways, then continue...

Windows Users

The cluster nodes are Windows 2012-R2 based. This is good news if you're a Windows user, and especially easy for PCs within the departmental DIDE domain. You have the option of using the MS client for launching jobs (see below), or you can use try the new HPC Web Portal. If you're preparing to use the cluster, and have doubts about whether it can run what you want - and whether it will be straightforward to develop and run on it, best talk to me and/or the IT guys first!

Linux and MAC Users

There are two problems, but both can be at least partially overcome. The first is that there is no client for linux. The second is that it can only run windows binaries.

Launching jobs on Linux/Mac

  • The best way might be the HPC Web Portal, which lets you submit jobs through a webpage. Give it a try, and feedback to me any issues!
  • Alternatively, you can either (1) go down the VM route, installing a windows virtual machine, eg VMWare or Parallels, and follow the instructions below. (2) Remote Desktop to your friend's Windows machine. These tend not to be very convenient, but some users have submitted thousands of jobs this way.

What Linux/Mac jobs can I run?

  • Many jobs may be platform indepedent. Where compiled code is involved though, you'll have some extra work to make sure the binary parts can be run in windows. For R, the tools help you sort that out. For others, you may need to run a job on the cluster to do your builds. Also, for many linux applications, there is a windows port. Mileage can vary, but it's worth a try searching for that before you give up hope. Some of these might be installed on the cluster already - see below. Or some you may be able to download and call them from your run script.
  • For C code, again, we can run scripts on the cluster to build the binaries for you, if your code can be compiled with (say) the MinGW port of gcc. Cygwin can also be used (although seems less common these days) with accompanying dlls. Get in touch if you need further ideas; most things can be solved.
  • Matlab is a little more tricky - we have a range of colossal matlab runtimes on the cluster, but you'd need to compile a windows binary to use them, since the matlab platforms are distinct in every way. For this, probably find a friend, or get in touch, and we'll work out how you can compile for it.
  • There is also the option of running a Windows VM, which might solve all of the above problems.

But I really want a unix cluster.

  • We have run linux clusters in the past (fi--didelxhn), but there was very low demand, whereas the other clusters are consistently well used. Accessing DIDE network shares from linux was a little harder in some aspects, but we did not implement the more traditional linux approach of having dedicated separate cluster storage that inputs and outputs would have to be staged on to, in order for the cluster to see them. Generally, the flexibility of the other clusters makes such things quite straightforward, and may even feel a pleasure if your background is with the less domain-integrated style of cluster architecture.
  • It has also become far easier to write software that builds on multiple platforms with good practice. Tools that only run on one platform are rare, and in some circumstances have raised questions about the quality of those packages. Alternatives are often available.
  • Where compilation is the issue - for example if you want to run binaries on the cluster, but your compilation machine runs linux, or is a mac, (Matlab via MCC, or C++ code for example), then get in touch, and there are a few ways we can think of solving that, either by getting someone else, or the cluster itself, to build your code.

Getting Started

Getting access to the cluster

Send a mail to Wes (w.hinsley@imperial.ac.uk) requesting access to the cluster. He will add you to the necessary groups, and work out what cluster you should be running on. Unless you are told otherwise, this will be fi--dideclusthn. But if you have been told otherwise, then whenever you see fi--dideclusthn, replace it with the cluster name you’ve been given, either mentally or (less effectively) on your screen with tip-ex.

Windows: Install the HPC client software

  • Run \\fi--dideclusthn.dide.ic.ac.uk\REMINST\setup.exe.
  • Confirm Yes, I really want to run this.
  • There's an introductory screen, and you click Next
  • You have to tick the Accept, then click Next
  • If it gives you any choice, select Install only the client utilities
  • Next, Next, Next
  • Select Use Microsoft Update and then Next
  • And then Install
  • And possibly Install again – if it had to install any pre-requisites first.
  • And Finish, and the client installation is done.
  • To check everything is alright so far, open a NEW command prompt window.
  • (Because the installer adds things to the path, which only take effect on NEW comamnd windows! :-)
  • (Find it under Accessories, or you can use Start, Run, cmd).
  • Type job list /scheduler:fi--dideclusthn.dide.ic.ac.uk

And it will list the jobs you have running, which will be none so far!

Id         Owner         Name              State        Priority    Resource Request
---------- ------------- ----------------- ------------ ----------- -------- -------

0 matching jobs for DIDE\user

Windows: Using a non-domain machine

If you're using a machine that isn't actually logged into DIDE, then the client software will have a problem working out who you are. In this case, there are two things you need to do.

  • Check you've really got a DIDE username. If you don't know what it is, talk to your lovely IT team.
  • Connect to the DIDE VPN using ZScaler from https://uafiles.cc.ic.ac.uk/, and login to that with your IC account - but you'll still need your DIDE account to access DIDE servers after the ZScaler connection is up and running.
  • Now we'll open a command window using "runas" - which lets you choose which identity the system thinks you are within that command window:-
runas /netonly /user:DIDE\user cmd
  • (Change user to your own login obviously). It will ask for your password, then open a new command window. In that window, you'll be able to do cluster-related things as your DIDE username.

Windows: Launching and cancelling jobs

Windows: Command Line

Suppose you have a file called "run.bat" in your home directory, which does what you want to run on the cluster. Let's say it's a single-core, very simple job. We’ll discuss what should be inside "run.bat" later. To submit your job on the cluster, at the command prompt, type this (all on one line) - or put it in a file called "launch.bat", and run it:-

job submit /scheduler:fi--dideclusthn.dide.ic.ac.uk /jobtemplate:GeneralNodes /numcores:1 \\qdrive.dide.ic.ac.uk\homes\user\run.bat

If it's the first time you’ve run a job - or if you've recently changed your password, then it might ask you for your DIDE password and offer to remember it. Otherwise, it will just tell you the ID number of your job.

Enter the password for ‘DIDE\user’ to connect to ‘FI—DIDECLUSTHN’:
Remember this password? (Y/N)
job has been submitted. ID: 123

If you want to remove the job, then:- job cancel 123 /scheduler:fi--dideclusthn.dide.ic.ac.uk

Or view its details with job view 123 /scheduler:fi--dideclusthn.dide.ic.ac.uk

Windows: Job Manager GUI

Alternatively to the command-line, you can use the job management software, rather than the command-line. The advantage is that it’s a GUI. The disadvantage is, as in all GUIs, you may not feel totally sure you know what it’s up to – where most of the time the things you want to do might not be very complex, as above.

The job management software will be on your start menu, as above. All the features are under the “Actions” menu, and hopefully it will be self explanatory. Read the details below about launching, and you'll find the interface bits that do it in the Job Manager. However, you may find over time, especially if you run lots of jobs, that learning to do it the scripted way with the command-line can be quicker.


Information for running any job

Visibility

Any executable/batch file that the cluster will run must be somewhere on the network that the cluster has access to, when it logs in as you. This amounts to any network accessible drive that you have access to when you login – including network shares, such as the temp drive, your home directory, and any specific network space set up for your project.

After a job is launched, all paths are relative to the /workdir: specified on the job submit command-line. Windows does not usually allow a network path to be the current directory, and MS-HPC somehow gets around this. But if in doubt, use fully-qualified paths to specify where files should be read from, or written to. You may want to write code that takes the paths from a parameter file, or as a command-line parameter, to give as much flexibility as possible.

Your home directory, and the temp drive are on departmental servers that don't have an especially fast connection to the cluster. So avoid having thousands of jobs all reading/writing to those places at once. Additionally, the temp drive is shared between all users, and your home drive has a limited quota. It is also backed up every day, so avoid using it as scratch storage for results you don’t actually want to keep. Generally, the shares you'll want to use from the clusters will be ones containing the word "NAS", which the big cluster has fast access to.

Interactivity

The job must run in an entirely scripted, unattended way. It must not require any key presses or mouse clicks, or other live interactivity while it runs. So jobs generally will read some sort of input (from a file, or from the way you run the job), do some processing, and write the results somewhere for you - all without intervention.

Launching jobs

Jobs can be launched either through the Job Manager interface, or through the command line tools, which offer greater flexibility. We'll describe the command line method here; if you want to use the GUI, then it'll be a case of finding the matching boxes... Below are the specifics for our clusters. For more information, simply type job on the commandline, or job submit for the list of submission-related commands.

FI--DIDECLUSTHN vs FI--DIDEMRCHNB

Job submissions, as shown below, must specify a "job template", which sets up a load of default things to make the job work. On fi--dideclusthn, the job templates are called 4Core, 8Core and GeneralNodes, which will respectively force jobs to run on the 4-core nodes, the 8-core nodes, or on any machine available.

On fi--didemrchnb, you can set the job template to be... 8Core, 12Core, 16Core, 12and16Core, and GeneralNodes - which hopefully are fairly self-explanatory. There are a couple of other job templates, (24Core and Phi), but those are a bit special purpose for now, so don't use them!

Job Submission

Job submissions from the command line can take the following form (all on one line):-

job submit /scheduler:fi--dideclusthn.dide.ic.ac.uk /stdout:\\path\to\stdout.txt
   /stderr:\\path\to\stderr.txt /numcores:1-1 /jobtemplate:4Core \\path\to\run.bat

The /singlenode argument

In MS HPC (2012), Microsoft finally added a tag to allow you to say that the 'n' cores you requested must be on the same computer. Therefore, if you know precisely how many cores you want, then use the following (on one line):-

job submit /scheduler:fi--dideclusthn.dide.ic.ac.uk /singlenode:true 
   /jobtemplate:8Core \\path\to\run.bat

However, there is one bug with this, for the specific case where you request a whole node, regardless of how many cores it has. In this case, oddly, you have to disable single node:-

job submit /scheduler:fi--dideclusthn.dide.ic.ac.uk /singlenode:false 
   /numnodes:1 /jobtemplate:8Core \\path\to\run.bat

Languages and libraries supported

A number of languages, and different versions of languages are available on the cluster. The sections below refer to your "run.bat" file - a batch file that you will create which will get run by a cluster node when a suitable one is available. The commands described below are to be put in this "run.bat" file, and they add various directories to the path, so that the software you want will be added to the path.

C/C++

A number of Microsoft C++ and Intel C++ runtimes are installed, but it's usually better to try and avoid using them, and make your executable as stand-alone as possible. If it requires any external libraries that you've had to download, then put the .dll file in the same directory as the .exe file. If you use Microsoft Visual Studio, in Project Preferences, C/C++, Code Generation, make sure the Runtime Library is Multi-threaded (/MT) – the ones with DLL files won’t work. Even so, on recent versions of the Intel and Microsoft C compilers, "static linking" doesn’t really mean static when it comes to OpenMP, and you’ll have to copy some DLLs and put them next to your EXE file. See the OpenMP section below.

The cluster nodes are all 64-bit, but they will run 32-bit compiled code. Just make sure you provide the right DLLs!

Here is a document about Using the Xeon Phi

Here is a document about Using MPI

And if you want to use libraries such as GSL, then try C/C++ Libraries for Windows

Java

Java doesn't strictly need installing; you can setup any Java Runtime Environment you like, from Oracle, or anywhere. If you want to do this, then, (if you were to use the Server JRE from Oracle), something similar to the following will add java.exe and jvm.dll to the path, and set the JAVA_HOME environment variable, which most Java-using software will appreciate.

   set JAVA_HOME=\\my\path\to\jre1.8.0
   set path=%JAVA_HOME%\bin;%JAVA_HOME%\bin\server;%path%

For convenience, if you're happy with Azul's OpenJDK Java 1.8.0 update 308 (September 2021), then call setJava64 in your run script will set this up for you. Incidentally the above will also add what's needed for using the rJava package in R - in which case you'll perhaps also want something like call setR64_4_1_0 in your run script.

Perl

Strawberry Perl 64-bit portable, v5.30.1.1. Put call setPerl at the top of your script.

R

New tools are coming: see here. The instructions below are still valid if you want to do this manually.


You can R jobs on all the clusters. This is the latest wisdom on how to do so – thanks to James, Jeff, Hannah and others.

First, if you are wanting to use packages, then set up a package repository on your home directory by running this in R.

install.packages("<package>",lib="Q:/R")

Now, write your run script. Suppose you have an R script in your home directory: Q:\R-scripts\test.R. And suppose you’ve set up your repository as above. Your run.bat should then be:-

	call <script for the R version you want – see below>
	net use Q: \\qdrive.dide.ic.ac.uk\homes\user
	set R_LIBS=Q:\R
	set R_LIBS_USER=Q:\R
	Rscript Q:\R-scripts\test.R

Various packages require various R versions, and the cluster supports a few versions. To choose which one, change the first line of the script above to one of these – 32-bit or 64-bit versions of R releases. Purely for amusement, R's codenames are shown here too.

call setr32_2_13_1.bat		call setr64_2_13_1.bat    (anyone know the codename?)
call setr32_2_14_2.bat		call setr64_2_14_2.bat    (Gift-getting season)
call setr32_2_15_2.bat		call setr64_2_15_2.bat    (Trick or treat)
call setr32_3_0_1.bat           call setr64_3_0_1.bat     (Good Sport)
call setr32_3_0_2.bat           call setr64_3_0_2.bat     (Frisbee Sailing)
call setr32_3_1_0.bat           call setr64_3_1_0.bat     (Spring Dance)
call setr32_3_1_2.bat           call setr64_3_1_2.bat     (Pumpkin Helmet)
call setr32_3_2_2.bat           call setr64_3_2_2.bat     (Fire Safety)
call setr32_3_2_3.bat           call setr64_3_2_3.bat     (Wooden Christmas Tree)
call setr32_3_2_4.bat           call setr64_3_2_4.bat     (Very Secure Dishes - Revised Version - R later renamed this to 3.2.5)
call setr32_3_3_1.bat           call setr64_3_3_1.bat     (Bug in your hair)
call setr32_3_3_2.bat           call setr64_3_3_2.bat     (Sincere Pumpkin Patch)
call setr32_3_4_0.bat           call setr64_3_4_0.bat     (You Stupid Darkness)
call setr32_3_4_2.bat           call setr64_3_4_2.bat     (Short Summer)
call setr32_3_4_4.bat           call setr64_3_4_4.bat     (Someone to lean on)
call setr32_3_5_0.bat           call setr64_3_5_0.bat     (Joy in Playing)
call setr32_3_5_1.bat           call setr64_3_5_1.bat     (Feather Spray)
call setr32_3_5_3.bat           call setr64_3_5_3.bat     (Great Truth)
call setr32_3_6_0.bat           call setr64_3_6_0.bat     (Planting of a Tree)
call setr32_3_6_1.bat           call setr64_3_6_1.bat     (Action of the Toes)
call setr32_3_6_3.bat           call setr64_3_6_3.bat     (Holding the Windsock)
call setr32_4_0_2.bat           call setr64_4_0_2.bat     (Taking Off Again)
call setr32_4_0_3.bat           call setr64_4_0_3.bat     (Bunny-Wunnies Freak Out)
call setr32_4_0_5.bat           call setr64_4_0_5.bat     (Shake and Throw)

The following versions are *not* available on the cluster at the moment - but here are their names anyway, as it seemed a shame to miss them out. Let me know if you desperately need one of these.

2.14.0    Great Pumpkin                   
2.14.1    December Snowflakes
2.15.0    Easter Beagle
2.15.1    Roasted Marshmallow
2.15.3    Security Blanket
3.0.0     Masked Marvel
3.0.3     Warm Puppy
3.1.1     Sock it to me
3.1.3     Smooth Sidewalk
3.2.0     Full of Ingredients
3.2.1     World famous astronaut
3.2.4     Very Secure Dishes (the non-revised version)
3.3.0     Supposedly Educational
3.3.3     Another Canoe
3.4.1     Single Candle
3.4.3     Kite-Eating Tree
3.5.2     Eggshell Igloo
3.6.2     Dark and Stormy Night
4.0.0     Arbor Day
4.0.1     See Things Now
4.0.4     Lost Library Book

It also seems that different R versions put their packages in different structures – sometimes adding "win-library" into there for fun. Basically, R_LIBS and R_LIBS_USER should be paths to a folder that contains a list of other folders, one for each package you've installed.

IMPORTANT: R_LIBS and R_LIBS_USER paths must NOT contain quotes, nor spaces. If the path to your library contains spaces, you need to use old-fashioned 8-character names. If in your command window, you type dir /x, you’ll see the names – Program Files tends to become PROGRA~1 for instance.

Passing parameters to R scripts

Passing parameters to R scripts means you can have fewer versions of your R and bat files and easily run whole sets of jobs. You can get parameters into R using Rscript (but not Rcmd BATCH, I think) as follows. In the run.bat example above, the Rscript statement becomes:

Rscript Q:\R-scripts\test.R arg1 arg2 arg3

Within the R code, the arguments can be recovered using

args <- commandArgs(trailingOnly = TRUE)

outFileName <- args[1]     ## name of output file.
dataFileName <- args[2]    ## name of local data file. 
currentR0 <- as.numeric(args[3]) ## convert this explicitly to number. 

If you want your arguments to have a particular type, best to explicitly convert (see R0 above). Better still, you can pass parameters directly into the batch file that runs the R script. Command line arguments can be referenced within the batch file using %1, %2, etc. For example, if you have a batch file, runArgs.bat:

call <script for the R version you want – see below>
net use Q: \\qdrive.dide.ic.ac.uk\homes\user
set R_LIBS=Q:\R
set R_LIBS_USER=Q:\R
Rscript Q:\R-scripts\%1.R %2 %3 %4

then

runArgs.bat myRScript arg1 arg2 arg3

will run the R script myRScript.R and pass it the parameters arg1, arg2 and arg3. The batch file runArgs.bat is now almost a generic R script runner.

OpenMP

This has been tested with C/C++ - but the same should apply to other languages such as fortran, that achieve multi-threading with a DLL file.

Microsoft's C/C++ Compiler

You will need to copy vcomp100.dll (for Visual Studio 2010 – it maybe vcomp90.dll for 2008), into the same directory as your executable. You can usually find the dll file in a directory similar to:- C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\redist\x64\Microsoft.VC100.OpenMP. Extrapolate for future versions! Also make sure that you've enabled OpenMP in the Properties (C++/Language).

GCC or MinGW

Usually, this applies to Eclipse use. In project properties, the C++ command “g++ -fopenmp”, the C command “gcc –fopenmp”, and in the Linker options, under Miscellaneous, Linker flags, put “-fopenmp” too. Copy the OpenMP DLLs from MinGW into the same directory as your final executable. You may find the dlls are in C:\MinGW\Bin\libgomp-1.dll and in the same place, libpthread-2.dll, libstdc++-6.dll and libgcc_s_dw2-1.dll.

Intel C++ Compiler

Copy the libiomp5md.dll file from somewhere like C:\Program Files (x86)\Intel\ComposerXE-2011\redist\intel64\compiler\libiomp5md.dll to the same place as your executable. And in Visual Studio, make sure you've enabled OpenMP in the project properties.

How many threads?

The OpenMP function omp_max_threads() returns the number of physical cores on a machine, not the number of cores allocated to your job. To determine how many cores the scheduler actually allocated to you, use the following code to dig up the environment variable CCP_NUMCPUS, which will be set by the cluster management software:-

		int set_NCores() {
		   char * val;
		   char * endchar;
		   val = getenv("CCP_NUMCPUS");
		   int cores = omp_get_max_threads();
		   if (val!=NULL) cores = strtol (val,&endchar,10);
                   omp_set_num_threads(cores);
		   return cores;
		}

WinBugs (or OpenBugs)

They are similar, but OpenBugs 3.2.2 is the one we've gone for. Something like this as your run.bat script will work:-

call setOpenBugs
openbugs \\path\to\script.txt /HEADLESS

MatLab

There are various ways of producing a non-interactive executable from Matlab. Perhaps the simplest (not necessarily the best performance) way is to use “mcc.exe” – supplied with most full versions of Matlab, including the Imperial site licence version that you've probably got.

Use mcc.exe to compile your code

Use windows explorer to navigate to the folder where the “.m” files are for the project you want to compile. Now use a good text editor to create a file called “compile.bat” in that folder. It should contain something similar to the following:-

mcc -m file1.m file2.m file3.m -o myexe

Don’t copy/paste the text from this document by the way – Word has a different idea of what a dash is to most other software, and will probably replace the two dashes with funny characters. so you have to list every single .m file that your project needs, after the’-m’. If you save this file, then double-click on it, then it will think for some while, and produce “myexe.exe” in this example. Copy your .exe file into a network accessible place as usual.

Launch on the cluster.

The launch.bat file will be exactly the same as before – see page 2. The run script will then start with a line that tells the cluster which version of Matlab you used to compile the cluster. Below is the table of different versions. The runtimes are huge and cumbersome to install on the cluster, so as a result I haven’t installed every single one. If you need one that’s not listed, get in touch.


Matlab Version First line of run script
R2009b call useMatLab79
R2010a call useMatLab713
R2011a call useMatLab715
R2011b call useMatLab716
R2012a (64-bit) call useMatLab717_64
R2012b (64-bit) call useMatLab80_64
R2013a (64-bit) call useMatLab81_64
R2014a (64-bit) call useMatLab83_64
R2014b (64-bit) call useMatLab84_64
R2015a (64-bit) call useMatLab85_64
R2015b (64-bit) call useMatLab90_64
R2016b (64-bit) call useMatLab91_64
R2017a (64-bit) call useMatLab92_64
R2019a (64-bit) call useMatLab96_64
R2021a (64-bit) call useMatLab910_64
R2022a (64-bit) call useMatLab912_64

Python

  • The most recent version of each major release is on the cluster. Different python versions within the same release (e.g., 3.5.1 and 3.5.2) do not behave nicely together in automated installation, so if you have a desperate need for a previous minor version, get in touch. While 2.7.18 is available, we strongly recommend migrating away from Python 2.7, as it has been officially discontinued on 1st January 2020. Experiences may vary, and future success is not guaranteed.
2.7.18  64-bit   call set_python_27_64.bat
3.5.9   64-bit   call set_python_35_64.bat
3.6.15  64-bit   call set python_36_64.bat
3.7.17  64-bit   call set_python_37_64.bat
3.8.17  64-bit   call set python_38_64.bat
3.9.17  64-bit   call set_python_39_64.bat
3.10.12 64-bit   call set python_310_64.bat
3.11.4  64-bit   call set_python_311_64.bat

  • You'll most likely want to use some packages, so you'll need to have a network-accessible location to put your packages in. If you're running a version of Windows with Python that matches one installed on the cluster, then you'll be able to share that package repository between the two. But if you're running a different version, or a different operating system, or a different Python version, then you'll need to keep your local package repository separate from the one the cluster uses.
  • Here comes the basic script to run some python. Suppose I decide to put my Python packages for the cluster in \\qdrive.dide.ic.ac.uk\homes\username\PythonCluster37_64 (which is Q:\PythonCluster37_64 on a DIDE Windows PC). To run some python, something like the following:-
call set_python_37_64
set PYTHONPATH=\\qdrive\homes\username\PythonCluster37_64
python \\qdrive\homes\username\Python\MyScript.py

or if you want to do this in a programmatic way...

import sys
sys.path = ['\\\\qdrive\\homes\\username\\PythonCluster37_64'] + sys.path

Packages

  • If you want to install a package like numpy, that's readily available through pip, then you can launch a batch file like this:-
call set_python_36_64
pip install numpy --target \\qdrive\homes\username\PythonCluster36_64
  • or alternatively, you could write a python script that looks like this, and launch it as normal:-
import pip
pip.main(['install','numpy','--target','\\\\qdrive\\homes\\username\\PythonCluster36_64'])
  • And if you're running the same Python version on a Windows machine, then you can do these two locally, without submitting a cluster job to do them for you.
  • And next time you submit a python job on the cluster, python will look in the place set by PYTHONPATH, and will find the packages you've just installed when you import them.

Beast and Beagle

We've got a couple of versions available. For BEAST 1.75 with BEAGLE 1.0: (the x y z represents all the command-line arguments you'll want to pass to beast)

call setbeast
call beast-beagle x y z

If you want BEAST 1.82 or 1.83 with BEAGLE 2.1:

call setbeast_182                        call setbeast_183
call beast-beagle_182 x y z              call beast-beagle_183 x y z

And if you want BEAST 2.3.2 with BEAGLE 2.1 (note that BEAST 2 is a different, er, beast to BEAST 1, regarding packages and compatibilities, so only use BEAST 2 if you know that's what you want to use).

call setbeast_232
call beast-beagle_232 x y z

OpenCL is installed on all the nodes, which means Beagle will pick up the Intel Xeon processors and use what it can on them. It will also detect the Xeon Phis where they're available. If that's something you'd be particularly interested in using, drop me a line.

Blast

Drop me a line to join the party testing this, and to get added to the (large) BlastDB network share.

call setblast

will set the `BLASTDB` environment variable to point to what we current have of the `v5` database, and the Blast tools (2.9.0, 64-bit) will be added to the path. So you can then call `blastp`, `blastn`, `blastx` setc.

BOW (BioInformatics on Windows)

A number of Bio-informatics tools are ready to go. These used to be provided together on codeplex, but that has since been retired. Ugene is one project that maintains windows builds of them. Start your run.bat file with call setBOW to add these to the path:-

EXE File Version
SamTools.exe 0.1.19-44429cd
BCFTools.exe 0.1.19-44428cd
bgzip ?
razip ?
bwa 0.7.17-r1188
tabix 0.2.5 (r1005)
minimap2 2.24 (r1122)
raxml 8.2.12 (best for each node)
raxml_sse3 8.2.12 (pthreads+sse3)
raxml_avx 8.2.12 (pthreads+avx)
raxml_avx2 8.2.12 (pthreads+avx2)

Note that BWA-MEM is hardwired to use a linux-only feature, so will produce an odd message and not do what you want if you use that mode on Windows. However, it seems that minimap2 may be a faster and more accurate way of doing what bwa does in any case, and that seems to have no such limitation.

Also, note that raxml.exe is a copy of the most optimised build for the particular cluster node the job is running on. fi--dideclusthn only supports SSE3; fi--didemrchnb varies a little; the 32-cores support AVX2, the 24, 20 and 16 core machines support AVX, and only the oldest 12-core nodes only support SSE3. You don't have to worry about, except to expect performance to vary between them.

GATK and Picard

These both rely on Java, so put this in your run script:

call setJava64
call setgatk

and then gatk in your scripts will use Python 3.7 to call the argument handling wrapper for GATK 4.2.2.0, and picard maps to the picard (2.26.2) jar file.

FastQC

This one has a slight eccentricity that instead of passing normal command-line args, it uses java's -D command-line options to buffer them out of an environment variable called _JAVA_OPTIONS. Strange but true. The ZIP comes with a bash script, which I'll reimplement if anyone actually wants me to. But in the meantime, it's easy to work out the arguments by inspection. For example, most likely usage:-

call setFastQC

set _JAVA_OPTIONS='-Dfastqc.output_dir=%OUT%'
call fastqc thing.fq

MAFFT

For version 7.212 (Win64), write call setMAFFT.bat at the top of your run script.

Applied-Maths Open Source

Write call setAppliedMaths.bat at the top of your run script, to add these to the path:

EXE File(s) Description Version
velvetg_mt_x86.exe Graph Construction Multi-threaded, 32-bit 1.01.04
velvetg_mt_x64.exe Graph Construction Multi-threaded, 64-bit 1.01.04
velvetg_st_x86.exe Graph Construction Multi-threaded, 32-bit 1.01.04
velvetg_st_x64.exe Graph Construction Multi-threaded, 64-bit 1.01.04
velveth_mt_x86.exe Hashing Multi-threaded, 32-bit 1.01.04
velveth_mt_x64.exe Hashing Multi-threaded, 64-bit 1.01.04
velveth_st_x86.exe Hashing Multi-threaded, 32-bit 1.01.04
velveth_st_x64.exe Hashing Multi-threaded, 64-bit 1.01.04
ray_x64.exe Ray 64-bit (for MPI) 1.6
ray_x86.exe Ray 32-bit (for MPI) 1.6
Mothur_x64.exe Mothur 64-bit (for MPI) 1.25.1
Mothur_x86.exe Mothur 32-bit (for MPI) 1.25.1

Tensorflow

Here is an example of getting Tensorflow (as a python package) available from an R script, using the cluster to install everything for you. It only works for Tensorflow 2.0 - older versions and newer versions seem to want combinations of DLLs that I can't satisfy. The following are I think minimal steps to get Tensorflow running from Python and R.

  • Create a folder for your python packages in your home directory (so the cluster can see it. I'm calling mine Q:/python_37_repo - and the cluster will call it \\fi--san03.dide.ic.ac.uk\homes\wrh1\python_37_repo - change wrh1 to your DIDE username of course.
  • Create a similar folder for R packages, which for me is Q:/R4
  • And a third folder, Q:/Tensorflow to do this work in.
  • Download these DLLs and put them in the Q:/Tensorflow folder - at present they seem to be the ones Python/Tensorflow wants. Let's hope for the best... Download msvcp140.dll and msvcp140_1.dll into your Q:/Tensorflow folder.
  • Now, create a script Q:/Tensorflow/prepare.bat. We're going to launch this on the cluster, and get it to wire up the packages we want. Replace USER with your username...
  net delete Q: /y
  net use Q: \\qdrive.dide.ic.ac.uk\homes\USER
  set TMPDIR=%TEMP%
  set PYTHONPATH=\\qdrive.dide.ic.ac.uk\homes\USER\python_37_repo
  set R_LIBS=Q:\R4
  set R_LIBS_USER=Q:\R4
  call set_python_37_64.bat
  call setr64_4_0_3.bat
  pip install --no-cache-dir tensorflow==2.00 --target Q:\python_37_repo
  RScript Q:\Tensorflow\install_tensorflow.R
  • Create also Q:/Tensorflow/install_tensorflow.R contains:
  install.packages(c("reticulate", "tensorflow"),lib="Q:/R4",repos="https://cran.ma.imperial.ac.uk/")
  • Now submit the prepare job to the cluster with job submit /scheduler:fi--dideclusthn /jobtemplate:GeneralNodes \\fi--san03.dide.ic.ac.uk\homes\USER\tensorflow\prepare.bat - or if you use the portal, just \\fi--san03.dide.ic.ac.uk\homes\USER\tensorflow\prepare.bat in the big window will do it. That will take a while, but you should notice lots of new files in your Q:/python_37_repo folder. You'll only need to do this once (unless you delete your Python repo folder, or it randomly breaks and you start again...)
  • Next, create a file Q:/Tensorflow/run.bat containing the following the two "call" lines near the end are the ones that do some "work" in python or R, so this script basically a template for running your future jobs. Just change those two calls to set off the python or R code you want to run that uses Tensorflow.
call set_python_37_64.bat
call setr64_4_0_3.bat
net use Q: /delete /y
net use Q: \\qdrive.dide.ic.ac.uk\homes\USER
set PYTHONPATH=Q:/python_37_repo
set R_LIBS=Q:\R4
set R_LIBS_USER=Q:\R4
call python Q:\tensorflow\test_tensorflow.py
call RScript Q:\tensorflow\test_tensorflow.R
  • And finally, you'll need Q:\tensorflow\test_tensorflow.py and Q:\tensorflow\test_tensorflow.R to exist. Here are minimal tests that everything is wired up. <code.test_tensorflow.py first:-
import tensorflow as tf;
print(tf.reduce_sum(tf.random.normal([1000, 1000])));

and Q:\Tensorflow\test_tensorflow.R contains:

library(reticulate)
reticulate::py_config()
library(tensorflow)
tf$constant('Hello, TensorFlow!')
tf$Variable(tf$zeros(shape(1L)))
  • Now submit the run job to the cluster with job submit /scheduler:fi--dideclusthn /jobtemplate:GeneralNodes \\fi--san03.dide.ic.ac.uk\homes\USER\tensorflow\run.bat - or if you use the portal, just \\fi--san03.dide.ic.ac.uk\homes\USER\tensorflow\run.bat in the big window will do it - and perhaps paste \\fi--san03.dide.ic.ac.uk\homes\USER\tensorflow\output.txt in the stdout box to collect the input and see how it gets on. You should see sensible output from both tests.

Launching a job

Submitting many jobs

Suppose you write an exe that you might run with… Mysim.exe init.txt 1 2 3, and you want to run it many times with a range of parameters. One way of many, is to write a launch.bat file that will run “job submit” separately, for example (thanks Tini/James!):-

@echo off
set SubDetails=job submit /scheduler:fi--dideclusthn.dide.ic.ac.uk /jobtemplate:GeneralNodes /numcores:1
set initFile=\\networkshare\job\init.txt
set exeFile=\\networkshare\job\mysim.exe

%SubDetails% %exeFile% %initfile% 1 2 3
%SubDetails% %exeFile% %initfile% 4 5 6
%SubDetails% %exeFile% %initfile% 7 8 9

Suppose the job you want to run is an R script? To specify arguments to an R script, you have to add ′--args a=1 b=2′ - so… you might make launch.bat like this:-

@echo off
set SubDetails=job submit /scheduler:fi--dideclusthn.dide.ic.ac.uk /jobtemplate:GeneralNodes /numcores:1
set rbatFile=\\networkshare\R-scripts\run.bat

%SubDetails% %rbatFile% 1 2
%SubDetails% %rbatFile% 3 4

And make the significant line of your run.bat:- Rcmd BATCH ′--args a=%1 b=%2′ U:\R-scripts\test.R %1 and %2 will map to the first and second thing after the batch file. You can go all the way up to %9.

IMPORTANT NOTE

Make sure you get the apostrophe character right in the above example – NEVER copy and paste from a word document into a script. It will go hideously wrong. Type the apostrophes (and dashes for that matter) in a good text editor – you want just the standard old-fashioned characters.

Requesting resources

The following modifiers next to the “/scheduler:” part of the job submit line (before your app.exe 1 2 3 part), will request things you might want…

/numcores:8 - number of cores you want

/numcores:8-12 - minimum and maximum cores appropriate for your job

/memorypernode:1024 - amount of mem in MegaBytes needed.

/workdir:\\networkshare\work - set working directory

/stdout:\\networkshare\out.txt - divert stdout to a file

/stderr: or /stdin: - similar for stderr and stdin

Troubleshooting / Miscellany / Q & A

  • My job doesn’t work.
    • Run the HPC Job Manager application. Find your job id, possible under “My Jobs, Failed”. Double click on it, then on “View All Tasks”. Perhaps something in the output section will help.
  • Check that the path to your job is visible everywhere.
    • Avoid spaces in your paths - “job submit” doesn’t like them very much. If you must have them, they’ll be ok in the “run.bat” batch file that the cluster will run – in which case, put the path in standard double-quotes ("). But avoid them in your “launch.bat” file – you may have to relocate your run.bat to a simple non-space-containing directory.
    • Rather than putting the full application and parameters on the job submit line, you might want to write a batch-file to do all that, and submit the batch file to the cluster. (See section 6 about R for example). But make sure the batch file is somewhere visible to the cluster.
  • My job seems to work, but reports as having failed.
    • The success/failure depends on the error code returned. If you’re running C code, end with “return 0;” for success.
  • job submit ..blah blah.. app.exe >out.txt doesn’t work!
    • The contents of out.txt will be the result of “job submit”, not the result of “app.exe”. You meant to say this job submit ..blah blah.. /stdout:out.txt app.exe correcting out.txt and app.exe to network paths of course.
  • My C/C++ Executable works perfectly on my desktop, but doesn't work on the cluster. What's wrong with the cluster?
    • Your EXE file probably relies on some DLL files that the cluster doesn't have. If you use Visual Studio, check that in Code Generation, you have "Multi-Threaded (/MT)" as your target - with no mention of DLLs. If your code uses OpenMP, check you've copied the right OpenMP DLL(s) (go back to the C section above). As a last resort, give your EXE to someone who doesn't have Visual Studio installed, and get them to run it, and see what DLL files it seems to ask for.

Contributing Authors

Wes Hinsley, James Truscott, Tini Garske, Jeff Eaton, Hannah Clapham