Using MPI

From MRC Centre for Outbreak Analysis and Modelling
Jump to navigation Jump to search

MPI (Message-Passing Interface) allows multiple computers (nodes) to communicate with each other by sharing blocks of memory with each other. Traditionally, it's used with C and Fortran - here I'll talk about how to get it running in Visual Studio using C/C++ on our clusters.

Using Visual Studio for MPI

Prerequesites

Setup the Project

  • Create a new C/C++ Empty project. Add a new helloworld.cpp file to it.
  • Project Properties, C/C++, General, Additional Include Directories. Add C:\Program Files (x86)\Microsoft SDKs\MPI\Include
  • Project Properties, Linker, General, Additional Library Directories. Add C:\Program Files (x86)\Microsoft SDKs\MPI\Lib\x64 (for 64-bit - x86 if you really want 32-bit).
  • Project Properties, Linker, Input, Additional Dependencies: insert msmpi.lib;
  • Project Properties, C/C++, Code Generation, set Runtime Library to a non-DLL version (/MT)

Write MPI Helllo World

#include <mpi.h>
#include <stdio.h>

int main(int argc, char* argv[]) {
  int mpi_size,mpi_rank;
  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);
  MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
  printf("Hello World %d out of %d\n",mpi_rank,mpi_size);
  MPI_Finalize();
  return 0;
}

Launching the Job

Let's assume you've built the above do a test in your home directory - \\fi--san02\homes\user\mpitest, and we'll write a script called run.bat, which will take two arguments: (1) the number of nodes you want to use, and (2) the working directory.

mpiexec -n %1 -wdir %2 helloworld.exe

And then we'll write another text file launch.bat to set the job running on the cluster:

set NODES=2
set WORKDIR=\\qdrive.dide.ic.ac.uk\homes\user\mpitest
job submit /scheduler:fi--didemrchnb /jobtemplate:GeneralNodes /numnodes:%NODES% /singlenode:false /workdir:%WORKDIR% \\fi--san02\homes\user\mpitest /stdout:out.txt run.bat %NODES% %WORKDIR%

A bit more MPI

Just a taster - you can easily google for all the MPI examples in the world. But just so you know what you're in for... It's quite low level. You can do things like:

  • Scatter data from one node to all the others
  • Gather data back to one node from all the others.
  • Scatterv and Gatherv allow you to scatter or gather non-equally-sized portions of data, but you have to know how big each bit is in advance. So, commonly, you might do a pair of MPI operations, the first to tell everyone the sizes (one integer per node), and the second to deal with the variable-size data, since you now know how big it is.
  • Allgather and Allgatherv cause all of the nodes to end up with all of the data, rather than just one node accumulating it all.
  • And the data we have been speaking of is... an array of ints, or floats of various sizes.

A very simple example. Arrange it so that all the MPI nodes know the names of all the MPI nodes. First: get the name.

  char name[MPI_MAX_PROCESSOR_NAME];
  int len;
  MPI_Get_processor_name(name, &len);

Now, the name could be a different length, so first, get all the nodes to tell everyone else how large their name is.

  int* results = new int[mpi_size];
  MPI_Allgather(&len, 1, MPI_INT, results, 1, MPI_INT, MPI_COMM_WORLD);  

  /*  MPI_Allgather's arguments are: 
        &len    = Address of data to send
        1       = How many items to send
        MPI_INT = The data type to send.
        results = Address to receive results into
        1       = How many items *per node* to receive
        MPI_INT = The data type to receive
        MPI_COMM_WORLD = A reference to the universe.     */
  

So after this, all the nodes know the length of all the nodes' names. They could differ, so...

  int totalsize = 0;
  for (int i = 0; i < mpi_size; i++) totalsize += sizes[i];
  char* incoming = new char[totalsize];
  int* displs = new int[mpi_size];
  displs[0] = 0;
  for (int i = 1; i < mpi_size; i++) displs[i] = displs[i - 1] + sizes[i - 1];

Here, we've worked out the total incoming buffer size, and made memory space for it. We know the sizes of each one, and I've calculated the displacements for each one - ie, displs[n] is the place in my receive buffer where the data coming from node n will begin.

  MPI_Allgatherv(&name, len, MPI_CHAR, incoming, sizes, displs, MPI_CHAR, MPI_COMM_WORLD);

  /*  MPI_Allgatherv's arguments are: 
        &name    = Address of data to send
        len      = How many items to send
        MPI_CHAR = The data type to send.
        incoming = Address to receive data into
        sizes    = Array of size mpi_size - size to receive from each node.
        displs   = Array of size mpi_size - starting point of data for each node. 
        MPI_CHAR = The data type to receive.
        MPI_COMM_WORLD = A reference to the universe.     */

And if you want to get the names out one by one, then, perhaps something like this...

  std::string allresults(incoming);
  std::string* array_results = new std::string[mpi_size];
  for (int i = 0; i < mpi_size; i++) {
    array_results[i] = allresults.substr(displs[i], sizes[i]);
  }

Don't forget to

  MPI_Finalize();

at the end.