Using the Xeon Phi: Difference between revisions

From MRC Centre for Outbreak Analysis and Modelling
Jump to navigation Jump to search
No edit summary
mNo edit summary
 
(8 intermediate revisions by the same user not shown)
Line 1: Line 1:
The Xeon Phi is essentially linux installed on a chip, on a card, inside some of our nodes. It's '''extremely''' multi-core, so offers itself for algorithms that have reasonably simple but very parallel sections. Here is the experience I've collected so far in using it.
The Xeon Phi is essentially linux on a chip, on a special card, inside some of our HPC nodes. It's '''extremely''' multi-core, so offers itself for algorithms that have reasonably simple but very parallel sections. Here is the experience I've collected so far in using it.


== Using Visual Studio with Intel C++ Parallel Compiler ==
== Using Visual Studio with Intel C++ Parallel Compiler ==
=== Prerequesites ===
* You'll need Visual Studio, and the Intel C++ Parallel Compiler - probably most recent editions.
* You'll also need the Phi copressor software - https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss . According to the docs, you need both the Coprocessor, and the Essentials installer. (Possibly the latest local copy will be at \\fi--didef3\Software\HPC\PhiDrivers)
=== Setup the Project ===
=== Setup the Project ===


* Create a new 64-bit project.
* Create a new 64-bit project.
* Choose to use the Intel C++ Compiler from the Projects menu.
* Get into Release, x64 mode with the menus at the top.
* Get into Release, x64 mode with the menus at the top.
* Choose the Intel Compiler, from the Projects menu.
* In Project Preferences:-
* In Project Preferences:-
** Linker, General [Intel C++], Additional Options for MIC Offload Linker, add '''-no-fortlib'''
** Linker, General [Intel C++], Additional Options for MIC Offload Linker, add '''-no-fortlib'''
** (The above is a workaround - the compiler assumes everyone has fortran installed and cries if we don't.)
** (The above is a workaround - the compiler assumes everyone has fortran installed and cries if we don't.)
** C/C++, Code Generation [Intel C++], Enable OpenMP Offloading Compilation... - choose '''Intel MIC Architecture'''
** C/C++, Code Generation [Intel C++], Enable OpenMP Offloading Compilation... - choose '''Intel MIC Architecture'''
** And check the Target Device below is also set to '''Intel MIC Architecture'''.
** C/C++, Language [Intel C++], OpenMP Support: '''Generate Parallel Code'''
** C/C++, Language [Intel C++], OpenMP Support: '''Generate Parallel Code'''
** C/C++, Language, Runtime Library: '''Multi-threaded (/MT)''' - for all the good it does. We'll copy DLLs later.
** C/C++, Language, Runtime Library: '''Multi-threaded (/MT)''' - we'll copy more DLLs across later.


=== Some Code ===
=== Some Code ===
In this example. I've setup a project called PhiTest, and a single file, main.cpp.


<pre>
<pre>
Line 23: Line 30:
int main(int argc, char *argv[]) {
int main(int argc, char *argv[]) {
    
    
   int th_id, nthreads;
   int thread_id;
    
 
  // Here's an OpenMP loop using main processor 
 
  #pragma omp parallel private (thread_id)
  {
    thread_id = omp_get_thread_num();
    printf("Local thread %d\n", thread_id);
    #pragma omp barrier
    #pragma omp single
    printf("There are %d local threads\n", omp_get_num_threads());
   }
 
  // Insert an extra offload pragma to do OpenMP on the Phi.
 
   #pragma offload target(mic)
   #pragma offload target(mic)
   #pragma omp parallel private (th_id)
   #pragma omp parallel private (thread_id)
   {
   {
     th_id = omp_get_thread_num();
     thread_id = omp_get_thread_num();
     printf("Offload thread %d\n", th_id);
     printf("Offload thread %d\n", thread_id);
     #pragma omp barrier
     #pragma omp barrier
     #pragma omp single
     #pragma omp single
     printf("There are %d threads\n", omp_get_num_threads());
     printf("There are %d offload threads\n", omp_get_num_threads());
   }
   }
   return EXIT_SUCCESS;
   return EXIT_SUCCESS;
}
}
</pre>
</pre>


Line 42: Line 63:
=== Prepare a cluster job ===
=== Prepare a cluster job ===


* Decide where to run the job as normal. I'm going for T:\Wes\Phi (which is \\fi--didef2\Tmp\Wes\Phi) in this example.
* Decide where to run the job as normal. I'm going for T:\Wes\Phi (which is \\fi--didef3\Tmp\Wes\Phi) in this example.
* Copy the executable there, which will be in Release\x64 in your project folder. Mine is called PhiTest.exe.
* Copy the executable there, which will be in Release\x64 in your project folder. Mine is called PhiTest.exe.
* Find a folder something like: C:\Program Files (x86)\IntelSWTools\parallel_studio_xe_2016.1.051\compilers_and_libraries_2016\windows\redist\intel64_win\compiler
* Find a folder something like: C:\Program Files (x86)\IntelSWTools\parallel_studio_xe_2016.1.051\compilers_and_libraries_2016\windows\redist\intel64_win\compiler
Line 51: Line 72:
I'll call this run.bat, and put it in T:\Wes\Phi. I'll assume we'll have the working directory set, so...
I'll call this run.bat, and put it in T:\Wes\Phi. I'll assume we'll have the working directory set, so...
<pre>
<pre>
set MIC_LD_LIBRARY=\\fi--didef2\Tmp\Wes\Phi\lib
set MIC_LD_LIBRARY_PATH=\\fi--didef2\Tmp\Wes\Phi\lib
PhiTest.exe
PhiTest.exe
</pre>
</pre>
Line 58: Line 79:


<pre>
<pre>
job submit /scheduler:fi--didemrchnb /numnodes:1 /singlenode:false /jobtemplate:Phi /workdir:\\fi--didef2\Tmp\Wes\Phi /stdout:out.txt /stderr:err.txt run.bat
job submit /scheduler:fi--didemrchnb /numnodes:1 /singlenode:false /jobtemplate:Phi /workdir:\\fi--didef3\Tmp\Wes\Phi /stdout:out.txt /stderr:err.txt run.bat
</pre>
</pre>


Line 68: Line 89:


<pre>
<pre>
\\fi--didef2\Tmp\Wes\Phi>set MIC_LD_LIBRARY_PATH=\\fi--didef2\Tmp\Wes\Phi\lib  
\\fi--didef2\Tmp\Wes\Phi>set MIC_LD_LIBRARY_PATH=\\fi--didef3\Tmp\Wes\Phi\lib  


\\fi--didef2\Tmp\Wes\Phi>\\fi--didef2\Tmp\Wes\Phi\PhiTest.exe
\\fi--didef2\Tmp\Wes\Phi>\\fi--didef3\Tmp\Wes\Phi\PhiTest.exe
Hello World from thread 194
Offload thread 112
Hello World from thread 182
Offload thread 43
Hello World from thread 114
Offload thread 117
Hello World from thread 176
....
....
Hello World from thread 18
Offload thread 45
There are 240 threads
There are 240 offload threads
 
Local thread 12
Local thread 5
Local thread 9
....
There are 16 local threads
 
</pre>
</pre>


That's a lot of threads.
That's a lot of threads. Note that my code did the local bit first, but the output has come out in reverse. There may be interleaving issues with stdout, so in real code, do it better!

Latest revision as of 13:52, 22 June 2020

The Xeon Phi is essentially linux on a chip, on a special card, inside some of our HPC nodes. It's extremely multi-core, so offers itself for algorithms that have reasonably simple but very parallel sections. Here is the experience I've collected so far in using it.

Using Visual Studio with Intel C++ Parallel Compiler

Prerequesites

Setup the Project

  • Create a new 64-bit project.
  • Choose to use the Intel C++ Compiler from the Projects menu.
  • Get into Release, x64 mode with the menus at the top.
  • In Project Preferences:-
    • Linker, General [Intel C++], Additional Options for MIC Offload Linker, add -no-fortlib
    • (The above is a workaround - the compiler assumes everyone has fortran installed and cries if we don't.)
    • C/C++, Code Generation [Intel C++], Enable OpenMP Offloading Compilation... - choose Intel MIC Architecture
    • And check the Target Device below is also set to Intel MIC Architecture.
    • C/C++, Language [Intel C++], OpenMP Support: Generate Parallel Code
    • C/C++, Language, Runtime Library: Multi-threaded (/MT) - we'll copy more DLLs across later.

Some Code

In this example. I've setup a project called PhiTest, and a single file, main.cpp.

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[]) {
  
  int thread_id;

  // Here's an OpenMP loop using main processor  

  #pragma omp parallel private (thread_id)
  {
    thread_id = omp_get_thread_num();
    printf("Local thread %d\n", thread_id);
    #pragma omp barrier
    #pragma omp single
    printf("There are %d local threads\n", omp_get_num_threads());
  }

  // Insert an extra offload pragma to do OpenMP on the Phi.

  #pragma offload target(mic)
  #pragma omp parallel private (thread_id)
  {
    thread_id = omp_get_thread_num();
    printf("Offload thread %d\n", thread_id);
    #pragma omp barrier
    #pragma omp single
    printf("There are %d offload threads\n", omp_get_num_threads());
  }
  return EXIT_SUCCESS;
}

And compile it. You might get some "warning #3335: *MIC* offload features on this platform currently require that RTTI be disabled" - don't worry about them.

Prepare a cluster job

  • Decide where to run the job as normal. I'm going for T:\Wes\Phi (which is \\fi--didef3\Tmp\Wes\Phi) in this example.
  • Copy the executable there, which will be in Release\x64 in your project folder. Mine is called PhiTest.exe.
  • Find a folder something like: C:\Program Files (x86)\IntelSWTools\parallel_studio_xe_2016.1.051\compilers_and_libraries_2016\windows\redist\intel64_win\compiler
    • Copy cilkrts20.dll, libiomp5md.dll and liboffload.dll to your run folder (T:\Wes\Phi for me)
  • We also need to copy the library for the Phi itself to have. Find a folder something like: C:\Program Files (x86)\IntelSWTools\parallel_studio_xe_2016.1.051\compilers_and_libraries_2016\windows\compiler\lib\mic
    • Make a folder called lib in your test folder (ie, T:\Wes\Phi\Lib), and copy all the files you just found into it, including the "locale" folder. If you like command-line copying, then something like
      xcopy *.* /e T:\Wes\Phi\Lib

A batch file to run the job

I'll call this run.bat, and put it in T:\Wes\Phi. I'll assume we'll have the working directory set, so...

set MIC_LD_LIBRARY_PATH=\\fi--didef2\Tmp\Wes\Phi\lib
PhiTest.exe

And my launch file

job submit /scheduler:fi--didemrchnb /numnodes:1 /singlenode:false /jobtemplate:Phi /workdir:\\fi--didef3\Tmp\Wes\Phi /stdout:out.txt /stderr:err.txt run.bat

So remember the /singlenode:false is the silly hack we have to do when we ask for a single, whole node.

And the result

In my out.txt, I have...

\\fi--didef2\Tmp\Wes\Phi>set MIC_LD_LIBRARY_PATH=\\fi--didef3\Tmp\Wes\Phi\lib 

\\fi--didef2\Tmp\Wes\Phi>\\fi--didef3\Tmp\Wes\Phi\PhiTest.exe
Offload thread 112
Offload thread 43
Offload thread 117
....
Offload thread 45
There are 240 offload threads

Local thread 12
Local thread 5
Local thread 9
....
There are 16 local threads

That's a lot of threads. Note that my code did the local bit first, but the output has come out in reverse. There may be interleaving issues with stdout, so in real code, do it better!