Parallelism
Table of Contents
In this article, I will introduce the installation and basic use of MPI framework through octave. Matlab is widely used in many scientific domains as it enables the developers focus on the algorithms other than programming skills. However, the support of parallelism computing from Matlab is limited and expensive, especially for computing on clusters. Currently, two parallelism frameworks are widely used: MPI (Message Passage Interface) and MapReduce. Hadoop, which is an Java implementation of MapReduce, does not currently support Matlab or Octave. Therefore, I choose the MPI + Octave as a parallelism solution for Matlab users.
1 Installation
1.1 Local Machine
- Install open-mpi.
- On Mac OS X with homebrew,
brew install open-mpi
- On Ubuntu, the direct way is install the packages via package
manager
sudo apt-get install openmpi-bin openmpi-common openmpi-checkpoint libopenmpi-dev libopenmpi-X.X
. However, the OpenMPI installed through this method does not work with the octave-mpi package due to a library linkage problem. A fast cue is compiling and installing it manually with specific configurations. Download the openmpi package openmpi-1.8.3.tar.gz.tar xzvf openmpi-1.8.3.tar.gz cd openmpi-1.8.3 ./configure --dsiable-dlopen make sudo make install
- On Mac OS X with homebrew,
- Install octave
- On Mac OS X with homebrew
- install XQuartz from image file downloaded from http://xquartz.macosforge.org/landing/.
brew tap homebrew/science
- install gfortran
brew install gcc
brew install octave
- On Ubuntu
sudo apt-get install octave
- On Mac OS X with homebrew
- Install octave mpi package
- Inside octave
putenv ("PATH", ["path/to/mpic++/:" getenv("PATH")]) % optional pkg install -forge mpi % install mpi from internet or pkg install PATH/TO/PACKAGE % install mpi from downloaded package
- Through package manager in Ubuntu (deprecated), run
apt-get install octave-openmpi-ext
. As the pakcage manager could automatically install the OpenMPI package, which does not work with the octave-mpi actually, we suggest do NOT use this method.
- Inside octave
1.2 Virtual Machine
2 Ocatve
2.1 Run octave code with mpi.
- First way is run code from shell/terminal/ssh.
mpirun [OPTIONS] octave -q --eval '[octave code, function or script name]'
- Second way is call the system command inside octave.
system ("mpirun octave -q --eval '[octave code, function or script name]");
- Octave mpi package contains some simple examples to show you
how to write the octave code that enables the parallelism via
MPI. These codes are in two possible locations if you install
the octave mpi package globally:
/usr/share/doc/octave-mpi/examples/
/usr/share/octave/packages/mpi-1.2.0/
If you install the package under user’s home folder, you can find exmaple codes under
$HOME/octave/packages/mpi-1.2.0/
. To test the example inside octave, run command[info res] = system ("mpirun octave -q --eval 'pkg load mpi; helloworld ()'");
. The displayed string of the funciton, e.g.helloworld()
, will return by this invokation. Therefore your can embed it in another serial computing octave script context (use str2num()). - Frequently used mpirun options:
--hostfile
indicate the file that contains the hostnames of cluster;-np
number of processes to run.
2.2 Programming
The interface provided by octave mpi package is a subset of all MPI functions.
Example 1, basic example about initializing the environment, send and receive messages/variables. Basically, these functions are enough to create the parallelism program.
MPI_Init(); communicator = MPI_Comm_Load("a label string") % one can use multiple communicators indicated by different label my_rank = MPI_Comm_rank(communicator) % rank is the indicator(id) of current process p = MPI_Comm_size(CW) % the size of a communicator is the number of processes handled by the communicator [info] = MPI_Send(value, ranks, TAG, communicator); % value: a octave variable % ranks: a vector that contains the id of destination processes, can be an integer % TAG: an integer to identify the message (when there are many messages sent in the same context) % info: an integer that indicates the success of failure state of process [value, info] = MPI_Recv(source, TAG, communicator) % source: the id the process that send this message % receive from any process if source is -1 % process starts from 0 and process 0 is usually used as master process % value: received variable MPI_Finalize()
Process sequence control functions:
- Function
MPI_Barrier(communicator)
will block the process until all the processes reached this location. - Function
MPI_Probe(RANK, TAG, COMM)
test if process [RANK] send a message with TAG from communicator COMM. Block the process if no message arrive. - Function
MPI_Iprobe(RANK, TAG, COMM)
test if process [RANK] send a message with TAG from communicator COMM. NOT block the process if no message arrive.
2.3 Summary
If you run the octave with MPI, you will find multiple octave
processes in the process monitor. Once you run your MPI program,
multiple clones of your program are craeted. Each of them is a
full clone, which means the content before MPI_Init
or
MPI_Finalize
will be run in each process. You can test it by
putting some random function before MPI_Init
and see the
result. So, if you have a complex application and only want to use
MPI to accelerate a specific function, you should call the MPI
program as a independent application and get the result in
indirect way. For example, you can write the result into a file
and read it in your main application. But that would require a
distributed file system, which is out of the range of this article.
3 C
The MPI interface for C language is much more powerful than that for octave. But basically, it follows the same procedure: initializing MPI environment (world), communication with process id and message tag. I recommend the tutorial from [https://computing.llnl.gov/tutorials/mpi/] as a learning material. It provides many exercises for basic features of MPI in C.
Compile C source file using the command below, for more options, please check the document of the version you installed.
mpicc [YOUR_C_SOURCE_FILE]
- Point-to-point communication
- blocked
MPI_Send(buffer,count,type,dest,tag,comm)
MPI_Recv(buffer,count,type,source,tag,comm, status)
- buffer
Buffer is the variable you want to send or you want to
receive the data. In C, you have to pass the buffer by
reference
&var
. - count Count is the number of data element to be sent since with the address/reference you send, the reciever can read an array.
- type MPI defiend series of data types for each C type. These types are used for transfering between processes (on network for clusters).
- dest/source The id (rank) of process you want to send or you want to receive from.
- tag Tag is used to identify the data you received. Since MPI is an interface that can be used to transfer data through network, it is possible that the data you send later arrives earlier. If the processes of your MPI application have to communicate multiple times, tag is necessary to receive the data and execute on them in correct order.
- comm The communicator or called MPI_COMM_WORLD, which is the MPI context created at the beginning of application.
- status Em, it is status, but I have no idea what it exactly is…
- buffer
Buffer is the variable you want to send or you want to
receive the data. In C, you have to pass the buffer by
reference
- non-blocked
MPI_ISend(buffer,count,type,dest,tag,comm,request)
MPI_IRecv(buffer,count,type,source,tag,comm,request)
First, I have to clarify the blocking/non-blocking concept in this context. It is different to the block/non-block in multiple thread programming. Blocking means the application will not continue until the data in the application buffer is completely sent out (through internet protocol). Non-blocking means the application continue without waiting. Pay attention, it wait until the data is sent out, not received by other processes. And I think that is why tag is necessary in both cases, because if the application block until data received, the tag is not necessary and the application becomes sequential. Besides, you cannot put receive before send because it is blocked.What is the benefit of non-blocking? You don’t have to wait for the receiving of data. In the example code below, we can see that processes receive data before sending them.
#include "mpi.h" #include <stdio.h> main(int argc, char *argv[]) { int numtasks, rank, next, prev, buf[2], tag1=1, tag2=2; MPI_Request reqs[4]; MPI_Status stats[4]; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); MPI_Comm_rank(MPI_COMM_WORLD, &rank); prev = rank-1; next = rank+1; if (rank == 0) prev = numtasks - 1; if (rank == (numtasks - 1)) next = 0; MPI_Irecv(&buf[0], 1, MPI_INT, prev, tag1, MPI_COMM_WORLD, &reqs[0]); MPI_Irecv(&buf[1], 1, MPI_INT, next, tag2, MPI_COMM_WORLD, &reqs[1]); MPI_Isend(&rank, 1, MPI_INT, prev, tag2, MPI_COMM_WORLD, &reqs[2]); MPI_Isend(&rank, 1, MPI_INT, next, tag1, MPI_COMM_WORLD, &reqs[3]); { do some work } MPI_Waitall(4, reqs, stats); MPI_Finalize(); }
- blocked
- Collective communication
Basically, all the collective communications can be implemented
using the point-to-point communication methods.
- MPI_Reduce Split data to processes, gather the result as reducing operation such as the sum of array.
- MPI_Scatter Split data from one processes to all the processes.
- MPI_Gather Gather data from all the processes to one process
- MPI_Barrier For synchronizing.
- Derived data type
Though the MPI provides series of primitive types, it is also
useful to let useser to define their own data type. You can
define a variable of type and use it as the MPI type.
- Contiguous TODO
- Vector TODO
- Indexed TODO
- Struct TODO
- Group and communicator management
Group is an ordered set of processes, where the id(rank) of
process go from zero to N-1. Communicator encompoasses a group of
processes. Communicator that comprises all tasks is represented
by MPI_COMM_WORLD. They are the same thing for programmer.
Groups can be destroyed and created during execution. One process can be in multiple groups.
- Virtual topology routine Organize the processors in geometric way, like orginizing the processors in a two-dimensional array.
4 Performance Comparison