Parallelism

1. Installation
- 1.1. Local Machine
- 1.2. Virtual Machine
2. Ocatve
3. C
4. Performance Comparison

In this article, I will introduce the installation and basic use of MPI framework through octave. Matlab is widely used in many scientific domains as it enables the developers focus on the algorithms other than programming skills. However, the support of parallelism computing from Matlab is limited and expensive, especially for computing on clusters. Currently, two parallelism frameworks are widely used: MPI (Message Passage Interface) and MapReduce. Hadoop, which is an Java implementation of MapReduce, does not currently support Matlab or Octave. Therefore, I choose the MPI + Octave as a parallelism solution for Matlab users.

1 Installation

1.1 Local Machine

Install open-mpi.
- On Mac OS X with homebrew, brew install open-mpi
- On Ubuntu, the direct way is install the packages via package manager sudo apt-get install openmpi-bin openmpi-common openmpi-checkpoint libopenmpi-dev libopenmpi-X.X. However, the OpenMPI installed through this method does not work with the octave-mpi package due to a library linkage problem. A fast cue is compiling and installing it manually with specific configurations. Download the openmpi package openmpi-1.8.3.tar.gz.
```
tar xzvf openmpi-1.8.3.tar.gz
cd openmpi-1.8.3
./configure --dsiable-dlopen
make
sudo make install
```
Install octave
- On Mac OS X with homebrew
  1. install XQuartz from image file downloaded from http://xquartz.macosforge.org/landing/.
  2. brew tap homebrew/science
  3. install gfortran brew install gcc
  4. brew install octave
- On Ubuntu sudo apt-get install octave
Install octave mpi package
- Inside octave
```
putenv ("PATH", ["path/to/mpic++/:" getenv("PATH")]) % optional
pkg install -forge mpi      % install mpi from internet or
pkg install PATH/TO/PACKAGE % install mpi from downloaded package
```
- Through package manager in Ubuntu (deprecated), run apt-get install octave-openmpi-ext. As the pakcage manager could automatically install the OpenMPI package, which does not work with the octave-mpi actually, we suggest do NOT use this method.

1.2 Virtual Machine

2 Ocatve

2.1 Run octave code with mpi.

First way is run code from shell/terminal/ssh.

mpirun [OPTIONS] octave -q --eval '[octave code, function or script name]'

Second way is call the system command inside octave.

system ("mpirun octave -q --eval '[octave code, function or script name]");

Octave mpi package contains some simple examples to show you how to write the octave code that enables the parallelism via MPI. These codes are in two possible locations if you install the octave mpi package globally:
- /usr/share/doc/octave-mpi/examples/
- /usr/share/octave/packages/mpi-1.2.0/
If you install the package under user’s home folder, you can find exmaple codes under $HOME/octave/packages/mpi-1.2.0/. To test the example inside octave, run command [info res] = system ("mpirun octave -q --eval 'pkg load mpi; helloworld ()'");. The displayed string of the funciton, e.g. helloworld(), will return by this invokation. Therefore your can embed it in another serial computing octave script context (use str2num()).
Frequently used mpirun options: --hostfile indicate the file that contains the hostnames of cluster; -np number of processes to run.

2.2 Programming

The interface provided by octave mpi package is a subset of all MPI functions.

Example 1, basic example about initializing the environment, send and receive messages/variables. Basically, these functions are enough to create the parallelism program.

MPI_Init();
communicator = MPI_Comm_Load("a label string") % one can use multiple communicators indicated by different label
my_rank = MPI_Comm_rank(communicator) % rank is the indicator(id) of current process
p = MPI_Comm_size(CW) % the size of a communicator is the number of processes handled by the communicator
[info] = MPI_Send(value, ranks, TAG, communicator);
% value:  a octave variable
% ranks:  a vector that contains the id of destination processes, can be an integer
% TAG:    an integer to identify the message (when there are many messages sent in the same context)
% info:   an integer that indicates the success of failure state of process
[value, info] = MPI_Recv(source, TAG, communicator)
% source: the id the process that send this message
%         receive from any process if source is -1
%         process starts from 0 and process 0 is usually used as master process
% value:  received variable
MPI_Finalize()

Process sequence control functions:

Function MPI_Barrier(communicator) will block the process until all the processes reached this location.
Function MPI_Probe(RANK, TAG, COMM) test if process [RANK] send a message with TAG from communicator COMM. Block the process if no message arrive.
Function MPI_Iprobe(RANK, TAG, COMM) test if process [RANK] send a message with TAG from communicator COMM. NOT block the process if no message arrive.

2.3 Summary

If you run the octave with MPI, you will find multiple octave processes in the process monitor. Once you run your MPI program, multiple clones of your program are craeted. Each of them is a full clone, which means the content before MPI_Init or MPI_Finalize will be run in each process. You can test it by putting some random function before MPI_Init and see the result. So, if you have a complex application and only want to use MPI to accelerate a specific function, you should call the MPI program as a independent application and get the result in indirect way. For example, you can write the result into a file and read it in your main application. But that would require a distributed file system, which is out of the range of this article.

3 C

The MPI interface for C language is much more powerful than that for octave. But basically, it follows the same procedure: initializing MPI environment (world), communication with process id and message tag. I recommend the tutorial from [https://computing.llnl.gov/tutorials/mpi/] as a learning material. It provides many exercises for basic features of MPI in C.

Compile C source file using the command below, for more options, please check the document of the version you installed.

mpicc [YOUR_C_SOURCE_FILE]

Point-to-point communication
- blocked MPI_Send(buffer,count,type,dest,tag,comm) MPI_Recv(buffer,count,type,source,tag,comm, status)
  - buffer Buffer is the variable you want to send or you want to receive the data. In C, you have to pass the buffer by reference &var.
  - count Count is the number of data element to be sent since with the address/reference you send, the reciever can read an array.
  - type MPI defiend series of data types for each C type. These types are used for transfering between processes (on network for clusters).
  - dest/source The id (rank) of process you want to send or you want to receive from.
  - tag Tag is used to identify the data you received. Since MPI is an interface that can be used to transfer data through network, it is possible that the data you send later arrives earlier. If the processes of your MPI application have to communicate multiple times, tag is necessary to receive the data and execute on them in correct order.
  - comm The communicator or called MPI_COMM_WORLD, which is the MPI context created at the beginning of application.
  - status Em, it is status, but I have no idea what it exactly is…
- non-blocked MPI_ISend(buffer,count,type,dest,tag,comm,request) MPI_IRecv(buffer,count,type,source,tag,comm,request) First, I have to clarify the blocking/non-blocking concept in this context. It is different to the block/non-block in multiple thread programming. Blocking means the application will not continue until the data in the application buffer is completely sent out (through internet protocol). Non-blocking means the application continue without waiting. Pay attention, it wait until the data is sent out, not received by other processes. And I think that is why tag is necessary in both cases, because if the application block until data received, the tag is not necessary and the application becomes sequential. Besides, you cannot put receive before send because it is blocked.
  What is the benefit of non-blocking? You don’t have to wait for the receiving of data. In the example code below, we can see that processes receive data before sending them.
```
#include "mpi.h"
#include <stdio.h>

main(int argc, char *argv[])  {
int numtasks, rank, next, prev, buf[2], tag1=1, tag2=2;
MPI_Request reqs[4];
MPI_Status stats[4];

MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);

prev = rank-1;
next = rank+1;
if (rank == 0)  prev = numtasks - 1;
if (rank == (numtasks - 1))  next = 0;

MPI_Irecv(&buf[0], 1, MPI_INT, prev, tag1, MPI_COMM_WORLD, &reqs[0]);
MPI_Irecv(&buf[1], 1, MPI_INT, next, tag2, MPI_COMM_WORLD, &reqs[1]);

MPI_Isend(&rank, 1, MPI_INT, prev, tag2, MPI_COMM_WORLD, &reqs[2]);
MPI_Isend(&rank, 1, MPI_INT, next, tag1, MPI_COMM_WORLD, &reqs[3]);

      {  do some work  }

MPI_Waitall(4, reqs, stats);

MPI_Finalize();
}
```
Collective communication Basically, all the collective communications can be implemented using the point-to-point communication methods.
- MPI_Reduce Split data to processes, gather the result as reducing operation such as the sum of array.
- MPI_Scatter Split data from one processes to all the processes.
- MPI_Gather Gather data from all the processes to one process
- MPI_Barrier For synchronizing.
Derived data type Though the MPI provides series of primitive types, it is also useful to let useser to define their own data type. You can define a variable of type and use it as the MPI type.
1. Contiguous TODO
2. Vector TODO
3. Indexed TODO
4. Struct TODO
Group and communicator management Group is an ordered set of processes, where the id(rank) of process go from zero to N-1. Communicator encompoasses a group of processes. Communicator that comprises all tasks is represented by MPI_COMM_WORLD. They are the same thing for programmer.
Groups can be destroyed and created during execution. One process can be in multiple groups.
Virtual topology routine Organize the processors in geometric way, like orginizing the processors in a two-dimensional array.

Table of Contents