CPU usage when a MPI rank waits during a blocking communication

2019-12-01

An answer to this question on the Scientific Computing Stack Exchange.

Question

A typical way of dealing with I/O in MPI parallel programs is to either read all data to a single node and dispatch to the other nodes accordingly, or send all data to a single node and write from this node.

I currently use blocking communication.

Since the "master" process can only communicate with one node at a time, most nodes are stuck in the blocking communication step : they essentially "do nothing" while they wait for the master to send/receive. This is completely understandable.

However, when I have a look at CPU usage using "top", all copies of the program have a full 99 - 100% CPU usage. This means that (contrary to my initial guess) waiting for the other side of the communication is not an "idle task" taking up a fairly low amount of CPU, but rather a "busy" one.

How can I reduce the CPU usage for these "waiting" nodes ? Is non-blocking communication relevant here ?

Thanks for any answer,

Answer

It depends on the communication settings you use for MPI. In blocking communication, MPI has three wait modes.

Aggressive busy wait. This is a kind of default mode. Open MPI, at least, uses this when it thinks it is exactly- or under-subscribed (number of processes<=number of processesors). In this mode processes will never voluntarily give up the processor to other processes. OpenMPI spins in tight loops attempting to make message passing progress as fast as possible. Other processes to not get any CPU cycles and never progress. Force this mode with: mpirun -np N --mca mpi_yield_when_idle 0 ./a.out
Degraded busy wait. Useful if you are oversubscribed (number of processes>number of processors), processes frequently yield the processor, thereby allowing multiple processes to progress. Slightly slower than aggressive mode if you are not oversubscribed. Force this mode with: mpirun -np N --mca mpi_yield_when_idle 1 ./a.out
Polling. This one you have to set up yourself by, e.g., calling MPI_Iprobe() in a loop with a sleep call.