CPU usage when a MPI rank waits during a blocking communication
An answer to this question on the Scientific Computing Stack Exchange.
Question
A typical way of dealing with I/O in MPI parallel programs is to either read all data to a single node and dispatch to the other nodes accordingly, or send all data to a single node and write from this node.
I currently use blocking communication.
Since the "master" process can only communicate with one node at a time, most nodes are stuck in the blocking communication step : they essentially "do nothing" while they wait for the master to send/receive. This is completely understandable.
However, when I have a look at CPU usage using "top", all copies of the program have a full 99 - 100% CPU usage. This means that (contrary to my initial guess) waiting for the other side of the communication is not an "idle task" taking up a fairly low amount of CPU, but rather a "busy" one.
How can I reduce the CPU usage for these "waiting" nodes ? Is non-blocking communication relevant here ?
Thanks for any answer,
Answer
It depends on the communication settings you use for MPI. In blocking communication, MPI has three wait modes.
Aggressive busy wait. This is a kind of default mode. Open MPI, at least, uses this when it thinks it is exactly- or under-subscribed (number of processes<=number of processesors). In this mode processes will never voluntarily give up the processor to other processes. OpenMPI spins in tight loops attempting to make message passing progress as fast as possible. Other processes to not get any CPU cycles and never progress. Force this mode with:
mpirun -np N --mca mpi_yield_when_idle 0 ./a.outDegraded busy wait. Useful if you are oversubscribed (number of processes>number of processors), processes frequently yield the processor, thereby allowing multiple processes to progress. Slightly slower than aggressive mode if you are not oversubscribed. Force this mode with:
mpirun -np N --mca mpi_yield_when_idle 1 ./a.outPolling. This one you have to set up yourself by, e.g., calling
MPI_Iprobe()in a loop with asleepcall.