Understading memory sharing in a GPGPU using a lattice example
An answer to this question on the Scientific Computing Stack Exchange.
Question
I am new in the GPUs world, I used them in Matlab ambient so I didn't need to appreciate the subtleties of these devices.
I know that a GPU can be divided into multiprocessors (also called Streaming Multiprocessors) whose single processors share a cache memory. I don't understand if the host is required for the communication between multiprocessors. In order to center my question I am going to give an example.
Let us take a fluid dynamics simulation in which I perform a discretization of my domain into $N$ cells. After that I assign to each multiprocessor $M$ cells. In this ultra-generic simulation I have some continuity equations, between one cell and the neighbors, which must be updated for each time step. If the cells belong to the same multiprocessor it is ok since they can share memory using the common cache. On the contrary, the situation is different for those cells whose (part of their) neighbors cells belong to a different multiprocessor. My question is:
How can cells belonging to different multiprocessors communicate for each time step?
I could use the memory transfer to the CPU but I think it would not be efficient.
In my opinion, this is the central point of the computational aspect of Lattice Boltzmann Methods.
Help me to understand that.
Answer
The GPU consists of several streaming multiprocessors.
Each SM has 64-96kB of shared memory that can be accessed by up to 1024-2048 threads. This shared memory allows these threads to communicate.
To communicate between SMs you must write to and read from the GPU's global memory, which is 4-32GB in size.
But it is best to think of your problem as consisting of a large number of 1D or 2D chunks which may be assigned to arbitrary SMs and the "communication" as storage of intermediate results. That is, the SM may "communicate" with itself.
Thus, because the GPU's memory is used for storing these intermediates, communication with the host in the middle of a calculation is not necessary.