Arm DDT's Message Queue debugging feature shows the status of the message buffers of MPI. For example, it shows the messages that have been sent by a process but not yet received by the target.
You can use DDT to detect common errors such as deadlock. This is where all processes are waiting for each other. You can also use it for detecting when messages are present that are unexpected, which can correspond to two processes disagreeing about the state of progress through a program.
This capability relies on the MPI implementation supporting this via a debugging support library: the majority of MPIs provide this. Furthermore, not all implementations support the capability to the same degree, and a variance between the information provided by each implementation is to be expected.
Open the Message Queues window by selecting Message Queues from the Tools menu. The Message Queues window will query the MPI processes for information about the state of the queues.
While the window is open, click Update to refresh the current queue information. Note that this will stop all playing processes. While DDT is gathering the data a "Please Wait" dialog may be displayed and you can cancel the request at any time.
DDT will automatically load the message queue support library from your MPI implementation (provided one exists). If it fails, an error message will be shown. Common reasons for failure to load include:
The support library does not exist, or its use must be explicitly enabled.
Most MPIs will build the library by default, without additional configuration flags. MPICH 2 and MPICH 3 must be configured with the --enable-debuginfo argument. MPICH 1.2.x must be configured with the --enable-debug argument. MVAPICH 2 must be configured with the --enable-debug and --enable-sharedlib arguments. Some MPIs, notably Cray's MPI, do not support message queue debugging at all.
Intel MPI includes the library, but debug mode must be enabled. See E.6 Intel MPI for details.
LAM and Open MPI automatically compile the library.
The support library is not available on the compute nodes where the MPI processes are running.
Ensure the library is available, and set the environment variable ALLINEA_QUEUE_DLL if necessary to force using the library in its new location.
The support library has moved from its original installation location.
Ensure the proper procedure for the MPI configuration is used. This may require you to specify the installation directory as a configuration option.
Alternatively, you can specifically include the path to the support library in the LD_LIBRARY_PATH, or if this is not convenient you can set the environment variable, ALLINEA_QUEUE_DLL, to the absolute path of the library itself (for example, /usr/local/ mpich-_1.2.7/lib/libtvmpich.so).
The MPI is built to a different bit-size to the debugger.
In the unlikely case that the MPI is not built to the bit-size of the operating system, then the debugger may not be able to find a support library that is the correct size. This is unsupported.
To see the messages, you must select a communicator to see the messages in that group. The ranks displayed in the diagram are the ranks within the communicator (not MPI_COMM_WORLD), if the Show Local Ranks option is selected. To see the 'usual' ranks, select Show Global Ranks. The messages displayed can be restricted to particular processes or groups of processes. To restrict the display in the grid to a single process, select Individual Processes in the Display mode selector, and select the rank of the process. To select a group of processes, select Process Groups in the Display mode selector and select the ring arc corresponding to the required group. Both of these display modes support multiple selections.
There are three different types of message queues about which there is information. Different colors are used to display messages from each type of queue.
Calls to MPI send functions that have not yet completed.
Calls to MPI receive functions that have not yet completed.
|Unexpected Message Queue||
Represents messages received by the system but the corresponding receive function call has not yet been made.
Messages in the Send queue are represented by a red arrow, pointing from the sender to the recipient. The line is solid on the sender side, but dashed on the received side (to represent a message that has been Sent but not yet been Received).
Messages in the Receive queue are represented by a green arrow, pointing from the sender to the recipient. The line is dashed on the sender side, but solid on the recipient side, to represent the recipient being ready to receive a message that has not yet been sent.
Messages in the Unexpected queue are represented by a dashed blue arrow, pointing from sender of the unexpected message to the recipient.
A message to self is indicated by a line with one end at the centre of the diagram.
A loop in the graph can indicate deadlock. This is where every process is waiting to receive from the preceding process in the loop. For synchronous communications, such as with MPI_Send, this is a common problem.
For other types of communication it can be the case, with MPI_Send that messages get stuck, for example in an O/S buffer, and the send part of the communication is complete but the receive has not started. If the loop persists after playing the processes and interrupting them again, this indicates a deadlock is likely.