Arm DDT has a powerful parallel memory debugging capability. This feature intercepts calls to the system memory allocation library, recording memory usage and confirming correct usage of the library by performing heap and bounds checking.
Typical problems which can be resolved by using Arm DDT with memory debugging enabled include:
- Memory exhaustion due to memory leaks can be prevented by examining the Current Memory Usage display, which groups and quantifies memory according to the location at which blocks have been allocated.
- Persistent but random crashes caused by access of memory beyond the bounds of an allocation block can be diagnosed by using the Guard Pages feature.
- Crashing due to deallocation of the same memory block twice, deallocation via invalid pointers, and other invalid deallocations, for example deallocating a pointer that is not at the start of an allocation.
To enable memory debugging within Arm DDT, from the Run window click on the Memory Debugging checkbox.
The default options are usually sufficient, but you may need to configure extra options (described in the following sections) if you have a multithreaded application or multithreaded MPI, such as that found on systems using Open MPI with Infiniband, or a Cray XE6 system.
With the Memory Debugging setting enabled, start your application as normal. Arm DDT will take care of ensuring that the settings are propagated through your MPI or batch system when your application starts.
Arm DDT provides two options for debugging memory errors in CUDA programs, which are found in the CUDA section of the Run window. See section 14.2 Preparing to debug GPU code before debugging the memory of a CUDA application.
When the Track GPU allocations option is enabled Arm DDT tracks CUDA memory allocations made by the host, that is, allocations made using functions such as cudaMalloc). You can find out how much memory is allocated and where it was allocated from in the Current Memory Usage window.
Allocations are tracked separately for each GPU and the host (enabling Track GPU allocations will automatically track host-only memory allocations made using functions such as malloc as well). You can select between GPUs using the drop-down list in the top-right corner of the Memory Usage and Memory Statistics windows.
The Detect invalid accesses (memcheck) option turns on the CUDA-MEMCHECK error detection tool, which can detect problems such as out-of-bounds and misaligned global memory accesses, and syscall errors, such as calling free() in a kernel on an already free'd pointer.
The other CUDA hardware exceptions (such as a stack overflow) are detected regardless of whether this option is checked or not.
For further details about CUDA hardware exceptions, you should refer to NVIDIA's documentation.
While manual configuration is often unnecessary, it can be used to adjust the memory checks and protection, or to alter the information which is gathered. A summary of the settings is displayed on the Run dialog in the Memory Debugging section.
To examine or change the options, select the Details button adjacent to the Memory Debugging checkbox in the Run dialog, which then displays the Memory Debugging Options window.
The two most significant options are:
Preload the memory debugging library. When this is checked, Arm DDT will automatically load the memory debugging library. Arm DDT can only preload the memory debugging library when you start a program in Arm DDT and it uses shared libraries.
Preloading is not possible with statically-linked programs or when attaching to a running process. See section 12.3.1 Static linking for more information on static linking.
When attaching, you can set the DMALLOC_OPTIONS environment variable before running your program, or see section 12.3.3 Changing settings at run time below.
The box showing C/Fortran, No Threads in the screen shot. You should choose the option that best matches your program. It is often sufficient to leave this set to C++/Threaded rather than continually changing this setting.
The Heap Debugging section allows you to trade speed for thoroughness. The two most important things to remember are:
- Even the fastest (leftmost) setting will catch trivial memory errors such as deallocating memory twice.
- The further right you go, the more slowly your program will execute. In practice, the Balanced setting is still fast enough to use and will catch almost all errors. If you come across a memory error that is difficult to pin down, choosing Thorough might expose the problem earlier, but you will need to be very patient for large, memory intensive programs. See also 12.3.3 Changing settings at run time.
You can see exactly which checks are enabled for each setting in the Enabled Checks box. See section 12.3.2 Available checks for a complete list of available checks.
You can turn on Heap Overflow/Underflow Detection to detect out-of-bounds heap access. See section 12.4.4 Writing beyond an allocated area for more details.
Almost all users can leave the heap check interval at its default setting. It determines how often the memory debugging library will check the entire heap for consistency. This is a slow operation, so it is normally performed every 100 memory allocations. This figure can be changed manually. A higher setting (1000 or above) is recommended if your program allocates and deallocates memory very frequently, for example, inside a computation loop.
If your program runs particularly slowly with Memory Debugging enabled you may be able to get a modest speed increase by disabling the Store backtraces for memory allocations option. This disables stack backtraces in the View Pointer Details and Current Memory Usage windows, support for custom allocators and cumulative allocation totals.
It is possible to enable Memory Debugging for only selected MPI ranks by checking the Only enable for these processes option and entering the ranks which you want to it for.
The Memory Debugging library will still be loaded into the other processes, but no errors will be reported.
Click on OK to save these settings, or Cancel to undo your changes.
Choosing the wrong library to preload or the wrong number of bits may prevent Arm DDT from starting your job, or may make memory debugging unreliable. You should check these settings if you experience problems when memory debugging is enabled.
If your program is statically linked then you must explicitly link the memory debugging library with your program in order to use the Memory Debugging feature in Arm DDT.
To link with the memory debugging library, you must add the appropriate flags from the table below at the very beginning of the link command. This ensures that all instances of allocators, in both user code and libraries, are wrapped. Any definition of a memory allocator preceding the memory debugging link flags can cause partial wrapping, and unexpected runtime errors.
if in doubt use libdmallocthcxx.a.
--undefined=malloc has the side effect of pulling in all libc-style allocator symbols from the library. --undefined works on a per-object-file level, rather than a per-symbol level, and the c++ and c allocator symbols are in different object files within the library archive. Therefore, you may also need to specify a c++ style allocator such as _ZdaPv below.
--undefined=_ZdaPv has the side effect of pulling in all c++ style allocator symbols. It is the c++ mangled name of operator delete.
To link the correct library, use the full path to the static library. This is more reliable than using the -l argument of a compiler.
The following heap checks are available and may be enabled in the Enable Checks box:
Detect invalid pointers passed to memory functions (malloc, free, ALLOCATE, DEALLOCATE, etc.)
Check the arguments of addition functions (mostly string operations) for invalid pointers.
Check for heap corruption, for example, due to writes to invalid memory addresses.
Check the end of an allocation has not been overwritten when it is freed.
Initialize the bytes of new allocations with a known value.
Overwrite the bytes of freed memory with a known value.
Check to see if space that was blanked when a pointer was allocated or when it was freed has been overwritten. Enables alloc-blank and free-blank.
Always copy data to a new pointer when reallocating a memory allocation (for example, due to realloc).
Protect freed memory where possible (using hardware memory protection) so subsequent read/writes cause a fatal error.
You can change most Memory Debugging settings while your program is running by selecting the Control → Memory Debugging Options menu item. In this way you can enable Memory Debugging with a minimal set of options when your program starts, set a breakpoint at a place you want to investigate for memory errors, then turn on more options when the breakpoint is hit.
Once you have enabled memory debugging and started debugging, all calls to the allocation and deallocation routines of heap memory will be intercepted and monitored. This allows both for automatic monitoring for errors, and for user driven inspection of pointers.
If the memory debugging library reports an error, Arm DDT will display a window similar to the one shown below. This briefly reports the type of error detected and gives the option of continuing to play the program, or pausing execution.
If you choose to pause the program then Arm DDT will highlight the line of your code which was being executed when the error was reported.
Often this is enough to debug simple memory errors, such as freeing or dereferencing an unallocated variable, iterating past the end of an array and so on, as the local variables and variables on the current line will provide insight into what is happening.
If the cause of the issue is still not clear, then it is possible to examine some of the pointers referenced to see whether they are valid and which line they were allocated on, as is explained in the following sections.
Any of the variables or expressions in the Evaluate window can be right-clicked on to bring up a menu. If memory debugging is enabled, View Pointer Details will be available. This will display the amount of memory allocated to the pointer and which part of your code originally allocated and deallocated that memory:
Clicking on any of the stack frames displays the relevant section of your code, so that you can see where the variable was allocated or deallocated.
Only a single stack frame will be displayed if the Store stack backtraces for memory allocations option is disabled.
This feature can also be used to check the validity of heap-allocated memory.
Memory allocated on the heap refers to memory allocated by malloc, ALLOCATE, new and so on. A pointer may also point to a local variable, in which case Arm DDT will tell you it does not point to data on the heap. This can be useful, since a common error is taking a pointer to a local variable that later goes out of scope.
This is particularly useful for checking function arguments, and key variables when things seem to be going awry. Of course, just because memory is valid does not mean it is the same type as you were expecting, or of the same size and dimensions, and so on.
As well as invalid addresses, Arm DDT can often indicate the type and location of the memory being pointed to. The different types are listed here:
- Null pointer.
- Valid heap allocation.
- Fence-post area before the beginning of an allocation.
- Fence-post area beyond the end of an allocation.
- Freed heap allocation.
- Fence-post area before the beginning of a freed allocation.
- Fence-post area beyond the end a freed allocation.
- A valid GPU heap allocation.
- An address on the stack.
- The program's code section (or a shared library).
- The program's data section (or a shared library).
- The program's bss section or Fortran COMMON block (or a shared library).
- The program's executable (or a shared library).
- A memory mapped file.
- High Bandwidth Memory.
Arm DDT may only be able to identify certain memory types with higher levels of memory debugging enabled. See 12.3 Configuration for more information.
For more information on fence post checking, see 12.4.5 Fencepost checking
Enabling memory debugging has an impact on the Cross-Process Comparison and Cross-Thread Comparison windows, see 8.16 Cross-process and cross-thread comparison.
If you are evaluating a pointer variable then the Cross-Process Comparison window shows a column with the location of the pointer.
Pointers to locations in heap memory are highlighted in green. Dangling pointers, that is pointers to locations in heap memory that have been deallocated, are shown in red.
The Cross-Process Comparison of pointers helps you to identify:
- Processes with different addresses for the same pointer.
- The location of a pointer (heap, stack, .bss, .data, .text or other locations).
- Processes that have freed a pointer while other processes have not, null pointers, and so on.
If the Cross-Process Comparison shows the value of what is being pointed at when the value of the pointer itself is wanted, then modify the pointer expression. For example, if you see the string that a char* pointer is pointing at when you actually want information concerning the pointer itself, then add (void *) to the beginning of the pointer expression.
Use the Heap Overflow / Underflow Detection option to detect reads and writes beyond or before an allocated block. Any attempts to read or write to the specified number of pages before or after the block will cause a segmentation violation which stops your program.
Add the guard pages after the block to detect heap overflows, or before to detect heap underflows. The default value of one page will catch most heap overflow errors, but if this does not work a good rule of thumb is to set the number of guard pages according to the size of a row in your largest array.
The exact size of a memory page depends on your operating system, but a typical size is 4 kilobytes. In this case, if a row of your largest array is 64 KiB, then set the number of pages to 64/4 = 16. Note that small overflows/underflows (for example, of less than 16 bytes) may not be detected. This is a result of maintaining correct memory alignment and without this vectorized code may crash or generate false positives. To detect small overflows or underflows, enable fencepost checking (see section 12.4.5 Fencepost checking) Note that your program will not be stopped at the exact location at which your program wrote beyond the allocated data, it only stops at the next heap consistency check.
On systems with larger page sizes (e.g. 2MB, 1GB) guard pages should be disabled or used with care as at least two pages will used per allocation. On most systems you can check the page size with getconf PAGESIZE.
DDT will also perform 'Fence Post' checking whenever the Heap Debugging setting is not set to Fast.
In this mode, an extra portion of memory is allocated at the start and/or end of your allocated block, and a pattern is written into this area.
If your program attempts to write beyond your data, say by a few elements, then this will be noticed by Arm DDT. However, your program will not be stopped at the exact location at which your program wrote beyond the allocated data, it will only be stopped at the next heap consistency check.
If Arm DDT stops at an error but you wish to ignore it (for example, it may be in a third party library which you cannot fix) then you may check Suppress memory errors from this line in future. This will open the Suppress Memory Errors window. Here you may select which function you want to suppress errors from.
Memory leaks can be a significant problem for software developers. If your application's memory usage grows faster than expected, or continues to grow through its execution, then it is possible that memory is being allocated which is not being freed when it is no longer required.
This type of problem is typically difficult to diagnose, and particularly so in a parallel environment, but is able to make this task simple.
At any point in your program you can go to Tools → Current Memory Usage and Arm DDT will then display the currently allocated memory in your program for the currently selected group. For larger process groups, the processes displayed will be the ones that are using the most memory across that process group.
To view graphical representations of memory usage, select the Memory Usage tab.
The pie chart gives an at-a-glance comparison of the total memory allocated to each process. This gives an indication of the balance of memory allocations. Any one process taking an unusually large amount of memory is identifiable here.
The stacked bar chart on the right is where the most interesting information starts. Each process is represented by a bar, and each bar broken down into blocks of color that represent the total amount of memory allocated by a particular function in your code. Say your program contains a loop that allocates a hundred bytes that is never freed. That is not a lot of memory. But if that loop is executed ten million times, you are looking at a gigabyte of memory being leaked! There are 6 blocks in total. The first 5 represent the 5 functions that allocated the most memory allocated, and the 6th (at the top) represents the rest of the allocated memory, wherever it is from.
As you can see, large allocations show up as large blocks of color. If your program is close to the end, or these grow, then they are severe memory leaks.
Typically, if the memory leak does not make it into the top 5 allocations under any circumstances then it may not be significant. If you are still concerned you can view the data in the Table View yourself.
For more information about a block of color, click on the block. This displays detailed information about the memory allocations comprising it in the bottom-left pane. Scanning down this list gives you a good idea of what size allocations were made, how many, where from and if the allocation resides in High Bandwidth Memory. Double-clicking on any one of these will display the Pointer Details view described above, showing you exactly where that pointer was allocated in your code.
Only a single stack frame will be displayed if the Store stack backtraces for memory allocations option is disabled.
To view the current memory usage in a tabular format, select the Allocation Table tab.
The table is split into five columns:
- Allocated by: Code location of the stack frame or function allocating memory in your program.
- Count: Number of allocations called directly from this location.
- Total Size: Total size (in bytes) of allocations directly from this location.
- Count (including called functions): Number of allocations from this location. This inludes any allocations called indirectly, for example, by calling other functions.
- Total Size (including called functions): Total size (in bytes) of allocations from this location, including indirect allocations.
For example: if func1 calls func2 which calls malloc to allocate 50 bytes. Arm DDT will report an allocation of 50 bytes against func2 in the Total Size column of the Current Memory Usage table. Arm DDT will also record a cumulative allocation of 50 bytes against both functions func1 and func2 in the Total Size (including called functions) column of the table.
Another valuable use of this feature is to play the program for a while, refresh the window, play it for a bit longer, refresh the window and so on. If you pick the points at which to refresh, for example, after units of work are complete, you can watch as the memory load of the different processes in your job fluctuates and you will see any areas which continue to grow. These are problematic leaks.
Some compilers wrap memory allocations inside many other functions. In this case Arm DDT may find, for example, that all Fortran 90 allocations are inside the same routine. This can also happen if you have written your own wrapper for memory allocation functions.
In these circumstances you will see one large block in the Current Memory Usage view. You can mark such functions as Custom Allocators to exclude them from the bar chart and table by right-clicking on the function and selecting the Add Custom Allocator menu item. Memory allocated by a custom allocator is recorded against its caller instead.
For example, if myfunc calls mymalloc and mymalloc is marked as a custom allocator, then the allocation will be recorded against myfunc instead. You can edit the list of custom allocators by clicking the "Edit Custom Allocators…" button at the bottom of the window.
The Memory Statistics view (Tools → Overall memory Statistics) shows a total of memory usage across the processes in an application. The processes using the most memory are displayed, along with the mean across all processes in the current group, which is useful for larger process counts.
The contents and location of the memory allocations themselves are not repeated here. Instead this window displays the total amount of memory allocated and freed since the program began, the current number of allocated bytes and the number of calls to allocation and free routines.
These can help show if your application is unbalanced, if particular processes are allocating or failing to free memory and so on. At the end of program execution you can usually expect the total number of calls per process to be similar (depending on how your program divides up work), and memory allocation calls should always be greater than deallocation calls. Anything else indicates serious problems.
If your application is using High Bandwidth Memory, the charts and tables in this dialog will be broken down into each type of memory in use.