You copied the Doc URL to your clipboard.

G General troubleshooting

If you have problems with any of the Arm Performance Reports products, please read this section carefully.

Additionally, check the support pages on the Arm Developer website, and make sure you have the latest version of the product.

G.1 Starting a program

G.1.1 Problems starting scalar programs

There are a number of possible sources for problems. The most common for users with a multi-process license is that the Run Without MPI Support check box has not been checked.

If the software reports a problem with MPI and you know your program is not using MPI, then this is usually the cause.

If you have checked this box and the software still mentions MPI then please contact Arm support at .

Other potential problems are:

  • A previous session is still running, or has not released resources required for the new session. Usually this can be resolved by killing stale processes. The most obvious symptom of this is a delay of approximately 60 seconds and a message stating that not all processes connected. You may also see a QServerSocket message in the terminal.
  • The target program does not exist or is not executable.
  • Arm Performance Reports products' backend daemon, ddt-debugger, is missing from the bin directory. In this case you should check your installation, and contact Arm support at for further assistance.

G.1.2 Problems starting multi-process programs

If you encounter problems while starting an MPI program, the first step is to establish that it is possible to run a single-process (non-MPI) program such as a trivial "Hello, World!", and resolve such issues that may arise. After this, attempt to run a multi-process job and the symptoms will often allow a reasonable diagnosis to be made.

In the first instance verify that MPI is working correctly by running a job, without Arm Performance Reports products applied, such as the example in the examples directory.

   mpirun -np 8 ./a.out

Verify that mpirun is in the PATH, or the environment variable ALLINEA_MPIRUN is set to the full pathname of mpirun.

Sometimes problems are caused by environment variables not propagating to the remote nodes while starting a job. The solution to these problems depend on the MPI implementation that is being used.

In the simplest case, for rsh-based systems such as a default MPICH 1 installation, correct configuration can be verified by rsh-ing to a node and examining the environment. It is worthwhile rsh-ing with the env command to the node as this will not see any environment variables set inside the .profile command. For example, if your nodes use a .profile instead of a .bashrc for each user then you may well see a different output when running rsh node env than when you run rsh node and then run env inside the new shell.

If only one, or very few, processes connect, it may be because you have not chosen the correct MPI implementation. Please examine the list and look carefully at the options. Should no other suitable MPI be found, please contact Arm support at .

If a large number of processes are reported by the status bar to have connected, then it is possible that some have failed to start due to resource exhaustion, timing out, or, unusually, an unexplained crash. You should verify again that MPI is still working, as some MPI distributions do not release all semaphore resources correctly, for example MPICH 1 on Redhat with SMP support built in.

To check for time-out problems, set the ALLINEA_NO_TIMEOUT environment variable to 1 before launching the GUI and see if further progress is made. This is not a solution, but aids the diagnosis. If all processes now start, please contact Arm support at for further advice.

G.1.3 No shared home directory

If your home directory is not accessible by all the nodes in your cluster then your jobs may fail to start.

To resolve the problem open the file ~/.allinea/system.config in a text editor. Change the shared directory option in the [startup] section so it points to a directory that is available and shared by all the nodes. If no such directory exists, change the use session cookies option to no instead.

G.2 Performance Reports specific issues

G.2.1 My compiler is inlining functions

While compilers may inline functions, their ability to include sufficient information to reconstruct the original call tree vary between vendors. Arm has found that the following flags work best:

  • Intel: -g -O3 -fno-inline-functions
  • Intel 17+: -g -fno-inline -no-ip -no-ipo -fno-omit-frame-pointer -O3
  • PGI: -g -O3 -Meh_frame
  • GNU: -g -O3 -fno-inline
  • Cray: -G2 -O3 -h ipa0


Some compilers may still inline functions even when explicitly asked not to.

There is typically a small performance penalty for disabling function inlining or enabling profiling information.

Alternatively, you can let the compiler inline the functions and compile with -g -O3, or -g -O5, or whatever your preferred performance flags are.

Arm Performance Reports will work correctly, but you will often see time inside an inlined function being attributed to its parent in the Stacks view. The Source Code view will be largely unaffected.

Arm Performance Reports will not be affected by function inlining.

G.2.2 Tail recursion optimization

A function may return the result of calling another function, for example:

int someFunction() 
return otherFunction();

In this case the compiler may change the call to otherFunction into a jump. This means that, when inside otherFunction, the calling function, someFunction, no longer appears on the stack.

This optimization is called tail recursion optimization. It may be disabled for the GNU C compiler by passing the -fno-optimize-sibling-calls argument to gcc.

G.2.3 MPI wrapper libraries

Arm Performance Reports wrap MPI calls in a custom shared library. One is built for your system each time you run Arm Performance Reports.

If this does not work please contact Arm support at .

You can also try setting MPICC directly:

   $ MPICC=my-mpicc-command bin/perf-report --np=16 ./wave_c

G.2.4 Thread support limitations

Performance Reports provides limited support for programs when threading support is set to MPI_THREAD_SERIALIZED or MPI_THREAD_MULTIPLE in the call to MPI_Init_thread.

MPI activity on non-main threads will contribute towards the MPI-time of the program, but not the more detailed MPI metrics.

MPI activity on a non-main thread may result in additional profiling overhead due to the mechanism employed by Performance Reports for detecting MPI activity.

Warnings are displayed when the user initiates and completes profiling a program which sets MPI_THREAD_SERIALIZED or MPI_THREAD_MULTIPLE as the required thread support.

Performance Reports does support calling MPI_Init_thread with either MPI_THREAD_SINGLE or MPI_THREAD_FUNNELED specified as the required thread support.

It should be noted that the requirements that the MPI specification make on programs using MPI_THREAD_FUNNELED are the same as made by Performance Reports: all MPI calls must be made on the thread that called MPI_Init_thread .

In many cases, multi-threaded MPI programs can be refactored such that they comply with this restriction.

G.2.5 No thread activity while blocking on an MPI call

Unfortunately Arm Performance Reports is currently unable to record thread activity on a process where a long duration MPI call is in progress.

If you have an MPI call that takes a significant amount of time (multiple samples) to complete then Arm Performance Reports will record no thread activity for the process executing that call for most of that MPI call's duration.

G.2.6 I'm not getting enough samples

By default sampling takes place every 20ms initially, but if you get warnings about too few samples on a fast run, or want more detail in the results, you can change that.

To increase the frequency of sampling to every 10ms set environment variable ALLINEA_SAMPLER_INTERVAL=10.

Note that the sampling frequency is automatically decreased over time to ensure a manageable amount of data is collected whatever the length of the run.

Increasing the sampling frequency is not recommended if there are lots of threads or there are deep stacks in the target program as this may not leave sufficient time to complete one sample before the next sample is started.

G.2.7 Performance Reports is reporting time spent in a function definition

Any overheads involved in setting up a function call (pushing arguments to the stack and so on) are usually assigned to the function definition.

Some compilers may assign them to the opening brace '{' and closing brace '}' instead. If this function has been inlined, the situation becomes further complicated and any setup time (for example, allocating space for arrays) is often assigned to the definition line of the enclosing function.

G.2.8 Performance Reports is not correctly identifying vectorized instructions

The instructions identified as vectorized (packed) are enumerated below. Arm also identifies the AVX-2 variants of these instructions (with a "v" prefix).

Contact Arm support at if you believe your code contains vectorized instructions that have not been listed and are not being identified in the CPU floating-point/integer vector metrics.

Packed floating-point instructions: addpd addps addsubpd addsubps andnpd andnps andpd andps divpd divps dppd dpps haddpd haddps hsubpd hsubps maxpd maxps minpd minps mulpd mulps rcpps rsqrtps sqrtpd sqrtps subpd subps

Packed integer instructions: mpsadbw pabsb pabsd pabsw paddb paddd paddq paddsb paddsw paddusb paddusw paddw palignr pavgb pavgw phaddd phaddsw phaddw phminposuw phsubd phsubsw phsubw pmaddubsw pmaddwd pmaxsb pmaxsd pmaxsw pmaxub pmaxud pmaxuw pminsb pminsd pminsw pminub pminud pminuw pmuldq pmulhrsw pmulhuw pmulhw pmulld pmullw pmuludq pshufb pshufw psignb psignd psignw pslld psllq psllw psrad psraw psrld psrlq psrlw psubb psubd psubq psubsb psubsw psubusb psubusw psubw

G.2.9 Performance Reports takes a long time to gather and analyze my OpenBLAS-linked application

OpenBLAS versions 0.2.8 and earlier incorrectly stripped symbols from the .symtab section of the library, causing binary analysis tools such as Arm Performance Reports and objdump to see invalid function lengths and addresses.

This causes Arm Performance Reports to take an extremely long time disassembling and analyzing apparently overlapping functions containing millions of instructions.

A fix for this was accepted into the OpenBLAS codebase on October 8th 2013 and versions 0.2.9 and above should not be affected.

To work around this problem without updating OpenBLAS, simply run "strip libopenblas*.so"-this removes the incomplete .symtab section without affecting the operation or linkage of the library.

G.2.10 Performance Reports over-reports MPI, I/O, accelerator or synchronization time

Arm Performance Reports employs a heuristic to determine which function calls should be considered as MPI operations.

If your code defines any function that starts with MPI_ (case insensitive) those functions will be treated as part of the MPI library resulting in the time spent in MPI calls to be over-reported.

Starting your functions names with the prefix MPI_ should be avoided and is in fact explicitly forbidden by the MPI specification (page 19 sections 2.6.2 and 2.6.3 of the MPI 3 specification document

All MPI names have an MPI_ prefix, and all characters are capitals. Programs must not declare names, for example, for variables, subroutines, functions, parameters, derived types, abstract interfaces, or modules, beginning with the prefix MPI_.

Similarly Arm Performance Reports categorizes I/O functions and accelerator functions by name.

Other prefixes to avoid starting your function names with include PMPI_, _PMI_, OMPI_, omp_, GOMP_, shmem_, cuda_, __cuda, cu[A-Z][a-z] and allinea_. All of these prefixes are case-insensitive.

Also avoid naming a function start_pes or any name also used by a standard I/O or synchronisation function (write, open, pthread_join, sem_wait etc).

G.3 Obtaining support

If this guide has not helped you, then the most effective way to get support is to email Arm support at with a detailed report.

If possible, you should obtain a log file for the problem and email this to Arm support at .

You can generate a log file by starting Performance Reports with the --debug and --log arguments:

$ perf-report --debug --log=<log>

Where <log> is the name of the log file to generate.

Then simply reproduce the problem using as few processors as possible.

On some systems this log file might be quite large. If this is the case, please compress it using a program such as gzip or bzip2 before attaching it to your email.

If your problem can only be replicated on large process counts, then please omit the --debug argument as this will generate very large log files.

G.4 Arm IPMI Energy Agent

The Arm IPMI Energy Agent allows Arm MAP and Arm Performance Reports to measure the total energy consumed by the compute nodes in a job in conjunction with the Arm Advanced Metrics Pack add-on.

The IPMI Energy Agent is a separate download from our website: IPMI Energy Agent.

G.4.1 Requirements

  • The compute nodes must support IPMI.
  • The compute nodes must have an IPMI exposed power sensor.
  • The compute nodes must have an OpenIPMI compatible kernel module installed, such as ipmi_devintf.
  • The compute nodes must have the corresponding device node in /dev, for example /dev/ipmi0.
  • The compute nodes must run a supported operating system.
  • The IPMI Energy Agent must be run as root.

To list the names of possible IPMI power sensors on a compute node use the following command:

    ipmitool sdr | grep 'Watts'
Was this page helpful? Yes No