This list of FAQs is intended to support the use of Streamline in Arm DS-5 Development Studio. If you can't find the answer you're looking for here, consider posting your question in the Arm Connected Community or contact Arm Technical Support.
Which targets are supported by Streamline?
Streamline is supported on all Armv7-A and Armv8-A architecture cores, and tested on Arm Cortex-A cores.
How do I determine if Gator is running?
Enter the command:
ps ax | grep gatord
If this returns no results, gatord is not running.
How do I run Gator on the target?
See details in README.md located in /ds-5-install-directory/arm/gator/
How do I use the Android 'adb' to forward Gator data on port 8080 over USB?
If you are using the Android Debug Bridge, you may forward Gator data with the command:
adb forward tcp:8080 tcp:8080
Use 'localhost' as the connection address in Streamline.
How do I profile kernel modules?
To profile kernel modules, the kernel must be configured with "Compile the kernel with debug info" enabled in order to add debug symbols to the kernel and modules (CONFIG_DEBUG_INFO=y). It is not required that the kernel be compiled with debug enabled in order to profile a module, build the kernel with debug disabled, then build the module with debug enabled. Otherwise, profiling kernel modules is no different to a regular application, just include the module in the 'Program Image' list from 'Capture Options' dialog.
Why am I unable to process or open a report, or the UI is sluggish?
Please consult the release notes to confirm that your machine meets the minimum specification required to run Streamline. You may not have enough memory allocated to Streamline. Please try launching Eclipse with the launch parameter '-Xmx2G' to allocate more memory for Streamline.
In the timeline, why do the process/thread bars in the heatmap and samples HUD detail bars not match?
The process/thread bars in the heatmap are generated using context switch trace synchronously output from the kernel scheduler, whereas the samples HUD detail bars are based on function call stack samples generated on a periodic interrupt. There is usually good correlation between these metrics so as to be a good guide for mapping telemetry data with the statistical views. However, due to the differences in the measurement approaches, they may not always match.
Why are multiple rectangles shown in the timeline process/thread area (heatmap) for each timeslice at the maximum zoom level? I thought there was one sample per millisecond?
The heatmap/telemetry data is based on synchronous kernel context switch trace with high resolution timestamps. Between each profiling sample interrupt (once per millisecond at a normal sample rate), multiple synchronous trace events could have occurred, each recorded with a high resolution timestamp. Multiple bars will be displayed when more than one thread is active in a timeslot.
At maximum zoom, why are there samples missing from the detail bars?
There are a couple of rare corner cases in the kernel which cannot be resolved, or the resolution would add too much probe effect. For such events, no sample data is collected, leaving a blank entry in the timeslot.
Instructions and cache metrics are missing from the timeline?
If the counters are correctly installed in the Gator driver and you are using daemon defaults (or configuration.xml), the most likely cause for the hardware counters not working (typically on Cortex-A8 systems) is that DBGEN (or NIDEN) has not been set by the bootloader, or debug peripherals have not been configured in the kernel.
Why are the call paths (stack unwinding) flat for vmlinux?
The Makefile for the gator driver must be modified and gator.ko rebuilt with GATOR_KERNEL_STACK_UNWINDING defined, otherwise the call paths will be flat. Kernel stack walking is off by default due to Linux issuing a warning when the call path cannot be completed correctly (e.g. a linked in module is not built with the necessary compiler options). This warning is then issued hundreds of times per second, which floods the console or dmesg output.
Why are the call paths shallow or incomplete?
A flat call path or orphaned nodes, i.e. not a complete chain back to main, can occur for several reasons:
- Call Stack Unwinding is not enabled in the Streamline Capture & Analysis Options.
- User space gator is running as opposed to kernel space gator, as user space gator does not support call stack unwinding.
- The image selected in Streamline is not derived from the same image that was run on the target.
- The application is not compiled with frame-pointers. For GCC, use CFLAGS '-fno-omit-frame-pointer' option.
- The application is compiled for Thumb mode. Only Arm mode may be call stack profiled due to compiler limitations, so do not use in parallel with '-mthumb'.
- If you are profiling with library symbols, you may not have access to a correctly compiled library, which will result in many orphaned nodes.
- Hand-written assembly is encountered in the call path which doesn't properly account for frames.
- The compilation was not a clean build, i.e. some .o files are compiled with frame pointers, some without frame pointers.
The statistical data doesn't seem to make any sense in the Code view
The likely cause is that the application presented to Streamline is not matching the image running on the target. Inlining may also be the culprit, which can complicate understanding of the reporting. Open the disassembly area to see if any metrics are 'hiding' due to being inlined from other functions/source-files.
Why are the Codeview function percentages not adding up to 100%?
The percentages are rounded to two decimal places, so you might be seeing rounding errors, which will be the case if >100% is reported. You may also be looking at a function with inlined routines. Click on the disassembly button to view the function's instruction metrics at the disassembly level.
Why are the hexadecimal addresses in the Codeview disassembly area green?
The green indicates that the instructions have been inlined from other routines. Clicking on the instructions will jump the source area to the related inlined routines. The percentages for the inlined source lines will be marked with the text 'inlined', as the metrics are reflected in the caller, not the callee.
What is the probe effect of gator?
Gator is often reported as having "less than 5% overhead". This is often true and can be much less than 5%. It can also be greater as the probe effect depends a lot on what is being captured. See the next FAQ for details.
How do I reduce the probe effect of gator?
Change the buffer mode to non-Streaming, lower the sample rate, reduce the number of counters collected, reduce annotations, do not perform call stack unwinding.
What is Event Based Sampling?
Event Based Sampling (EBS) is a statistical profiling feature, where instead of using the timer to generate sampling interrupts (PC+symbol call stacks) we use the PMU counter overflow to generate the event. The resulting samples accumulate according to the event, enabling application/library/function microarchitectural code investigations.
Why does Event Based Sampling produce no data in the timeline?
Event Based Sampling requires an interrupt when the PMU counter overflows and that interrupt routing is SoC configurable. Since this feature relies on SoC configuration in the kernel, not all the SoC code in the kernel is up to date, or worse it is incorrect. In these cases, a pmu device exists but the irq numbers are incorrect, so an interrupt does not occur.
How do I configure events/counters when performing a local capture?
To enable events one must update configuration.xml. If a confguration.xml file does not exist, the default counters are used. To override the defaults, create a configuration.xml and place it in the same folder as gatord or point to it using the -c parameter when launching gatord. See the User's Guide on how to create a configuration.xml file. Any incompatible events, such as a Cortex-A8 event though the target is a Cortex-A9, will be ignored by gator and are okay to leave in the configuration.xml file. Use the TRM to look up the event numbers for those events the customer is interested in, or reference the XML files shipped with the gator daemon. Of course, the GUI is the preferred way to set up the configuration of a target.
How can I call a script or start a workload on my target when a capture is started from Streamline on the host?
Use the Command field in Capture & Analysis Options. For a local capture, one may launch gator from the script which also launches the workload of interest.
What is the purpose of the average and Hertz display types used in the chart configuration expression?
The display type is used to display the underlying data when zoomed out.
- Average - averages the underlying data, e.g. Energy Probe
- Maximum/Minimum - displays the maximum value of the underlying data, and maintains that value until a new value is received from the target, e.g. Memory
- Accumulate - sums all underlying values, e.g. Branch Mispredicted
- Hertz - accumulates the data and normalizes the data to one second, e.g. Clock Cycles
As an example, if the underlying data at each millisecond is 5, 6, 7, 8, 9, When zoomed to the 5ms level, Streamline will display this data as 7 for Average, 9 for Maximum, 35 for Accumulate, and 7000 for Hertz. For more details, reference the Gator-Streamline Protocol document located in /ds-5-install-directory/arm/gator/protocol/.
Why does the "high resolution timeline" feature not require a re-capture from the target?
Everything captured from gator is captured with a nanosecond timestamp. To show this data in Streamline, the data is interpolated to fit at whatever zoom level the user has selected. To make the UI experience fast, all zoom levels between 1 second and 1 millisecond are calculated prior to showing the report. The "high resolution timeline" feature simply adds the three microsecond zoom levels (100us, 10us, and 1us) when generating the report. NOTE: not all data contains high resolution information even if the "high resolution timeline" feature is enabled. Only scheduler trace and annotation data is displayed at high resoltuion. Scheduler trace includes the heatmap process area and CPU Activity timeline chart. The other timelines maintain a 1 millisecond resoltuion even when "high resolution timeline" is enabled.
Can Gator send data over an interface other than Ethernet, like UART or USB?
The Gator daemon supports sending of data over any tcp/ip network. Please consult online resources for configuring networking (Slip, PPP) over a serial connection. A good rule of thumb for a standard setup is that the 'Normal' sample rate will require ~100kB/s. For slow serial links please use the 'Low' sample rate. Disabling counters and the Call Stack Unwinding option will also reduce the bandwidth requirements. If all else fails, Local Capture may be used to store data to the target's filesystem.
Can Gator send data over a JTAG interface?
While it's certainly possible that a network connection could be established over a JTAG interface, it is probably unwise to use such a configuration due to the shallow FIFO and interrupt overhead of Arm DCC communication. If attempted, the 'Low' sample rate should be used.
Does Streamline support Android targets?
Yes, the same targets are supported as regular Linux, but the daemon will need to be built for Android. Follow the normal Linux compilation instructions for the Gator driver.
Are Android NDK libraries supported by Streamline?
Yes, as are all native libraries, comparable to regular Linux systems.
Can Streamline profile Android Java applications?
Pure Java profiling is not supported. Streamline does support any ELF-compatible file as an image, including OAT files which are part of the Android Runtime (ART). Also, you may still derive a lot of value from using Streamline for profiling Java applications which use native libraries.
How do I offline/online a CPU?
To offline a CPU, issue the command
echo 0 > /sys/devices/system/cpu/cpu1/online
What is meant by unknown, unresolved, and anonymous in the call paths view?
Anonymous/unknown samples in the code view are samples in which no symbols could be found. A sample contains two bits of info: a file, and an offset within that file. From this offset, we can associate the sample with a file, function, and source line number using the debug information. If the file was not specified as an image, the sample will be labeled as anonymous. If this offset is out of bounds within the debug information or the file contains no debug information, it will be labeled as anonymous. When the file cannot be resolved by the OS on the target, the data is labeled unknown. In the case of "unknown" appearing under "anonymous" in the call chain, the executable file was resolved but the file associated with the sample cannot be resolved. Frame pointers are in no way related to anonymous code. With frame pointers, we are able to walk the stack. Once we cannot continue walking the stack, e.g. an invalid frame pointer, the analysis stops and that call chain is placed at the thread root level. [unresolved] occurs when the code is running in an unassigned address area and thus cannot resolve a file with this address (this can happen with JIT) or for some reason the OS has not assigned a valid string to the file (called a dentry name).
What are the differences between the XML files used during data collection?
- captured.xml describes the capture environment and what was captured, this file is placed in the .apc folder and describes the raw capture data 0000000000 file.
- configuration.xml lists the events to use during a capture, located in the same directory as gatord
- counters.xml is an internal file (never written to disk) is simply a file listing of /dev/gator/events
- events.xml lists all possible events, located in the same directory as gatord, and is used to populate the Counter Configuration dialog
- session.xml describes the options set by the user in Streamline and is stored in the .apc folder
Visual annotation shows corrupt images
Rebuild the binary that issues the visual annotations using the latest streamline_annotate.c and .h files
Annotations are not working with kernel space gator and Android lollipop.
This should only happen with applications built against older streamline_annotate.h files. This occurs because SELinux blocks annotations. Either recompile the application using the latest streamline_annotate.c and .h files, or disable SELinux (setenforce 0 from the command line).
Why is CPU Activity showing 100% User activity but the Call Paths view has all samples within the kernel?
All kernel threads (a thread with no address space) are part of the system; all other threads (a thread with a virtual address space) are part of the user. A user thread that performs a system call and runs in privileged kernel space is still logged as User in the CPU Activity chart.
How can CPU Activity + CPU Wait > 100%?
The CPU activity and CPU wait states are independent, and thus each may be at 100%. CPU Wait I/O is just indicating to the user that there is a task waiting on I/O and is therefore not currently running. It is more of a binary indicator, such that when the task is waiting, it is 100%, otherwise 0%. The only reason you may get percentages in between is based on linear extrapolation of the time the process was waiting. Example: process X waits on I/O from time 5ms to 6ms. The user is zoomed to a 5ms zoom level and is viewing a bin from 5ms to 10ms, thus CPU Wait I/O will show 20%.
Why is the Samples HUD empty?
Sampling occurs every 1ms or every 10ms depending on the sample rate. If the zoom level is less than the sample rate, it is likely that no samples will appear in the heads up display. This can also happen when using the PREEMPT_RT patch to Linux. In this case default hrtimer execution (which is used to obtain samples) is in softirq context (ie, it runs in a separate kernel thread instead of interrupting the user thread), so get_irq_regs is always returning NULL. Without knowing what the PC reg is, backtraces cannot occur.
The value shown in a CSM selection does not make sense.
Certain combinations of options can produce nonsensical data. For example, selecting the 'Percentage' checkbox and unchecking the 'Average Selection' checkbox will lead to a sum of percentages which has no useful meaning. Chart configuration is considered an advanced feature and Streamline does not attempt to prevent every single incompatible combination.
No energy probe data from the NI-DAQ when the NI-DAQ is disconnected and no warning
When using NI-DAQmx Base, it takes up to 10 seconds to determine if the device is connected or not. So if the device is not connected and a short (less than 10s) Streamline capture is made, there is not enough time to determine that the NI-DAQ is disconnected.
Only 3 of the 4 CCI-400 counters work
There is a bug in the Linaro kernels where only 3 of the CCI-400 counters work. This is expected to be fixed in 13.01, or a patch is available that can be applied separately
Many counters on the timeline are zero
Gator uses perf to obtain the hardware PMU data. If perf is not working correctly, gator will not work either. If perf is installed, you can test it by running "sudo perf stat -a -e cycles -e instructions -e branch-misses sleep 1". If perf is working correctly, there should be non-zero values for cycles, instructions, and branch-misses.
Hardware counters are zero or the wrong hardware counters are shown
There is a bug in some Linux kernels where perf misidentifies the CPU type. To see if you are affected by this, run ls /sys/bus/event_source/devices/ and verify the listed processor type matches what is expected. For example, an A9 should show the following.
# ls /sys/bus/event_source/devices/ ARMv7_Cortex_A9 breakpoint software tracepoint
To workaround the issue try upgrading to a later kernel or comment out the gator_events_perf_pmu_cpu_init(gator_cpu, type); call in gator_events_perf_pmu.c
Hardware monitor (hwmon) data is incorrect
hwmon data does not appear to be safe against concurrent access. Ex, if you run sensors on the same machine on two different terminals at the same time you can get incorrect results. Streamline uses the same library to read hwmon data so if you run sensors during a capture, the output of sensors and the hwmon data you capture may be incorrect.
How can clock cycles be greater than clock frequency?
Clock cycles can never be greater than the clock frequency. However, the way gator records the time and collects the hardware performance data, combined with slight inaccuracies in the high resolution timer (it is not a true nanosecond timer), some of the samples can shift from one bin into another. This shifting can be minimized by disabling interrupts and some other tricks that would decrease the probe effect, which we have opted not to implement. Thus, there can be shifting of cycles. Note that this shifting is usually on the order of 1% or less and is generally only seen when zoomed into the maximum zoom level.
What does not work in thumb mode?
Everything will work in Streamline without issue when using thumb assembly except for call stack unwinding. If you want call stack unwinding you must compile your code to arm assembly. To use arm assembly, either configure gcc to do it by default (set --with-mode correctly when compiling gcc) or specify it at runtime by using -marm. If you're using thumb assembly instead of arm assembly, you will see valid data in all the tabs except for a flattened code paths tree and the call graph will not show the relationship between the methods.
Not all clusters/cores appear with Streamline when using my A53/A57 big.LITTLE target, what's wrong?
As of today (June 26, 2015), Linux does not support big.LITTLE. The most up-to-date version of the patches can be found on the external Linux ARM kernel mailing list, and apply atop of the 32-bit patches that have just been merged as part of the v4.2 merge window. Linaro versions of the Linux kernel and perf do support big.LITTLE but only for v7. A12 support was added in Linux 3.15 and A17 in Linux 3.16.
Why are the core names Unknown?
There are several reasons this may occur:
- Using kernel space gator, the core will be unknown if it was never onlined during a capture.
- Using user space gator with v8 architecture cores, Linux does not make the core names available.
- Using user space gator with v7 architecture cores prior to Linux 3.8, Linux does not make the core names available.
Streamline thinks the target Mali-450 GPU device has 8 cores but in reality it only has 4 cores
Part of the requirements with Mali-4xx is that mali.ko must execute before gator.ko for proper initialization. Incorrect number of cores is an example of what can happen when gator.ko executes first.
no such file or directory OR command not found
This error can occur when the ABI of the application binary is incompatible with the system. Two common examples is building an application for Linux and attempting to run it on Android or building an application for armel and attempting to run it on an armhf system. Linaro toolchains use armhf since 12.05. Read more about the arm hardfloat ABI. You can get the Linaro toolchain binaries at launchpad.net.
get gator: Unknown symbol GLOBAL_OFFSET_TABLE (err 0) in dmesg when insmoding gator.ko
This can happen when using the android ndk cross compiler (arm-linux-androideabi-). Instead, use the compiler provided by linaro-toolchain-binaries distributed by linaro (arm-linux-gnueabihf-).
IKS: no hardware counters on big cluster
There is a known issue with perf on IKS where no data may be collected on the big cluster. See Perf broken on linux-linaro 3.10.x kernel
gator_backtrace.c: error dereferencing pointer to incomplete type
The issue is that 'struct module' is undefined. It is undefined because the kernel was built to not support modules. To resolve the issue, one must either change and rebuild their kernel to support modules, or compile the gator driver directly into the kernel.
The board locks up on startup or other quirky issues
Odd issues like the board locking up can occur if the gator kernel module (gator.ko) was not compiled with exactly the same toolchain, kernel sources and config. It does not always happen so having everything match exactly is not a hard requirement, but it is strongly recommended.
error: "monotonic_to_bootbased" undeclared
Gator 5.19 and previous versions are not compatible with Linux kernel 3.17 and greater. Upgrade to Gator 5.20 or later.
CANNOT LINK EXECUTABLE: cannot locate symbol
If you get an error like CANNOT LINK EXECUTABLE: cannot locate symbol "signal" referenced by "gatord" on android, please ensure you are using the correct version of the Android ndk: ndk32 for 32-bit targets and ndk64 for 64-bit targets.
How to enable OpenCL
Streamline OpenCL mode provides a visual representation of OpenCL code running on Mali Midgard and Bifrost devices. It is supported by gator version 21 and later.
To enable OpenCL mode in Streamline:
- The flags you will need to build the DDK, these will be applied to the 'scons' command:
'cl=1; streamline_annotate=1; instr=1; timeline=cl_timeline; gator=2'.
- Modify the autogenerated <process_name>.instr_config file with:
[timeline] enabled=1; uds_lossless_mode=1
- To get Timeline tracepoints from OpenCL the application must enable profiling using the CL_QUEUE_PROFILING_ENABLE property:
queue = clCreateCommandQueue(context, device, properties | CL_QUEUE_PROFILING_ENABLE, &err);
Please refer to Mali DDK Integration Manual for more details about OpenCL profiling.