Sunday, March 27, 2011

Profiling Android NDK Code

Profiling code on embedded systems is tricky. The Google tools for profiling Android applications only cover Java code, not native C code. So I've created a new project on Google Code for profiling Android NDK shared libraries. I had the idea when reading this post on the forum - there coder "sverx" asked how to profile code, and I remembered Visual Boy Advance did it pretty well on the GBA, but there was no way to do so on the DS - nor on Android. Well, the Android situation just improved :-)

I read the VBA source, lifted the relevant chunks of gprof-inspired code, hacked in some timing information, and added a touch of assembly to tie together what gcc -pg does and what the profile-creating code does. If you look at a chunk of the output from gcc with -pg, it looks like this:
0001184c <do_raisedead_cast>:
   1184c:   b5f8        push    {r3, r4, r5, r6, r7, lr}
   1184e:   b500        push    {lr}
   11850:   f004 f830   bl  158b4 <__gnu_mcount_nc>
   1189c:   3a01        subs    r2, #1
   1189e:   701a        strb    r2, [r3, #0]
   118a0:   bdf8        pop {r3, r4, r5, r6, r7, pc}

So the compiler pushes the link register (lr) and the __gnu_mcount_nc function has to pop it off the stack. This isn't a normal calling convention - usually each function uses the stack symmetrically, pushing and popping the same number of times. Also I learned that the mcount function must preserve all the registers. You can't smush a single one or Bad Things happen (I've seen a lot of weird crashes lately, I can tell you). This little conundrum took me quite some time to figure out. It can probably be done better, but the implementation I came up with was this:
        push {r0-r3}
        push {lr}
        ldr r0, [sp, #20]  @ r0 = lr pushed by calling routine
        mov r1, lr    @ address of calling routine
        bl profCount
        pop {r0}
        mov lr, r0
        pop {r0-r3}
        @ pop previous address
        add sp, #4
        bx lr

Since registers r4+ must be preserved by the called function, I can ignore those when the code calls out to profCount. Keeping r0-r3 clean was the problem, as usually these are "scratch" registers and you're free to mess them up. The function profCount takes the C function-just-called and its caller addresses as arguments for use in generating the profile. Anyway, the results are pretty good for the code I've tested it on - you get the usual call counts and timing information that gprof provides. Here's a bit of the information from Chaos...
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 65.31      0.32     0.32      288     1.11     1.11  android_render_screen
 18.37      0.41     0.09      450     0.20     0.20  clear_square
  4.08      0.43     0.02        3     6.67     6.67  clear_bg
  2.04      0.44     0.01       25     0.40     0.40  print_text16
  2.04      0.45     0.01        3     3.33     3.33  draw_decor_border_internal
  2.04      0.46     0.01        2     5.00    33.00  splash_touch
  2.04      0.47     0.01        1    10.00    52.00  options_back
  2.04      0.48     0.01        1    10.00    54.80  show_options
  2.04      0.49     0.01                             chaos_start
  0.00      0.49     0.00   442368     0.00     0.00  draw_8x8_tile
  0.00      0.49     0.00    28800     0.00     0.00  platform_set_tile_entry
  0.00      0.49     0.00     3646     0.00     0.00  platform_set_text_map_entry
  0.00      0.49     0.00     2671     0.00     0.00  platform_key_pressed
  0.00      0.49     0.00     1186     0.00     0.00  platform_set_palette_entry

index % time    self  children    called     name
                0.32    0.00     288/288         chaos_gl_render [2]
[1]     65.3    0.32    0.00     288         android_render_screen [1]
                0.00    0.00  442368/442368      draw_8x8_tile [22]
                0.00    0.00     286/286         InterruptProcess [30]

Functions with no "calls", "self" or "total" are because they got called via JNI. The "draw_8x8_tile" call count there is accurate - it is called for each 8x8 tile on the screen in each frame. There are 32 x 24 tiles, and 2 layers of tiles. 32 * 24 * 2 * 288 = 442368, as expected. And the android_render_screen count is correct, since at 30 fps that is around 9 seconds of use and I only ran the game for a few seconds to check.

More details on the android-ndk-profiler page, including a guide on how to use the library.


  1. Anonymous8:36 pm

    I am trying to use the profiler but when I compile I get undefined reference to monstartup and moncleanup functions. My file is
    LOCAL_PATH := $(call my-dir)

    # remove -ffunction-sections and -fomit-frame-pointer from the default compile flags
    TARGET_thumb_release_CFLAGS := $(filter-out -ffunction-sections,$(TARGET_thumb_release_CFLAGS))
    TARGET_thumb_release_CFLAGS := $(filter-out -fomit-frame-pointer,$(TARGET_thumb_release_CFLAGS))
    TARGET_CFLAGS := $(filter-out -ffunction-sections,$(TARGET_CFLAGS))

    # include libandprof.a in the build
    include $(CLEAR_VARS)
    LOCAL_MODULE := andprof
    LOCAL_SRC_FILES := libandprof.a

    # RayTracer shared library
    include $(CLEAR_VARS)

    LOCAL_MODULE := RayTracer

    # compile with profiling
    LOCAL_CFLAGS := -pg

    # Ray Tracer
    LOCAL_SRC_FILES := RayTracer.cpp
    LOCAL_LDLIBS := -lGLESv1_CM -lstdc++



    The results I get after I run ndk-build are:
    Gdbserver : [arm-linux-androideabi-4.4.3] libs/armeabi/gdbserver
    Gdbsetup : libs/armeabi/gdb.setup
    SharedLibrary :
    C:/.../TempNDK/obj/local/armeabi/objs-debug/RayTracer/TempNDK.o: In function `Java_ADE_RayTracer_GLView_nativeInit':
    C:/.../TempNDK/jni/TempNDK.cpp:185: undefined reference to `monstartup(char const*)'
    C:/.../TempNDK/obj/local/armeabi/objs-debug/RayTracer/TempNDK.o: In function `Java_ADE_RayTracer_GLView_nativePause':
    C:/.../TempNDK/jni/TempNDK.cpp:239: undefined reference to `moncleanup()'
    collect2: ld returned 1 exit status
    make: *** [/cygdrive/c/.../TempNDK/obj/local/armeabi/] Error 1

    Is there anything else I should in order to make it work?

  2. Ah! Looks like you're using C++. Name mangling. I'll upload a fix, but if you declare them with C linkage it should work:

    extern "C" void monstartup(char const*);

    extern "C" void moncleanup();

    1. I'm getting the same undefined reference compilation errors when I attempt to use android-ndk-profiler in my project. I'm using version 3.2, and I've got the extern "C" declarations in my header.

      Without the extern "C" declarations, I get 'monstartup not declared in this scope'. With them, I get 'undefined reference to monstartup'.

      Any tips?

  3. I tried using the pre-compiled profiling library posted on the project page, but my app crashes in System.loadLibrary() if I compile with

    APP_ABI := armeabi-v7a

    in my file. It works fine if I use the default APP_ABI.

    I did not try compiling the library myself with armeabi-v7a, although I imagine that would fix the problem. You may want to add this caveat to the documentation.

  4. @LightStruk - Thanks for the report. I thought that I only needed to supply the lowest-common-denominator armeabi, as the docs/PREBUILTS.html file in the NDK says:

    "Remember that you don't need to provide an armeabi-v7a prebuilt library, since an armeabi one can easily run on the corresponding devices."

    That probably only applies to shared libraries, not when you mix armeabi and v7a static libraries. I've updated the project with v1.3, which has a prebuilt armeabi-v7a library.

  5. Anonymous10:45 pm

    I followed the steps on your ndk-profiler usage, and I'm able to run my program and generate a gmon.out on my nexus one phone. However the prof outpu says "no time accumulated". My program is compiled with pg, and android project is set to debug in its manifest also.

  6. Perhaps the calls are not long enough? Are you getting *any* profiling information? e.g. the flat profile with the number of calls per function? The library hardcodes PC sampling at 100Hz (one sample every 10 ms), if your function is called sporadically and takes less time I guess it won't trigger time accumulation.

  7. Anonymous8:47 pm

    I called my function 100 times to ensure the call is long enough. The measured wall time is 13 s.

    The Flat profile does not have number of calls per function. It shows "Flat profile:
    Each sample counts as 0.01 seconds.
    no time accumulated" and then gives the definition of %time, name etc. terms, that's it.

    I'm using ndk-r5b on Windows, and use cygwin to compile my natvie code.

  8. Are you using armeab-v7a? I've not tested that at all and it probably doesn't work. I don't have a compatible device, but I'll try and fix it if that's the case.

  9. Anonymous8:49 pm

    Just to be clear, I downloaded your latest NDK-profiler 1.3, and only make for armeab .
    But still gmon.out does not produce any function usage information. My device is Nexus One running Android 2.3.3

    Here is PROFILING info LogCat:

    04-27 11:16:16.280: INFO/PROFILING(2632): Proflie 80400000-8
    0409000: 0
    04-27 11:16:23.790: INFO/PROFILING(2632): 1: stopping profile thread
    04-27 11:16:23.800: INFO/PROFILING(2632): 1: writing gmon.out

    One suggestion: it would be useful if you can provide a test project (stating under what platform it has been tested), so we can download it and quickly test if it can work under our platform. That be easier than putting it into our large project and trying to figure out what goes wrong.

  10. Anonymous7:53 pm

    No, I’m using armeab
    Lasted NDK profiler v1.3,
    Android 2.3.3 gingerbread.

    One suggestion, can you provide a test project that we can download and quickly test it under our platform? That would be easier to isolate the problem. Thanks.

  11. Anonymous2:04 pm

    I am trying to use the profiler on my c++ code. However although the flat profile shows the number function calls, there is no accumulated time. When I compile the code it gives me a lot of warning says "-ffunction-sections disabled; it makes profiling impossible". Do you know what might be the problem. Normally it should accumulate time as I have functions that are called a million times and the running time of the application is enough.

  12. I've not used C++, but I bet adding something like this to your would help:

    TARGET_CXXFLAGS := $(filter-out -ffunction-sections,$(TARGET_CXXFLAGS))

    By default the Android build system adds the -ffunction-sections so you have to remove it.

  13. Hmm, actually, seems there isn't actually a TARGET_CXXFLAGS, so if you follow it should Just Work.

  14. Thanks for posting the armeabi-v7a build; sadly, when compiling in armeabi-v7a mode, it doesn't appear to work any better than the armeabi build. I don't think it's even getting to monstartup() before it hangs.

    As you said earlier, it doesn't make sense that armeabi-v7a builds would need a different andprof library. The calling convention is unchanged.

    Do you preserve r2 and r3 because profCount() clobbers them? __gnu_mcount_nc() doesn't otherwise touch them. Looking at the disassembly of profCount(), it isn't obvious to me that profCount is correctly preserving r4, r5, and r6.

    FWIW, these are the TARGET_* flags I have in my

    TARGET_CFLAGS := $(filter-out -ffunction-sections,$(TARGET_CFLAGS))
    TARGET_thumb_release_CFLAGS := $(filter-out -fomit-frame-pointer,$(TARGET_thumb_release_CFLAGS))
    TARGET_arm_release_CFLAGS := $(filter-out -fomit-frame-pointer,$(TARGET_arm_release_CFLAGS))

    These flags seem to make both thumb and arm objects work.

    Finally, I also get "no time accumulated" in my profiling results. This happens both with a JNI-loaded library and with a fully native activity. I'm guessing the problem is caused by the multi-threaded nature of Android apps, even native ones. See for more details.

  15. Yeah, the calling convention is that r0-r3 are free for the called function to reuse (i.e. profCount is free to clobber them). r4 onwards have to be preserved by the called function if it uses them.

    I haven't looked at what armeabi-v7a spits out - I imagine that when you compile with armeabi-v7a and thumb2 in particular it is using a different calling mechanism to __gnu_mcount_nc. It hasn't been a priority for me to really investigate it though.

    My testing is done on a multi-threaded app. I guess that there's something in 2.3 on hardware that's screwing it up, because even that works fine with regular armeabi on emulators.

    The article you linked to is for the real gprof and the Linux kernel hooks it uses, which my own hack here doesn't use. I keep track of apparent time spent on each PC in a separate thread. As long as __gnu_mcount_nc is called, it should be registering time spent in the callee.

    If you get the chance, see the other post I made about the native activity example.

  16. thanks for posting your work. If works, it can be SOOO useful!..

    I have a following issue. When trying to perform the build with -pg directive, I get
    "arm-linux-androideabi-g++: -pg and -fomit-frame-pointer are incompatible". May be you can hint what can be done?..


  17. @santa - you need to remove the -fomit-frame-pointer declarations from the build. To do that, you have to temporarily add some lines to your More details

  18. Thanks! There was another source of CFLAGS in my project...

  19. Thanks again. I managed to build the thing. Yet, when I get into running, I get

    I/PROFILING( 714): 0: No smaps found

    Any chance you got a hint?


  20. corrected the misspelling of the parameter passed to monstartup. It rocks&rolls. Thanks!!

  21. @santa, yep that'll do it. I should add a better debug message in that case.

  22. Anonymous8:01 am

    I try to use the tool to profile my projects. It seems that the bottle neck is the following function:

    28.00 11.51 11.51 4922 0.00 0.00 TaskQueue::Dequeue(Queue*, unsigned int, unsigned int*, unsigned long)

    In this function, pthread_mutex_lock, pthread_cond_wait and pthread_mutex_unlock is used to for atomic operations.

    So, if the call is pending in pthread_cond_wait, is the waiting time included in the final perf counter(because it doesn't take any CPU time)?

  23. With version 2.0 of the library onwards you should get correct timing information when you run it on Gingerbread. Amongst the fixes is just this case: system calls with no overhead report no time. On Froyo the timing info is flaky even in 2.0, that's unfixable.

  24. What would be the changes required to make this work with an x86 android device (Google TV) ?

  25. @Jay: the NDK supports x86 now, so what's missing is a conversion of the assembly glue for the profiling calls. That's this code: which is ARM only.

  26. Thanks for replying to my earlier query. I actually looked at prof.c and there seem to be some symbols which are defined only in ARM: "instruction_pointer". I also wrote some replacement assembly glue for x86, but it seems that __gnu_mcount_nc might not be the function call that is put in by gcc, since at no point did I see a call to profCount (which is ordinarily called by __gnu_mcount_nc). Is there a difference in the nomenclature for gnu_mcount_nc? (I remember reading about this on the gprof website, and it mentioned it could be compiler/OS dependent, _mcount, __mcount etc?) Do you have any idea about this?

  27. @Jay - looks like the glue function is called "mcount" on x86 and it is put in by the compiler, it doesn't appear in the disassembly of the object files like with arm. APart from that, there's the issue of getting the instruction pointer from pt_regs, which on arm is easy (there's a macro for it) but on x86 it's a mystery to me.

  28. Hello!
    Thanks for your effort developing the profiler, I can't wait to give it a try.
    I'm having a slight problem, could you please help? I'm getting this error when I try to compile my lib with your profiler (strictly sticking to your instructions):

    make: *** [/home/alex/the_project/android/obj/local/armeabi-v7a/libandprof.a] Error 1
    make: *** Waiting for unfinished jobs....

    **** Build Finished ****

    The libandprof.a module is indeed at the path shown above.

    Thanks in advance!

  29. Can't do much with just that snippet. The Error 1 is the error code that the program actually erroring out passed to make. I guess it is gcc linking - perhaps you're using armeabi rather than armeabi-v7a?

  30. Nope, it's armeabi-v7a.
    I didnt't manage to link with the .a module, but I can compile the profiler from source. However, I get SIGSEGV upon program startup from somewhere outside (before) of my source code.
    Is it even legitimate to compile your profiler right into my shared module?

  31. You can compile it in your own code, just as long as you make sure not to do so with -pg on the actual profiling code (or it will get confused).

  32. Richard,

    I'm trying to use your profiler on a JNI app. I don't have much experience with makefiles. I can't seem to get the makefile to build no matter where I insert the snippets from the Usage guide. Can you give me some guidance here? Here is my makefile:

    TOP_LOCAL_PATH := $(call my-dir)

    include $(TOP_LOCAL_PATH)/fmod/


    include $(CLEAR_VARS)


    LOCAL_MODULE := xxx

    include $(LOCAL_PATH)/


    ifeq ($(NDK_DEBUG),1)
    LOCAL_USER_CFLAGS += -DNDEBUG -Os -funroll-loops

    ifeq ($(NDK_LITE),1)


    LOCAL_ARM_MODE := arm

    LOCAL_LDLIBS := -llog -lGLESv1_CM -lz #-lthread_db
    LOCAL_SHARED_LIBRARIES := fmod-prebuilt


  33. Is it possible to know how much CPU each thread in Android native code took. in short Android native code thread profiling

  34. To use Android native pofilier. I used the profiling library while compiling .so
    and also used monstartup and mon cleaup in the source file sused to make .so.

    I am able to get first three parameter in Flat pofile but rest of the parameters are zero. May I know is there as way to ge that information as well

  35. Anonymous9:14 am

    output i m getting is

    % cumulative self self total
    time seconds seconds calls Ts/call Ts/call name
    0.00 0.00 0.00 43066 0.00 0.00 std::less::operator()(std::string const&, std::string const&) const

    Followed your instruction. But the output is weird. Why all the values are 0.00?

    Is there something that i missed. Thanks in advance.

    1. Anonymous2:13 pm

      I got similar problem, no time data but only call, I used this library in Android 4.1 emulator and manually compiled it.


Note: only a member of this blog may post a comment.