Sunday, March 27, 2011

Profiling Android NDK Code

Profiling code on embedded systems is tricky. The Google tools for profiling Android applications only cover Java code, not native C code. So I've created a new project on Google Code for profiling Android NDK shared libraries. I had the idea when reading this post on the GBAdev.org forum - there coder "sverx" asked how to profile code, and I remembered Visual Boy Advance did it pretty well on the GBA, but there was no way to do so on the DS - nor on Android. Well, the Android situation just improved :-)

I read the VBA source, lifted the relevant chunks of gprof-inspired code, hacked in some timing information, and added a touch of assembly to tie together what gcc -pg does and what the profile-creating code does. If you look at a chunk of the output from gcc with -pg, it looks like this:
 
0001184c <do_raisedead_cast>:
   1184c:   b5f8        push    {r3, r4, r5, r6, r7, lr}
   1184e:   b500        push    {lr}
   11850:   f004 f830   bl  158b4 <__gnu_mcount_nc>
   ...
   1189c:   3a01        subs    r2, #1
   1189e:   701a        strb    r2, [r3, #0]
   118a0:   bdf8        pop {r3, r4, r5, r6, r7, pc}

So the compiler pushes the link register (lr) and the __gnu_mcount_nc function has to pop it off the stack. This isn't a normal calling convention - usually each function uses the stack symmetrically, pushing and popping the same number of times. Also I learned that the mcount function must preserve all the registers. You can't smush a single one or Bad Things happen (I've seen a lot of weird crashes lately, I can tell you). This little conundrum took me quite some time to figure out. It can probably be done better, but the implementation I came up with was this:
 
        push {r0-r3}
        push {lr}
        ldr r0, [sp, #20]  @ r0 = lr pushed by calling routine
        mov r1, lr    @ address of calling routine
        bl profCount
        pop {r0}
        mov lr, r0
        pop {r0-r3}
        @ pop previous address
        add sp, #4
        bx lr

Since registers r4+ must be preserved by the called function, I can ignore those when the code calls out to profCount. Keeping r0-r3 clean was the problem, as usually these are "scratch" registers and you're free to mess them up. The function profCount takes the C function-just-called and its caller addresses as arguments for use in generating the profile. Anyway, the results are pretty good for the code I've tested it on - you get the usual call counts and timing information that gprof provides. Here's a bit of the information from Chaos...
 
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 65.31      0.32     0.32      288     1.11     1.11  android_render_screen
 18.37      0.41     0.09      450     0.20     0.20  clear_square
  4.08      0.43     0.02        3     6.67     6.67  clear_bg
  2.04      0.44     0.01       25     0.40     0.40  print_text16
  2.04      0.45     0.01        3     3.33     3.33  draw_decor_border_internal
  2.04      0.46     0.01        2     5.00    33.00  splash_touch
  2.04      0.47     0.01        1    10.00    52.00  options_back
  2.04      0.48     0.01        1    10.00    54.80  show_options
  2.04      0.49     0.01                             chaos_start
  0.00      0.49     0.00   442368     0.00     0.00  draw_8x8_tile
  0.00      0.49     0.00    28800     0.00     0.00  platform_set_tile_entry
  0.00      0.49     0.00     3646     0.00     0.00  platform_set_text_map_entry
  0.00      0.49     0.00     2671     0.00     0.00  platform_key_pressed
  0.00      0.49     0.00     1186     0.00     0.00  platform_set_palette_entry

index % time    self  children    called     name
                0.32    0.00     288/288         chaos_gl_render [2]
[1]     65.3    0.32    0.00     288         android_render_screen [1]
                0.00    0.00  442368/442368      draw_8x8_tile [22]
                0.00    0.00     286/286         InterruptProcess [30]

Functions with no "calls", "self" or "total" are because they got called via JNI. The "draw_8x8_tile" call count there is accurate - it is called for each 8x8 tile on the screen in each frame. There are 32 x 24 tiles, and 2 layers of tiles. 32 * 24 * 2 * 288 = 442368, as expected. And the android_render_screen count is correct, since at 30 fps that is around 9 seconds of use and I only ran the game for a few seconds to check.

More details on the android-ndk-profiler page, including a guide on how to use the library.