I read the VBA source, lifted the relevant chunks of gprof-inspired code, hacked in some timing information, and added a touch of assembly to tie together what
gcc -pg
does and what the profile-creating code does. If you look at a chunk of the output from gcc with -pg
, it looks like this:
0001184c <do_raisedead_cast>:
1184c: b5f8 push {r3, r4, r5, r6, r7, lr}
1184e: b500 push {lr}
11850: f004 f830 bl 158b4 <__gnu_mcount_nc>
...
1189c: 3a01 subs r2, #1
1189e: 701a strb r2, [r3, #0]
118a0: bdf8 pop {r3, r4, r5, r6, r7, pc}
So the compiler pushes the link register (lr) and the
__gnu_mcount_nc
function has to pop it off the stack. This isn't a normal calling convention - usually each function uses the stack symmetrically, pushing and popping the same number of times. Also I learned that the mcount function must preserve all the registers. You can't smush a single one or Bad Things happen (I've seen a lot of weird crashes lately, I can tell you). This little conundrum took me quite some time to figure out. It can probably be done better, but the implementation I came up with was this:
push {r0-r3}
push {lr}
ldr r0, [sp, #20] @ r0 = lr pushed by calling routine
mov r1, lr @ address of calling routine
bl profCount
pop {r0}
mov lr, r0
pop {r0-r3}
@ pop previous address
add sp, #4
bx lr
Since registers r4+ must be preserved by the called function, I can ignore those when the code calls out to
profCount
. Keeping r0-r3 clean was the problem, as usually these are "scratch" registers and you're free to mess them up. The function profCount
takes the C function-just-called and its caller addresses as arguments for use in generating the profile. Anyway, the results are pretty good for the code I've tested it on - you get the usual call counts and timing information that gprof provides. Here's a bit of the information from Chaos...
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
65.31 0.32 0.32 288 1.11 1.11 android_render_screen
18.37 0.41 0.09 450 0.20 0.20 clear_square
4.08 0.43 0.02 3 6.67 6.67 clear_bg
2.04 0.44 0.01 25 0.40 0.40 print_text16
2.04 0.45 0.01 3 3.33 3.33 draw_decor_border_internal
2.04 0.46 0.01 2 5.00 33.00 splash_touch
2.04 0.47 0.01 1 10.00 52.00 options_back
2.04 0.48 0.01 1 10.00 54.80 show_options
2.04 0.49 0.01 chaos_start
0.00 0.49 0.00 442368 0.00 0.00 draw_8x8_tile
0.00 0.49 0.00 28800 0.00 0.00 platform_set_tile_entry
0.00 0.49 0.00 3646 0.00 0.00 platform_set_text_map_entry
0.00 0.49 0.00 2671 0.00 0.00 platform_key_pressed
0.00 0.49 0.00 1186 0.00 0.00 platform_set_palette_entry
index % time self children called name
0.32 0.00 288/288 chaos_gl_render [2]
[1] 65.3 0.32 0.00 288 android_render_screen [1]
0.00 0.00 442368/442368 draw_8x8_tile [22]
0.00 0.00 286/286 InterruptProcess [30]
Functions with no "calls", "self" or "total" are because they got called via JNI. The "draw_8x8_tile" call count there is accurate - it is called for each 8x8 tile on the screen in each frame. There are 32 x 24 tiles, and 2 layers of tiles.
32 * 24 * 2 * 288 = 442368
, as expected. And the android_render_screen count is correct, since at 30 fps that is around 9 seconds of use and I only ran the game for a few seconds to check.More details on the android-ndk-profiler page, including a guide on how to use the library.