Sunday, May 31, 2015

Fixing bugs with GCC's -fsanitize

New versions of gcc have added several options to check for undefined behaviour, out-of-bounds behaviour, and other run-time bugs in a program.

To use address sanitizer, you have to compile and link your code with the -fsanitize=address option. Then run your program as normal. If there are any buggy bits, then the compiler-inserted runtime checks will throw a wobbly.

Using the CMake build system I add the flag like this:

include(CheckCCompilerFlag)
set(flag fsanitize=address)
set(CMAKE_REQUIRED_FLAGS -${flag})
check_c_compiler_flag(-${flag} ${flag}_is_ok)
if(${flag}_is_ok)
    set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -${flag}")
endif()

I had to set CMAKE_REQUIRED_FLAGS as otherwise the check_c_compiler_flag macro just uses -fsanitize on the compile stage, skips it on the link stage, and the check subsequently fails with an undefined reference:

CMakeFiles/cmTryCompileExec3487553727.dir/src.c.o: In function `_GLOBAL__sub_I_00099_0_main':
src.c:(.text+0x10): undefined reference to `__asan_init_v4'
collect2: error: ld returned 1 exit status

Anyway… for an out-of-bounds read, a crash looked like this (I compiled in Debug to get line information):

=================================================================
==16223==ERROR: AddressSanitizer: global-buffer-overflow on address 0x0000004833bc at pc 0x000000412a7a bp 0x7ffe5fb416e0 sp 0x7ffe5fb416d0
READ of size 2 at 0x0000004833bc thread T0
    #0 0x412a79 in load_bg_palette chaos/gfx.c:306
    #1 0x412824 in load_all_palettes chaos/gfx.c:292
    #2 0x415d8d in load_bg chaos/main.c:23
    #3 0x415e4c in chaos_main chaos/main.c:37
    #4 0x407758 in main port/linux/main.c:52
    #5 0x7f6c1551778f in __libc_start_main (/usr/lib/libc.so.6+0x2078f)
    #6 0x406e08 in _start (build/linux-cmake/port/linux/chaos+0x406e08)

0x0000004833bc is located 0 bytes to the right of global variable 'palette1Pal' defined in 'gfx/palette1/palette1.c:16:22' (0x4833a0) of size 28
SUMMARY: AddressSanitizer: global-buffer-overflow chaos/gfx.c:306 load_bg_palette
Shadow bytes around the buggy address:
  0x000080088620: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x000080088630: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x000080088640: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x000080088650: 00 00 00 00 00 00 00 00 f9 f9 f9 f9 00 00 00 00
  0x000080088660: 00 00 00 00 00 f9 f9 f9 f9 f9 f9 f9 00 00 00 00
⇾0x000080088670: 00 00 00 00 00 00 00[04]f9 f9 f9 f9 00 00 00 00
  0x000080088680: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x000080088690: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0000800886a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0000800886b0: f9 f9 f9 f9 00 00 00 00 00 f9 f9 f9 f9 f9 f9 f9
  0x0000800886c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Heap right redzone:      fb
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack partial redzone:   f4
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
==16223==ABORTING

Chaos has run "valgrind clean" for a while, which means that there should be no out-of-bounds writes. The nice thing about address sanitizer is that it seems to find extra problematic areas over valgrind, and the runtime overhead is less. I've noticed that buggy code that does out-of-bounds reads on an x86 machine tends to work fine, but when you do the same on Android running on an ARM processor it crashes right away.

Hopefully armed with this new tool I'll be able to fix more bugs :-D

Tuesday, May 26, 2015

Chaos update notes

A few weeks ago I released a new version of Chaos that had what looked to be some pretty bad bugs. In fact the bug was just that the controls were enabled when the CPU is moving or casting, but you could really mess things up by tapping the screen during their turns.

This bug came about because I'd switched over to the latest emscripten build for the Javascript version.

As you may recall, I had to make some changes to the code structure of Chaos to avoid busy-loops deep in the guts of the game. Emscripten previously required you to have just one "main loop". Since then, Emscripten has gone completely insane - in a good way. The latest emscripten releases have a way to compile C code down to an emscripten-specific byte-code, that is then run on an interpreter, which itself runs in the Javascript interpreter in the browser!

What this marvellous change means is that the one-main-loop restriction is lifted. You get better performance if you write code to use the preferred way, but the interpreter-byte-code solution lets you keep the old semantics going if you don't mind taking the performance hit. Hurray! So I reverted Chaos back to how it was in release 1.18 more or less. That means no more co-routines to get around the loop restriction, and the code is easier to read and understand.

As a side effect, I added the previously-mentioned game destroying bug by enabling controls in the "wait" code that was called everywhere. The fix was easy - just remove the control polling/updates in the wait. But it broke the game for a lot of people :( To try and prevent this from happening again, I've created a beta group on Google+ and I'll push updates here before making them live to everyone.

I currently have 1 bug on my list - fix the glitched screen that happens after you minimize the game and return to the homescreen, then go back to the game. I'm doing something wrong in the Android lifecycle and the GL screen is not reset correctly.