Monday, December 01, 2008

Things you never wanted to know about assembler but were too afraid to ask

When you write homebrew for the Nintendo DS, or the Game Boy Advance, you can choose to use a friendly language like C or C++. This makes life a bit easier. It's comfortable, familiar. But at the back of your mind there's this nagging doubt... Shouldn't I be writing some of this in assembler? After all, if I'm writing C code, I may as well code for a PC and use SDL or Allegro. But then you think... Assembler, that sounds a bit tricky, I wouldn't know where to begin!

And that's where this post comes in.

Getting started with assembler is probably the hardest part. There's aren't many articles on best practices, what not to do, which idioms to use, and so on, as there are for other languages. Many of the ideas you have about C can't be applied to assembler, or at leas that's what you might think at first, so you probably brush it off as impossible. This is a bit sad, because these Nintendo consoles use an ARM processor and ARM assembler has a nice syntax. Unlike assembler for Intel processors, say, the ARM syntax is fairly small, has few real surprises and can be learned quite easily.

Now, I don't want to claim that I'm some assembler guru. I've written about 6,000 lines of ARM for Pocket Beeb and Elite AGB, plus some in a current "top secret" project I'm working on. It's not really that much, especially compared to the millions of lines of C and C++ I've probably written. But it is a good enough amount to get a feel for how to code in this way, I think.

Before we dive in, let's see at how we go from C to "the magic" that runs on a DS. The first step is to write a function in C. We then use a compiler, which converts this to a file in the ELF format. This binary file is joined together with all the other compiled modules in your program, and the output is a "pure" binary blob with some header information. So that's C to object file, then object files to NDS "ROM" file.

Let's rewind to those first steps. The C function is compiled to an ELF file, which typically ends in ".o", for "object file". This step actually skips out quite a lot. You can see the intermediate files involved by passing --save-temps to your compilation step. Lets try this. Here's a useless function that does some pointless thing. We pass in a couple of variables, and add or subtract them depending on their relative values.

int my_function(int x, int y) {
if (x < y) {
return x + y;
} else {
return x - y;

If we compile this as so: "arm-eabi-gcc --save-temps -c -o my_function.o my_function.c" then the output will be 3 files. These are my_function.o (as expected), my_function.i and my_function.s. The .i file is the post-processed output - what happens after running the C preprocessor on the file. This finds-and-replaces any #defines, pastes in any #include'd files and adds in compiler-specific information. The .s file is what we are after. This is the output of the C code converted to ARM assembler. It contains a lot of "cruft" that you wouldn't write if you were to write this function by hand, but some of the features are important:

.align 2
.global my_function
.type my_function, %function

The lines that start with a dot are called directives. They are psuedo-instructions that tell the assembler to do extra things.

The .text directive adds headers to let the linker know where abouts the code should be placed. It's not too important on the DS, more so on the GBA, where if you omit it then the code may be placed in RAM. You can also force code to go into fast iwram on the GBA by using the directive .section .iwram,"ax",%progbits. Don't ask what it means, just use the voodoo.

The .align directive ensures that our code is padded to 4 byte boundaries. The "2" means the number of bits that must be zero in the location counter at this point. So 2 bits signifies aligned by 4 bytes. Since ARM assembler instructions are 32 bits big (4 bytes) we need to align the code to 4-byte boundaries, or Very Bad Things can happen.

The .global directive makes the symbol name (my_function) visible outside the compilation unit, so you can use it in other parts of your program. Without .global, the symbol would be the equivalent of using "static" in a C definition.

The .type directive is pretty pointless, but I tend to use it to mark what are functions and what are just helper "scraps" of assembler with a named label. It's use is supposed to be for interoperability with other assemblers, but we always use GCC from devkitPro so its a moot point.

Finally, "my_function:" is a label - this is like a goto label in C and is where the program jumps to when we call this function.

The rest of the ARM code spit out by GCC is as good as could be for this example, if you compile with -O2. Normally, the advantages of assembler tend to be minimal for small functions, they are better when you do things that can't be easily done in C. Such as unrolling loops, keeping often-used memory addresses in a single register, that kind of thing. Also, hand coded ARM assembler can make better use of the registers in some cases, more on registers in a minute, but GCC assembler tends to make more use of the stack. This tends to manifest itself only in more complex functions. Anyway, lets just say that it'd complicate matters needlessly to copy paste the rest of the code here, and for this example GCC does a good job.


I mentioned the registers back there. The ARM processor in the NDS and GBA has 16 registers - usually named r0, r1, r2... up to r12, then sp, lr, pc instead of r13, r14 and r15. I say usually, because there is an alternate naming scheme that some documents use. Here certain registers are given special names:

std | alt
r0 | a1
r1 | a2
r2 | a3
r3 | a4
r4 | v1
r5 | v2
r6 | v3
r7 | v4
r8 | v5
r9 | v6
r10 | v7 or sl
r11 | v8 or fp
r12 | ip
r13 | sp
r14 | lr
r15 | pc

It's probably best to use the r0-r12,sp,lr,pc naming scheme as that is what the ARM documentation uses. The trick is to pick one scheme and stick to it. sp is a mnemonic for "stack pointer" and points to, you guessed it, the stack. lr means "link register" and is used as the return address for function calls. pc is the program counter and is the address of the current instruction (i.e. where we are in the program).

The stack is an area of RAM. Initially it points to the end of the RAM area and grows downwards as programs allocate memory (on the stack). When writing assembler I find it's not that usual to use the stack - generally I only use it for storing register values that I may need later, such as the lr when calling functions recursively.

Code set up

Remember how gcc spat out a .s file? Well, that means "pure assembler code". The common way to store ARM code is in a file ending in ".S". This is interpreted by gcc as a source file containing assembler and pre-processor directives. So here we can use #defines and #includes if we really need to. It also distinguishes hand coded assembler files from any --save-temps left-overs.

Lets try writing our my_function in assembler. First I fire up an $EDITOR and write out the preamble to my_function.S. Now lets think about what we want. This function has the following C prototype:

int my_function(int, int);

So that means it accepts 2 input values and returns a value. The input values in ARM are in the first 4 registers - any more than 4 inputs are stored on the stack. The original function wanted an addition if x was less than y, else it subtracted y from x. Here's my final code to do this:

.align 2
.global my_function
.type my_function, %function
cmp r0, r1
addlt r0, r0, r1
subge r0, r0, r1
bx lr

The header is as we saw before. The my_function label is the start of the function. The r0 and r1 registers contain the input values.

The cmp instruction means "compare r0 to r1 and set the flags accordingly"... now I don't really want to copy out the whole ARM assembler guide here, but suffice to say that this sets some flags based on the mathematical result of performing "r0 - r1". These tests are in the form of those "lt", "ge" that follow the end of the other instructions. This is one of the neat parts about ARM - every instruction is conditional!

The default condition is "al" for "always", but may be omitted for obvious reasons. We don't want to write al after every instruction, right? So here we say, if r0 is less than (lt) r1, then add r0 to r1 and store the result in r0. A normal add would just be "add", but here we write "addlt". This makes the processor skip over the instruction if the "less than" state is not set. It's important to note that conditional instructions like this still have some overhead, the processor still has to read the operation in some way, so if you have more than say 3 or 4 conditional instructions with the same condition, it may be faster to use a branch to skip the code completely.

If the result was greather than or equal (ge) then the subtraction is done and the result stored in r0, followed by a return. The bx instruction is used to return from a function - it means "branch to the address in the register lr". As you see, we use the link register to return to the calling code.

Finally, the result is always stored in r0 in this example. This is part of the ARM binary interface - results are returned in registers r0 and r1. This convention allows C and ARM code to interoperate sensibly. If you are returning to your own ARM function, then you can invent any convention you like - return values in r4, r7 and r11 if you want - but sticking to the "C way" means that you can use the function from C later, if that becomes necessary.

You may have noticed that addition and subtraction seem to work "backwards" compared to C convention. Instead of reading left-to-right and the result being stored at the end, the result is actually stored in the first register given. This is fairly common in other flavours of assembler code too, and once you get used to it, it isn't so bad.

We can write a simple test program to run this code:

#include <nds.h>
#include <stdio.h>

int my_function(int x, int y);

int main(void) {
int x = 1;
int y = 2;
iprintf("my_function(%d, %d) returns %d\n", x, y, my_function(x, y));
x = 100;
y = 20;
iprintf("my_function(%d, %d) returns %d\n", x, y, my_function(x, y));

while(1) { swiWaitForVBlank(); }
return 0;

This prints out the results on the DS screen. Nothing very exciting, but you can see that combining C and ARM assembler in a project is not as difficult as you first thought!

Tips and tricks

There are quite a few traps for the unwary - apart from the shift in thought process needed to code assembler, of course. The first of these is that loading a value into a register requires some thought. You can only load values with "mov" that are shifted 8 bit values - 0xff0, 0x1c00, but not 0x101, for example. There is a psuedo opcode to get around this easily - "ldr r1, =0x101" for example - but it may bite you if you didn't know this. The error would look like this "Error: invalid constant (101) after fixup", just in case you were wondering.

When you write ARM, always try to maximise what each instruction does! Shifts can be added onto the end of instructions, so often instead of a shift then logical orr, you can do the lot in one go:

orr r3, r0, r1, lsl r2

This is the same as r3 = r0 | (r1 << r2) all in one instruction! Similarly you can load data from structures using offsets with ldr, with "ldr r0, [r1, r2, lsl #2]". That means r0 = r1[r2*4] - useful for loading 32-bit values from structures into registers. Remember that in assembler pointer arithmetic works as if all pointers were to char, there are no data types. Adding 1 to a pointer really just adds 1 to it, not like in C where it can increase by the size of the data type.

Generally, make use of all the registers you have available and avoid pushing/popping to and from the stack. You can use r0-r12 however you want, make the most of these and you'll find some nice shortcuts that provide speedy, optimised code. Hopefully. If you need to push multiple registers to the stack, then the idiom to use is the stmfd/ldmfd one (store/load multiple full descending). This stores a comma seprated list and/or range of registers to the stack, and decrements/increments the stack by the right amount. It is written as follows, to store here registers r0 to r7 and lr (r14):

stmfd sp!,{r0-r7, lr}
... do stuff with registers r0-r7 and lr ...
ldmfd sp!,{r0-r7, lr}

Always try and reduce duplication - not only is it a maintenance problem, but each copied line is a wasted cycle. Try and reduce code to the minimum number of instructions, that's the name of the game here. Ah, but remember: get a working version first, then optimise heavily. No point having a fast bit of code that doesn't do what you want :-)

I'd recommend this PDF cheat sheet for a quick guide to the ARM instruction set. Keep in mind that the DS doesn't support all the instructions on there - rbit, bfc, and some others - but they are generally the more exotic operations that are used less. If in doubt, compile a quick test file. You'll get the error "Error: selected processor does not support..." if the instruction is not supported by the CPU.


So why would you want to do all this? Isn't this just for head cases?

I think learning how to write code at this level leads to a better understanding of why things are done in certain ways in high level languages. You'll certainly grok pointers, if you haven't already. Learning new coding styles - not just using C-like languages - will make you a better coder too. Heck, maybe a better person even ;-)


  1. I just jumped onto this whole Bunjalloo bandwagon. And just 10 seconds ago I read your farewell post from October.

    I was wondering after seeing this long thesis on coding if you really are calling it quits. I hope you're not because I am expecting my DS in the mail in the next few days and can't wait to start using Bunjalloo.

    I mean who knows what kind of garbage the DSi will come with. I'd bet that if this project continues it will be ported to the DSi just as soon as someone makes a cart for it.

    Anyway, I hope you keep goin.

  2. Anonymous12:56 am

    Top Sercret Project? You make us curious. See ya.

  3. @-Ryan - yeah, I probably ought to do some more stuff on Bunjalloo. I mean, I've said before I'll quit, and I never do :-) I've just updated to DKA r24, and it makes life a lot easier, so we'll see how it goes.

    @Anonymous - I'm completely stuck on my new project, so don't hold your breath. I got really far, and it'd be awesome if I could get past the latest "impossible" bug, but it isn't looking good ATM. Stay tuned though, just in case. Tease, tease. :-)

  4. Anonymous4:39 am

    Greetings, I am working on a homebrew project and I was wondering if you have the time to quickly identify this problem that I am having with this particular assembly code.

    I currently have this macro/function combo which is used to store vertex information in the memory:

    #define VRTPACK_Y3D(y) (((short int)y) << 16)
    #define VRTPACK_X3D(x) (((short int)x) & 0xFFFF)

    void DrawVTX(short int x, short int y, short int z)
    *(vuint32*)0x0400048C = VRTPACK_Y3D(y) | VRTPACK_X3D(x); // Set X and Y coords
    *(vuint32*)0x0400048C = ((short int)z); // Set Z coord

    Since I am pretty new to assembly coding I decided to use this opportunity to learn and understand coding in arm assembly since it'll help out with optimizations. I decided to use the following code that I pasted and convert it into assembly code. Problem is that the assembly code is not producing the same results as the C-code version, so I was wondering if you could help identify and explain my mistake?

    Here's my assembly version of 'DrawVTX'

    @ r0 = x, r1 = y, r2 = z

    ldr r12, =0x0400048C
    ldr r8, =0xFFFF

    and r0, r0, r8
    orr r9, r0, r1, lsl #16

    str r9, [r12]
    str r2, [r12]

    bx lr

    Also, for some odd reason, the rom would randomly crash, or other parts in the memory would get corrupted when I set the r4, and r5 registers with anything (variable, constant, pointer etc). Any idea whats going on with that?

    Heh, that pretty much wraps everything that I am having problems with. I hope you could help shine some light on this subject. I appreciate your time, thanks!

  5. You've given me some hope of programming on my DS again!

    The only reason I haven't started yet is because I'd have to program in C. Is there any way to only program in assembly and avoid C altogether? It's no problem to learn another assembly language, I'm already familiar with several!

  6. @John: just create a .global symbol "main" and away you go. devkitpro r24 has made getting started easier than ever.

  7. I am very curious as to how one gets started with making a game for the Gameboy Advance but I have no idea where to start on the programming end. I am familiar with C++ and ASM, but only with hacking, not writing code from scratch. There is very little accurate information available on the web today documenting the way to go about such a project.

    If you are wondering, I am planning to write a Legend Of Zelda-esque game in terms of exploration but with a story as deep as that of a JRPG. I can sprite, write music, program, and have already worked on the details of the game for months now, but I do not know how to make it work on the GBA. Any help would be welcome. My email is adhdkiki AT gmail d0t com.

    Thanks for having such a cool blog. :)


Note: only a member of this blog may post a comment.