Monday, December 15, 2008

Upgrading to devkitARM r24 (on Linux)

The latest version of the devkitARM toolchain was released last week, and it was a biggie. A whole new API for sprites and backgrounds was added to libnds. A new default ARM7 binary was added that automatically handles wifi, sound and sleep mode. The other big addition was a new fangled sound library that adds mod playback and fancy sound effects.

This is my post on the upgrade process for Bunjalloo on Linux (Ubuntu 8.04 in particular). Hey, if you read all this and get through it, then you could probably help out hacking on Bunjalloo too :-)

A summary for the impatient: not without hassles, but the gain is worth the short term pain.

I assume you have a directory devkitPro somewhere. This will contain devkitARM and libnds. You should set an environment variable DEVKITPRO to point to this directory. Something like this will do:
export DEVKITPRO=$HOME/devkitpro_r24
mkdir -p $DEVKITPRO
I recommend versioning this release at the devkitpro directory level. Release r24 contains some breaking changes and by having the possibility to change between r23 and r24 you may save yourself some headaches. At the very least, it will ensure you don't mix versions, which would be a big no-no.
  1. Download the devkitARM r24 files from the project page
    • devkitARM_r24-i686-linux.tar.bz2
    • libnds-src-1.3.1.tar.bz2
    • dswifi-src-0.3.5.tar.bz2
    • maxmod-src-1.0.1.tar.bz2
    • default_arm7-src-20081210.tar.bz2
    • nds-examples-20081210.tar.bz2
    • libfat-src-1.0.3.tar.bz2
    I saved them all in $HOME/Downloads, and that's what I've put in this post. Change that path as you need.

  2. Install the devkitARM toolchain. This provides the compiler and C/C++ standard libraries. I usually install it in a versioned file, but if you already have named your devkitpro directory "devkitpro_r24" then there's no need.

    tar xvf ~/Downloads/devkitARM_r24-1686-linux.tar.bz2
    mv devkitARM devkitARM_r24

  3. The rest of these steps can be done in any directory. I tend to use $DEVKITPRO/vendor for all this build stuff. You can also use the precompiled binaries, but this way you know you have everything you need installed.

  4. Set the DEVKITARM environment variable to point to our installed toolkit
    export DEVKITARM=$DEVKITPRO/devkitARM_r24

  5. Install libnds
    !!! CARE !!! these source tars don't have a top level directory, so you need to create them manually. This is the case for all of the devkitPro tar balls, except the main devkitARM_r24 one. They will splurge their files in the current directory, instead of creating their own top level one. Yuck!
    mkdir libnds-1.3.1
    cd libnds-1.3.1
    tar xjf ~/Downloads/libnds-src-1.3.1.tar.bz2
    Now to compile libnds - if you have DEVKITPRO and DEVKITARM set correctly then this should compile the library succesfully.
    make -j 3 install

  6. libfat is optional, but recommended. Open up the tar, compile and install it too, if you like. libfat for the nds depends on having libnds installed, so you'll have to do step 4 first.

    mkdir libfat-1.0.2
    cd libfat-1.0.2
    tar xvf ~/Downloads/libfat-src-1.0.2.tar.bz2
    make nds-install

  7. Now install dswifi 0.3.5 in a similar way - untar the release files, compile and install. dswifi depends on libnds, so you have to do the previous steps before this one.

    mkdir dswifi-0.3.5
    cd dswifi-0.3.5
    tar xvf ~/Downloads/dswifi-src-0.3.5.tar.bz2
    make -j 3 install

  8. Install maxmod. Again, this depends on libnds and won't compile if you have skipped a step.
    mkdir maxmod-1.0.1
    cd maxmod-1.0.1
    tar xvf ~/Downloads/maxmod-src-1.0.1.tar.bz2
    make -j 3 install-nds

  9. Install the new default ARM7 core. This requires dswifi and maxmod, if either are missing then you will get compile or link errors.

    mkdir default_arm7-20081210
    cd default_arm7-20081210
    tar xvf ~/Downloads/default_arm7-src-20081210.tar.bz2
    make install

  10. Try out the nds examples! Now that everything is installed, you can compile and run some examples.
    mkdir nds-examples-20081210
    cd nds-examples-20081210
    tar xvf ~/Downloads/nds-examples-20081210.tar.bz2
    cd audio/maxmod
    # test the nds files on your DS!
  11. Write your own program :-) This part is a bit trickier!
Now once I had all this installed, I needed to update Bunjalloo to use the new code (Elite DS coming soon!). The first step was to see what would compile without drastic changes. The list of changes required in my code was:
  • Remove irqInit() calls - this is now done by libnds before your main is called
  • Register name changes: DISPLAY_CR -> REG_DISPCNT, BG0_X0 -> REG_BG0HOFS, BG0_CR -> REG_BG0CNT (and their SUB equivalents, where applicable)
  • Function name changes: powerON, powerOFF -> powerOn, powerOff, touchReadXY() -> touchRead(touch_structure)
  • Deprecated header: nds/jtypes.h -> nds/ndstypes.h
  • Look-up table changes: COS[angle] -> cosLerp(angle), SIN[angle] -> sinLerp(angle)
  • Some #defines have gone: BG_16_COLOR -> BG_COLOR_16, BG_256_COLOR -> BG_COLOR_256, SOUND_8BIT -> SOUND_FORMAT_8BIT
  • powerOn/Off now expects a PM_Bits enum value, not an int
Nothing major there... the big breaking changes are on the ARM7 side. Here trying to use your own hand-coded arm7 causes plenty of problems - most of the old inter-processor communications code has been removed from libnds, or has changed completely. Unfortunately, that means it's quite tricky to migrate to the new libnds and use your own ARM7 core. The easiest thing to do here is ignore your ARM7 code completely and use the new default_arm7. There's no one-stop solution, since most IPC code has been built in an ad hoc way, but here are some pointers.

For wifi code, your old wifi init code should be replaced by a single call to Wifi_InitDefault(true/false) - no need to faff about setting up the interrupts and timers. Then you can either now use sockets (if you passed "true", which means connect using the firmware settings) or connect using a detected AP first - all this is documented in the dswifi headers and by the new examples. Pretty neat, and cuts down on maintenance for everyone.

If you have a debug console, then consoleInitDefault() is a bit trickier to update. It has disappeared, pretty much. In most cases consoleDemoInit() will probably suffice. If you are doing anything much more complicated, then you will have to get to grips with the PrintConsole structure. This seems to be quite a powerful new feature, for example it allows printing to both screens from within the same program, but will take a while to get used to. I doubt many people used the old consoleInitDefault in "real" programs, and the new API looks like it might be possible to use the printf stuff on something other than a black/white screen.

Sound has seen a rather large shakeup in this release. The new MaxMod sound engine seems to fill a huge gap in the homebrewer's library arsenal. It isn't without its drawbacks though, if you did just basic NDS sound effects. Previously you could play individual samples by sending the raw data to the playGenericSound() function, after having previously set the sample rate, volume, panning and format via the setGenericSound() function. That has now been completely removed, replaced by new MaxMod sound engine functions. This is a flexible sound and music system, but requires that the sound effects are in a special format. A tool (mmutil) is provided to convert wav and mod file formats to the expected static structure. Alternatively, a more complex streaming system can be used. This latter does not require input sounds to be converted, and is the only way to play sound from a file, but is slightly trickier to use than the straight forward playGenericSound() function.

The sound streaming approach requires you the coder to implement an audio-filling callback and to call the mmUpdateStream() function often enough to keep the sound buffer full. Fortunately there are some good examples of all the MaxMod library's usage in the nds-examples pack, and the new system is a great addition. Besides music, you can do looped samples, proper panning, mixing, etc, etc. Bah, I'm just miffed because I'll have to rewrite some of my SDL sound code ;-)

Needless to say, on the ARM7 side of things the sound handling stuff has completely changed - this alone is a good enough reason to abandon any hand-rolled arm7 code. Added to the new sleep function - no more will-it-wont-it wondering when you close the lid - and all in all I think this is a great release.

Monday, December 01, 2008

Things you never wanted to know about assembler but were too afraid to ask

When you write homebrew for the Nintendo DS, or the Game Boy Advance, you can choose to use a friendly language like C or C++. This makes life a bit easier. It's comfortable, familiar. But at the back of your mind there's this nagging doubt... Shouldn't I be writing some of this in assembler? After all, if I'm writing C code, I may as well code for a PC and use SDL or Allegro. But then you think... Assembler, that sounds a bit tricky, I wouldn't know where to begin!

And that's where this post comes in.

Getting started with assembler is probably the hardest part. There's aren't many articles on best practices, what not to do, which idioms to use, and so on, as there are for other languages. Many of the ideas you have about C can't be applied to assembler, or at leas that's what you might think at first, so you probably brush it off as impossible. This is a bit sad, because these Nintendo consoles use an ARM processor and ARM assembler has a nice syntax. Unlike assembler for Intel processors, say, the ARM syntax is fairly small, has few real surprises and can be learned quite easily.

Now, I don't want to claim that I'm some assembler guru. I've written about 6,000 lines of ARM for Pocket Beeb and Elite AGB, plus some in a current "top secret" project I'm working on. It's not really that much, especially compared to the millions of lines of C and C++ I've probably written. But it is a good enough amount to get a feel for how to code in this way, I think.

Before we dive in, let's see at how we go from C to "the magic" that runs on a DS. The first step is to write a function in C. We then use a compiler, which converts this to a file in the ELF format. This binary file is joined together with all the other compiled modules in your program, and the output is a "pure" binary blob with some header information. So that's C to object file, then object files to NDS "ROM" file.

Let's rewind to those first steps. The C function is compiled to an ELF file, which typically ends in ".o", for "object file". This step actually skips out quite a lot. You can see the intermediate files involved by passing --save-temps to your compilation step. Lets try this. Here's a useless function that does some pointless thing. We pass in a couple of variables, and add or subtract them depending on their relative values.

int my_function(int x, int y) {
if (x < y) {
return x + y;
} else {
return x - y;

If we compile this as so: "arm-eabi-gcc --save-temps -c -o my_function.o my_function.c" then the output will be 3 files. These are my_function.o (as expected), my_function.i and my_function.s. The .i file is the post-processed output - what happens after running the C preprocessor on the file. This finds-and-replaces any #defines, pastes in any #include'd files and adds in compiler-specific information. The .s file is what we are after. This is the output of the C code converted to ARM assembler. It contains a lot of "cruft" that you wouldn't write if you were to write this function by hand, but some of the features are important:

.align 2
.global my_function
.type my_function, %function

The lines that start with a dot are called directives. They are psuedo-instructions that tell the assembler to do extra things.

The .text directive adds headers to let the linker know where abouts the code should be placed. It's not too important on the DS, more so on the GBA, where if you omit it then the code may be placed in RAM. You can also force code to go into fast iwram on the GBA by using the directive .section .iwram,"ax",%progbits. Don't ask what it means, just use the voodoo.

The .align directive ensures that our code is padded to 4 byte boundaries. The "2" means the number of bits that must be zero in the location counter at this point. So 2 bits signifies aligned by 4 bytes. Since ARM assembler instructions are 32 bits big (4 bytes) we need to align the code to 4-byte boundaries, or Very Bad Things can happen.

The .global directive makes the symbol name (my_function) visible outside the compilation unit, so you can use it in other parts of your program. Without .global, the symbol would be the equivalent of using "static" in a C definition.

The .type directive is pretty pointless, but I tend to use it to mark what are functions and what are just helper "scraps" of assembler with a named label. It's use is supposed to be for interoperability with other assemblers, but we always use GCC from devkitPro so its a moot point.

Finally, "my_function:" is a label - this is like a goto label in C and is where the program jumps to when we call this function.

The rest of the ARM code spit out by GCC is as good as could be for this example, if you compile with -O2. Normally, the advantages of assembler tend to be minimal for small functions, they are better when you do things that can't be easily done in C. Such as unrolling loops, keeping often-used memory addresses in a single register, that kind of thing. Also, hand coded ARM assembler can make better use of the registers in some cases, more on registers in a minute, but GCC assembler tends to make more use of the stack. This tends to manifest itself only in more complex functions. Anyway, lets just say that it'd complicate matters needlessly to copy paste the rest of the code here, and for this example GCC does a good job.


I mentioned the registers back there. The ARM processor in the NDS and GBA has 16 registers - usually named r0, r1, r2... up to r12, then sp, lr, pc instead of r13, r14 and r15. I say usually, because there is an alternate naming scheme that some documents use. Here certain registers are given special names:

std | alt
r0 | a1
r1 | a2
r2 | a3
r3 | a4
r4 | v1
r5 | v2
r6 | v3
r7 | v4
r8 | v5
r9 | v6
r10 | v7 or sl
r11 | v8 or fp
r12 | ip
r13 | sp
r14 | lr
r15 | pc

It's probably best to use the r0-r12,sp,lr,pc naming scheme as that is what the ARM documentation uses. The trick is to pick one scheme and stick to it. sp is a mnemonic for "stack pointer" and points to, you guessed it, the stack. lr means "link register" and is used as the return address for function calls. pc is the program counter and is the address of the current instruction (i.e. where we are in the program).

The stack is an area of RAM. Initially it points to the end of the RAM area and grows downwards as programs allocate memory (on the stack). When writing assembler I find it's not that usual to use the stack - generally I only use it for storing register values that I may need later, such as the lr when calling functions recursively.

Code set up

Remember how gcc spat out a .s file? Well, that means "pure assembler code". The common way to store ARM code is in a file ending in ".S". This is interpreted by gcc as a source file containing assembler and pre-processor directives. So here we can use #defines and #includes if we really need to. It also distinguishes hand coded assembler files from any --save-temps left-overs.

Lets try writing our my_function in assembler. First I fire up an $EDITOR and write out the preamble to my_function.S. Now lets think about what we want. This function has the following C prototype:

int my_function(int, int);

So that means it accepts 2 input values and returns a value. The input values in ARM are in the first 4 registers - any more than 4 inputs are stored on the stack. The original function wanted an addition if x was less than y, else it subtracted y from x. Here's my final code to do this:

.align 2
.global my_function
.type my_function, %function
cmp r0, r1
addlt r0, r0, r1
subge r0, r0, r1
bx lr

The header is as we saw before. The my_function label is the start of the function. The r0 and r1 registers contain the input values.

The cmp instruction means "compare r0 to r1 and set the flags accordingly"... now I don't really want to copy out the whole ARM assembler guide here, but suffice to say that this sets some flags based on the mathematical result of performing "r0 - r1". These tests are in the form of those "lt", "ge" that follow the end of the other instructions. This is one of the neat parts about ARM - every instruction is conditional!

The default condition is "al" for "always", but may be omitted for obvious reasons. We don't want to write al after every instruction, right? So here we say, if r0 is less than (lt) r1, then add r0 to r1 and store the result in r0. A normal add would just be "add", but here we write "addlt". This makes the processor skip over the instruction if the "less than" state is not set. It's important to note that conditional instructions like this still have some overhead, the processor still has to read the operation in some way, so if you have more than say 3 or 4 conditional instructions with the same condition, it may be faster to use a branch to skip the code completely.

If the result was greather than or equal (ge) then the subtraction is done and the result stored in r0, followed by a return. The bx instruction is used to return from a function - it means "branch to the address in the register lr". As you see, we use the link register to return to the calling code.

Finally, the result is always stored in r0 in this example. This is part of the ARM binary interface - results are returned in registers r0 and r1. This convention allows C and ARM code to interoperate sensibly. If you are returning to your own ARM function, then you can invent any convention you like - return values in r4, r7 and r11 if you want - but sticking to the "C way" means that you can use the function from C later, if that becomes necessary.

You may have noticed that addition and subtraction seem to work "backwards" compared to C convention. Instead of reading left-to-right and the result being stored at the end, the result is actually stored in the first register given. This is fairly common in other flavours of assembler code too, and once you get used to it, it isn't so bad.

We can write a simple test program to run this code:

#include <nds.h>
#include <stdio.h>

int my_function(int x, int y);

int main(void) {
int x = 1;
int y = 2;
iprintf("my_function(%d, %d) returns %d\n", x, y, my_function(x, y));
x = 100;
y = 20;
iprintf("my_function(%d, %d) returns %d\n", x, y, my_function(x, y));

while(1) { swiWaitForVBlank(); }
return 0;

This prints out the results on the DS screen. Nothing very exciting, but you can see that combining C and ARM assembler in a project is not as difficult as you first thought!

Tips and tricks

There are quite a few traps for the unwary - apart from the shift in thought process needed to code assembler, of course. The first of these is that loading a value into a register requires some thought. You can only load values with "mov" that are shifted 8 bit values - 0xff0, 0x1c00, but not 0x101, for example. There is a psuedo opcode to get around this easily - "ldr r1, =0x101" for example - but it may bite you if you didn't know this. The error would look like this "Error: invalid constant (101) after fixup", just in case you were wondering.

When you write ARM, always try to maximise what each instruction does! Shifts can be added onto the end of instructions, so often instead of a shift then logical orr, you can do the lot in one go:

orr r3, r0, r1, lsl r2

This is the same as r3 = r0 | (r1 << r2) all in one instruction! Similarly you can load data from structures using offsets with ldr, with "ldr r0, [r1, r2, lsl #2]". That means r0 = r1[r2*4] - useful for loading 32-bit values from structures into registers. Remember that in assembler pointer arithmetic works as if all pointers were to char, there are no data types. Adding 1 to a pointer really just adds 1 to it, not like in C where it can increase by the size of the data type.

Generally, make use of all the registers you have available and avoid pushing/popping to and from the stack. You can use r0-r12 however you want, make the most of these and you'll find some nice shortcuts that provide speedy, optimised code. Hopefully. If you need to push multiple registers to the stack, then the idiom to use is the stmfd/ldmfd one (store/load multiple full descending). This stores a comma seprated list and/or range of registers to the stack, and decrements/increments the stack by the right amount. It is written as follows, to store here registers r0 to r7 and lr (r14):

stmfd sp!,{r0-r7, lr}
... do stuff with registers r0-r7 and lr ...
ldmfd sp!,{r0-r7, lr}

Always try and reduce duplication - not only is it a maintenance problem, but each copied line is a wasted cycle. Try and reduce code to the minimum number of instructions, that's the name of the game here. Ah, but remember: get a working version first, then optimise heavily. No point having a fast bit of code that doesn't do what you want :-)

I'd recommend this PDF cheat sheet for a quick guide to the ARM instruction set. Keep in mind that the DS doesn't support all the instructions on there - rbit, bfc, and some others - but they are generally the more exotic operations that are used less. If in doubt, compile a quick test file. You'll get the error "Error: selected processor does not support..." if the instruction is not supported by the CPU.


So why would you want to do all this? Isn't this just for head cases?

I think learning how to write code at this level leads to a better understanding of why things are done in certain ways in high level languages. You'll certainly grok pointers, if you haven't already. Learning new coding styles - not just using C-like languages - will make you a better coder too. Heck, maybe a better person even ;-)