There's an old rule when it comes to optimising your code; don't. 2nd verse, same as the first. The 3rd rule of optimisation is to profile your code first. So that's what I did.
The good old way to see where the bottle necks are is to use gprof. You compile with profiling on by passing a couple of flags to gcc, run the program and it spits out a profile upon exit. You then run the gprof command line tool and that interprets the profile to give you a table of hot spots. From commit 5beaffd581ed it looked something like this:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
10.00 0.02 0.02 103228 0.00 0.00 Font::valueToIndex(unsigned int) const
10.00 0.04 0.02 68648 0.00 0.00 std::_List_iterator<HtmlElement*>::operator--()
7.50 0.06 0.01 701479 0.00 0.00 std::less<unsigned int>::operator()(unsigned int const&, unsigned int const&) const
Right there you can see that
Font::valueToIndex was taking a lot of time, but I already knew that that method could be improved. The other C++ STL calls were trickier to track down. I use all sorts of STL containers, so the problem could have been anywhere...
A couple of years ago Google released their performance analysis tools. One of the advantages google-perftools' CPU profiler has over gprof is that the output shows function calls line-by-line in a nifty graphical representation. Running the profiler on the same code gave this output:
From there, I could trace the call graph back from the "slow" STL call to my code see that the use of a
std::set was causing problems. It was overkill anyway and a simple change improved things there.
The different output in the perftools analysis is probably due to the way it works compared to gprof. With gprof, you get a call to the profiling library added to each of your program's functions, which means you need to recompile all of your code. With the Google tools you don't need to recompile your code, you just have to link in the profiler library to your executable. It then hooks in to the start of the program and uses timers and signals to instrument the code, adding traces at the current point of execution. The result is that code that takes longer has more "hits", but the result is not exactly the same as gprof's output.
This is why being able to compile NDS code on the PC is a really good idea. These things are not impossible on the DS of course - counting the hblanks your slow function takes by changing the background color is a classic technique - but in general running unit tests, debugging, and doing performance analysis are much tougher to do on an embedded device.
Enough waffle. All this means that the next version of Bunjalloo may be a bit nippier, hopefully.