More than a year ago, I wrote a tool based on CCI that could inline methods in a CIL assembly. My goal was to provide a means of improving runtime performance of XNA games on Xbox 360, where the JIT compiler doesn’t do many optimizations to begin with, and where the XNA Framework and C# language encourage coding patterns that result in particularly inefficient native code. For various reasons, I put that project aside just as I got it working. Recently, though, I blew the dust off my old source code and tried it again.
My initial experiments were flawed – I had tested the optimization on a debug assembly. Most of the performance improvement I saw could be had by enabling optimizations in the C# compiler (mainly elimination of redundant copying). Running my optimizer on an optimized C# assembly made half as much improvement as I previously reported (<7%). Anyway, that got me doing new experiments.
After two years of working on the code-gen team for the Xbox native compilers, I have a pretty good idea of what the PPC architecture can do, and what good PPC disassembly looks like. I also know what C++ code gets you the best PPC native code. I thought if I pre-optimized the CIL according to what I knew of the PPC architecture, XNA’s Xbox JIT compiler could generate better code even if it didn’t do optimizations of its own. Oh, man, was I ever naïve!
Through blind luck, I managed to eek out a total 15% improvement in Box2D, using my IL optimization tool. Half the time, though, my attempts at optimization made the code run more slowly! This ran counter to my understanding of the architecture, and I decided I needed to see the generated machine code to see what was really going on. To do this, I looked at an old XNA project, called Artemus, written by Justin Holewinski. The code I found wasn’t exactly working, but I put a bit of elbow grease into it and made a new version for XNA Game Studio 4.
Artemus uses unsafe code on the Xbox to read the JIT-compiled native machine instructions out of memory, then it sends them to a client app on the PC which disassembles the instructions. With this tool, I can now see why some of my optimization efforts resulted in slower execution: the JIT compiler is generating code that is largely agnostic of the target architecture. Specifically, it is only using a subset of available instructions and registers, which means its generated code is effectively emulating a much simpler, less capable CPU.
I feel like starting a collection to help all those poor, neglected registers! :(
The good news is that the disassembler is giving me insight I can use to improve my CIL optimizer. Before you ask, I am keeping my tools to myself for now. However, I will try to write up some of the performance pitfalls I discover, when I can also provide practical tips to avoid them. For now, more experimentation is necessary.