I keep working on my hobby project, which is an assembly rewriter whose goal is to optimize assemblies for the XNA runtime on Xbox 360. I probably spend more than half my time debugging obscure errors resulting from invalid CIL in the assemblies. Another big chunk of time is spent on experiments to see what else I can do that might improve runtime performance.
Previously, I wrote about some of the NetCF’s code-gen peculiarities around floating-point variables and arithmetic on Xbox 360. In particular, I noted that single-precision arithmetic uses the double-precision native instructions, and then explicitly rounds the result using frsp (floating-point round to single-precision). I also noted that while double-precision arithmetic doesn’t suffer this inefficiency, the JIT-compiler emits two frsp instructions when casting from double- to single-precision (eg, when you store a result in a float variable). Well, I wondered about the trade-off and ran some experiments.
In one experiment, I found an open-source physics library compatible with the XNA runtime on Xbox 360 called Jitter. This library is meant to be portable, and doesn’t depend on the XNA Framework. In particular, it doesn’t use the XNA Framework math types – Vector2, Vector3, etc. This was interesting because it made it easy to find-and-replace all instances of “float” to “double” in the source code. After fixing a couple hundred compile-time errors, I got the demo running – this time with pure double-precision arithmetic. I had eliminated all float variables and parameters, so there were no float values anywhere (except in the rendering code, which did depend on Vector3 for drawing primitives).
I was ready to be blown away by silky-smooth, realistic physics… But my hopes were dashed when it ran 10-20% slower than before! After looking at the code and thinking real hard, I realized that the slowdown must be due to my effectively doubling the size of all data structures.
The thing that confused me was that the code had already been hand-optimized to avoid passing structures by value, so it wasn’t a matter of copying twice the data. Rather, I believe the problem was twice the distance between data structures passed by reference. Twice the distance means more frequent cache misses. Also, memory operations on Xbox 360 are really slow so doubling the size of any structures that are passed by value didn’t help, either.
So that experiment didn’t teach me anything positive, and I got sad and played some games for a while.
The following week, I was over it and I tried a different experiment. I wondered what would happen if I just rewrite the arithmetic to work at double precision, without changing the variables? That would eliminate a bunch of rounding on intermediate computations, but would require redundant double-rounding whenever I needed to store a result into a variable. It’s way too tedious to try that by hand on a big code-base, so I coded-up a new feature in my assembly rewriter.
This time, I used my assembly rewriter to rewrite all expressions in an assembly to do floating-point arithmetic at double-precision, but convert the result back to single-precision whenever storing to a float variable. Using Box2D.XNA as my test bed, I saw virtually no change in the performance. Well, it was small, mixed results, really.
So then I added another feature to replace local variables of type float with local variables of type double. That eliminated a lot of the double-rounding that occurred when computing intermediate values inside a function, and I got small perf improvement in a few tests, and I didn’t see any case that was slower.
This approach seems promising. The next thing I’m working on is a feature to decompose local structs into variables for each component field (eg, a single Vector2 variable is replaced by two float variables, representing its X and Y fields). Not all local structs are candidates (eg, if they are used in function calls), but after inlining all the Vector2 operations, many variables can be decomposed this way. After breaking them up, those float variables can be rewritten as doubles, and that eliminates more rounding!
I originally began working on decomposing structure variables to make it easier to eliminate single-use variables. This possible secondary use is a bonus, and I think the two together will work nicely. Fingers crossed.