Previously, I said I would describe inefficiencies of the NetCF JIT compiler on Xbox if I could provide a reasonable workaround. This tip is pretty reasonable: avoid using float as a parameter type in function signatures. Use double instead. Here’s why…
The NetCF uses a calling convention that passes all arguments on the stack. Furthermore, NetCF also passes all float arguments as doubles. That is, callers push each 4-byte float argument onto the stack as an 8-byte double. The called function then converts each double argument value back into a float, and writes it back onto the stack in the exact same spot. It does this by loading the double value into a register, rounding it to single-precision, and then writing it back to the stack as a single.
Why? I have no idea; but that’s not important. What’s important is that this superfluous data conversion is expensive, and you’re better off avoiding it.
The expense is incurred upon entry to the function, when each float parameter is loaded into a register, rounded, and stored back into the stack.
If you rewrite the method to receive a double parameter instead of a float, then the code generated at the caller will be the same (at the call site, it costs the same to put either a float or double onto the stack. ), but the method itself will not include expensive load-round-store initialization at the beginning of the function body.
Note: There are code examples in the text below, to illustrate.
An important caveat is that float values in complex types (eg, structs) are not converted this way when the containing type is passed by value. So, for example, if you pass a Vector2 instance by value, it will be passed in 8 bytes (2 x 4-bytes for each float member).
You have to pass the whole structure, though, not just a member.
This implies a big difference between the following functions:
float AddXandY(Vector2 v);
float AddXandY(float x, float y);
The first function will receive an 8-byte Vector2 value on the stack. The second function will receive 2 x 8-byte double values on the stack, which it will have to round to single-precision before executing any of its function body.
Take a look at the native code.
; float AddXandY(Vector2 v)
; (0xBE29F26C) size=92 bytes
mflr r12
stw r12, 8(r15)
addi r1, r1, -44
stw r15, 0(r1)
addi r15, r1, 0
addis r3, r0, 48681
ori r3, r3, 62060
addi r5, r0, 0
stw r5, 20(r15)
stw r3, 12(r15)
stw r15, 0(r30)
addi r4, r15, 44
lfs fr1, 0(r4)
addi r4, r15, 44
lfs fr2, 4(r4)
fadd fr1, fr1, fr2
frsp fr1, fr1
lwz r15, 0(r15)
addi r1, r1, 52
stw r15, 0(r30)
lwz r12, 8(r15)
mtlr r12
blr; float AddXandY(float x, float y)
; (0xBE29F3CC) size=108 bytes
mflr r12
stw r12, 8(r15)
addi r1, r1, -44
stw r15, 0(r1)
addi r15, r1, 0
addis r3, r0, 48681
ori r3, r3, 62412
addi r5, r0, 0
stw r5, 20(r15)
stw r3, 12(r15)
stw r15, 0(r30)
lfd fr0, 44(r15)
frsp fr0, fr0
stfs fr0, 44(r15)
lfd fr0, 52(r15)
frsp fr0, fr0
stfs fr0, 52(r15)
lfs fr1, 52(r15)
lfs fr2, 44(r15)
fadd fr1, fr1, fr2
frsp fr1, fr1
lwz r15, 0(r15)
addi r1, r1, 60
stw r15, 0(r30)
lwz r12, 8(r15)
mtlr r12
blr
In the two disassembled functions above, the orange highlight shows the code used to convert the double arguments to floats. That conversion is only present when a function signature contains float parameters.
The green highlights above represent loading the single-precision arguments and adding them. The difference in how the values are loaded is due to the difference in parameter type – fields are loaded via an offset from the address of the containing type.
Looking at the disassembled code, there are other obvious inefficiencies. Unfortunately, most of the inefficiencies are unavoidable, or else you end up trading one for another where there isn’t a clear winner in all situations. That’s why I said I would only write about inefficiencies that have practical workarounds. Avoiding float params is one such workaround.
Here is the function declared with double parameters:
; float AddXandY(double x, double y)
; (0xBE68F4DC) size=88 bytes
mflr r12
stw r12, 8(r15)
addi r1, r1, -44
stw r15, 0(r1)
addi r15, r1, 0
addis r3, r0, 48744
ori r3, r3, 62684
addi r5, r0, 0
stw r5, 20(r15)
stw r3, 12(r15)
stw r15, 0(r30)
lfd fr1, 52(r15)
lfd fr2, 44(r15)
fadd fr1, fr1, fr2
frsp fr1, fr1
frsp fr1, fr1
lwz r15, 0(r15)
addi r1, r1, 60
stw r15, 0(r30)
lwz r12, 8(r15)
mtlr r12
blr
The main difference between this function and the one using floats is that the arguments are not converted to single-precision and written back to the stack before adding them (green highlight).
A word of caution: before converting all your float parameters to double, take a look at the orange highlight. I’m pretty certain this is a bug in the JIT compiler, as what I’ve highlighted is a redundant rounding operation (frsp is “floating-point round to single-precision”). This happens whenever you cast from double to float. I’m pointing this out because if you need to store the result of floating-point arithmetic in a float variable (like a field in a struct), then using doubles could do more harm than good (depends on how much arithmetic you need to do before storing the result).
It’s worth noting that casting from float to double in an expression incurs no cost. This is because all floating-point values are automatically converted to double-precision when loaded into a register (this is done by the CPU). For this reason, when you mix double- and single-precision values in an expression, it is preferable to perform the arithmetic at double-precision.
For example, prefer this:
float result = (float)(((double)singleValue + doubleValue1) * doubleValue2);
over this:
float result = (singleValue + (float)doubleValue1) * (float)doublevalue2;
In the first case, casting singleValue to double doesn’t use an instruction, but the explicit cast for the result emits two rounding instructions. In the second case, casting doubleValue1 and doubleValue2 to singles causes explicit rounding, as well as an additional rounding before storing the result (required before storing any float value).
Avoiding float parameters is a reasonable perf tip, but if you do it, you must be cautious about mixing float and double values.
Another tip I can provide is to avoid creating very small functions. That should be obvious from the disassembly, but in case it isn’t, there is a lot of painfully expensive overhead in the examples I provided. (Wouldn’t it be great if you could force those little functions to be inlined?)
Happy coding!
PS: My highlights weren’t preserved when I published the article, so I’ve atttempted to fix it as best as I could. I apologize if the assembly code is hard to read.
Thank you for taking the time to analyze this. I had heard rumors about the behavior of floats/doubles on the XBox XNA but had nothing solid to go with and never took the time to look into as carefully as you did. Useful stuff!
Brilliant! Please consider posting your tool, it sounds amazingly useful!
I’ll think about it. The utility of viewing the disassembly is probably less helpful than you imagine. It certainly is less helpful than *I* imagined! It’s quite easy to see inefficiencies in the native code gen, but difficult to see practical ways to avoid those inefficiencies. You almost can’t predict the impact of any code change without writing the function different ways and constructing a perf test to measure it… Which means you are about as well off with trial and error.
Yes indeed, I can see how that could be difficult. But the way I see it, this is only less helpful from an directly optimising point of view – I really feel it could be useful in other performance tracking tools or even educationally.
My understanding of the build process is that MSIL produced under the .NET CF on the 360 can be different to MSIL produced on a full Framework build. If this is the case, being able to verify whether or not there is actually a difference could at least provide answers/insight and have possible diagnostic uses.
I am personally from the ‘educational’ camp. Low level optimisation is a bit of a hobby of mine, something I’d like to maybe go into one day and I’m trying to get as close to the nuts and bolts of the 360 as I can via XBLIG. Unfortunately, I don’t have a proper D3D devkit kicking around my house.
Building assemblies for Xbox is the same as building them for the desktop, or for Windows Phone. All of them use the exact same C# compiler, which is what produces the CIL. You can verify this by examining the MSBuild targets used by each type of project, as well as by disassembling the assemblies with ildasm.exe. The CIL code generation is not platform-specific unless you specify a target platform to the compiler (eg, add /platform:x86 to the csc.exe command line); even then, I’m not even aware of specific differences aside from metadata.
The thing that makes Xbox assemblies specific to Xbox are the dependencies. On the console, the runtime assemblies mscorlib, System, and others, do not contain the same APIs as the desktop framework. This doesn’t mean you can’t run an assembly that referenced the desktop mscorlib when it was built, but it means that such an assembly might not load or execute (for any of a variety of reasons). To avoid those problems, the XNA Xbox projects reference a custom set of assemblies that provide only the APIs that will be available at runtime on Xbox.
To view CIL differences between assemblies, the easiest thing to do is to use ildasm.exe to disassemble the assemblies to CIL text files, then use a diff tool to diff the text files.
I inferred too much, it seems.
I suppose my misunderstanding was based on changes to StringBuilder. I knew that the desktop version had been upgraded from the original to use a rope internally. I had also read that depending on whether you execute code on the 360 or Windows you would somehow end up with a reference that was internally either the new or old original implementation.
I am still a way from properly understanding the CLR so the way I assumed this was dealt with was that different platforms generated different MSIL instructions. Your explanation makes things more clear though, so thanks a lot.
I wonder that I might have misunderstood the context of your tool, though. If MSIL is equally generated on all platforms, then why generate instructions on the fixed hardware and send the result to a server process on Windows?
The assemblies *you* build for either Windows or Xbox or Windows Phone are all built with the same compiler, so the same source code ought to end up compiling into the same CIL no matter what platform you’re targeting. Where you seem to be confused is how the compiler consumes reference assemblies. When you use a StringBuilder class in your code, the assembly generated by the compiler only has a reference to that class, not the implementation. A reference is a fully-qualified type name (class name + namespace + assembly + assembly version + assembly public key token, etc.). The actual implementation of a StringBuilder comes from whatever assembly the CLR resolves at runtime. StringBuilder is in mscorlib, and on Xbox, mscorlib is not the same as the one on Windows. So, you are correct in that the implementations of StringBuilder can vary from one platform to another – but not because the compiler generated different CIL. Rather, it’s because of completely different versions of mscorlib. The code that *you* compile isn’t affected by your choice of platform, unless you explicitly compile different source of course.
Now, something that I’m not sure you understood is that CIL is an intermediate language. It is binary rather than text, but it’s still basically source code. It can’t execute until another compiler compiles it into machine code (aka native code). The machine code is what executes, and that is absolutely specific to each CPU architecture. There’s more than one compiler that can compile CIL assemblies, but the typical case is for the JIT compiler to do it. JIT stands for just-in-time, and it is so named because it compiles one method at a time, right before the method is executed for the first time.
The idea behind JIT compilation is that it allows the same assembly to run on multiple platforms, and also on multiple runtimes. For most of the Windows world, this means you can compile one assembly and run it on 32-bit Windows or 64-bit Windows. Or, you could have one assembly that runs on both CLR 3.5 and CLR 4.0. However, it is often impractical to write a completely portable assembly that can run on different operating systems.
The disassembler that I upgraded to XNA 4.0 is not reading CIL, but the native PowerPC machine code generated *after* the JIT compiler has compiled the CIL at runtime. These are the actual machine instructions that will execute. The C# compiler doesn’t/can’t optimize the CIL very much because it is compiling code for the CLI and not a real machine architecture. It’s the JIT compiler’s job to do most of the optimization work, because most optimizations are architecture-specific. The reason for the client/server architecture in the disassembler is that you can’t see the native code until you execute the assembly (because it doesn’t get JIT compiled until it executes). The server really just reads the binary code out of memory and sends it to the client, which does the disassembling. Because the client is on Windows, I can also copy/paste the disassembled code, or save it to a file, etc.
What my disassembler shows is that the NetCF JIT compiler on Xbox is not tuned at all to generate good code for PowerPC. I didn’t know the extent of how un-tuned it was until I looked at this. To provide some context, my day job is working on the Xbox code-gen team, where our job is to optimize the Xbox compilers… all except for the NetCF JIT compiler. This is how I know the difference between very good native code and very bad native code.
A related project I’m working on is a tool that will rewrite CIL in order to pre-optimize it for a specific runtime. Basically, I’m trying to rewrite Xbox assemblies so that when the JIT compiler turns it into native code, it will run better. So far, I have only managed to get very small improvements, mainly because the JIT compiler is too literal in its translation of CIL, and what makes sense for the CLI is generally the worst possible thing you can do on PowerPC. It’s a bit depressing, really.
Well, after all that, I hope I’ve answered your question and clarified a thing or two.
Thanks for taking the time to write that detailed explanation. It makes perfect sense now. Not sure where I got the impression that CIL was executed itself. It was probably the fact that at first glance it looks a little like assembler and that threw me headlong into an assumption.
My interest still stands, though, even if it’s PPC instructions. I’m looking to learn anything I can learn about native 360 code execution, as I want to be writing native console code as a career. Native instructions go even closer to the nuts and bolts than my original impression of the tool!
I had seen your post on the optimiser. I will be following your progress on it keenly. It looks like a really worthwhile project if you can manage it, but I can see how frustrating it must be for you to have no direct control over the PPC instructions generated by the jitter.
Pingback: Xbox/XNA performance tuning « IceFall Games