NO EXECUTE! January 1, 2008

NO EXECUTE!

January 1 2008

[Part 1] [Part 2] [Part 3] [Part 4] [Part 5] [Part 6] [Part 7] [Part 8] [Part 9] [Part 10] [Part 11] [Part 12] [Part 13] [Part 14] [Next] [Return to Emulators.com]

Bochs 2.3.6 - Micro-architectural Tuning with C++

As I mentioned last week, Bochs 2.3.6 has been released and is available for download from here: http://bochs.sourceforge.net/. Today I am going to give you my timings of the performance differences between the official released builds of Bochs 2.3.5 and 2.3.6 and explain some more of the optimization tricks that were used in 2.3.6.

It is very tempting when working on an emulation style project to use assembly language, either in the form of inline assembly code mixed into the C/C++ sources, as standalone .ASM code, or using some form of dynamic recompilation (a.k.a. "jitting"). I am more than guilty of doing all three of these myself, since much of the emulation code in PC Xformer, Gemulator, Fusion PC 3.0, and SoftMac is hand-tuned 486 and Pentium assembly language code. And the code in my earlier Atari products such as Graphics Utility Package, Quick ST, and ST Xformer was similarly almost all assembly language. This dates back to my early computer days in the 1980's, when C compilers absolutely sucked and tuning assembly language for the in-order microprocessors of the day was relatively easy.

It took me a long time to break that habit of instinctively wanting to just write everything in raw machine code on the bare metal. C and C++ compilers have come a long way in 20 years, and now with what is called whole program optimization (the -O3 switch in GCC, and the -GL switch in Microsoft Visual C/C++) compilers can even optimize across function call boundaries, which has traditionally been a major bottleneck for compiled code. Working in conjunction with whole program optimization is an additional technique called PGO (profile guided optimization), which uses profile data from a first-generation build of an executable to feed back into the compiler to generate a second-generation optimized executable.

Whole program optimization allows the compiler to pull some static compilation tricks such as using additional registers to pass parameters when it knows those registers are not in use, and to find small functions to inline that the programmer did not explicitly mark in the source code to inline. PGO additionally gives the compiler information that allows it to make decisions whether to even inline specific functions, to know which branch of an "if/else" block is more frequently used, and whether to compile some functions for size (at the expense of some execution speed) while compiling "hot" functions for maximum speed. These are decisions that cannot necessarily be made statically at compile time using only whole program optimization.

Too many programmers try to solve their problems with compiler switches, telling the compiler to just optimize everything to the max, resulting in a lot of bloated and unnecessarily slow code. This is also naive. Bloated code means less of the code first into the L1 code cache, more code hogs the L2 cache, and thus both code and data suffer from more cache misses. It is often desirable to compact the code, even generate calls to common code instead of inlining it, in order to lower the overall memory footprint of an application. As I described a few postings ago, I unfortunately discovered that the PGO in Visual Studio 2008 itself gets a little too aggressive at inlining, and can end up producing "optimized" second-generation executable code that runs slower than the "unoptimized" first-generation executable. For this reason I avoided relying much on PGO, but did make use of the data that it provides.

One type of code pattern that gives the hardware nightmares are branches and indirect jumps. In source code, these commonly appear as "switch" statements, "if" statement, and object method calls. Under the hood, these language constructs are compiled into either conditional jumps, indirect jumps, or indirect calls. In any case, the microprocessor has to use its branch predictor unit to try to guess which of the two or more possible code paths is likely going to be taken.

When branch prediction is working well, it's the best thing since sliced bread. The overhead of switch statements and function calls can mostly be hidden (which is another reason to not go too crazy at inlining). When branch prediction starts to miss a lot, it results in the microprocessor having to backtrack to the branch instruction and re-fetch the correct code path, decode it, and refill the execution pipeline. On most modern microprocessors this is an operation that adds about 20 clock cycles of penalty. Plus or minus, this can get even worse on long pipeline architectures. Given that branch prediction is needed every dozen instructions or so, frequent branch misprediction can cause significant slowdowns in any code.

In Part 10, I discussed a useful piece of information that the Visual Studio 2008 profile provided to me, and that is the actual counts and percentages of branch targets in the profiled code. Visual Studio 2005, Intel's C/C++ compiler, and other compiler toolsets provide this kind of information. In Part 10, I showed how one can take numerous targets of an indirect jump, in that case several dozen individual little C++ functions, and fold them down into just a handful of functions such that the probability of hitting a hot function increases. By combining the functionality of several of the smaller functions into one generic function that how had an over 80% chance of being hit, it means that most of the time the call to that function is almost free.

Another example I gave had to do with the code that simulates the x86 "Jump Condition" instructions, of which there are 16 but only two of which are frequently used (the JZ "Jump Zero" and JNZ "Jump Not Zero"). The optimization in this case was the opposite - to break apart the Jcc opcode handler function which used a switch statement to handle the 16 individual cases, and to break that up into 16 individual functions such that the JZ and JNZ handlers can be called directly the CPU execution loop. The logic here is that the CPU execution loop is itself an indirect jump that is hard to predict, and then the JCC handler's switch statement is an another indirect jump that is hard to predict, and thus rather than having two back-to-back mispredicted jumps, it is preferable to at least just have one. You know that you will likely mispredict, so just flatted the levels of nested jumps.

Since then, Stanislav took that concept further to many more instructions. In Bochs 2.3.5, the beginning of most instruction handlers start with an "if" statement which checks whether the instruction is operating on a register or a memory location. And then two completely separate blocks of code are used, one to handle the memory operation, one to handler the register operation. For example, the x86 instructions MOV, ADD, SUB, INC, DEC, XOR, MUL, DIV, AND, OR, TEST, SHL, SHR, and many others have this check. The PGO profile data was showing that many of these "if" statements had probabilities close to 50/50 of branching one way or the other. This is the worst kind of "if" statement, one where the odds are 50/50 and the pattern of the condition is irregular (i.e. the pattern of whether the "if" statement expression is true or false).

When one comes across code like this in C++ - an indirect jump to a hard to predict "if" or "switch" statement - the solution is to flatten this to a single indirect jump. So in this case, many of the x86 opcode handlers were duplicated into two separate functions - one to handle the register case, one to handle the memory case. This is code bloat, yes, but with the benefit of eliminating about 20 clock cycles worth of misprediction.

Another speed win that made it into Bochs 2.3.6 is to change the calling convention of memory accessor functions. The x86 opcode handlers that deal with memory operations ultimately make a call into a set of functions in the source file cpu\access.cpp with the descriptive names like read_virtual_dword, write_virtual_byte, etc. As with the real hardware, these handlers have to perform a series of checks on each and every memory access to make sure the access is permitted, that the memory is not write protected, that the access isn't going off the end of a page, etc. If this sounds a lot like the "software TLB" code from Gemulator that I was discussing in Part 8, well, that's because this is in fact the Bochs equivalent of the software TLB. More checks need to be performed when simulating x86 than 68040, so the code is little hairier but serves the same function as the memory accessors in Gemulator.

The most common memory access is a 32-bit read of memory, which is handled by the read_virtual_dword function. Let's compare the Bochs 2.3.5 and Bochs 2.3.6 versions to see how the code was streamline with branch prediction in mind (I have edit the code down to the relevant parts):

Table 11-1: Bochs 2.3.5 vs. Bochs 2.3.6 memory accessor comparison
Bochs 2.3.5 source Bochs 2.3.6 source
void read_virtual_dword(unsigned s, bx_address offset, Bit32u *data)
{
bx_address laddr;
bx_segment_reg_t *seg = &BX_CPU_THIS_PTR sregs[s];
if (seg->cache.valid & SegAccessROK) {
    if ((Is64BitMode() && IsCanonical(offset)) || (offset < (seg->cache.u.segment.limit_scaled-2))) {
      unsigned pl =(CPL==3);
      laddr = BX_CPU_THIS_PTR get_segment_base(s) + offset;
      if (pl && BX_CPU_THIS_PTR alignment_check) {
        if (laddr & 3) {
          exception(BX_AC_EXCEPTION, 0, 0);
        }
      }
    Bit32u *hostAddr = v2h_read_dword(laddr, pl);
    if (hostAddr) {
      ReadHostDWordFromLittleEndian(hostAddr, *data);
      return;
    }
...

Bit8u* v2h_read_dword(bx_address laddr, unsigned pl)
{
Bit32u pageOffset = laddr & 0xfff;
if (pageOffset <= 0xffc) { // Make sure access does not span 2 pages
    Bit32u tlbIndex = BX_TLB_INDEX_OF(laddr);
    bx_address lpf = LPFOf(laddr);
    bx_TLB_entry *tlbEntry = &BX_CPU_THIS_PTR TLB.entry[tlbIndex];
    if (tlbEntry->lpf == BX_TLB_LPF_VALUE(lpf)) {
      Bit32u accessBits = tlbEntry->accessBits;
      if (accessBits & (1<<pl)) { // Read this pl OK.
        bx_hostpageaddr_t hostPageAddr = tlbEntry->hostPageAddr;
        Bit32u *hostAddr = (Bit32u*) (hostPageAddr | pageOffset);
        return hostAddr;
      }
    }
}
return 0;
}
void read_virtual_dword(unsigned s, bx_address offset)
{
bx_address laddr;
bx_segment_reg_t *seg = &BX_CPU_THIS_PTR sregs[s];
Bit32u data;

if ((seg->cache.valid & SegAccessROK4G) == SegAccessROK4G) {
    laddr = BX_CPU_THIS_PTR get_segment_base(s) + offset;
    if (BX_CPU_THIS_PTR alignment_check) {
      if (laddr & 3) {
      exception(BX_AC_EXCEPTION, 0, 0);
    }
}
Bit32u tlbIndex = BX_TLB_INDEX_OF(laddr, 3);
bx_address lpf = LPFOf(laddr);
bx_TLB_entry *tlbEntry = &BX_CPU_THIS_PTR TLB.entry[tlbIndex];
if (tlbEntry->lpf == lpf) {
    if (tlbEntry->accessBits & (1<<CPL)) { // Read this pl OK.
      bx_hostpageaddr_t hostPageAddr = tlbEntry->hostPageAddr;
      Bit32u pageOffset = laddr & 0xfff;
      Bit32u *hostAddr = (Bit32u*) (hostPageAddr | pageOffset);
      ReadHostDWordFromLittleEndian(hostAddr, data);
      return data;
    }
}
...

The actual read operation in both version is the line with the call to the ReadHostDWordFromLittleEndian macro. Ideally, you want to reach that line of code as quickly as possible. To get to that line, as you can see in Bochs 2.3.5, there is one conditional branch after another - is the guest code in 32-bit or 64-bit mode, is it exceeding the segment limit, then a check of the privilege level (user mode or kernel mode), does it pass the alignment check, then a call to the TLB index calculation, then an access check, and then finally, the read operation. Whew!

The first optimization is to roll the 64-bit check and the segment limit check into a single check of a flag that states "this is a segment with a maximum limit". The flag therefore states that a segment check is unnecessary regardless of the mode of execution, period. The cost of this check is as cheap as the original check of the mode since Linux and Windows code generally runs in a flat memory mode in both 32-bit and 64-bit modes. One highly predictable branch is substituted by another, but the segment limit check which was always done in Bochs 2.3.5 is now avoided for the vast majority of accesses.

The next optimization is the "pl == (CPL==3)" line. This simply determines whether the memory access has user-mode or kernel-mode privileges, and compresses the CPL value (which specifies the "ring level" of execution) from the range 0..3 to the values 0 and 1. Depending on the compiler though, this kind of expression can actually result in a conditional branch! The compiler turns such an expression into the equivalent of an "if/else". Oops. The best thing to do in a case where a small set of values (4) are being transformed into a slightly smaller set of values (2) is to just stick with the original set of values and adjust the rest of the code accordingly. Otherwise it is just unnecessary computation.

The alignment check for now is left as is, although in a future revision it may be rolled into the Gemulator-style TLB index lookup. Notice the definition of the BX_TLB_INDEX_OF macro has been adjustment to take into account the page span, which eliminates the need for the explicit "pageOffset < 0xffc" bounds check from before.

The final change is to pass the data being read by value. Most computer languages use two common methods of passing parameters and return values around - a method called "by reference" where you pass the address of the data, and a method called "by value" where you just pass the data itself. The original memory accessors in Bochs were designed to be symmetric in terms of all accepting exactly 3 parameters - the first parameter holds the segment of the memory, the second parameter holds the offset into the segment (i.e. the effective address), and the third parameter is the pointer to the data. For memory write operations, this is a pointer to a buffer that holds the data to write to memory. For read operations as above, this is a pointer to a buffer that will receive the data.

Object oriented languages such as Java and C# love to implicitly pass objects around by reference, whereas in C and C++ you need to use the "&" address of operator at the call site and the "*" operator to indirect the pointer. It's all nice and object oriented and polymorphic and all such pie-in-the-sky computer science stuff, but it's also horribly inefficient. Passing data by reference means that each such function call to read memory performs an additional write (when the caller pushes the address of the data buffer), an extra read (when the called function reads that address), another extra write (when the called function writes the data to that buffer), and another extra read (when the caller reads that data in). If you wonder why object oriented code sucks, that's a good reason why. Each parameter passed by reference instead of by value can generate at least four extra memory operations! Simply returning the data directly (which gets returned via the EAX register) eliminates four memory operations and some stack space.

As a real world example, compare the code for a simple function call to increment a variable in C, first a function that does it by value and then a function that does it by reference. On the right I show the compiled code from Visual Studio 2008. Which code do you prefer? Now of course features such as whole program optimization will optimize away a lot of this function calling overhead, but it's better to make the compiler's job easy in the first place.

Table 11-2: Comparing pass-by-value vs. pass-by-reference
Source code The resulting assembly code as compiled by Visual Studio 2008
unsigned int ReadByValue();
void ReadByReference(unsigned int *px);

    ...
    unsigned temp1, temp2;
    temp1 = ReadByValue();
    ReadByReference(&temp2);
    ...     ...
    call _ReadByValue
    mov esi, eax

    lea eax, DWORD PTR _temp2$[esp+8]
    push eax
    call _ReadByReference
    mov ecx, DWORD PTR _temp2$[esp+12]
    ...
unsigned int ReadByValue()
{
    return 1;
}

void ReadByReference(unsigned int *px)
{
    *px = 1;
} _ReadByValue:
    mov eax, 1
    ret
_ReadByReference:
    mov eax, DWORD PTR _px$[esp-4]
    mov DWORD PTR [eax], 1
    ret

The four lines of assembly code that I highlighted in red are the four extra memory operations mentioned above. No thank you computer science eggheads and your mumbo jumbo about the benefits of object oriented languages. As most engineers who care about performance know by now, OO is all b.s. I will stick to plain old C and C++ and passing parameters by value.

Bochs 2.3.6 - The Numbers

I started writing down the performance numbers of Bochs two months ago in Part 10 as I was starting to get some significant speedups implemented, and it's really stunning how much further the numbers have improved even since then, thanks in great part to the great coding by Stanislav Shwartsman of the various optimization tricks that I just described. Let's put some optimization theory into practice and measure the actual results!

Below is a copy of the data I presented two months ago, showing the boot time of Windows XP SP2 on my Mac Pro using the official Bochs 2.3.5 release at close to 4 minutes. I re-ran the Bochs 2.3.5 tests again a few hours ago to verify that they still match the results of two months ago (they do within a 2% a margin), and then timed 2.3.6. Here are the results for my 2.66 GHz Core 2 based Apple Mac Pro:

Table 11-3: Bochs results on Core 2
Mac Pro (Core 2 2.66 GHz) timings in seconds Virtual PC 2007 QEMU 0.90 Bochs 2.3.5 release build
(run Nov 2007) Bochs 2.3.5 release build
(run Jan 2008) Bochs 2.3.6 release build
(run Jan 2008) approximate speedup
Stage A - splash screen 7 3 11 11 11 0
Stage B - defrag - 7 26 26 26 0
Stage C - "Windows is starting up..." 17 16 67 66 58 +14%
Stage D - "Welcome" 20 23 95 95 81 +17%
Stage E - desktop wallpaper renders 34 33 221 216 124 +74%
Stage F - Start button and taskbar render 34 36 236 232 148 +57%
test program T1FAST.EXE 0.27 10.5 25 23 13 +77%
test program T1SLOW.EXE 0.27 12 31 32 20 +60%

Once it gets past the first 30 seconds of boot or so, the speed begins to take off. If you recall the T1FAST and T1SLOW tests from Part 11, they are vastly faster. Some real computational bound applications, such as my Gemulator 2008 beta, speeds up from a neighborhood of about 90 seconds in some tests to under 45 seconds. Download Bochs and try it out yourself.

This next table is a brief summary of running Bochs 2.3.5 and 2.3.6 on my 1000 MHz Pentium III system which I've mentioned before. This is home built machine that's many years old, running Windows XP as a host OS, and booting the exact same virtual disk image file on Bochs as was tested on the Mac Pro. With it's tiny L2 cache and much slower bus than the Core 2, the older Pentium III doesn't show as dramatic a speedup as the Core 2 but it does show a very consistent one:

Table 11-4: Bochs results on Pentium III
1000 MHz Pentium III timings in seconds Bochs 2.3.5 release build
(run Nov 2007) Bochs 2.3.6 release build
(run Jan 2008) approximate speedup
Stage A - splash screen 37 28 +32%
Stage B - defrag 104 88 +18%
Stage C - "Windows is starting up..." 287 223 +28%
Stage D - "Welcome" 330 265 +24%
Stage E - desktop wallpaper renders 445 352 +35%
Stage F - Start button and taskbar render 682 528 +29%

In a future posting I'll go into the 64-bit build and other tweaks that I've made privately to my sources (which aren't yet merged into the official Bochs sources) which beat even these numbers. Right now it's time to drink some champagne and celebrate the bright year for emulation that 2008 will be.

This will also hopefully be a good year for the "$100 laptop". I placed an order for a second XO prior to the midnight deadline which just passed, so in a few weeks I plan to do a follow up on the XO and test out its peer-to-peer capabilities when two such units are within "ear"-shot of each other.

As always, keep those comments and ideas coming by emailing me at darek@emulators.com or simply click on one of the voting links below to send your comments:

Darek, keep writing, this is better than egg nog and whiskey!

Darek, shut up and go back to playing on your Atari 800!

Source code	The resulting assembly code as compiled by Visual Studio 2008
unsigned int ReadByValue(); void ReadByReference(unsigned int *px); ... unsigned temp1, temp2; temp1 = ReadByValue(); ReadByReference(&temp2); ...	... call _ReadByValue mov esi, eax lea eax, DWORD PTR _temp2$[esp+8] push eax call _ReadByReference mov ecx, DWORD PTR _temp2$[esp+12] ...
unsigned int ReadByValue() { return 1; } void ReadByReference(unsigned int px) { px = 1; }	_ReadByValue: mov eax, 1 ret _ReadByReference: mov eax, DWORD PTR _px$[esp-4] mov DWORD PTR [eax], 1 ret

Mac Pro (Core 2 2.66 GHz) timings in seconds	Virtual PC 2007	QEMU 0.90	Bochs 2.3.5 release build (run Nov 2007)	Bochs 2.3.5 release build (run Jan 2008)	Bochs 2.3.6 release build (run Jan 2008)	approximate speedup
Stage A - splash screen	7	3	11	11	11	0
Stage B - defrag	-	7	26	26	26	0
Stage C - "Windows is starting up..."	17	16	67	66	58	+14%
Stage D - "Welcome"	20	23	95	95	81	+17%
Stage E - desktop wallpaper renders	34	33	221	216	124	+74%
Stage F - Start button and taskbar render	34	36	236	232	148	+57%
test program T1FAST.EXE	0.27	10.5	25	23	13	+77%
test program T1SLOW.EXE	0.27	12	31	32	20	+60%

1000 MHz Pentium III timings in seconds	Bochs 2.3.5 release build (run Nov 2007)	Bochs 2.3.6 release build (run Jan 2008)	approximate speedup
Stage A - splash screen	37	28	+32%
Stage B - defrag	104	88	+18%
Stage C - "Windows is starting up..."	287	223	+28%
Stage D - "Welcome"	330	265	+24%
Stage E - desktop wallpaper renders	445	352	+35%
Stage F - Start button and taskbar render	682	528	+29%