NO EXECUTE! December 01, 2007

NO EXECUTE! A weekly look at personal computer technology issues.

December 01 2007

[Part 1] [Part 2] [Part 3] [Part 4] [Part 5] [Part 6] [Part 7] [Part 8] [Part 9] [Part 10] [Part 11] [Part 12] [Next] [Return to Emulators.com]

Bochs 2.3.5 1/2 "Rochs" With Less Cache

While not all of my optimizations to Bochs have yet been merged in to the main source tree on SourceForge, I've made some additional progress. Windows XP is now booting in 2 minutes 20 seconds (200 seconds) on my Mac Pro, while the T1FAST.EXE test loop is executing in 12 seconds, just a shade slower than the dynamically recompiled x86 emulation engine in QEMU. From what I've seen on the discussion list, instead of the usual year-long wait between Bochs refreshes, there will be an official Bochs 2.3.6 release shortly which incorporates many of the speed optimizations so far.

An even more dramatic reduction in Windows XP boot time occurs on my 1000 MHz Pentium III machine, where XP is now booting up in my custom build of Bochs in just 7 minutes 20 seconds. This is down from about 9 minutes a few weeks ago, and from over 11 minutes in Bochs 2.3.5. There is a small trick I needed to use in order to accomplish this, and as with the lazy flags code from the last posting, the trick is to trade memory accesses for some extra integer computations.

If you look at two files in Bochs, our old friend cpu\cpu.cpp and a the header file cpu\icache.h, you will see the mechanism by which previously seen x86 instructions are cached. There is a flat array of decoded instructions, in Bochs 2.3.5 of length 32K (32768) entries, and more recently bumped up to 64K (65536) entries. The size of the array, which is used as an instruction cache, is defined by the BxICacheEntries macro in icache.h:

#define BxICacheEntries (32 * 1024) // Must be a power of 2.

An inline function naturally named hash maps a 32-bit or 64-bit x86 physical address into a 32-bit index into that array:

BX_CPP_INLINE unsigned hash(bx_phy_address pAddr) const
{
// A pretty dumb hash function for now.
return pAddr & (BxICacheEntries-1);
}

This is what is commonly referred to as a direct mapped cache, because consecutive addresses map directly into consecutive entries in the array, wrapping around every 32768 or 65536 entries. Some real microprocessors do in fact use a direct mapped cache for their actual hardware L1 instruction cache. More commonly, L1 and L2 caches in hardware use a set associative caching scheme (http://en.wikipedia.org/wiki/CPU_cache) which permits, 2, 4, or more entries to occupy the same "set" in the cache. A set in an associate cache is like the slot in a hash table and it's associated buckets.

Because a direct mapped cache is essentially a truncation operation implemented using a masking operation, it is the easiest to implement in software. Unfortunately, what one finds is that the distribution in a direct mapped cache is anything but uniform. In Windows XP, large memory allocation tends to cluster around 64K-aligned addresses. This is because the Win32 API implementation in all versions of Windows since Windows 95 carve up address space in 64-kilobyte increments. This is the granularity of allocations made with the Win32 VirtualAlloc calls, and is also the granularity of memory mapped files and shared memory regions.

Windows executable files (.EXE files) and dynamic libraries (.DLL files) are memory mapped from disk to the application's address space. That means that EXE and DLL modules always appear at 64K-aligned addresses. In fact, if you have Microsoft Visual Studio or similar Windows development tools, you can dump the header of any EXE or DLL file and look at its preferred load address. In a Visual Studio command prompt, this is easily done by using the linker and this linker command line, say, for the module KERNEL32.DLL:

link -dump -headers kernel32.dll

which gives this partial output:

Dump of file kernel32.dll
PE signature found
File Type: DLL

FILE HEADER VALUES
14C machine (x86)
4 number of sections
E0 size of optional header
2102 characteristics
Executable
32 bit word machine
DLL

OPTIONAL HEADER VALUES
10B magic # (PE32)
8.00 linker version
CA000 size of code
CC00 size of initialized data
0 size of uninitialized data
4B55C entry point (77E3B55C)
1000 base of code
C3000 base of data
76CF0000 image base (76CF0000 to 76DC8FFF)

Notice the bolded line. It states that the load address of the DLL is up high in user address space, just under the 2 gigabyte address boundary, starting at a 64K aligned address. The portable executable ("PE") header is 4096 bytes in size, and generally the first byte of code in any EXE or DLL is 4096 bytes into the file, or 0x1000 bytes in hexadecimal, as shown in that output above. This means that there will be a lot of clustering of addresses of the form xxxx1xxx in Windows programs.

A hashing function which simply truncates the low, say, 15 bits of the address (for a 32768-entry table) is going to experience a lot of this clustering starting at index 4096 and higher into the cache. Other parts of the cache will rarely be used. This is a common pitfall of hashing, which is why most hashing function involve a more complex set of operations, such as multiplications and shifts and such.

I instrumented my build of Bochs to actually count the number of instruction cache hits and misses during the boot of Windows XP. In repeated runs, I measured that it takes approximately 4.4 billion x86 instructions to get Windows booted and logged in to the default user's desktop. Using the current size of a 65536-entry instruction cache with simple truncation, approximately 140 million of those 4.4 billion instructions are missed in the cache, which is a pretty decent hit rate of about 97%. That means that 97% of the time, the inner loop in cpu.cpp will find an instruction that is already fetched and decoded, and save the expensive of calling the x86 decoder code in cpu\fetchdecode.cpp. But this 97% hit rate comes at a cost. The internal Bochs instruction cache entry structure is dozens of bytes in size. Roughly every 16K entries adds a megabyte of runtime memory footprint to Bochs. Simply doubling the size of the instruction cache from 32768 entries to 65536 entries blows up just the instruction cache memory footprint from about 2 megabytes to 4 megabytes. This easily makes Bochs's memory footprint balloon well past even the large L2 hardware cache of an Intel Core 2 microprocessor.

What I found both on the Pentium III machine and on my Core 2 based Mac Pro, is that my Windows XP boot scenario appeared to boot faster if I made the cache smaller. So, experimenting with cache sizes ranging from 16K (16384) entries up to well over half a million entries, I found that the cache miss rate is about half at 64K compared to 16K, and another one quarter less at 512K entries. Windows XP experiences barely 30 million cache misses out of 4.4 billion instructions when using a 512K cache, but at the expense of making the actual boot take longer.

The drastic dropoff in cache misses going from 64K entries to 512K entries is due to that 64K file mapping granularity. Using a cache with more than 64K entries pulls additional address bits into the hash. If only this could be done without using a large cache.

An obvious approach is to simply merge more of the upper bits of the address down into the low bits. A common approach when using a 16-bit truncation is to do this:

hash = (address) XOR (address >> 16);

This exclusive-ORs the upper and lower parts of the instruction address to give a 16-bit result suitable for indexing a 65536-entry cache. But doing this actually makes the hit rate drop! The problem is that the 64K aliasing hardly affects the hash index. If two DLLs are calling each other, say, an instruction at address 0x10001234 calls address 0x11001184, the hash index for both instructions is very similar. What you want is the effect of what a wider mask gives, which is that 64-bit aliasing affects the higher bits of the hash index.

So in my implementation I shift only by 4 bits, and use a 32768 entry table. This is the actual code from my cpp.cpp:

unsigned iCacheHash = BX_CPU_THIS_PTR iCache.hash(pAddr) ^ BX_CPU_THIS_PTR iCache.hash(pAddr >> 4);
bxICacheEntry_c *cache_entry = &(BX_CPU_THIS_PTR iCache.entry[iCacheHash]);

Notice I shift the hash index itself, not do the shift in the hash function, so as to avoid performing 64-bit wide shifts. This gives you mostly the effect of a 19-bit wide hash table index using only a 15-bit index. Not quite, but the hit rate does improve. In fact, using this hash formula and a 32768-entry table, the hit rate is 97%, just as good as it was using the simpler approach and a 65536-entry table.

Reducing the memory footprint by 2 megabytes like this is critical in keeping performance high on older host machines, such as my old Pentium III system. That Pentium III microprocessor only has a 256-kilobyte L2 hardware cache, so at any given moment, only a tiny fraction of Bochs' instruction cache is actually cached by the hardware. Most cache "hits" really become cache misses on the actual host machine, defeating much of the purpose of increasing that software instruction cache size.

The PGO Riddle

A few weeks ago I mentioned that in playing around with the Visual Studio 2008 Beta 2 release, I was noticing that a Profile Guided Optimization, or "PGO", build of Bochs for Windows was appearing to run slower than the normal un-PGO-ed build. This was very unexpected, since I know first hand from using other compilers that profile guided optimization is almost always a good thing as it tends to re-arrange code to improve branch prediction, and also performs less optimizations on rarely used function in order to reduce the overall code size and memory footprint.

Did you figure out this perplexing problem?

The secret is to look at the actual compiled code of the "optimized" build of BOCHS.EXE. This is done by adding the -FAsc switch to the compiler command line and then looking at the generated CPU.COD file. When I looked at the line of code generated for the one single indirection function call for the decoded instruction dispatch, I was shocked to see dozens of actual compiled instructions:

; 280 : BX_CPU_CALL_METHOD(i->execute, (i)); // might iterate repeat instruction
mov eax, DWORD PTR [ebx+4]
cmp eax, OFFSET ?MOV_GdEEd@bx_cpu_c@@SIXPAVbxInstruction_c@@@Z ; bx_cpu_c::MOV_GdEEd
je $LN1205@cpu_loop
; taken 88485987(8%), not-taken 900816992(91%)
cmp eax, OFFSET ?PUSH_ERX@bx_cpu_c@@SIXPAVbxInstruction_c@@@Z ; bx_cpu_c::PUSH_ERX
je $LN1212@cpu_loop
; taken 55874110(6%), not-taken 844942882(93%)
cmp eax, OFFSET ?JZ_Jd@bx_cpu_c@@SIXPAVbxInstruction_c@@@Z ; bx_cpu_c::JZ_Jd
je $LN1224@cpu_loop
; taken 51595973(6%), not-taken 793346909(93%)
cmp eax, OFFSET ?MOV_EEdGd@bx_cpu_c@@SIXPAVbxInstruction_c@@@Z ; bx_cpu_c::MOV_EEdGd
je $LN1226@cpu_loop
...
cmp eax, OFFSET ?AND_EAXId@bx_cpu_c@@SIXPAVbxInstruction_c@@@Z ; bx_cpu_c::AND_EAXId
je $LN1400@cpu_loop
; taken 2995737(4%), not-taken 70088692(95%)
mov ecx, ebx
call eax
jmp SHORT $LN1121@cpu_loop

What is all this! It appears that PGO is being too smart. The profiled information is telling the compiler that there are relatively few, a few dozen, frequently called target addresses for that indirect function call. This makes sense, and the actual symbols of those function are shown in the listing file: the MOV memory-to-register instruction handler, the PUSH register handler, the conditional branch handler, etc.

The PGO-optimized listing file also shows the actual taken and not-taken counts for each of those target functions, indicating for example, that out of 900 million instruction dispatches, over 88 million, or about 9% of the calls, were to the MOV memory-to-register handler. Almost as frequent, the PUSH register handler and the conditional branch handlers. This makes perfect sense, as those are two of the most common instructions one will see in any typical Windows code.

The taken and not-taken counts are a very powerful pieces of data for performance optimization, because it tells you exactly how frequently and with what odds a particular branch is taken. But I take issue with the fact that the Visual Studio 2008 compiler took liberty to "unroll" an indirect function call into over 4 dozen, 51 to be exact, pairs of "compare and jump equal" instructions before actually then making the indirection function call. If you read that data above, of the 900 million executions of line 280 in cpu.cpp, 88.4 million jumped to the MOV handler. Then of the remaining ones about 6% jumped to the PUSH handler. Then of the remaining ones, about 6% jumped to the conditional branch handlers and so on.

The numbers get a little misleading by the time you get to the 51st pair of CMP/JE because only about 2.99 million out of the original 900 million instructions jump to the "AND EAX" handler. The percentage shows 4%, but that's 4% of the time that you actually get that far.

I believe that it is a fundamental bug for VS2008 to be unrolling indirect calls to this great extent. A mispredicted CALL instruction might cost 15 to 30 clock cycles on most x86 microprocessors. Falling through 51 sets of CMP / JE instructions and then finally reaching the indirect CALL instruction involves up to 52 branch predictions, with almost a 100% guarantee that the one you need will mispredict!

Therefore, unless you are in a tight loop simulating the exact same x86 instruction over and over again, the expanded code above is not only fatter but guaranteed to be slower. The "AND EAX" handler for example, has a 3 in 900, or 0.3% change of being taken. Is it really necessary to special case it then? In fact, I will argue that even special casing an 8% probability is too low a probability to bother with. If it was a probability of 60%, 70%, 60%, even then, is it necessary to special case something that is so frequent that the hardware is likely to correctly branch predict the CALL anyway?

This is a tricky technical issue that I actually don't know the exact answer to. At what thresholds of branch probabilities does one want to make this code expansion optimization?

Anyone Can Be A Critic, But...

I held off posting the solution to this riddle last time because as it happened, Visual Studio 2008 went golden recently, or as they say, it RTM-ed. It was released to manufacturing, and the final version of Visual Studio 2008 was made available to developers here on MSDN (http://msdn2.microsoft.com/). Unfortunately, this same over-expanding of the indirect call problem still exists in the RTM version. Doh!

However, my constant spamming of this and other Visual Studio 2008 issues to my former colleagues in the Visual Studio group actually bore some unexpected fruit this past week. You know the old saying about anyone can be a critic. But given the opportunity, and my desire to buy even more new Apple Macintosh computers and Playstation games this Christmas, would I be willing to help with the problems? Absolutely!

Starting this Monday, and this is why I am making this week's post on a weekend instead of the usual Monday morning, I will be returning to Microsoft on a temporary consulting basis to help identify more of these little types of compiler issues. Or as I call looking at raw x86 bytecode... FUN! This gig might last a few weeks or it might last a few months, I don't know at this time. But it is certainly the kind of opportunity I should not pass up on since it stands to have longer term benefits to projects such as, well, Bochs and other C/C++ based virtual machine projects.

My blogging may be intermittent over the next month or two due to both the Christmas holidays and my detour back to Microsoft, so bear with me. I will continue my work on Bochs and on this blog series in the new year.

I was also hoping to review the new One Laptop Per Child XO laptop this week, but it has not arrived yet. The status emails are saying before Christmas, so when it does I will write up my impressions of it. Until then, don't forget to download Bochs 2.3.6 as it gets posted, to enjoy the holidays with some Guitar Hero III, and to keep those comments and ideas coming by emailing me at darek@emulators.com.