NO EXECUTE! November 19, 2007

NO EXECUTE! A weekly look at personal computer technology issues.

November 19 2007

[Part 1] [Part 2] [Part 3] [Part 4] [Part 5] [Part 6] [Part 7] [Part 8] [Part 9] [Part 10] [Part 11] [Fast Lazy Flags] [Next]

The Arithmetic Flags Challenge

Let me show you a piece of code from a very simple C test program, which I call T1.C:

static int foo(int i) {
return(i+1);
}

int main(void) {
int i;
int t = 0;

for(i = 0; i < 100000000; i++)
t += foo(i);
}

This is actually a test program I've used for about 10 years. I compile this test program using various compilers, and use it for two different purposes: to compare the code generation between different compilers and their compiler optimization switches in order to compare competing C/C++ compilers on the same platform, and, to generate test programs for use in correctness tests and benchmarking of various emulators. For example, this was the first piece of compiled PowerPC test code I got running back when I started developing my never ending PowerPC emulator.

I like this code because it is simple, always takes exactly the same number of instructions to execute in, and exercises many common features such as arithmetic operations, loop index incrementing and checking, function calls, memory reads and writes and store-forwarding issues, stack pushes and stack adjusts. There are all extremely common elements of any typical C or C++ program. I compile two versions of this code: T1FAST.EXE is compiled with maximum compiler optimization turned on, and T1SLOW.EXE is compiled with compiler optimizations disabled, tending to generate many more memory references and thus slower code. You would be surprised how much Windows code there is out there ships without full compiler optimizations and so it is necessary to test both variants of the code. The full source code and the two compiled x86 Windows-compatible test programs can be downloaded here (nx11_t1.zip).

Download that ZIP file if you will and try running those two executables from a Windows command prompt. You should find that T1FAST.EXE takes about 3/10 of a second to execute on most current Windows machines, and T1SLOW.EXE will be as fast or perhaps a tenth of a second slower. The Intel Core 2 is actually quite impressive in that it executes both versions in about the same time of under 300 milliseconds, indicating that the internal CPU pipeline is doing a lot of clever tricks to make memory accesses appear to be as fast as ordinary register operations. On most other architectures such as Pentium III and Pentium 4, there is a fixed extra execution time of about 100 milliseconds for the unoptimized executable, due to a slightly larger real cost in accessing memory over registers, even when that memory is known to be cached in the L1 data cache.

There is another reason why I like this simple test program. Each iteration of the loop executes about dozen instructions (plus or minus, depending on the compiler) or which 6 of those instructions either use or modify the arithmetic flags, which I began describing to you last week. On x86, these are the 6 bits within the EFLAGS register which record the Sign flag, the Zero flag, the Carry flag, the Overflow flag, the Adjust flag, and the Parify flag, written in shorthand as SF ZF CF OF AF and PF.

The first four I listed - Sign, Zero, Carry, Overflow - are common on most CPU architectures. On the 68000 and 68040, they exist as the "NZVC" bits (Negative, Zero, oVerflow, Carry) in the Condition Codes Register, which is the Motorola equivalent of the EFLAGS register. On PowerPC, these bits exist as five bits - Less Than, Greater Than, Equal, Overflow, and Carry - but actually convey the same information.

The other two flags - Adjust (also known as Auxiliary Carry) and Parity - are unique to the x86 architecture but no less important. Windows would not boot correctly without these flags.

Any kind of "if" statement in C, C++, Java, or other language, generally compiles into code which first executes one or more arithmetic instructions to set these condition flags, followed by a conditional branch instruction such as JE (Jump If Equal). Conditional flags are as fundamental to computer code as memory reads and writes.

One of the great challenges in developing any efficient virtual machine which emulates real legacy CPU hardware is the accurate and efficient simulation of arithmetic condition flags. All too often I have seen otherwise good emulators slow to a crawl due to a poor implementation of the flags simulation. I believe that this is due to most programmers, even those who develop virtual machines, not having the basic understanding of exactly how condition flags are calculated and therefore they do not know how to efficiently simulate them. This shall be this week's topic.

Flags - The Emulator Litmus Test

Two of the ongoing debates in the virtual machine community center on two arguments:

Direct execution or binary translation?
Interpretation or dynamic recompilation?

If you have been reading this series in detail you should already have a good idea of what each of these terms mean:

Direct execution is the concept of allowing code that is in a sandbox to be executed natively by the host CPU, whether that code is inside of a virtual machine or just user-mode code in a multi-tasking protected operating system. For example, when you run a Windows application, the CPU switches to user mode, or what is called "Ring 3", and executes your Windows application natively. The hardware (in theory at least) catches any attempts of the sandboxed code from doing things that is not supposed to - writing beyond the end of the stack, accessing memory that does not belong to it, etc. Direct execution is utilized when the virtual machine's guest bytecode is the same as the bytecode of the host CPU, and is found in most "PC-on-PC" emulation products such as Microsoft's Virtual PC 2007 and VMware Fusion.

Binary translation is the opposite of that, and covers wide range of non-native code execution mechanisms. A Java virtual machine for example has to use binary translation since Java bytecode is not understood directly by the x86 architecture. Within the realm of binary translation are a variety of techniques, ranging from the trivial interpretation of bytecode (as I have discussed previously with the interpreter loop example), to full dynamic recompilation (also called "Just-In-Time" compilation or "jitting"), and various hybrid schemes in between. Binary translation is necessary when the guest CPU and host CPU are not identical, such as when emulating a 68040 or PowerPC Macintosh computer on a Windows PC.

In some virtual machines there may be a mixture of direct execution and binary translation techniques in use. For example, in one older technique used by PC-on-PC virtual machines, kernel mode code (or "Ring 0" code) is sandboxed via some form of binary translation such that accesses to hardware and other protected resources can be properly controlled in software, while user mode code (that "Ring 3" code I mentioned previously) is allowed to execute directly and is protected by the hardware itself. A related technique, called ring compression, directly executes both user and kernel mode code, but kernel mode code is pushed down to "Ring 1", which you can think of as a user mode kernel mode. When that code in Ring 1 attempts to access a protected resource, it triggers a hardware exception which is caught by the virtual machine's monitor (a.k.a. "hypervisor") that is actually executing at Ring 0. Within the Ring 0 exception handler the Ring 1 instruction is usually interpreted, then direct control is given back to the next instruction in Ring 1.

Ring compression fell out of favor recently due to its complexity - some code executes directly, but other code throws exceptions which then require it to be interpreted. Older versions of Virtual PC, Virtual Server, VMware, and Xen all used variants of ring compression or other hybrid schemes.

Today the hot technology is "hardware virtualization", called "Vanderpool" and "VT" by Intel and "Pacifica" and "AMD-V" by AMD. While I agree that ring compression is an overly complex and error prone solution, I disagree with the direction that the industry has taken to move away from all forms of binary translation and go to almost purely direct execution schemes based on hardware virtualization.

If anything, I view the decision to use VT or even ring compression as a copout by the developer of the virtual machine; an admission that they really didn't know how to solve performance issues without relying on additional hardware. Microsoft recently dropped support of ring compression, choosing to force its VM customers to upgrade to Intel and AMD's latest VT-enabled chips (http://blogs.technet.com/jhoward/archive/2006/04/28/426703.aspx). Forcing customers to buy new hardware is being spun as a "feature" when in fact it is an admission of utter failure.

If one is to ask most developers to rank various virtual machine techniques by order of performance, from fastest to slowest, they will likely give you a list similar to this:

Direct execution using VT hardware virtualization
Direct execution of ring 3 and ring compression
Direct execution of ring 3 and binary translation based jitting of ring 0
Binary translation based jitting of both user and kernel mode code
Interpretation of both user and kernel mode code

Traditionally, virtual machines are first developed as a pure interpreter. Once that is working they are optimized using jitting. When the guest and host architectures are the same, the virtual machine is further optimized to use one of the forms of direction execution. People naively assume that each step up the list equates to a large performance speedup. This is not always the case, as the performance of any virtual machine is only as good as the implementation. Current research from both VMware and people involved with the QEMU emulator indicates that today's VT technology from both Intel and AMD is actually slower than older ring compression or binary translation techniques! I even confirmed this with the little test program I discussed in Part 4 of this series. And in the Macintosh emulation world, I have seen third-party jitting-based Mac emulators get trounced by purely-interpretation based implementations written by myself.

So you might understand why I still care about discussing interpreters and why I tend to dislike sexy-sounding but purely misled efforts to move the whole world to VT. The data just does not back up people's assumptions.

Take the example of the QEMU virtual machine, which bills itself as a fast x86 emulator which uses jitting (http://fabrice.bellard.free.fr/qemu/about.html). Fair enough, I would expect that to be fast. However, QEMU also has an accelerated mode of operation that uses direct execution techniques. This would seem to imply that the jitted mode is significantly slower enough so as to warrant a direction execution alternative.

So earlier this summer, I took my little T1.C test program and ran it on a variety of Windows virtual machines running on my Intel Core 2-based Mac Pro to get an idea of what these slowdowns are like compared to natively running the test program.

VM and Technique	T1FAST.EXE time in seconds	T1SLOW.EXE time in seconds
Native	0.26	0.26
VPC2007 using VT	0.27	0.27
VPC2007 using ring compression	0.27	0.27
QEMU 0.9.0 using jitting	10.5	12
Bochs 2.3.5 using interpretation with inline ASM for flags	25	31
Bochs 2.3.5 using interpretation with lazy flags code in C++	34	42
My current Bochs sources using interpretation and new flags code in C++	14	18

Table 11-1: Timings of the simple integer benchmark.

As you can see from the above data, QEMU's dynamic recompilation is a grossly deficient implementation of jitting. I had excepted QEMU's slowdown to be on the order of 2x to 5x slower than native execution. It was actually slower by an order of magnitude!

So what is the bottleneck? What is so difficult in emulating the x86 architecture that makes the jitter in QEMU drop to its knees and barely outperform an x86 interpreter?

One significant factor, one that I consider to be the litmus test of a good virtual machine implementation, is the simulation of the arithmetic condition flags - those 6 little bits I described above. As of two months ago, QEMU was already only about 200% faster than Bochs 2.3.5 at running my CPU intensive test program. As of today, I have optimized Bochs down to a point where it is barely 50% slower than QEMU.

What did I do? And what are these "lazy flags" mentioned in the table?

WARNING! As I did last week, I will be discussing sequences of C++ code derived from the Bochs 2.3.5 source code which is covered by LGPL (the GNU Library license). If you have an allergy to GPL or to looking at open source code, I am afraid you will need to stop reading now.

There Is Lazy, And Then There Is Lazy

Arithmetic condition flags are one of those ubiquitous features of any microprocessor machine language. While most programmers quite easily understand the concept that a CMP EAX,EBX instruction sets the Zero flag when the contents of registers EAX and EBX are equal, few really understand how the flags are really set. For example, is EAX=3 and EBX=-47, what bit values does CMP EAX,EBX set for the Parify flag, the Carry flag, the Overflow flag?

In many virtual machine implementations, that question is side-stepped by the fact that simulating most arithmetic operations requires executing those exact same operations. Huh? In other words, if one is simulating a PowerPC "ADDCO." instruction (which sets the five PowerPC arithmetic condition flags), the host virtual machine can simply execute a native ADD instruction. This is true the other way around, say, trying to emulate a 68040 on PowerPC as Apple once did in its Power Macintosh computers, or emulating PowerPC on x86 as Apple does today in its current Intel-based Mac products.

Most instruction sets, x86 included, provide some means to transfer the arithmetic condition flags into a register. On 68040 for example, one uses the MOVE CCR,D0 instruction to move the N Z V C bits into the lower 4 bits of the D0 register. On PowerPC one uses a similar instruction called mfcr ("Move From Condition Register"). And on x86, there are two common methods: push the flags registers to the stack and then pop the stack in to a register, using a code sequence such as PUSHF / POP AX, or, copy 5 of the arithmetic condition flags using the LAHF instruction and then read the 6th flag using SETO. What? What? What? Ok, let me explain this in a way that makes sense.

Neither Intel's IA32 nor AMD's AMD64 instruction sets provide an instruction to directly transfer all six condition flags to a general purpose register as is commonly found on other architectures. The shortest and most common x86 code sequence for such purposes is to use the Push Flags instruction PUSHF which pushes the flags to the stack. A POP instruction can then store those flags in a register or in a memory location. This is an inefficient code sequence it involves at least two memory operations (a write to the stack followed by a read of that value just pushed to the stack). The instruction PUSHF requires several clock cycles, and worse, its counterpart POPF can require several dozen clock cycles. The worst virtual machine implementations I have seen involve using both PUSHF and POPF.

There are other techniques. The x86 instruction set of course has conditional branch instruction which query most of those flags. There is JZ (Jump If Zero) instruction, also called JE (Jump If Equal), which branches if the ZF is 1 and does not branch when it is 0. There is JC (Jump If Carry), JO (Jump If Overflow), and JS (Jump If Sign) which are pretty self explanatory. These instructions are sufficient to query and record the four common flags ZF CF OF SF, which in turn is sufficient to simulate the condition flags of a 68040. In fact, earlier versions of my Gemulator and SoftMac products did in fact use JZ and JS instructions to record the Zero and Sign flags, and JC and JO to record the Carry and Overflow flags.

The 32-bit x86 instruction set IA32 also contains an LAHF ("Load AH With Flags") instruction, which is a little weird. It copies five of the six arithmetic flags to the AH register, which is the subset of bits 15..8 of the larger EAX register. I have no clue why the LAHF instruction exists as it does since it is rather awkward and does not fully do the job, and AMD unfortunately removed this instruction from the 64-bit AMD64 instruction set. That was a terrible mistake, as it is actually a handy instruction in cases where the result of the Overflow flag is known by other means. For example, most common arithmetic operations such as AND and XOR are defined to clear both the Carry and Overflow flags. So for those instructions, which thankfully are quite common, one does not need the full brute force strength of a PUSHF instruction, and can rely on LAHF. Most recent versions of Gemulator and SoftMac in fact do use LAHF to now record the Zero, Sign, and Carry flags in cases where Overflow is known to be set to zero.

What I am describing here is a technique called "lazy flags". In most direct execution based virtual machines, the host flags are kept in sync with the guest flags state. There is no need to use PUSHF or LAHF or other schemes, as the state of the host CF ZF OF SF AF and PF do always reflect the state of the code being simulated. In lazy schemes, that flags state is stored away somewhere for later retrieval, whether in a register or in memory. When the a particular arithmetic condition flag needs to be known, that register or memory location is tested in some way in order to recreate the necessary flag.

In the case of Bochs 2.3.5, there are two lazy flags schemes in use. The first one I already showed to you last week for when the host CPU is also of an x86 architecture. If you look at the Bochs source file cpu\hostasm.h, you will see that there are inline assembly versions of all of the x86 arithmetic operations, with descriptive inline function names such as asmShr16, asmTest32, asmXor8, etc. Each of these functions performs the arithmetic operation and then stores away the result of the operation, executes PUSHF and POP to read the host CPU's EFLAGS register, and stores that flags state away in memory. I don't like this particular scheme, because it introduces very CPU-specific inline ASM code to what otherwise a very portable emulator, and because the PUSHF/POP code sequences is really not that efficient.

The second lazy flags mechanism that the Bochs 2.3.5 source code uses is almost the opposite of the first mechanism, where instead of using PUSHF to know the state of the flags as soon as possible, the evaluation of those flags is pushed out as long as possible. What this portable C++ based method does is to store away 5 values for arithmetic operations - the two input values, the result, the size of the operation (whether it be 8-bit, 16-bit, 32-bit, or 64-bit), and an integer which specifies which operation was performed. If you look at the Bochs 2.3.5 source files cpu\cpu.h, cpu\lazy_flags.h, and cpu\lazy_flags.cpp, you will see this mechanism at work.

Let's look a typical ADD operation and how it works using this very lazy flags algorithm. We'll look at the code for a common 32-bit register-to-register ADD operation. The code for that is found in the Bochs 2.3.5 source file cpu\arith32.cpp, and I show here an edited down version of the code:

void BX_CPU_C::ADD_GdEGd(bxInstruction_c *i)
{
Bit32u op1_32, op2_32, sum_32;
unsigned nnn = i->nnn();

op1_32 = BX_READ_32BIT_REG(nnn);
op2_32 = BX_READ_32BIT_REG(i->rm());
sum_32 = op1_32 + op2_32;
SET_FLAGS_OSZAPC_32(op1_32, op2_32, sum_32, BX_INSTR_ADD32);

BX_WRITE_32BIT_REGZ(nnn, sum_32);
}

So what does this mysterious "SET_FLAGS_OSZAPC_32" macro do? If you look back at cpu\cpu.h, the various SET_FLAGS* macros boil down to these five lines of code:

oszapc.op1##size = lf_op1;
oszapc.op2##size = lf_op2;
oszapc.result##size = lf_result;
oszapc.instr = ins;
lf_flags_status = BX_LF_MASK_OSZAPC;

There is a data structure oszapc which holds the two input operands, the result, and an enumeration which specifies the x86 instruction which was just simulated. An lf_flags_status variable holds a bitmask which corresponds to the flags bits which need to be lazily evaluated. This scheme is thus very lazy in that this first portion never actually calculates the values of the flags. It simple stores 5 values to memory for later use.

When the flags actually need to be evaluated, such as when simulating an x86 conditional branch instruction, there are 6 lazy evaluation functions in cpu\lazy_flags.cpp, each corresponding to one of the arithmetic condition flags. The stripped down code for evaluating the state of the Zero flag for the above 32-bit ADD instruction is this:

bx_bool BX_CPU_C::get_ZFLazy(void)
{
unsigned zf;

zf = (oszapc.result_32 == 0);
lf_flags_status &= 0xff0fff;
eflags.val32 &= ~0x40;
eflags.val32 |= zf<<6; // zf always exactly 0 or 1.
return(zf);
}

What this code does is evaluate ZF, and then also propagate that state of ZF to the eflags state variable so as to avoid needing to lazily re-evaluate the ZF. Code that actually needs the ZF does not call get_ZFLazy directly, instead if calls an inline method called get_ZF which expands out to this code:

bx_bool get_ZF(void) {
if ( (lf_flags_status & 0x40) == 0)
return eflags.val32 & 0x40);
else
return get_ZFLazy();
}

So far so good, this looks fairly clean and efficient. The majority of x86 instructions update the arithmetic flags, and so they simply need to do the 5 stores to the temporary state in oszapc and lf_flags_status. A smaller percentage of x86 instruction (such as conditional branches) read the arithmetic flags, and thus call one of the get_*F() functions which in turn may end up calling one of the get_*FLazy() functions.

This appears to be efficient, and is certainly very portable to any non-x86 architecture that has a C++ compiler, but Table 11-1 shows it to be about 30% slower than the Bochs inline asm technique. As I found out, part of this slowdown is actually due to compiler differences between GCC (which was used to build the release version of Bochs 2.3.5) and the VC7.1 C++ compiler in Visual Studio 2003 which I used.

Nevertheless, it bothered me that in order to eliminate the one piece of inline assembly code from Bochs that I would have to pay this significant performance penalty. And so I set out to drill deeper into the lazy flags code and optimize it.

The INC/DEC Wrinkle

I did not show you the full code of the lazy evaluation. If you actually look at the Bochs 2.3.5 version of cpu\lazy_flags.cpp, it is quite a messy source file. The actual structure of each lazy evaluation function is a nested pair of switch statements:

bx_bool get_ZFLazy(void)
{
unsigned zf;

switch ((BX_CPU_THIS_PTR lf_flags_status>>12) & 0x00000f) {
    case BX_LF_INDEX_OSZAPC:
    switch (BX_CPU_THIS_PTR oszapc.instr) {
        ...

This is murder on a modern CPU architecture. When lazily evaluating any of the arithmetic flags, there three different conditional brances involves: first the "if" statement that checks whether the flag is lazy, then the first "switch" statement to determine the which lazy state the flags are in (more on this shortly), and finally a second "switch" statement which is based on the actual x86 instruction that updated the flags. Three different branch predictions that have to succeed. If any or all of the three predictions fail, the host CPU will take a dozen or dozens of clock cycles penalty as it refills the CPU pipeline.

This is all fine and dandy if the lazy evaluation code is needed infrequently. After all, that is the point of lazy evaluation - put off things that you don't really need to do right away. The problem as it turns out is that the arithmetic flags need to evaluated very frequently, on the order of every 1 in 8 simulated x86 guest instructions! Putting in some instrumentation I determined that about 1 in 10 instructions is a conditional branch. A few other common instructions, such as the INC and DEC instructions which are almost always used in C++ loop iterations, ADC and SBB add and subtract instructions used in multi-precision arithmetic, and certain shift and rotate instructions that are the other culprits because they both read and update the arithmetic flags state.

INC and DEC are a couple of ugly instructions. They are commonly used to increment or decrement a register, or in other words, to implicitly add 1 to or subtract 1 from a register. They are identical to the ADD and SUB instructions except for one thing - they do not modify the Carry flag. Almost all other arithmetic operations have the property that they update all six arithmetic condition flags every time. INC and DEC only update five of the flags and need to preserve the 6th flag.

This may seem like a minor and trivial detail, but actually rears its ugly head in real-world performance. Intel's optimization guidelines even warn developers not to use INC and DEC and instead to use ADD and SUB to avoid pipeline stalls. The reason for this, both in hardware and software emulation, is that not modifying the Carry flag is the same as saying that the instruction reads the Carry flags and then outputs that previous value. In other words, there is a data dependency on the previous value of the Carry flag. This is true for INC and DEC, as well as for ADC and SBB.

Bochs 2.3.5 handles these cases in two ways. For ADC ("Add With Carry") and SBB ("Subtract With Borrow"), the previous Carry state needs to be explicitly evaluate and used, as in this code for a 32-bit ADC operation:

bx_bool temp_CF = getB_CF();
Bit32u op2_32, op1_32, sum_32;
op2_32 = BX_READ_32BIT_REG(i->nnn());
op1_32 = BX_READ_32BIT_REG(i->rm());
sum_32 = op1_32 + op2_32 + temp_CF;
BX_WRITE_32BIT_REGZ(i->rm(), sum_32);
SET_FLAGS_OSZAPC_32(op1_32, op2_32, sum_32, BX_INSTR_ADD_ADC32(temp_CF));

For INC and DEC, the authors of Bochs decided to simply have two saved states, the oszapc state and the oszap state. While most arithmetic operations save their temporary lazy flags state to the oszapc structure, INC and DEC save to the oszap structure, as this code for a 32-bit INC shows:

Bit32u op1_32;
op1_32 = BX_READ_32BIT_REG(i->rm());
op1_32++;
BX_WRITE_32BIT_REGZ(i->rm(), op1_32);
SET_FLAGS_OSZAP_RESULT_32(op1_32, BX_INSTR_INC32);

This necessitates one extra "switch" statement in every lazy evaluate method, such that the correct lazy state is used.

A final wrinkle is evident in the actual conditional branch code, found in the Bochs 2.3.5 source code file cpu\ctrl_xfer32.cpp, when evaluating the flags:

void JCC_Jd(bxInstruction_c *i)
{
bx_bool condition;

switch (i->b1() & 0x0f) {
case 0x00: /* JO */ condition = get_OF(); break;
case 0x01: /* JNO */ condition = !get_OF(); break;
case 0x02: /* JB */ condition = get_CF(); break;
case 0x03: /* JNB */ condition = !get_CF(); break;
case 0x04: /* JZ */ condition = get_ZF(); break;
case 0x05: /* JNZ */ condition = !get_ZF(); break;
case 0x06: /* JBE */ condition = get_CF() || get_ZF(); break;
case 0x07: /* JNBE */ condition = !get_CF() && !get_ZF(); break;
case 0x08: /* JS */ condition = get_SF(); break;
case 0x09: /* JNS */ condition = !get_SF(); break;
case 0x0A: /* JP */ condition = get_PF(); break;
case 0x0B: /* JNP */ condition = !get_PF(); break;
case 0x0C: /* JL */ condition = getB_SF() != getB_OF(); break;
case 0x0D: /* JNL */ condition = getB_SF() == getB_OF(); break;
case 0x0E: /* JLE */ condition = get_ZF() || (getB_SF() != getB_OF()); break;
case 0x0F: /* JNLE */ condition = (getB_SF() == getB_OF()) && !get_ZF(); break;
...

As you can see, some of the conditions, particularly the ones used in C-style "for" loops, require evaluating 2, even 3, conditional flags. That means potentially 6 or even 9 unpredicted branches. If you at the compiled code for the inner loop of my simple test program, it looks like this:

; 25 : for(i = 0; i < 100000000; i++)
00010 33 ff xor edi, edi
; 26 : t += foo(i);
00012 57 push edi
00013 e8 00 00 00 00 call _foo
00018 83 c4 04 add esp, 4
0001b 03 f0 add esi, eax
0001d 47 inc edi
0001e 81 ff 00 e1 f5 05 cmp edi, 100000000 ; 05f5e100H
00024 7c ec jl SHORT $L83042

_foo PROC NEAR
; 16 : return(i+1);
00000 8b 44 24 04 mov eax, DWORD PTR _i$[esp-4]
00004 40 inc eax
00005 c3 ret 0

Each iteration of the loop involves ten x86 instructions, of which the six I have bolded either reads the flags state and/or update the flags state. I have marked three of those instructions in read, as they are the problematic ones that introduce complications, INC for not updating the Carry flag, and JL for needing to lazily evaluate multiple flags. So what we now have is a pretty complicated mess:

Most arithmetic operations save up to 5 pieces of lazy state information to one of two different lazy state data structures.
Lazy evaluation of any one arithmetic condition flag requires three potentially unpredicted branches.
ADC and SBB need to evaluate the Carry flag during the execution.
INC and DEC alone necessitate the second lazy data structure and one of the three unpredicted branches.
Common C/C++ "for" loops can hit a perfect storm of unpredicted lazy flags evaluation.

The code above is representative of a lot of existing Windows code. I will leave it up to you the reads now to look through the QEMU source code and see the similar problems that QEMU faces even though it uses dynamic recompilation and claims to be "fast". 40 to 50 times slower than native execution is not fast, especially when Bochs is nipping at its heels.

As I mentioned a few pages ago, a sign of failure in a virtual machine implementation is to abandon dynamic recompilation and take the easy route of direct execution, as QEMU does to work around this performance bottleneck. A better approach (and what I like to solve) is to study the real problem and solve the root cause of that performance slowdown.

Faster Lazy Evaluation

So what did I do to roughly double the performance of this code in Bochs? The solution is ridiculously obvious and comes in four parts.

The first optimization is to realize that arithmetic condition flags are so frequently updated that when they are used they are just needed once. For example, some variable is tested and that is followed by a conditional branch which tests one or more of the flags. So given this reality, is it really necessary to propagate the lazy flags state to the eflags state each and every time? Absolutely not! So the very first optimization is to simply eliminate these three lines of code from each and every one of the lazy evaluation methods:

lf_flags_status &= 0xff0fff;
eflags.val32 &= ~0x40;
eflags.val32 |= zf<<6; // zf always exactly 0 or 1.

The second optimization is to eliminate the two lazy flags temporary structures and replace them by one set of state. This is done by explicitly calling get_CF() in the implementations of INC and DEC guest instructions. Now all arithmetic operations appear to always update all six arithmetic flags, necessitating just one lazy flags stte.

The third optimization is to eliminate the size value needed to be stored in the lazy flags state. The differences between evaluating the Zero flag for 8-bit operations, 16-bit operations, and 32-bit operations for example come down to these three different lines of code in Bochs 2.3.5:

zf = (BX_CPU_THIS_PTR oszapc.result_8 == 0);
zf = (BX_CPU_THIS_PTR oszapc.result_16 == 0);
zf = (BX_CPU_THIS_PTR oszapc.result_32 == 0);

Would it not be much simpler to store a sign-extended form of the temporary state? Duh?

The final optimization is to realize that the amount of code needed to evaluate a given arithmetic flag is quite small. In fact, far more code has to be executed to reach the simple "zf = result == 0" line of code that is required for that line of code itself. My final insight therefore was to realize that being too lazy is a bad thing. It is much more efficient to do a little more work up front. If you look at my new version of the ZF lazy evaluation function, it is literally just one line:

bx_bool get_ZFLazy(void)
{
return (BX_CPU_THIS_PTR result_lo == 0);
}

and similarly the Sign flag is evaluate in one line of code as well:

bx_bool get_SFLazy(void)
{
return (result_lo >> 31); // extract high bit of result
}

No switch statements, no unpredictability. This is so efficient that the C++ compiler inlines this into the get_ZF() function, thus even eliminating a function call. Instrumentation on real Windows code shows that the Zero flag is by far the most commonly evaluated flag, and thus simplifying the code path to evaluating ZF is a huge "bang for the buck".

What I have done therefore is front-loaded the lazy flags evaluation to do slightly more work at the time the arithmetic operation happens, in exchange for vastly simplifying and speeding up the evaluation of the flags when they are needed. By eliminating the need to store the enumeration of the last instruction executed, my scheme also reduces the number of temporary valued needing to be written down to 4. It also reduces the overall size of the temporary lazy flags state by merging everything into 16 bytes of state.

I have submitted my changes to Stanislav, so expect this code to be checked in to Bochs very soon!

The speedup is so significant, almost doubling the speed of Bochs 2.3.5 on my little test program, it indicates that the majority of CPU cycles in Bochs are spent evaluating lazy flags and the related branch mispredictions which that causes. The 9- and 12-second reduction in execution time of my new lazy flags code over Bochs 2.3.5, if applied to QEMU, would reduce QEMU's execution times to about 1.5 and 2.0 seconds respectively, which equates to about 8x to 10x slowdown. I am not saying that will happen but I am willing to bet that QEMU's dynamic recompilation mode could be made significantly faster with these very simple tweaks to their lazy evaluation code.

XOR To The Rescue Again!

I have shown you how to lazily evaluate two of the flags, ZF and SF, because they are the easiest to understand. But there are four more flags to go. The Parity flag is also relatively straightforward. PF indicates whether the low 8 bits of the result contain an even or odd numbers of "1" bits (and similarly of "0" bits). Any scheme which counts bits in an integer will work here, and I chose to preserve the Bochs 2.3.5 implementation of using a 256-entry lookup table which returns a 1 or 0 for each of the possible 256 values of those low 8 bits. Parity can therefore be derived by reusing the same temporary result field as is used to determine SF and ZF.

The interesting flags are Carry, Overflow, and Adjust, which require some deep thought. Most people understand the concept of a Carry bit. Add two numbers, and if the result of each digit does not fit, add a 1 (a Carry-out bit) to the digit to the left. This works when doing decimal long addition, and it works in a microprocessor when doing binary addition. The only catch is that you require one extra bit of computation. Adding two 8-bit numbers can result in a 9-bit sum. That is workable on a 32-bit host CPU, but what about the result of a 32-bit addition? That requires 33 bits! What about a 64-bit addition? That would require 65 bits. No C++ compiler that I know performs 65-bit arithmetic operations. A better way has to exist that does not require a wider result than the original operations.

The authors of Bochs have already figured this out, but their clever solution was buried deep away in the ugly lazy evaluation code. I have taken that code and put it inline into the SET_FLAGS* macros. The code needs some explaining to prove to you that it works.

For the time being let's stick to 32-bit arithmetic operations such as ADD, and let's assume that we have the ability to do 33-bit arithmetic so as to be able to record the Carry-out result in the 33rd bit. How from that do we calculate the Overflow flag?

We instinctively understand the concept of signed overflow. Pretend integers are two bits wide. Therefore only 4 possible numbers exist:

Binary number	Decimal unsigned value	Decimal signed value
00	0	0
01	1	1
10	2	-2
11	3	-1

Table 11-2: A 2-bit signed integer number space

A signed overflow occurs when the sign of the result is incorrect. For example, in this number space, 00+00=00 generates neither a Carry nor an Overflow condition. 01+01=10 is a signed overflow, because in decimal, 1+1 <> -2. Therefore the Overflow flag is set. There is no Carry though, because no unsigned overflow occurred. In decimal, 1+1=2, which is correct.

Look at 11+11=10. As signed numbers there is no Overflow, because -1 + -1 = -2 in decimal. But, as unsigned numbers, 11+11=110 in binary, and 3+3=6 in decimal. That result cannot be represented in just 2 bits, and therefore this is a Carry-out condition.

I hope that intuitively you can see how Carry and Overflow are very similar, and are both indicators that the result does not fit. Carry is such an indicator when performing an unsigned addition, and Overflow is such an indicator when performing a signed addition.

I have simplified the general case of 32-bit integers down to 2-bit integers but the idea is the same. The high bit is considered the "sign" bit , and the remaining bits (in this case, the low bit) is the actual number. By doing this thought experiment on 2-bit integers it is easy to derive the entire truth table of all 16 possible combinations of 2-bit additions, and thus all 16 flags results:

Binary input A	Binary input B	3-bit Binary Sum	Binary output D	Carry-out flag	Overflow flag
00	00	000	00	0	0
00	01	001	01	0	0
00	10	010	10	0	0
00	11	011	11	0	0
01	00	001	01	0	0
01	01	010	10	0	1
01	10	011	11	0	0
01	11	100	00	1	0
10	00	010	10	0	0
10	01	011	11	0	0
10	10	100	00	1	1
10	11	101	01	1	1
11	00	011	11	0	0
11	01	100	00	1	0
11	10	101	01	1	1
11	11	110	10	1	0

Table 11-3: Complete truth table for 2-bit addition A + B = D.

The Carry flag is easy to see. If the result generates a 3rd output bit, you have a Carry-out condition, and therefore you have the Carry flag set. Overflow is not quite as obvious, and only occurs in four of the conditions, 01+01=10, 10+10=00, 11+10=01, and 10+11=01. You can see that the last two are actually mirror images of each other, so really just three cases, which can be generalized as just two cases:

positive + positive = negative is a signed overflow
negative + negative = positive is a signed overflow

So if you are adding two input numbers of the same sign, and the result is the opposite sign, you have experience as signed overflow condition and thus the Overflow flag must be set.

Hmmm, do you remember the XOR trick I showed you in Part 8? The XOR operation is useful to determine when bits change. If you want to test whether two integers have the same sign, just XOR them together and test the high bit of the result. If it is 0, the two integers are of the same sign. If it is 1, the two integers are of opposite signs. We know that the calculation is going to involve "A XOR D" or "B XOR D" to check if the sign of the result is different, and it will involve "A XOR B" to determine if the input signs are the same. The Overflow flag calculation is now solved in fact!

For any two signed integers (A,B) being added, the value of the overflow flag is equal to OF(A+B) = ((A XOR D) AND NOT (A XOR B)) < 0. In other words, "if the signs of the inputs are the same AND the sign of the result is different from the input, you have a signed overflow. The expressions can also be written as OF(A+B) = ((B XOR D) AND NOT (A XOR B)) < 0 since either of the input values can be XOR-ed with the result.

It is left up to you the reader to derive the truth table for subtraction and prove to yourself that the formula for subtraction is very similar: OF(A-B) = ((A XOR D) AND (A XOR B)) < 0.

The original lazy evaluation code in Bochs 2.3.5 implements exactly these formulas, using a mask to select which bit position to calculate overflow for, as seen in the source file cpu\lazy_flags.cpp:

#define GET_ADD_OVERFLOW(op1, op2, result, mask) (((~((op1) ^ (op2)) & ((op2) ^ (result))) & (mask)) != 0)
#define GET_SUB_OVERFLOW(op1, op2, result, mask) (((((op1) ^ (op2)) & ((op1) ^ (result))) & (mask)) != 0)

To properly understand the Carry flag requires one additional piece of knowledge, which I found in IBM's PowerPC Programmer Reference manual in thh description of the PowerPC's Overflow flag. It states that the Overflow flag is simply the XOR of the Carry-out of the high two bits of the result. Therefore, if you can figure out how to get the Carry-out bit of the 31st bit instead of the 32nd bit, it can be XOR-ed with the Overflow value just calculated and presto, you have the correct Carry flag!

A hint to this formula is found in Bochs 2.3.5 already, in the code to lazily evaluate the Adjust flag:

af = ((oszapc.op1_32 ^oszapc.op2_32) ^oszapc.result_32) & 0x10;

The Adjust flag is also known as the Auxiliary Carry in some x86 manuals, since it is merely the Carry-out bit of the lower 4 bits of an addition. AF and CF are calculated in the same way, just in different bit positions. It turns out (and I'll spare you the truth table, but it works), if you simply XOR the input values and then XOR that result with the sum, you get the Carry out bit of the previous bit. In other words, CFprev_bit(A+B) = ((A XOR B) XOR D) < 0. This works because the result of A + B for any given bit is the same as A XOR B unless there is a carry coming from the previous bit.

So now we have the complete formula for the Carry flag:

CF(A+B) = (((A XOR B) XOR D) < 0) XOR (((A XOR D) AND NOT (A XOR B)) < 0).
CF(A-B) = (((A XOR B) XOR D) < 0) XOR (((A XOR D) AND (A XOR B)) < 0).

While these seems complicated, there are common values being evaluated multiple times, and in reality the full code for deriving the lazy flags state for a 32-bit ADD instruction is this:

SET_FLAGS_OSZAPC_ADD_32(op1,op2,result) {
Bit32u xvec = ((op1)^(op2));
Bit32u ovec = ((op1)^(result)) & ~xvec;
result_lo = result;
result_ofv = ovec;
result_xor = xvec ^ ovec ^ result;
lf_flags_status = EFlagsValidOACMask | EFlagsOSZAPCMask; }

This compiles into a fairly short piece of code with some instruction-level parallelism performing 6 integer operations on the host. The new lazy flags state in Bochs requires just 4 integer values: the two input operands, the result, and a bit mask which states that all the arithmetic flags are lazy. I use an additional mask which states that the OF AF and CF are valid. If this mask bit (EFlagsValidOACMask) is not set, the lazy evaluation code returns 0 for those three flags. This permits an even more optimal piece of code to be used with AND OR XOR TEST and other instructions that explicitly clear those flags, for example 32-bit XOR:

SET_FLAGS_OSZAPC_LOGICAL_32(result) {
result_lo = result;
lf_flags_status = EFlagsSZPOthersClearedMask;

Not bad, just two instructions, and actually less code than was being emitted in Bochs 2.3.5! Reading the Carry flag is quite simple, it is almost as efficient as the Zero flag:

bx_bool get_CFLazy(void)
{
return (result_xor & lf_flags_status) >> 31;
}

and similarly with the Overflow flag:

bx_bool get_OFLazy(void)
{
return (result_ofv & lf_flags_status) >> 31;
}

Finally, the for Adjust flag, the flag is extracted in the same way as the Carry flag, but from the 4th bit of the carry-out vector:

bx_bool get_AFLazy(void)
{
return (result_xor >> 3) & (lf_flags_status >> 31);
}

This new lazy flags algorithm is not only portable because it is written in plain C++, but it is actually significantly faster than even the inline assembly that uses PUSHF/POP! Bochs 2.3.5 already has this clever code, it just had it in the wrong place and surrounded it with too much slow code.

In summary, replacing code that's full of "if" statements and indirect jumps and other forms of poorly branch-predictable code is a powerful optimization technique. Many C and C++ today emit code that uses these "branchless" techniques instead of the older ways that were used 10 or 20 years ago. For other neat "bit twiddling" tricks, there is this very cool web page (http://graphics.stanford.edu/~seander/bithacks.html) that shows many other tricks of evaluating common expressions without requiring conditional branches or "if" statements.

I will be out of town for the Thanksgiving long weekend here in the United States and will resume my posts again in December. Keep those comments and ideas coming by emailing me at darek@emulators.com.