NO EXECUTE!

(c) 2010 by Darek Mihocka, founder, Emulators.com.

July 25 2010

[Part 33]  [Table Of Contents]  [Return to Emulators.com]

"Beyond debugging, migration to emulated environments opens the door to cross-platform migration of virtual machines.  A virtual machine running on a slow ARM chip in a mobile device could be migrated to a fast desktop or server machine.  Code running in abstract machines such as the JVM or .NET CLR could have the JID cache invalidated and be recompiled from bytecode for the new native platform.  Currently, virtual-to-emulated (V2E) and emulated-to-virtualized (E2V) migrations using QEMU are still experimental.  They are likely to become a significant feature of Xen in the future, however"

    - The Definitive Guide to the Xen Hypervisor, David Chisnall, Prentice Hall, 2008

Prophetic words! This quotation from a book on virtualization published two years ago eludes to the near future, where someday virtual machines will migrate freely across devices without worry about which chip vendor's hardware is being used. I could not have written a better introductions to this week's discussion on QEMU. Since I have discussed Virtual PC, PowerPC, ARM, and Bochs recently, let's now take that closer look at QEMU.


A Closer Look At QEMU

As I mentioned in previous postings, three years ago I first looked at QEMU version 0.9 hosted on Windows and quickly dismissed it as junk. It was too slow (in some cases even slower than the Bochs interpreter) and too buggy. My interest in QEMU was ignited again last summer when I happed across a new Windows port of QEMU called appropriately enough, WinQEMU, hosted at this SourgeForge page: http://sourceforge.net/projects/winqemu/

I found WinQEMU 0.10.2 to be measurably faster than the previous 0.9 build I had tried two years earlier. As I have eluded to this summer, I have been following QEMU, enlisting in the sources for what have now been the 0.12.4 and 0.12.5 releases, and have enough improvement to put my work on Bochs on hold for a while in order to evaluate QEMU deeper.

As fortune would have it, the latest point release of QEMU, version 0.12.5, was released just days ago on the QEMU web sites:

    http://www.qemu.org/

    http://www.qemu.com/

For QEMU novices, I would suggest grabbing these 0.12.5 sources as your starting point. Until there is a significant new point release in the future, I will also continue to use the 0.12.5 release as my reference for future discussions.

After recently exchanging a few emails with Yan Wen, the author of WinQEMU, he assures me he is porting version 0.12.4 (hopefully 0.12.5 now) to Windows and will update the WinQEMU release within a month or two.

More adventurous readers can enlist into QEMU directly. To this on Linux, and I have successfully done this on both Fedora 12/13 and Debian 5.05 releases, you will need to make sure that you install the gcc and g++ packages (which you would have already for building Bochs), the git or git-core package for installing the Git source control system, the make package, and the SDL development package (on Debian you will want libsdk1.2-dev package).

On Fedora 12 and 13, use the "yum" command to install packages, such as:

    yum -y install gcc

while on Debian, use the similar "apt-get" command:

    apt-get install gcc

A few other packages I find very convenient to install, including "rdesktop" which permits a Linux machine to Remote Desktop into a Windows machine, and the "ntfs-3g" package gives read-write file access to any Windows NTFS partition which may reside on the same machine. Since I dual-boot regularly between Windows 7 and Linux on several of my machines, I like to keep all my virtual machine disk images, test programs, and source on common NTFS partitions, which I can then easily access from both Windows 7 and Linux.

For convenience in building both Bochs and QEMU and other open source projects you might enlist in, I suggest also installing the packages "cvs" and "subversion" (which are two other popular source control systems), "xterm" to make sure that X11 graphical subsystem is installed, "bochs" and "qemu" themselves in order to install pre-requisites such as BIOS images and device models, "wine" which allows running some Windows applications directly from Linux, and "bximage" for creating Bochs disk image files.

Generally, the configure script of any given project will point out any components which you are missing and will prompt you to install the appropriate missing packages.


Evaluating QEMU

For today I am going to look at three particular builds of QEMU - that terrible 0.9 release from 2007, the WinQEMU 0.10.2 release from 2009, and the current 0.12.5 release which I have built on both 64-bit PowerPC G5 and 64-bit x86-64 Fedora Linux machines from the tip-of-tree sources.

Bochs, QEMU, the Xen hypervisor, and the KVM virtualization built in to current Linux releases are all related. Bochs and QEMU started out sharing very similar BIOS and VGA BIOS sources as well as device models - Bochs being the purely interpreted x86 virtual machine, and QEMU being the "jit" (dynamic binary translation) based x86 virtual machine. Over the years two have diverged, and QEMU now also supports emulating 68040, PowerPC, ARM, MIPS, SPARC, and other architectures.

More recently, the QEMU framework has been used as the basis for VT-based hypervisors such as KVM, Xen, and VirtualBox. While I disagree with the VT approach of course, the nice thing about using QEMU as a starting point is that virtual machine disk images can easily be reused on different virtualization products. For my testing, I have several Windows 2000, Windows XP, Windows 7, and Linux disk images which I originally created in Bochs which I also test with the various QEMU releases and even KVM. The virtual machines do not have to be re-built from scratch for each hypervisor, but rather, those base images I create in Bochs can be used as-is everywhere.

For my testing I mainly used my oldest Windows 2000 disk image due to its small size, quick boot time, and because it is the least cluttered of my images. The Windows 2000 disk image holds a complete Windows 2000 Workstation Service Pack 4 install, plus Microsoft .NET 2.0 framework, FireFox 3.6, Visual Studio 98, and all of my test programs including the Gemulator 9 sources.

Why Visual Studio 98? Because that is the compiler I have used for over a decade to build the Gemulator product on Windows and because it has a ridiculously small disk footprint (roughly 18 megabytes for the command line build tools, header files, and libraries, and an additional 36 megabytes for the VS98 IDE). Sticking to the same build tools gives me a consistent way to compare benchmarks over the years without worrying that I am introducing variability due to compiler changes and thus code quality changes. For the same reasons, I tend to do most of my Bochs and QEMU testing using the older Windows 2000 and XP images instead of constantly changing what I am testing month to month.

Other tests which I used to evaluate QEMU my regular readers will be familiar with as I have referenced several of them and/or provided source code in past postings of NO EXECUTE!:

  •     CPU_TEST - my framework of hundreds of small (and mostly assembly language) x86 and Windows micro-benchmarks
  •     HDTEST32 - a utility I wrote to measure raw unbuffered disk write throughput using various block write sizes
  •     T1FAST, T1SLOW, SIMP - small Visual Studio compiled C test programs I wrote which measure common function calling and integer code patterns
  •     MEMBAND - a utility I wrote to measure memory copy bandwidth based on block copy size and misalignment between source and destination blocks
  •     LPIPE - a utility I wrote which measure measures the speed of sending data via Windows pipes
  •     CPU-Z - a third party utility from http://www.cpuid.com/ which displays processor information such as model and x86 instruction capabilities
  •     FRACTAL - my first 32-bit Windows program from about 15 years ago, calculates and displays a fractal image
  •     TESTFLOAT - a third party utility from http://www.jhauser.us/arithmetic/index.html which verifies x87 floating point correctness
  •     111 - a custom test I wrote to check for common virtual machine memory implementation errors
  •     ND32 - a custom set of micro-benchmarks which evaluate branch prediction performance
  •     And a few others, such as managed C# variants of SIMP and some scripted Microsoft Office tests.

When I test and benchmark Bochs, or QEMU, or KVM, I run these tests as if I was evaluating a native x86 processor directly. When I show results below, I will be clear as to which virtual machine product was running on which host x86 or PowerPC hardware, and make it clear in the few cases where I am running a test directly on bare metal.


The Read-Modify-Write Bug

In my 2008 paper "Virtualization Without Directly Execution", I provided CPU_TEST T1FAST and T1SLOW benchmark results to show how QEMU 0.9's performance was barely faster than than of Bochs 2.3.7 after Stanislav and I had gone through and cleaned up the Bochs x86 interpreter engine. What I didn't go into much detail about were some rather blatant and serious x86 correctness errors in QEMU.

The most serious in my opinion is one which is ridiculously easy to reproduce and results in incorrect data being written to memory. The scenario is this:

  •     allocate some memory using the Windows VirtualAlloc API (which I discussed in detail back in Part 4)
  •     perform a memory read-modify-write operation such as an addition operation to an integer in memory
  •     read the integer to verify that the value in memory is correct

It is a pretty simple test, which I originally wrote over 5 years ago to verify a bug I had discovered in Virtual PC 7, which unfortunately I reported to Microsoft too late to get into that final Virtual PC 7.02 release. The core code of the tests consists of two inlined assembly language instructions:

    __asm stc ; set Carry flag
    __asm adc dword ptr [eax],1 ; 0 + 1 + Carry should equal 2

The STC instruction (set carry) ensures that the x86 Carry Flag is in a known state, for this test I set the Carry. The ADC (Add with Carry) instruction performs a 3-input addition, taking the value in memory pointed to by register EAX, adding the constant 1 to it, and also adding the Carry Flag. The result is then written to memory, and the new value of the Carry Flag reflects the result.

Now, in the most trivial case, when the test program is freshly launched and any allocated memory is filled with zeroes, the value written to memory will be 0 + 1 + 1 = 2. If this was a global variable in memory, it would start with 0 and end up with the value 2. Easy, how could a virtual machine possibly blow such a simple piece of code?

Well, when this code sequence is run on Virtual PC 7.02, the value written to memory ends up being 1. Not 2, 1.

This is flat out a correctness error in Virtual PC's x86 integer and memory emulation which could result in something simple as a program malfunctioning to something more serious as a security exploit. if the value being written to memory is, say, a pointer, and the value of the pointer is now corrupted, a crash or data corruption could easily occur.

Any wonder then that many third party Windows programs simply fail to run in Virtual PC for Mac? There is the likely culprit.

How does this possibly happen?!?!?!?! The key lies in understanding how Windows initializes memory and looking at some extra debugging information I have instrumented into my test program.

Windows, like Linux, runs user mode applications using virtual memory address translation. A pointer in a Windows program does not point to physical memory, rather it is a pointer into virtual address space which is the translated by the TLB and page tables to the actual physical address. In a virtual machine such as Virtual PC or QEMU, this translation may be done in software. Read posting Part 8 for some background into how this is done.

On top of that, Windows and Linux are lazy about even allocating the physical memory until it is actually used, which is sort of the whole point of virtual memory. A Windows program can allocate 100 gigabytes of virtual memory even on computer with perhaps two gigabytes of actual physical RAM. The operating system allocates the mapping of virtual to physical memory as needed, swapping pages of physical memory out of the pagefile to give the illusion of these being 100 gigabytes of memory on a two-gigabyte machine.

A further optimization which Windows performs is to not even assign the physical memory page until a given page of virtual memory is actually written to. The write operation is the key, because for as long as you merely allocate memory and read from it, you will read zero values. The operating system plays page table tricks to map all of the pages of a block of virtual memory to a common 4K zero page. For example, that 100-gigabyte allocation would consist of over 26 million pages, but Windows merely needs to set those 26 million page table entries to all point to the exact same 4K page that contains nothing but zeroes and which is marked as a read-only page.

When a write finally occurs, the x86 processor throws an Access Violation exception, since the page table has that zero page marked as read-only. At this point, the Windows kernel then allocates the new physical page (possibly swapping one out to the pagefile to make room), changes the page table entry in the program's page table to now make that 4K block of virtual memory translate to the newly allocated physicla page, and then restarts the faulting x86 instruction to complete the write.

As a further optimization, Windows may not even allocate the page table entries themselves until they are referenced. In other words, page tables themselves are swapped in and out of the pagefile, so just the mere act of reading newly allocated virtual memory can cause an Access Violation as well. Windows uses this trick to lazy allocate page table memory, and it is during this read fault that Windows will initialize the page table entry to point to that common zero page.

With this knowledge, you can see that the ADC instruction could cause two faults, a read Access Violation fault, followed by a write Access Violation fault. This is in fact exactly what both QEMU 0.9 and Virtual PC 7.02 generate - a read fault and a write fault. It is also wrong!

Because a real x86 processor implements read-modify-write memory operations as a write snoop to the memory bus. Since the ADC instruction knows that it will perform both a read and a write, the ADC (and other arithmetic instructions capable of read-modify-write operations such as ADD SUB INC DEC XOR AND etc.) ask the memory bus for exclusive write access to that memory location. This is so that other cores can flush any cached (and soon to be stale) copies of that memory. Real x86 hardware, whether Intel or AMD and regardless of the processor model, generates only the write fault. The Windows kernel then sees that fault and bundles all of its actions together - allocate a page entry, allocate a physical page, map the entry to the page, and mask it writable - in one single round trip to the kernel. When the ADC instruction is then restarted, the memory is writable and the write succeeds.

Where Virtual PC 7.02 screws up is it seems to treat the read-modify-write operation as three separate operations consisting of the read from memory, the addition operation, and the write. What would then happen is this:

  •     The initial read faults, Windows maps the memory pointed to by EAX to the common zero page.
  •     The block of jitted code corresponding to the ADC instruction is executed again from the beginning, this time the read succeeds and reads the value 0.
  •     The addition is performed, getting a result of 2 and clearing the Carry Flags (since 0+1+1 does not cause arithmetic overflow).
  •     The write of 2 is attempted. This causes a write exception, Windows now goes and maps the page table entry to a fresh physical page.
  •     The block of jitted code corresponding to the ADC instruction is executed yet again from the beginning, repeating the read (which reads zero), repeating the addition, which now gives 0+1+0=1 due to Carry Flag already being cleared, and writing the value of 1.

Memory gets corrupted and the program fails. Virtual PC 7.02 has two gross errors:

  • it treats a read-modify-write operation as two distinct memory operations, and,
  • it updates the guest x86 register state such the Carry Flag before it knows that the whole guest instruction will succeed.

QEMU 0.9 suffers from the first error as well, generating distinct read and write faults, and thus failing that portion of the test. Sadly, WinQEMU 0.10.2 as well as the latest QEMU 0.12.5 all fail this first simple test.

QEMU does correctly write the value 2 in this case, which at least indicates that it is buffering writing out the arithmetic flags state until after the write has succeeded. Memory at least is not corrupted by QEMU, but the incorrect fault on read could be detected by code running inside of the QEMU virtual machine.

But, it gets worse. If one recalls the AMD Phenom TLB bug, all sorts of ugly things happen on real hardware (causing all sorts of potential race conditions) when a memory access spans two pages of memory. This can be caused when a multi-byte memory access exactly such as the ADC example above is accessing an address at the last byte of a page, which thus causes two virtual pages of memory, two physical pages of memory, two L1 cache lines, and two page table entries to be accesses. The AMD and Intel manuals are full of errata discussing these kinds of problems due to the potential race conditions cause by having to sequence a lot of data operations between caches and memory. The well publicized 2008 bug in the AMD Phenom is an example of such a serious hardware error.

So what happens in QEMU and Virtual PC when such an unaligned access is attempted? Using the exact same code sequence of STC and ADC but simply changing the value of EAX to point to the last byte of a page in the allocated memory block interestingly results in the value of 3 in Virtual PC 7.02 and 4 in QEMU 0.9. Yikes!

The problem here again is in not handling the read-modify-write operation as real x86 hardware would. To get a value of three, one has to look deep into the PowerPC manual to discover that the behaviour of misaligned writes across a page boundary is undefined. In other words, just don't do it! And I am guessing Virtual PC does it. What appears to happen is that the PowerPC performs a partial write, writing the value of 1 to memory before faulting on the write to the second page. This would cause additional faults to occur, re-reading the value of 1 and feeding that as the input into the re-execution of the ADC code. Whatever the exact sequence of events, the result should not be 3!

QEMU 0.9, using a software TLB implementation, appears to handle the buffering of the flags correctly, but either through a partial write error or by re-executing the faulting ADC instruction too many times eventually writes a value of 4. Big big mistake.

As in the previous example, WinQEMU 0.10.2 and QEMU 0.12.5 do write the correct answer of 2, but with the extra read faults being generated.

Interestingly all versions of Bochs which I have tested do work correctly. They generate only the write faults, and always write the value 2. This is because in Bochs, read-modify-write operations check access permissions before anything else, before the memory access is attempted, before the arithmetic operation is performed.

This simple change seems obvious to make in QEMU and would help eliminate any kind of memory related ordering bugs.


Floating Point Compatibility

The next area where both Virtual PC 7.02 and QEMU fail miserably is in the emulation of 80-bit x87 floating point instructions. Virtual PC 7.02 faces a slightly different problem from QEMU and Bochs. In those emulators, x87 floating point is handled by the open source SoftFloat library - http://www.jhauser.us/arithmetic/index.html - an open source IEEE floating point library implemented purely in portable C using 32-bit and 64-bit integers. SoftFloat is great, because it gives any emulator, regardless of the availability or lack of availability of floating point hardware on a given processor, to implemented proper 32-bit, 64-bit, and 80-bit floating point operations. As I mentioned above, SoftFloat comes with a test utility called TestFloat, which contains a series of unit tests for various floating point operations and compares them with the results of the native floating point hardware.

TestFloat runs perfectly with no errors on Bochs, as well as on any recent AMD or Intel processor. However, on Virtual PC 7.02 and on QEMU, many of the tests fails due to rounding errors, floating point status flags errors, or just plain incorrect numeric results. Virtual PC I can understand, as it does not appear to use the library, opting instead to use the native PowerPC floating point hardware. Since PowerPC does not support 80-bit floats, it is not surprising that just about all of the 80-bit floating point tests in TestFloat fail on Virtual PC.

What is not clear to me is why those same tests fail in QEMU. QEMU is using the SoftFloat library, and therefore it is unacceptable that it should fail while Bochs works correctly. From what I can tell, QEMU takes shortcuts in not updating the x87 status flags for Underflow and Inexact results. While these are ignored by most floating point code, it is simply incorrect (and detectable) to set these results incorrectly. In my opinion, this is rather low-hanging fruit that should be fixed in QEMU.


Performance on Real World Build Scenario

Obvious correctness errors aside, the most important thing QEMU needs to focus on is performance. Back in December 2008 in Part 27, I analyzed the performance of the then relatively new Intel Atom and Intel Core i7 processors. I used a couple of tests, a real-world Visual Studio build scenario of building my Gemulator 9.0 emulator, and a set of synthetic micro-benchmarks from my CPU_TEST suite. I recently repeated these tests on a variety of modern 2 GHz class processor hosts, measuring either native performance or the performance when running inside of a KVM, QEMU, or Virtual PC 7.02 virtual machine.

A set of results of the Visual Studio build of Gemulator 9.0 sources is summarized below, with some data repeated from the December 2008 posting, showing the build time, the host clock cycles, and the environment of the build, "native" to indicate the test was run natively in Windows, and everything else running the Windows 2000 virtual machine disk image. The various virtual machine host environments were KVM running on Fedora Linux 13, QEMU 0.9, WinQEMU 0.10.2, the latest QEMU 0.12.5, Virtual PC 7.02 for Mac, or my Windows build of Bochs 2.4.5.
 
Desktop computer specs
clock speed, CPU
Gemulator 9 build time
(seconds, lower is better)
Total clock cycles
(billions)
Execution environment
3460 MHz Core i515.653.97native
3460 MHz Core i5265916.9WinQEMU
2666 MHz Core i720.053.32native
2666 MHz Core 2 (Mac Pro)22.158.92native
2666 MHz Core 2 (Mac Pro)9102426Bochs 2.4.5
2260 MHz Centrino 2 Penryn23.753.56native
2260 MHz Centrino 2 Penryn37.985.64KVM
2666 MHz AMD Phenom24.866.12native
2400 MHz AMD Phenom35.485.0KVM
2400 MHz Core 2 Q660024.060.0native
2400 MHz Core 2 Q6600329789.6QEMU 0.12.5
2400 MHz Core 2 Q6600383919.2WinQEMU 0.10.2
2400 MHz Core 2 Q66004771144.8QEMU 0.9
2500 MHz PowerMac G5126315VPC 7.02
2500 MHz PowerMac G55051262.5QEMU 0.12.5
2000 MHz PowerMac G59951990QEMU 0.12.5
1250 MHz Mac Mini G420922615QEMU 0.12.5

The data may look a little confusing at first, but let me walk you through it. From the Core 2, Core i5, and Core i7 native results, the bottom line is that the build of Gemulator 9 requires about 53 to 60 billion host clock cycles. Not surprising given the similarity of those architectures. Regardless of the clock speed, the absolute amount of work required is about the same.

What is interesting is to compare those results against the virtualized times. On the two systems that I have Fedora 13 running with KVM virtualization (my Penryn and Phenom boxes), the amount of absolute work rises to about 85 billion cycles. This indicates that KVM introduces approximately a 40% to 60% performance overhead for its virtualization. Not quite the zero-overhead cost of hardware virtualization as perpetuated. Real world workloads, which require emulating disk I/O, interrupts, exceptions, ring transitions, and hardware, do experience a slowdown even using VT, something that VMware pointed out four years ago in their excellent paper comparing jitting and VT: http://www.vmware.com/pdf/asplos235_adams.pdf. I first pointed readers at that VMware paper almost three years ago back in Part 3, and as VMware found back then, VT does not quite live up to the hype.

Bottom line: KVM, real workloads, 50% slowdown give or take.

Next datapoint, look at the 2.4 GHz Core 2 Q6600 numbers. The Q6600 is the quad-core Core 2 I discussed two years ago back in Part 19. Two years later, it is still one of my favourite processors, in part because I can easily over-clock it to 3.4 GHz when needed. For these tests I ran it at its speced 2.4 GHz speed, running virtual machines on both 64-bit Windows and 64-bit Debian 5.05 Linux. One thing that is very obvious is the efficiency increase of QEMU over the past 3 years from version 0.9 to 0.10.2 to 0.12.5, as you can see the build times dropping from 477 seconds down to 329 seconds. Keep in mind this was tested on exactly the same Windows 2000 disk image, reverted each time to a known snapshot for each of the three QEMU versions. The real-world efficiently of QEMU has truly improved by a good 50% over these past few versions. However, today this still leaves it at about, oh, 13 times slower than native execution, or almost an order of magnitude slower than running under KVM.

Another takeaway from this data are the PowerPC numbers, which I ran in emulation using either Virtual PC 7.02 or the latest QEMU 0.12.5 build. Virtual PC requires about 315 billion cycles, equating to about a 5x slowdown over native x86 performance. QEMU requires at least four times the time, or roughly about a 20x slowdown. Interestingly the slowdown gets even worse using the slower G5 processor, as the 2.5 GHz chip contains a 1MB L2 cache, while the 2.0 GHz chip contains only a 512K L2 cache. The size of the L2 cache matters a lot! The slower G4, both in absolute time and absolute clock cycles is slower yet. So as I was saying last posting, Virtual PC was actually getting quite fast on the G5 right around that Microsoft decided to discontinue the product. Pity.

It is interesting that Virtual PC actually holds its own against QEMU. It sets the bar for how fast QEMU could run, and that bar is quite a bit faster than QEMU is at today. Four times faster on PowerPC is possible by the mere existence proof of Virtual PC. This is not really too outlandish an expectation, given that various x86 jit frameworks such as Intel's PIN, DynamoRIO, Mojo, and recent work published at the 2010 CGO conference in Toronto all suggest that x86-to-x86 dynamic binary translation of real world applications can be accomplished with under 2x slowdown, in some cases with as low as 20% slowdown. Given the observed 50% slowdown seen in KVM, this suggests that it is possible (i.e. it is technically plausible) to make QEMU perform at about the same performance level as hardware virtualization.

The challenge is in figuring out the root causes of the existing performance bottlenecks.


Performance on My Benchmarks

In trying to nail down exactly why QEMU is as slow as it is, I ran it through my usual series of other benchmarks that I test processors with. These are tests that as explained above range from simple integer loops to micro-benchmarks of various x86 code sequences to memory bandwidth and hard disk bandwidth tests.

The results of various tests are broken out similarly by host processor and clock speed and the execution environment of the test:
 
  Guest OSWin2KWin2KWin2KWin2KWin2KWin2KWin2KWin2KWin2KWin2KWin2KWin2K-
  VM*VPC702Q.12.5Q.12.5Q.12.5Q.12.5Q.12.5Q.10.2Q.10.2Q.9.0Q.12.5KVMKVMnative
  Host OSOS XFedora12Fedora12Fedora12Fedora13Debian5Win7Win7Win7Win7Fedora13Fedora13Win7
  Host CPUPPC G5PPC G5PPC G5PPC G4PhenomCore2Core2Core2/MCore2Corei5Core 2/MPhenomCore 2/M
  Clock2500250020001250240024002400240024003460240024002400
                
Test:Units              
T1FASTseconds 2.1 .. 4.2**3.94.817.32.82.22.93.14.62.50.310.280.29
T1SLOWseconds 2.7 .. 4.9**6.17.720.93.93.24.84.88.93.70.310.280.29
SIMPseconds 1.89.612.223.68.46.78.18.58.96.60.80.70.8
Office scriptseconds 16.777.8120.8255.059.460.967.772.868.847.35.94.33.6
LPIPEseconds 10.643.582.0164.538.441.746.756.146.233.68.11.62.9
CPU-Ztext HUNGPII/SSE3PII/SSE3PII/SSE3PII/SSE3PII/SSE3PII/SSE3PII/SSE3PII/SSE3PII/SSE3Core2/SSSE3Phenom/SSE4ACore2/SSSE3
HDTEST32MB/sec 354416822741486635
111result FAIL(2)FAILFAILFAILFAILFAILFAILFAILFAIL(2)FAILPASSPASSPASS
TESTFLOATresult FAILFAILFAILFAILFAILFAILFAILFAILFAILFAILPASSPASSPASS
MEMBAND (a,u)clocks 1,  426, 5533, 7462, 16621, 4314, 4017, 4018, 4020, 3318, 29<1, 3<1, 1<1, 3
FRACTALseconds 7.834.331.063.126.730.134.944.341.524.422.016.14.4
Build Gemulator 9seconds 126505995209235732938743247726537.935.425
C# shortseconds 3.712.825.140.19.46.68.39.215.99.60.90.50.4
C# longseconds 30.859.9105.0176.446.534.645.753.588.540.36.44.03.6

Notes:

These results show the results of 12 different Windows 2000 virtual machine runs on 7 different host computers (3 PowerPC based Macs, one AMD Phenom machine, and three Intel machines). I chose these varied systems to illustrate a few points.

Generally, these results once again show that QEMU is still an order of magnitude slower than hardware virtualization using KVM, which has compatibility and performance issues of its own. And when compared against the native run times of these tests on  Core 2 host hardware, many are in fact close to 20 times slower than native speeds. This is just plain unacceptable for the level of performance that dynamic recompilation, "jitting", should be delivering.


Performance on Micro-Benchmarks

Finally, I ran QEMU and Virtual PC through my CPU_TEST and ND32 tests (mentioned in the series of postings in Parts 25 26 27). I will now give you some raw performance comparisons of Virtual PC and QEMU and KVM:
 
  Guest OSWin2KWin2KWin2KWin2KWin2KWin2KWin2KWin2K
  VM*VPC702Q.12.5Q.12.5Q.12.5Q.10.2Q.9.0KVMBochs
  Host OSOS XFedora12Fedora13Debian5Win7Win7Fedora13Win7
  Host CPUPPC G5PPC G5PhenomCore2Core2/MCore2PhenomCore2
  Clock25002500240024002400240024002666
           
Test:Units         
test 1 int addclocks 242266114
test 1 int adcclocks 54825303628117
test 1 mem indirclocks 31713152020339
test 2 int adc++clocks 24422112780.518
test 5 zero memclocks 210146816152
test 6 and0 memclocks 213181114101.554
test 7 divideclocks 1152507769731244477
test 3 os pg fltclocks 337839615385714888891043481142853120126952
test 12 PeekMsgclocks 599510504699711822153841100946927770
test 15 sbb r rclocks 85617152320137
test 17 read rtcclocks 681369499106926385
test 23 shld immclocks 4361515185220
test 29 call eaxclocks 1742101021131974218122
test 29 call mispredclocks 17420810211419518811107
test A15 span 64clocks 39103343539341129
test A15 span 4Kclocks 3814110710496911222
test A19c FXSAVEclocks FAIL80852947599860665653
test LAHF/SAHFclocks 5331010079951006131
test 23b self modclocks 2031562510126979513333415468537
test 23d self modclocks *625004137943636521731420113361052
nd32 x86 native loopclocks 17322621**7*
nd32 simulatedclocks 1596227014671646**90*
nd32 x86 sim ver cclocks 1072210614181560**131*
nd32 x86 sim ver dclocks 3056396822642216**215*
nd32 x86 sim ver lclocks 1384256615801888**49*

* indicates that a particular test was not run, my bad!

Keeping in mind how I looked for differences in the numbers to highlight design differences between the Atom and Nehalem processors in those older postings, let me make some similar observations now:

So far I have shown you that QEMU does well today on simple integer code sequences but really trips on itself when arithmetic flags dependencies are present, which can be fairly common in mainstream x86 code. QEMU does not use as efficient lazy flags scheme as Bochs, and it shows!

QEMU also misses several "low hanging fruit" opportunities, such as optimizing away unnecessary arithmetic flags dependencies and optimizing space-saving compiler-generated code sequences which appear to introduce false data dependencies.

QEMU also does not appear to try to optimize function call and return sequences, as evident by having the same timing for predictable and mispredicted function returns.

Now I will get into some scenarios which it does miserably on:

Self-modifying code, context switches, and frequent indirect jumps are necessarily not mainstream performance scenarios but do occur and should be handled more efficiently than they are today. Indirect jumps especially can rear their head in things like GUI message loops or any other kind of code that contains large C/C++/Java switch statements. Self-modifying code per-se may not be occurring frequently, but is very much related to the "double jit" problem that does occur with managed languages such as C# and Java. Given how much server code is written in managed or dynamic languages, that fact alone limit's QEMU's appeal for use in server farms.

So unfortunately, the hardware virtualization people try to solve the problem with hardware and perpetuate the myth that emulation has to be slow.

Binary translation frameworks such as PIN, DynamoRIO, and the Nirvana/iDNA infrastructure which I worked on have proven that x86 jit can be done at well under an order of magnitude slowdown. PIN and DynamoRIO have regularly shown that user-mode application-level instrumentation can be accomplished at maybe 20% to 40% slowdown. Not surprising then that at this year's 2010 CGO (Code Generation and Optimization, http://www.cgo.org/cgo2010/program.html) conference in Toronto, I saw a great from one of the developers of DynamoRIO on a project called Umbra (http://www.cgo.org/cgo2010/talks/cgo10-QinZhao.pptx) which pulls off full-system dynamic recompilation and memory sandboxing at roughly 2x slowdown, close to the performance of hardware virtualization scenarios. QEMU could certainly do the same thing.

As I have said for years, emulation in itself is not a slow technology. It is the implementation that if done incorrectly will mess up the performance. Observe another paper from CGO 2010 (ironically presented the same morning just minutes before the Umbra paper) entitled "PinPlay" (http://www.cgo.org/cgo2010/talks/cgo10-PinPlay.pptx) which describes how the folks over that Intel managed to take PIN and slow it down by a factor of almost 100x to do tracing just like Nirvana/iDNA except five times slower than Nirvana/iDNA. I said ironic, because the Nirvana work was developed almost a decade ago and published four years ago. So, I'm flattered that all the best brains at Intel couldn't even come close to the performance of my old work, but it goes to show that two sets of people can sit down to write exactly the same thing and end up with completely different performance results.


Too Clever For The Sake of Portability?

One really cool thing about QEMU is in the way it decodes guest instructions into an intermediate language (IL) format, which it then jits to the appropriate host architecture. It effectively does what a real C compiler does, first representing the code in an abstract form, performing optimizations on that abstract for, and then converting that optimized IL to the native code (whether say, x86, x86-64, or PowerPC). This IL is documented in QEMU's TCG (tiny code generator) folder in the header files tcg-opc.h and tcg-op.h. The IL consists mainly of only 32-bit and 64-bit operations, with some 8-bit and 16-bit operations for loads and stores. The intermediate language form is very RISC-like. The IL operations, especially ones like NAND and EQV and ORC scream of RISC instruction sets such as PowerPC.

However what is missing are operations of more complex operations that one wouldn't necessarily see in RISC. For example, the read-modify-write operations of x86 which cause so many problems. By breaking everything down into loads and stores, the meaning of the read-modify-write is lost, leading to such problems as generating an access violation for both the load and the store. Some of the funkier shifts and rotates and bit twiddling operations in x86 are also not represented, resulting in a sequence of IL containing many simple mask and shift operations. The TCG then does not combine those back into a single instruction.

So ironically what happens, QEMU's attempt to be more portable and clever use of an intermediate representation also means that it generates fatter less efficient code than a jitter whose job is to solely say, translate instruction set X into instruction set Y (as I suspect Virtual PC 7.02 and many similar jitters do). PIN, DynamoRIO, and VMware binary translation modes for example perform also a 1-to-1 translation of many integer instructions. QEMU's IL needs to be able to represent the majority of x86 instruction set in IL, perhaps such as using my VX64 form which I suggested almost three years ago in Part 6 as a means to better represent common x86 code sequences in a fixed-size RISC-like form.

By not generating as many intermediate operations it also speeds up jit time, because now fewer IL nodes would need to be emitted by the TCG. Complex operations (such as those crazy bit twiddles) would be handled by the individual code generators and so in the case of x86 hosts the code generator would simply emit the crazy instruction on the host. On something like a PowerPC host you'd get a more RISC-ified sequence, not unlike what it emits today.

But the approach that QEMU takes today is just too simple.


Conclusions

QEMU has come a long way in terms of performance and compatibility in the past couple of years. It's not the piece of junk I first evaluated in 2007. However, the performance leaves a lot of be desired and today it would simply be too slow to deliver reasonable x86 emulation, and thus virtualization and live migration capabilities, to today's ARM-based devices. The Apple iPad and the coming wave of ARM-based netbooks and tablets present a huge potential to do things like offload cloud computing from large server machines and to give people more flexible options for running their Windows based software.

QEMU right now is sort of analogous to PinPlay. It is a great idea, mostly solid, but just grossly inefficient. The authors of PinPlay apparently did not pay any heed to the lessons of the Bochs optimization work two years ago, which clearly pointed out efficient techniques for handling arithmetic flags, memory sandboxing, and flow control in portable virtual machines. Similarly those same ideas would benefit QEMU if the past developers of QEMU had paid attention to that paper. The fact that QEMU is "portable" is no excuse for it being as slow as it is, and certainly no excuse for it still being as slow as the Bochs interpreter at some operations.

QEMU may be dumbing down its IL format to too simple a form, trying to represent complex x86 instructions in simple RISC-like form, which it then has to convert back to complex x86 instructions. QEMU needs a more robust IL representation if it is to come close. VMware Workstation and Sun/Oracle's VirtualBox to their credit do deliver full-system x86 dynamic recompilation virtualization modes which are very comparable to KVM performance but with some pathological flow control performance problems as nicely described and documented in various past VMware papers that I've mentioned before. Unfortunately those products are hard-coded for x86 hosting and not anywhere near the level of portability of Bochs or QEMU.

In future postings - which after this flurry of blogging this summer I am going to take a break from for a while - I will discuss my own work on QEMU and give some updated performance results of applying my patches to the QEMU 0.12.5 sources. I trust that my introduction to the POSIX subsystem in Windows 7 as well as the trip to PowerPC based development inspired some people to break out of the usual drudgery of x86 based development on Windows and Linux.

As a reminder, I never refuse free Starbucks coffee in appreciation for my postings. Go to the Starbucks Online Store, purchase a prepaid gift card to cover a few cups of coffee, and send it to me at:

Darek Mihocka c/o Emulators
14150 N.E. 20th Street, Suite 302
Bellevue, WA 98007-3700
U.S.A.

Enjoy the rest of the summer folks!


[Part 33]  [Table Of Contents]  [Return to Emulators.com]