NO EXECUTE! October 15, 2007

NO EXECUTE! A weekly look at personal computer technology issues.

October 15 2007

[Part 1] [Part 2] [Part 3] [Part 4] [Part 5] [Part 6] [Part 7] [Next]

Today we delve into the inner workings of virtual machines. My very inquisitive puppy joins us.

One Night In Paris

It is said that necessity is the mother of invention. That could not have been more true when I had a flash of inspiration this summer as my girlfriend and I were driving across Europe. We were on our way from an Iron Maiden concert in Belgium to a Metallica concert in Spain. Sane tourists would have taken the train. Being speed freaks with a passion for Europe's mostly non-existent speed limits, we instead chose to rent a car in Brussels and head in the general direction south toward France. No maps, no directions, no prior planning, we literally just got on the highway and started driving. The only tools at our disposal were a Sony VAIO notebook running Windows, and a Mac Powerbook running Mac OS X with the VMware Fusion (beta release) hosting a Windows virtual machine. On both systems we had mapping and trip planning software for Windows, and one single USB-based GPS receiver to share between both computers.

Getting to Paris was easy enough - all roads in Europe seem to lead to Paris. Getting out of Paris was quite another matter. Despite mapping a path through Paris that would take us past the Eiffel Tower and then back on the road, the combination of my inability to drive stick and well as the bright sunlight making it difficult to read the screen of the Sony VAIO got us stuck numerous times in tiny streets. As time was ticking, the battery on the VAIO was running low and we needed to switch computers.

As we were frantically re-entering the route from the software on the VAIO to the software on the Macbook, the gears started grinding in my brain. Something I love about Windows XP Pro is the "Remote Desktop" feature - the ability to transfer a desktop session from one Windows computer to another. I routinely do this at home for example, moving my email session from one computer to another as I go from being in the kitchen to the backyard to my office. Sometime I remote desktop in from a laptop, and other times from various desktop machines. In either case, my email session continues to always run on the same computer in my office. What I am doing is merely transferring my mouse, keyboard, and video interaction from one computer to another, separating the user interface portion of my Windows session to one computer while the actual number crunching and email occurs on another computer.

The concept of using Remote Desktop is not unlike using virtualization software such as Microsoft's Virtual Server, which similarly allows one to use a web browser running on a remote client machine to access a virtual machine running on some fixed server. With Virtual Server one can even connect to the same virtual machine session from several client machines simultaneously. This is a very cool feature, but this, as well as Remote Desktop, share one serious drawback: you need to have network connectivity. Lose the network connection (or battery power), and your remote session stops.

This is the problem we faced in the car. The Macbook has a reflective LCD display which allows it to be used in bright daylight. But using Remote Desktop was not an option since the battery on the VAIO was going dead. Lose the battery and you lose the network connection. What we needed was a way to transfer my entire Windows session from one computer to another as it was running. In virtualization terminology, this is called "live migration", or as VMware calls it, VMotion (http://www.vmware.com/pdf/vmotion_datasheet.pdf).

As far as I know, nobody actually supports useful live migration in a robust manner today. Both VMware and Microsoft blow smoke today about how their live migration is better than their competitor's, but that's pure marketing bunk. Microsoft's live migration solution is vaporware at this point. VMware Fusion for Mac OS X does not actually support live migration. VMware's own documentation actually places very specific limitations on how VMotion works and between what two host CPUs one can migrate a virtual machine (http://www.vmware.com/pdf/vi3_systems_guide.pdf). One cannot migrate a VMware session from an Intel host to an AMD host or even between different models of Intel microprocessors. Unless you have a farm of almost identical servers running the high-end VMware server software, VMotion is useless.

Yes, there are virtual machines that allow their state to be saved and resumed at some later time. My own Gemulator and SoftMac emulators have supported this functionality for years, allowing one to freeze a running Mac OS session on one Windows computer, copy the saved state data to another computer, and then resume running Mac OS on a completely different computer. It is a form of migration, but is not live migration.

It would be amazing if a virtual machine environment could just be packaged up and moved around as a single file, much like how one can copy around and email around a MP3 audio file, a WMV movie, or a JPEG photo. Bandwidth issues aside, I would love to be able to go on vacation, walk into some internet cafe, and access my computers at home in a Remote Desktop-like fashion. Then, to walk next door into a computer store and either rent or purchase a PC, perhaps a Mac, or even a low cost OLPC machine, connect that new computer to the Internet, and migrate my home computer state to that new machine. I am talking about making Remote Desktop and Live Migration actually be one and the same feature.

I believe it would be the next killer application for some company like Google to provide virtual machine hosting services on the web, for hotels and airports to rent laptop computers, and for yours truly to develop the virtual machine client technology to host virtual machines on any PC, Mac, even my Playstation 3. This would allow one not only to "remote desktop" into a virtual machine, but to actually migrate it (either move it or clone it) to the local computer. This scenario of host-independent live migration of virtual machines has driven my research and design philosophy of the Virtual Execution Runtime over the past four months. I believe that this would be a powerful use of the low cost OLPC laptops which are about to hit the market, which could be sold or rented out to tourists who wish to have their home computers with them without bringing their personal laptops with them on vacation. It could make use of the millions of Playstation 3 game consoles out there, which are in fact perfectly usable Linux computers. Imagine this scenario: you are using your laptop computer, you are stranded somewhere in Paris, your battery is running low, yet with one press of a button you wirelessly transfer your entire running Windows session to another laptop computer. Treating entire Windows or Linux sessions in the same way as one treats a song or photo is in my opinion a very powerful concept that the big companies have failed to deliver, and do not have the technology or will to deliver.

Neither VMware nor Microsoft will be able to pull off such a live migration trick anytime soon. Too bad for them, because they are not only screwing themselves out of revenue from users of older hardware, but they are also leaving the market wide open for companies like Google (acting as the giant server cloud) and emulators.com. This is what I intend to develop over the next 2 years or so.

I have run across one web site, YouOS (https://www.youos.com/), which bears some resemblance to what I just described, except it is limited to newly written applets running on their custom OS platform. YouOS is not a true virtualization environment that supports existing operating systems and applications such as Windows and Office, nor does it appear to function if the network connection is lost. But I think the baby step they are taking to promote the general idea of portable virtual machine sessions is a step in the right direction. Try the YouOS demo, then imagine if their demo consisted of running a real legacy OS such as Windows XP which could also be pulled offline, just like a true virtual machine with true live migration.

The Hardware Virtualization Trap

There are some very basic technical reasons why live migration and VMotion have such severe constraints. Why is it that one can suspend and resume a Mac OS virtual machine across different host computers but cannot do the same to a Windows session? Microsoft, VMware, XenSource, and other virtualization companies have fallen into a technical trap set for them by chip makers. A trap that forces the use of virtual machine techniques which in my opinion are just plain wrong.

Let me explain the difference in how I have approached the problem of emulating a Macintosh virtual machine compared to how most of these companies above approached the problem of emulating a Windows virtual machine, and you will clearly see how Microsoft and VMware are on a technological dead-end road.

A Macintosh virtual machine is running an operating system written for a non-x86 microprocessor, that being either the 68000, 68020, 68030, 68040, or PowerPC. Such a virtual machine uses a technique called "binary translation" (BT for short) or "emulation" to translate the non-x86 code into native x86 code that your PC can execute. I will use the BT as a general term to encompass the various sub-groups of such virtual machines including interpreters, dynamic recompilation engines, offline recompilation engines, and other techniques which never execute the guest code directly. A Java virtual machine for example is a binary translation type virtual machine since Java bytecode is not native x86 bytecode.

A Windows virtual machine on the other hand - whether QEMU, VirtualBox, VMware Fusion, VMware Server, Virtual PC 2007, Virtual Server - when running on a PC generally use sandboxing techniques that go by names such as "direct execution", "ring compression", and "VT". In these techniques, the actual original guest code is allowed to execute directly. As is explained in this Microsoft blog (http://blogs.msdn.com/virtual_pc_guy/archive/2006/03/14/550712.aspx) ring 3 (a.k.a. "user mode") code is allowed in Virtual PC to execute directly since ring 3 is usually protected by the hardware itself from accessing memory or executing code that it does not have permission for. Ring 0 (a.k.a. "kernel mode") is a little trickier, since the code executing directly in ring 0 needs to be trusted. The virtual machine engine itself, known as a Virtual Machine Monitor (VMM) or "hypervisor" lives in ring 0, and either pushes the guest ring 0 code down to execute in ring 1 (thus "ring compression"), or it runs "enlightened" operating system code which is trusted. This is the code that is installed for example, when Virtual Machine Additional are installed to a particular supported guest operating system. AMD and Intel have recently pushed unnecessary hardware virtualization technologies on us, with such catch names as "Vanderpool" and "Pacifica", which eliminate the need for hypervisors to use the ring compression trick, but do little to increase stability or security other than to slow down overall performance.

The terms "virtualization" (the verb) and "virtual machine" (the noun) have been hijacked of late, perhaps by the popularity of products with such names as Virtual PC and VMware, to refer to this latter type of direct execution-based technology. Do not be fooled! These terms can refer to both groups of technologies, and I for one, have believed for over 20 years now that the former group, the binary translation based techniques are ultimately technically superior and more secure.

In reality, most virtual machines today use a hybrid combination of binary translation and direct execution techniques. Even something that appears to be purely binary translation based, such as a Java virtual machine, relies on the host MMU and the ring protection mechanism to prevent Java code from errantly trashing memory it does not have access to. Similarly, even Virtual PC and VMware use a little bit of binary translation as necessary to patch code that would otherwise escape its sandbox or cause significant performance latencies as was described in the VMware BT vs. VT paper.

And therein lies the problem with the direct execution style of virtual machines - as soon as you allow the guest code to run natively you have to trust that your hypervisor or VMM has plugged all the holes by which that code could escape its sandbox. Do they? Apparently not. As these recent postings describe (http://www.offensivecomputing.net/files/active/0/vm.pdf and https://www.sans.org/webcasts/show.php?webcastid=90652), it is possible for code to easily detect that it is running in the presence of a virtual machines. And with that knowledge, malicious code can actually exploit the virtual machine, potentially leading up to a new class of malware that is even more stealthy than a rootkit.

The dirty little secret that virtualization companies do not want you to know about VT-based virtual machines is that they are not the magic solution that IT departments think they are paying for. They do not increase security. They do not fully isolate virtual machines against each other. They do not truly isolate the virtual machine from the host hardware. For example, most hypervisors pass through the CPUID information to the guest, allowing code on the guest virtual machine to query the kind of host hardware being executed on. From there, that code could exploit specific known bugs, such as various Intel Core 2 errata, to potentially escape the sandbox of the virtual machine.

Knowing the type of host hardware also allows code running in virtual machines to detect that it is in fact being virtualized. For example, the simple user mode Windows test code I showed you three weeks ago to measure the latency of memory allocations can even discern whether Virtual PC is being run in ring compression or hardware virtualization modes.

Intel themselves document dozens of different ways that sandboxed guest code can fault into the VMM - anything from setting reserved bits in certain registers to executing certain instructions to accessing memory. The job of the VMM (the hypervisor) is to decode the information of what is called the "VM exit event" or the "general protection fault" and determine whether it needs to abort the guest code, patch it, binary translate it, re-execute it, or do who knows what. The VMM generally needs to disassemble the guest instruction that caused the event. Therefore you need to hope that the VMM's disassembler is correctly disassembling the guest instructions exactly as the host microprocessor is. That is not always the case, especially as new instructions keep being added. When you use hardware virtualization, you are putting your trust (and your company's machines) in the hands of both the virtual machine vendor (Microsoft, VMware, Xen, etc.) and in the hands of the chip makes (AMD, Intel, IBM, etc.) and you trust that both sets of parties have delivered product that works as specified. The vendor may introduce bugs in the VMM due to incorrect handling of an event which was not anticipated. Errata in the chip itself may cause a perfectly written VMM to still malfunction due to hardware bugs.

The bottom line is this: hardware virtualization leaks out far too much information about the host hardware to allow for true virtual machine migration across the web to really work. It is too easy to detect the presence of a virtual machine, and too easy to exploit a virtual machine. Any company that spends money migrating servers and desktop to any of the popular commercial virtualization products thinking that it is somehow increasing security and reliability is just plain stupid.

What is the correct approach? An emulation solution in the form of a binary translation based virtual machine, one that makes no reliance on the host hardware MMU or VT capabilities and one that can truly sandbox all code and data accesses, is the only plausible way that I see to bring seamless and secure virtual machine migration across different host platforms. When Microsoft or VMware give us products that can migrate a running Windows sessions from a Mac to a PC to a Playstation 3 and without the guest code being able to even detect that it has been migrated to different host platforms, I will change my opinion. I do not think such a day will ever come.

Emulation 101 - Fetch and Dispatch

And this leads us to the classic BT vs. VT debate. Those of us who work on BT-based emulation constantly have to butt heads with the misinformation from the hardware virtualization VT camp who claim that direct execution of code has to be the better approach. I myself used to believe that myth at one time, and in the late 1990's rewrote the inner workings of my own emulators to make use of more VT-style techniques. I struggled for years to track down and fix random reliability and stability issues, which is ironic considering that it took me less time to write the original emulators for MS-DOS than it did to work out all the weird issues of the Windows port.

I finally came to the realization this summer that I simply should have stuck to using BT techniques all along, for both MS-DOS and Windows releases of my products. Not only are pure-BT emulators easier to write, they require less powerful host hardware capabilities, they are easier to debug, they are more portable, they are less prone to hardware bugs and errata, and they open up a world of possible applications not possible with VT-based implementations. Applications such as reliable tracing, profiling, and deterministic re-execution of code for use during code development. And killer applications such as the "host-independent cloud of virtual machines" on the web which can be moved, cloned, and accessed by almost any computer on the planet regardless of its actual architecture.

So how does one write a virtual machine based on binary translation? What does a virtual machine actually do inside of the sandbox that it creates?

Fundamentally, all virtual machines do only two things: they intercept the execution of guest code, and they intercept memory accesses to guest data. That is it, those two things. They simply intercept everything the guest software is trying to do and in turn lie to that software by feeding it intercepted data. A true virtual machine intercepts everything such that no unintended state information from the host machine leaks into the guest machine.

In the early days of the PC and Mac, the days of MS-DOS and classic Mac OS, all programs (operating system code and application code) executed with full control over every single byte of memory and every single hardware port. There was no memory protection, either on the part of the operating system or on the part of the microprocessor hardware itself, to stop a program from trashing every byte of RAM or writing complete garbage to the video screen or to the disk. This was the age when a single errant "POKE" command from BASIC could take down a computer.

This lack of protection allowed developers to easily hack operating systems, sometimes replacing entire portions of an OS with their own code. I paid my way through college in the late 1980's by selling a utility called "Quick ST" (http://www.atarimagazines.com/startv4n5/speed.html) which hooked and replaced many of GEM's text and graphics rendering operations with my own (and much faster) hand-tuned assembly code. Similar utilities existed for MS-DOS which were called TSRs (Terminate and Stay Resident programs) which for example hooked into certain keystrokes so as to bring up useful applets at any time from within any program. Today such programs would be considered rootkits and be branded malware, but back in those innocent days before computers were being networked together, taking control of the whole system was how you got things done efficiently.

In the early summer of 1986 while on break from classes I purchased an Atari 1040ST. This Mac-like computer from Atari featured an incredible 1 megabyte of RAM, a high resolution display, a mouse, a graphical operating system (GEM), and a built-in 720K floppy disk drive. This machine blew the IBM PC and the Apple Macintosh away in price and capabilities. As my first programming project for my new Atari ST I set out to write an emulator to run programs from my older Atari 400 and from my friend Ig's Apple II. I didn't know anything about the technology of emulation or even the terminology, but what I was setting out to do was to write my first virtual machine.

The only emulation type products I was aware of were IBM PC co-processor cards. These were actual pieces of hardware costing many hundreds of dollars. This was not the approach I wanted to go. I felt that the problem could be solved entirely in software. So I compared the two machines, the Atari 800 and Apple II "guest" computers which I was hoping to emulate and were both based on the MOS Technology 6502 microprocessor (http://en.wikipedia.org/wiki/6502), and my Motorola 68000-based (http://en.wikipedia.org/wiki/68000) Atari ST "host" computer that would run the emulation:

Computer: Apple II Atari 400 Atari ST
CPU: 6502 6502 68000
Clock speed: 1.0 MHz 1.8 MHz 8 MHz
RAM: 48K 48K 1024K
ROM: 12K 18K (8K BASIC, 2K FP, 8K OS) 192K
Hardware space: 8 bytes 6K (2K hardware, 4K unused) megabytes
Memory mode: little-endian little-endian big-endian
Address space: 16 bits (64K) 16 bits (64K) 24 bits (16 MB)
Register width: 8/16 bits 8/16 bits 32 bits
Register state: 8-bit A/X/Y data registers
8-bit P register
16-bit Stack Pointer (SP)
16-bit Program Counter (PC) 8-bit A/X/Y data registers
8-bit P register
16-bit Stack Pointer (SP)
16-bit Program Counter (PC) Sixteen 32-bit general purpose registers
16-bit Status Register
32-bit Program Counter

Things really could not have been more different between the two machines guest machines and the host. An Atari 400 and an Atari ST were similar in name only. The actual hardware was as different as night and day, not unlike how a Mac Classic, a PowerMac G5, and a quad-core Mac Pro are based on three entirely different hardware designs. Thankfully, the Atari 400 and the Apple II themselves were very similar, and with the Apple II being a much simpler design I tackled the emulation of the Apple II first. The Apple II literally contains all of 8 one-byte hardware registers, which are used to set the video mode, beep the speaker, and read keystrokes.

The first step seemed pretty obvious - subdivide the relatively huge RAM available on the Atari ST into small blocks which correspond to the 48 kilobytes of Apple II RAM and to the AppleSoft BASIC ROM. Guest RAM and guest ROM can be accessed by calculating a 16-bit guest effective address and using that address as an index into those two blocks of host memory. An easy trick is to simply allocate a contiguous 64-kilobyte block of memory on the Atari ST and use that to represent the combined RAM, ROM, and hardware state of the Apple II.

The second step was also obvious - of how to represent the 6502's register state, which consisted entirely of 8 bytes of register state - the 8-bit A/X/Y data registers, the 8-bit P register containing control and status bits, and the 16-bit SP (stack pointer) and PC (program counter) registers. Using the C compiler's "register" keyword, I could declare these as "char" and "short" integer variables within the main C function of the emulator and trust the compiler to enregister these 8 bytes of guest state.

The third step was to write what is called an "interpreter loop". This is an infinite loop in C which dereferences the current guest PC address, reads an 8-bit opcode, executes one of 256 possible handlers which simulate each of the possible 6502 opcodes, and updates the guest register state variables after each instruction. This is a fetch-and-dispatch loop, a common but inefficient technique of interpreting bytecode.

While this interpreter loop is not the most efficient way to go, it is very easy to implement. The fetching of 6502 codebytes and dispatching to handler code for each opcode is the first means by which you achieve the goal of intercepting code. I will return to this later, but for now just consider that our interpreter loop looks something like this:

void Emulate6502()
{
    register short unsigned int PC, SP, addr;
    register unsigned char A, X, Y, P;
    unsigned char memory[65536];

    memset(memory, 0, 65536);
    load_rom();

    /* set initial power-on values */
    A = X = Y = P = 0;
    SP = 0x1FF;
    PC = peekw(0xFFFC);

    for(;;)
        {
        switch(peekb(PC++))
            {
        default: /* undefined opcode! treat as nop */
        case opNop:
            break;

        case opIncX:
            X++;
            break;

        case opLdaAbs16:
            addr = peekw(PC);
            PC += 2;
            A = peekb(addr);
            break;

...
}
}
}

Intercepting data is much more difficult. The endian difference (http://en.wikipedia.org/wiki/Big_endian) meant that only single bytes at a time of 6502 memory can be access by the 68000. Any larger access, such as 16-bit integer or 32-bit integer, requires reading individual bytes and reversing their order. This was easy to set up, as I was writing the code in C and so could allocate memory and write small functions to perform this byte swapping. The C language even includes a macro called "swab()" which performs this byte swapping.

There is also an issue with how one intercepts the data in the first place. A very simple 6502 machine language instruction is LDA (Load Accumulator) which can be followed by a 16-bit value that is the absolute 16-bit address to load 8 bits of data from. If the load is occurring from Apple II RAM or ROM, there is no problem. The 16-bit value is directly an index into the 64-kilobyte block which was set up in the first step. But what if the memory location being loaded is in the address space between RAM and ROM? What if the value is say, the 8-bit hardware register that reads a keyboard keystroke?

One approach is to write a "data load" and "data store" handlers which are similar in concept to the fetch-and-dispatch loop, except they operate on a 16-bit address and either write an 8-bit data value to the guest memory or read and return an 8-bit data value from guest memory. These are the magic peekb and peekw handler functions in the pseudocode above which I have not explained yet. If each of the 256 bytecode handlers (and the fetch-and-dispatch code itself) passed all memory loads and stores through these data load and data store handlers, it would guarantee that all data is intercepted. For a 6502, one only needs four handlers: peekb, peekw, pokeb, and pokew. Each data handler could be implemented either as a large C switch statement, or a series of if-else statements which check every possible address that could be written to. This would also guarantee extremely slow performance.

So with a bunch of design details resolved, we hit a big issue: how does one effectively and efficiently sandbox memory accesses? This is not just an emulation issue, this is something that any multi-tasking operating system has to solve. How do you allow code to only access some portions of memory not others? A virtual machine has an additional problem not only of determining the permissions to each portion of memory, but also of then having to simulate accesses to different portions of memory. For example, storing data to ROM needs to be prevented but is not necessarily an error or fault on the part of the code doing that store. Storing data to RAM can be as simple as indexing a large block of memory, or if that memory corresponds to an emulated video frame buffer, it can trigger graphics operations on the host.

Emulation 102 - Sandboxing Memory

Even though an Apple II's memory space is mostly plain old RAM, and even though only 8 of those 65536 bytes of memory map to hardware registers, I needed to solve this problem. There are various techniques used in various virtual machines:

Bounds checks: assume that guest RAM starts at address 0 and stops at some "end of RAM" address, for example 49152 which corresponds to the end of the 48-kilobytes of RAM. Make each data handler do a fast bounds by comparing the effective address of the access with the constant 49152. If the address is smaller, write directly to the large block of guest RAM. Otherwise call a secondary hardware emulation handler. This is also very similar to what the hardware does for segmented memory models, where the hardware itself performs the bounds check.
Trap-and-emulate: this is commonly done by today's MMU-based virtual machines such as Virtual PC and VMware. No bounds check is performed by software. Instead, the host MMU is configured to throw an exception when an address is accessed that does not map to writable guest RAM. Memory is generally mapped in 4-kilobyte "pages", the common "4K page" used by Windows, Linux, and other operating systems. An exception handler inside the virtual machine monitor then determines the cause of the exception, which guest instruction was trying to access which guest memory location, and then calls the secondary hardware emulation handler. The trap-and-emulate technique is a double-edges sword, for while it is faster than a bounds check on code with a low fault rate, it can have severe slowdowns when faults are frequent. This is my whole argument against hardware virtualization, as it can add many thousands of clock cycles of cost to each and every such fault.
Range checks: a hybrid of software bounds checks and the kind of per-page granularity of protection offered by an MMU, a range check is also known as a "Software TLB" because it closely mimics the actions of the real hardware MMU. What this technique does is break up the guest address space into equal sized chunks, perhaps on a granularity of a megabyte, or a granularity of a 4K page, or even a granularity of individual bytes. Each guest data load or store is munged to calculate the TLB index, or which block number into the array of memory blocks the access refers to, and an offset into that block. Each block can have different protection permissions, such that one range of memory can be flagged as being writable RAM, and the block memory right after it can map to hardware. This is exactly what happens on the Apple II, where RAM ends and is immediately followed by the hardware registers, and then by ROM.

Over the past 20 years I have used all three techniques in my virtual machines. For the Apple emulator, the trap-and-emulate technique was out of the question. There simply was no hardware memory protection provided by the Motorola 68000 microprocessor or the Atari ST chipset. I had to use either bounds checks or range checks.

My first version of the Apple II and Atari 400 emulator used the first technique - bounds checks. Since RAM on both the Apple II and Atari 400 started at memory location 0 (0 in the 6502's address space that is), and since most memory loads and stores occur to RAM, it is sufficient to have a fast path of code in each opcode handler which does a bounds check and then writes directly to the guest RAM buffer. In the rarer case that hardware is being accessed, the bounds check fails and calls the hardware emulation handler.

This worked beautifully for Apple II. It was a little more problematic for Atari 400 emulation because of the way Atari overlaid ROM over RAM. You may already noticed a small problem based on the Atari 400 information I listed. The 6502 generates 16-bit addresses, meaning that it can access 64 kilobytes, or exactly 65536 bytes, of address space. This address space has the lower and upper addresses (in hexadecimal) that range from 0x0000 to 0xFFFF. However, adding up the RAM + ROM + hardware ranges of an Atari 400 gives 48 + 8 + 2 + 8 + 2 = 68 kilobytes of memory! What gives? It turns out that the memory map of the Atari 400 is not quite linear. When the BASIC ROM cartridge is installed it maps on top of the upper 8 kilobytes of RAM, giving only 40 kilobytes of RAM. When BASIC is switched off, there are the full 48 kilobytes of RAM available. There is also a 4 kilobyte gap of address space (which was later used by additional ROM in the Atari 800XL and Atari 130XE models).

I had to make a compromise and set my bounds check to only allow the first 40 kilobytes of RAM to access directly. Accesses above the 6502 address of 0xA000 (byte 49152 and up) had to go through the slower hardware handler. This is so that addresses 0xA000 to 0xBFFF, which correspond to the 40K to 48K range of address space, could either be mapped to the Atari BASIC ROM or to RAM, depending on which the hardware had selected.

In early 1987 I released my combination Apple II / Atari 400 emulator to the world as "ST Xformer 1.0" and posted the program to "the web", which at the time meant Compuserve, Delphi, and some bulletin boards. As I started working on emulating the Atari 800XL and Atari 130XE computers, I ran into additional ROM-over-RAM problems which made the bounds check very inefficient. I therefore switched to using a software TLB technique, which I described for the Atari magazine "ST Log" in this article (http://www.atarimagazines.com/st-log/issue18/71_1_INSIDE_THE_ST_XFORMER.php).

The accelerate the checking of whether it was possible to load or store to a particular memory location and to accelerate the call to the hardware handler, I set up a separate 64-kilobyte status array which mapped byte-for-byte with the 64-kilobyte memory block that represents the address space of the 6502. On each memory access, the 16-bit effective address is used as an index into the status array and that particular status byte is loaded. If the status byte is zero, it signifies writable RAM. A load or a store can proceed directly to now operate on the guest memory block. If the status byte is positive, that is, it contains the values 0x01 through 0x7F, it means that the memory is OK to load directly, but store operations have to go through a handler. The index of one of 127 possible "store handlers" is the value of that status byte! No further address decoding is needed, as the status byte tells you exactly what kind of hardware is being accessed - a sound chip register, a video buffer location, etc. Status values of 0x80 through 0xFF signify hardware locations that need to call a handler for either a load or a store.

These status bytes not unlike the actual permission bits used in page tables and segment descriptors. Although I did not know the terminology at the time, what I had implemented was a type of software TLB. This allowed for guest RAM, ROM, and hardware to be arranged in any order without skewing performance for the lower ranges of addresses. In 68000 assembly code a typical byte store operation is coded up like this:

    move.b 0(a1,d7),d0
    bmi.s .jump_write_handler
    move.b d3,(a0)
    dispatch
.jump_write_handler:
    lsl.w #8,d0
    move.b #$80,d0
    pea 0(a4,d0.w)
    move.b d3,d0
    rts

Keep in mind that on the 68000 the MOVE instruction also tests the data and updates condition flags, so there is no explicit test of the status byte. The write handler dispatch is a calculated jump which is written in a way to be 68030 friendly.

Fast forward a few years to when I was writing Gemulator 1.0, an Atari ST virtual machine for MS-DOS which emulates the 68000 microprocessor on the Intel 386. Since MS-DOS provides the same completely unprotected memory model as the Atari ST had, and since my analysis had shown that over 99% of memory accesses on the Atari ST were to RAM, I used the simple bounds check scheme in Gemulator release 1.x through 3.x for MS-DOS.

With the release of Windows NT and the upcoming release of Windows 95, I ported Gemulator to WIN32, the 32-bit execution model of Windows applications. I modified the memory load/store code to now use the trap-and-emulate technique. Using the Windows functions VirtualAlloc() and VirtualProtect(), it is possible allocate a 16-megabyte block of memory and to make individual 4K page of memory be writable, read-only, or fault on any access.

By 1996, Gemulator for Windows included a rather large and complex fault handler, which needed to determine the address of the fault, the decode and determine the instruction causing the fault, and to then call the appropriate hardware emulation code to simulate that guest memory access.

I found something interesting comparing the MS-DOS version of Gemulator to an equivalent Windows version of Gemulator: the MS-DOS version was faster! In the mid-1990s as many applications were being ported from MS-DOS to Windows, it was commonly found to be the case. This was not just a virtual machine issue. Many people incorrectly attribute this trend to the extra CPU clock cycles spent task switching in a multi-tasking operating system. This is only one small factor though as the slowdown also happens when just Gemulator is running and most Windows services are disabled. Other people attribute this to the VGA driver overhead, but I was able to factor this out by artificially slowing down the refresh rate to once per second.

What I realized the cause was is that "unsafe code tax" we've met before. I traded the up-front range check on every Atari ST memory access for a somewhat sporadic page fault penalty that although rare, is very costly when it does occur. Back in the days of Windows 95 and the first generation of Pentium microprocessors, the cost of handling a page fault exception in Windows was on the order of about 3000 to 5000 clock cycles.

So which is better? To take a 3000 to 5000 clock cycle penalty roughly every 1 in 1000 guest memory accesses? Or to have that additional "compare-and-branch" operation on each and every guest memory access? As it turned out, in those days it was roughly awash, and so in 1998 when I split Gemulator into two distinct products - Gemulator the Atari ST emulator for Windows and SoftMac the Apple Macintosh emulator for Windows - I kept the trap-and-emulate technique.

Emulation 103 - Nearly Free Byte Reversal

One of the most problematic issues for most C or C++ based virtual machines is the cross-endian byte-reversal problem: how to efficiently handle loads and stores when guest and host CPU architectures are of different endianness. Emulation of 68000, 68030, 68040, and PowerPC architectures on Intel poses such a problem. Some Macintosh virtual machines use the "swab()" style C macros, which result in rather horrible compiled code. Other implementations resort to using the x86 BSWAP instruction, which unfortunately did not appear until the Intel 486. When I was developing Gemulator in 1991 and targeting 386-based MS-DOS machines, neither of these solutions was an option.

So I changed mindset. Instead of attempting to perform an inefficient endian-reversal on every load and store operation, why not simply byte swap the entire guest RAM? Nowhere is it written in stone that the large memory buffer allocated to represent the guest RAM needs to be in ascending order. Why not simply reverse it? The PowerPC architecture uses a variant of this trick called "Little Endian Mode", in which the hardware reverses the order of every group of 8 bytes of memory. I took this to the extreme and simply reversed the entire 16 megabytes of Atari ST or Apple Macintosh address space.

Imagine if you will allocating 16 megabytes of Windows address space (and for the sake of simplicity assume the malloc magically aligns the buffer to a 4K page boundary and that we later set protection on non-writeable pages):

#define cbMem (16*1024*1024)
unsigned char *pb = malloc(cbMem);

Now, to write data to the 68000 address space, you start at the end of this buffer and work toward lower addresses in guest memory, like this:

void WriteGuestByte(unsigned long addr, unsigned byte data)
{
pb[cbMem-1-addr] = data;
}

How do write a 32-bit integer to the guest memory? If the address being written to is address 0 of the Atari ST address space, the code above will overrun the buffer. The trick is to realize that the "-1" fudge factor is actually a variable fudge factor that is equal to the size of the data access. So to write a 32-bit value, the code would be this:

void WriteGuestLong(unsigned long addr, unsigned long data)
{
(unsigned long *)(&pb[cbMem-3-addr]) = data;
}

The mapping works out like this:

host linear address Atari ST / Macintosh guest physical address
pb + 0 16MB - 1
pb + 1 16 MB - 2
pb + 2 16 MB - 3
pb + 3 16 MB - 4
... ...
pb + 16MB - 3 2
pb + 16MB - 2 1
pb + 16MB - 1 0

The actual code that I used in Gemulator 98 was written in assembly language and looked something like this (the example here is a 16-bit load operation):

    neg ebp
    add ebp,edi
    ; jnc ReadHW1
    mov bx,word ptr [edx+ebp-1]

This consumes all of 9 bytes of code and will generally execute in about 5 or 6 clock cycles. No BSWAP instruction or funky C macro required! The EBP register contains the 68000 effective address being accessed, EDI contains the hardcoded constant equal to 16 megabytes minus one, and EDX is the base pointer to the 16-megabyte block of memory.

Notice how I perform what in C code was a subtraction operation. I instead use the NEG instruction to negate the 68000 effective address then add to it the constant 0x000FFFFF (which is 16 megabytes minus one, so as to avoid the fudge factor for byte operations). This has the property of setting the x86 EFLAGS Carry bit when the address is below 16 megabytes, and clearing the Carry bit when the address is out of bounds. By uncommenting the JNC instruction, the code transforms from trap-and-emulate to bounds check mode.

The technique I just described mostly worked, but caused spurious crashes due to the fact that 68000 code actually generates 32-bit effective addresses. Unless the JNC instruction was present, the MOV instruction could seriously access well outside of the 16-megabyte buffer. Should the 68000 effective address just be some particular value that happens to be translated to say, a Windows DLL or heap location, the guest could actually corrupt host memory. This is an example of a virtual machine escaping from its sandbox. This type of bug actually still ships today in commercially available virtual machines from other companies.

There are actually several distinct bugs in the code above (yikes!):

guest addresses above 16 megabytes will escape the sandbox and could corrupt host memory,
guest addresses above 16 megabytes that do fault but which alias to valid guest RAM incur a significant performance penalty,
this technique does not work if the guest effective addresses to not map linearly to guest physical addresses,
this code requires that the guest memory block be in one contiguous piece. When I added support for emulating 68030 and 68040 based Macintosh virtual machines and allowed up to 1 gigabyte of guest RAM, it was difficult to allocate one contiguous gigabyte of memory from Windows.

For the Gemulator 2000 and SoftMac 2000 round of emulators (a.k.a. the Gemulator 8.0 and SoftMac 8.0 releases), I tackled all three issues by rewriting the NEG/ADD code sequence as follows:

    mov edi,ebp
    shr ebp,12
    xor edi,[pagetbl+ebp*4]
    ; js ReadHW1
    mov bx,word ptr [edi-1]

This fattens up the code to 16 bytes and adds a few cycles to the execution time but tackles each of the three problems. A second block of memory, the pagetbl array, contains one entry for each 4K page in the 68040 address space, a total of about one million entries. The "page table index" is calculated by taking the 68040 effective address in the EBP register and shifting it right by 12 bits. The original effective address is copied to the EDI register, which will serve to actually reference the host memory.

The magic is in the XOR operation which is read from the pagetbl array. Each of the million or so entries in that array contains the unique XOR value which maps the address of a particular page of guest memory to its counterpart in host memory. Why an XOR and not an ADD operation? Because although both can be used to achieve the same result to adjust the upper 20 bits of the address, only an XOR allows you to invert the lower 12 bits in order to perform the nearly-free byte reversal trick. Each of the million entries in the table is of the form 0xxxxxxFFF, such that the lower 12 bits of the address get inverted.

The advantage of this code is that it more efficiently emulates aliased addresses in the 24-bit address space. For example, on the Apple Macintosh II when running in 24-bit mode, the ROM BIOS is accessible both at the 32-bit address 0x40800000 as well as at the 24-bit address 0x00800000. This technique maps each of these two guest addresses to the exact same page of host memory, making address aliasing work with equal efficiency regardless of which address is accesses. There is no bounds checking any more and there is no longer an assumption that guest RAM has to start at address zero, or that host memory is allocated in one contiguous chunk.

The C equivalent of the above code 16-bit load operation would be something like this:

unsigned short ReadGuestWord(unsigned long addr)
{
return (unsigned short *)(&pb[(addr ^ pagetbl[addr >> 12]) - 1]);
}

There are still bugs in this updated code, not to mention performance slowdowns, which I did tackle and fix until just this summer in the Gemulator 2008 beta. Do you see them all? Hint, it will be central to the discussion in next week's posting! What the ultimate goal of these "ReadGuest" and "WriteGuest" memory access functions is to combine all of the following operations as efficiently as possible:

determine whether the guest memory access is valid for the given guest address,
determine whether the guest memory access is valid for the particular access type (load or store) and address alignment,
determine whether the guest memory access is valid for the current privilege level (user or kernel mode),
determine whether the guest memory access is valid for the particular data size (byte, word, long, quadword), and,
either map the guest effective address to a host address and perform the access, emulate the access in a hardware handler, or indicate failure and fault the guest.

The code I have shown so far does not meet all this criteria for reasons I will explain next week.

Emulation 104 - Just Avoid the Host MMU

Fast forward to late 2000 and the infamous launch of the Intel Pentium 4. The performance of my virtual machines plummeted, so much so that on a 1500 MHz Pentium 4 host, a Macintosh virtual machine ran slower than on a similar 900 MHz Intel Pentium III or Athlon host. The Intel Pentium 4 has such significantly higher clock costs of handling page faults and other operations, that all my performance tuning was thrown awry. Depending on the model of Pentium 4 and the version of Windows, a page fault incurred a cost of at least 11000 to 18000 clock cycles! Even with the faster clock speed of the Pentium 4, more actual real time is spent handling the fault.

For example, comparing a 900 MHz Pentium III to a 3600 MHz Pentium 4, if the Pentium III takes only 3000 clock cycles to do what the Pentium 4 takes 18000 clock cycles to do, then the real throughput of a 1990's era Pentium III system is 300,000 page faults per second. The 2005 era Pentium 4 - 200,000 page faults per second. The Pentium 4 architecture fails to perform at exactly the operations needed to make not only virtual machines fast, but also the basic mechanisms by which operating systems provide virtual memory and do multi-tasking.

As part of the performance tuning I did for the SoftMac XP release, I tried to reduce the number of the hardware page faults, and found one easy optimization. Most hardware registers on both the Atari ST and on the Macintosh are 8-bit memory locations. In contrast, most memory accesses to RAM tend to be 16-bit or 32-bit operations. By simply tuning the virtual machine to use bounds checking on byte accesses and to use trap-and-emulate on all other accesses, I reduced the number of page faults and thus the unsafe code tax. Such a hybrid approach runs at almost the speed of the MS-DOS versions. Fortunately I had already written in those conditional jump instructions described above into the source code and commented them out, so it was easy to tune the emulator specifically on those Atari and Macintosh guest instructions which were known to generate hardware accesses.

Ironically, we've come full circle again. As the Pentium 4 was phased out and replaced by the Pentium III derivative Pentium M, Core Duo, and Core 2 microprocessors, Windows software got drastically faster. On the order of 2 to 3 times faster in some cases, and in the case of my virtual machines, almost 4 times faster (comparing a 2.66 GHz Core 2 host to a 2.66 GHz Pentium 4 host).

Yet a new unsafe code tax now looms, which is this rush to hardware virtualization and its longer latencies for handling page faults and exceptions. My results with Gemulator and SoftMac show that either a pure software memory protection scheme or a hybrid software/hardware memory protection scheme is superior to a purely hardware based scheme. This is exactly the same conclusion that engineers at VMware reached last year when they published the BT vs. VT paper that I have been mentioning so much (and here it is again in case you still did not read it: http://www.vmware.com/pdf/asplos235_adams.pdf)

After measuring the "VT tax" on my own Core 2 systems, I tried a different approach this summer in the new beta releases of Gemulator 2008 and SoftMac 2008 - I went back to completely software-based memory sandboxing using a version of the "software-TLB" that tackles some of the outstanding issues I missed in SoftMac 2000. The end result, which shipped in the beta releases I posted to this site in August, is about as efficient as the hybrid approach, yet eliminates the need for the host CPU to provide any MMU functionality at all. If all operating systems and applications always executed within a binary translation based virtual machine, AMD and Intel hardware as we know it today could be vastly simplified, shrunk, and made more secure. Many Intel errata, which are due to bugs in the complex hardware memory protection implementation, would be eliminated.

Designing future virtual machines to completely avoid the host MMU and to avoid the trap-and-emulate technique of memory sandboxing is the key to developing efficient virtual machines that can also support live migration across heterogeneous host computers. Vanderpool and Pacifica should not be utilized by future virtual machines.

The software-TLB approach is simple but does incur several clock cycles of penalty. This is why four weeks ago I recommended that as part of the 10 steps to improve the x86 platform, one of those steps needs to be to add a few simple new x86 instructions to help accelerate binary translated code.

Next week:

the Gemulator 2008 software-TLB benchmark results
the Gemulator 2008 software-TLB code sequences
my suggestions for new x86 instructions to accelerate virtual machines
progress report on optimizing Bochs 2.3.5

If you have comments, as always, email me directly at darek@emulators.com.

Computer:	Apple II	Atari 400	Atari ST
CPU:	6502	6502	68000
Clock speed:	1.0 MHz	1.8 MHz	8 MHz
RAM:	48K	48K	1024K
ROM:	12K	18K (8K BASIC, 2K FP, 8K OS)	192K
Hardware space:	8 bytes	6K (2K hardware, 4K unused)	megabytes
Memory mode:	little-endian	little-endian	big-endian
Address space:	16 bits (64K)	16 bits (64K)	24 bits (16 MB)
Register width:	8/16 bits	8/16 bits	32 bits
Register state:	8-bit A/X/Y data registers 8-bit P register 16-bit Stack Pointer (SP) 16-bit Program Counter (PC)	8-bit A/X/Y data registers 8-bit P register 16-bit Stack Pointer (SP) 16-bit Program Counter (PC)	Sixteen 32-bit general purpose registers 16-bit Status Register 32-bit Program Counter

host linear address	Atari ST / Macintosh guest physical address
pb + 0	16MB - 1
pb + 1	16 MB - 2
pb + 2	16 MB - 3
pb + 3	16 MB - 4
...	...
pb + 16MB - 3	2
pb + 16MB - 2	1
pb + 16MB - 1	0