NO EXECUTE! Sep. 24 2007

NO EXECUTE! A weekly look at personal computer technology issues.

September 24 2007

[Home] [Part 1] [Part 2] [Part 3] [Part 4] [Next]

Previously on NO EXECUTE!...

I trust my posting last week raised enough doubt to question the "AMD64" 64-bit extensions to x86 as implemented by AMD and Intel, specifically in regard to how 64-bit integer register extensions, 64-bit addressing extensions, and the removal of useful segmentation were all bundled together in the 64-bit "long mode" of execution. I demonstrated how the design did not pay close enough attention to the needs of the virtual machine, and how it would have been useful to allow the individual enhancements of AMD64 to be enabled individually instead of using one master "switch". As AMD, Intel, Microsoft, VMware, and other companies race to force consumers to make the 64-bit upgrade to the AMD64 architecture, a new slew of post-AMD64 technologies is already on its way - SSE4.x, SSE5, LWP, hardware virtualization, multi-core processors, and hardware virtualization-based "hypervisors" such as VMware Server and Microsoft's upcoming Windows Server 2008 Virtualization.

This recent Windows Server 2008 Technical Overview paper (http://www.microsoft.com/technet/windowsserver/longhorn/evaluate/whitepaper.mspx?wt.svl=globalhighlight) indicates to me that Microsoft is betting the farm on AMD64-based hardware memory protection. One sentence on hardware-level security states that the NX / DEP feature "prevents execution of most prevalent viruses and worms". I disagree with Microsoft and wonder if they actually have any data to support this claim, given how easy NX is to circumvent by malware exploits. The fact that virtualization in general is going to require a 64-bit host microprocessor that uses either Vanderpool or Pacifica is unfortunate, since Microsoft's existing Virtual PC 2007 and Virtual Server 2005 products run just fine on 32-bit non-VT systems. XenSource's latest Xen 4.0 release and VMware's latest releases also seem to require 64-bit VT-enabled processors. The fundamental problems of design limitations of AMD64's "long mode" - vulnerability to buffer overruns in 32-bit and 64-bit flat model execution models and operating systems, the limited usefulness of NX/DEP, and doubts as to whether hardware virtualization is even faster than older software based binary translation - are not addressed by anything I'm hearing from Microsoft, VMware, or XenSource.

As I proposed last week, as the average consumer has yet to even embrace 64-bit operating systems, there is still a window of opportunity to get 64-bit "right" and fix the security and reliability issue. The world should forget AMD64 and think about how 64-bit multi-core microprocessors could have been designed differently, and thus how secure 64-bit operating systems might evolve. The nice thing is that such mental experiments can be realized and tested thanks to binary translation based emulation techniques. In fact, computer science researchers regularly perform such experiments and publish papers about possible instruction set extensions or hardware design changes which they test out by simulating their changes in software. The drawback is that generally these experiments are performed on non-x86 architectures, such as Java virtual machines or on RISC processors.

To me it seems advantageous for computer science to be researching real x86 environments, analyzing the real world code that hundreds of millions of Windows, Linux, and now Mac OS users really use, and thus being able to derive new x86 instruction set extensions, memory models, and computer languages from real data. It seems to me that the "horse goes before the cart", in that x86 operating systems and languages have evolved based on design decisions made by AMD and Intel. It should be other way around! AMD and Intel should be designing hardware changes jointly.

Therefore, after taking the first steps last week to imagine that we stripped the AMD64 hardware bare of most of its hardware memory protection and virtualization capability, the next leap is to develop a new memory model which ideally would offer the better memory protection of segmentation yet retains the easily addressable large address space of a flat memory model. In the short term, such a memory model has to be implemented and researched entirely in software using runtime libraries and virtual machines, but could then later be realized in hardware with the addition of some simple x86 instruction set extensions and relatively simple address translation hardware.

As I was preparing to post today's blog, I could not have read a more timely news piece than what I read in the Wall Street Journal this morning. As much grief as I'm giving AMD over their AMD64 design, I do applaud their support of the One Laptop Per Child effort which I mentioned in my very first paragraph of my very first posting here three weeks ago. I was very pleased to see today's announcement that starting in November it will be possible for anybody to purchase these AMD- and Linux-powered "XO" laptops as I had first seen on "60 Minutes" four months ago. This is exactly the kind of inexpensive portable device that represents the next billion PCs to be sold in the developing world; PCs that will not be powered by VMware hardware virtualization or Windows Server 2008 requiring $1000 microprocessors. By the very nature of trying to produce the OLPC hardware as inexpensively as possible and have it consume as little power as possible, it is imperative that as much unnecessary functionality be removed from hardware and put into software.

The possibilities here are quite staggering. As companies like Apple, Microsoft, and VMware are doing everything they can to court the deep pockets of IT departments, gamers, and home entertainment enthusiasts, they are also ignoring and locking themselves out of China, Africa, India, and other emerging markets that represent the next billion new PC owners. In five or ten years, a company like Google could replace Microsoft as the world's dominant supplier of new software. While software based virtual machines could bring decades of legacy MS-DOS, Windows, Atari, and Mac software to those billion new users.

I am excited by this news and have already put on my down for purchasing one of these laptops. I encourage you to do the same and consider making your next notebook computer not an overpriced Dell but rather an XO. Put your name on the mailing list at this web site and let's all wish the OLPC effort all the best:

http://www.xogiving.org/

Mistakes of the Past

Before we can unify segmentation and with flat model demand paging, let us consider the motivations of why features like protected mode segmentation and demand paging were added at all, what unintended side effects this has produced in today's software, and what the root causes of each problem (and possible solutions) are.

Problem: Real mode addressing sets a fairly small finite limit on the amount of virtual address space available to the operating system and to all the programs that run on that operating system. For example, in MS-DOS, the 20-bit physical address space of 16-bit real mode resulted in a practical limit of about 640K of total usable memory to be used by MS-DOS itself, drivers, memory resident programs, and applications.

Side effect: All sorts of schemes were devised to work around the 640K MS-DOS memory limit - from the A20 hack available on 286 and 386 microprocessors to map in an additional 64K of memory using a 21st bit of address space, to 386 memory managers such as 386MAX which similarly allowed mapping of physical memory beyond 1 megabyte, to using some of the address space reserved for video cards for RAM. By the early 1990's this resulted in the CONFIG.SYS and AUTOEXEC.BAT hell that was mocked in Apple Macintosh ads. Anyone who had an MS-DOS computer in those days played the dance of editing those startup files and crossing their fingers hoping their PC would still boot and maybe give a few extra kilobytes of memory.

Root cause: An unnecessary amount of complexity was introduced early on in MS-DOS to add a very small amount address space. Windows and x86 processors have continued to find themselves in similar traps with such gimmicks as /3GB and /PAE which only marginally increase address space but add all sorts of complexity to the operating system. For example, Intel recently published a technical paper just to explain to operating system designers the correct way to flush the translation cache (http://www.intel.com/design/processor/applnots/317080.pdf).

Problem: 16-bit protected mode segmentation was added to the 286 and used by OS/2 to vastly increase the virtual address space to well beyond 640K. The operating system is able to expose up to 16383 segments of up to 64K each, a virtual address space of about 30 bits, or 1 gigabyte, per running program. While segmentation is an effective way to stop buffer overruns in memory blocks, it requires burning a segment selector for each such protected memory block. As this marketing literature for a commercial memory manager (http://www.microquill.com/smartheap/sh_tspec.htm) quotes a figure of "under 256 bytes" per memory allocation, and I've heard figures of well under 100 bytes per allocation, this limit the practical addressable memory of an OS/2 program to about 1 or 2 megabytes of secure memory, hardly better than what is available to MS-DOS!

Side effect: To work around the limits of burning selectors, most applications use a memory heap. A heap is a large block of memory allocated by the operating system as a single segment. The application code (usually in the C runtime library) then subdivides the heap into smaller allocations. The application is then forced to do its own management of multiple heaps. The application has more address space to work with, but unfortunately within each heap the memory blocks are susceptible to buffer overrun bugs and trampoline exploits. On systems with larger addressing, such as the 24-bit addressing on the Atari ST, Commodore Amiga, and early Apple Macintosh, the operating system and all applications essentially shared one large unprotected heap.

Root cause: There are two basic flaws of segmentation: first in the way the bits were distributed: too few bits in the selector (14 of the 16 bits of the segment selector could be used as an actual index into one of 16384 segments, of which segment index 0 was always invalid), and too many bits in the offset. Second, the 14 useful bits in the selector were in the wrong bit positions, such that simple pointer math does not work. When a segment offset overflows from 0xFFFF to 0x0000, it was not possible to simply flow into the numerically next segment selector. This problem was not addressed by the 48-bit segmented addressing model brought about in the 386 and 486 microprocessors.

Quick aside... With 20 years of hindsight, I'm going to state the obvious. Had the 386 or even the AMD64 design simply extended segment selectors up to a full 32-bit width, Windows and Linux may have very well have take a more secure approach to memory management years ago. The segment descriptor tables that all segment selectors index are also just segments. All together now... D'oh!

Problem: To address the memory protection vulnerability of large heaps, Motorola and Intel responded with the 32-bit 68030 and 80386 generation of microprocessors which brought demand paged virtual memory and hardware memory protection capabilities to their respective architectures using built-in MMU ("Memory Management Unit") hardware. An application could now address up to 32 bits, or 4 gigabytes, of address space, consisting of over a million 4K-"pages" of memory, each page having control bits to prevent writing to the page or even accessing any data on that page. This has now led to what Microsoft Research calls the "Unsafe Code Tax", the performance loss due to the memory and CPU cycles wasted due to the page tables and caches related to pages - page tables, page directories, TLBs ("Translation Lookaside Buffers") - and the costs associated with flushing and refreshing those data structures. Microsoft's data in their Singularity paper indicate that the Unsafe Code Tax at well over 30% of execution time of user mode code! For example, whereas executable files used to be loaded monolithically in one disk-read operation, demand paging now causes a 10 megabyte executable program to require 2500 separate page faults and disk-read operations, greatly slowing down the boot time of operating systems and applications compared to before.

Side effect: Operating systems such as Windows XP and Windows Vista have re-discovered the concept of preloading a large block of code all at once! In Windows XP this was called "prefetching". In Windows Vista, this is now being marketed as "SuperFetch" (http://www.microsoft.com/windows/products/windowsvista/features/details/superfetch.mspx), but is not much more of an improvement over how MS-DOS worked 25 years ago!

Root cause: 4K is the wrong page size of granularity! 4K is still much larger than the average memory allocation. To effectively stop buffer overruns, the operating system would need to allocate a separate 4K page of memory to each object, then pad the address space with another 4K "guard page" which is not accessible. The guard page blocks a buffer overrun from spilling into the next allocation block. Any such small allocation would thus burn 8 kilobytes of address. Burning 8K for the typical "under 256 bytes" allocation means that an operating system such as Windows XP, which exposes 31 bits of address space to each application, would in effect be limited to at most about 2^31 / 8192 or about a quarter million allocations. This is slightly more than an order of magnitude increase over OS/2. Hard disk drives operate on 512-byte granularity, and memory caches operate on 32-, 64-, or 128-byte granularity. A better solution would have been to actually make pages smaller.

Problem: There is a problem with making pages too small and probably why 4K became popular. Large memory allocation which are paged 4K at a time consume a great deal of page table entries. For example, in Windows Vista, each 4K page of virtual memory is described by an 8-byte "page table entry", a data structure which the operating system sets up to tell the MMU how to translate the virtual address of the page to physical memory, and what memory protection to apply. Microsoft advertises that 64-bit Windows XP and Windows Vista now support 43 bits of user mode address space, or 8 terabytes, per process. However, in simple experiments which I've performed on 64-bit Windows Vista, even a single user-mode call to VirtualAlloc (http://msdn2.microsoft.com/en-us/library/aa366887.aspx) to request reserving only one terabyte of address space can bring a quad-core Core 2 system to its knees. It turns out that 1 terabyte of address space requires 1TB/8 = 2 gigabytes of page table entries. Windows Vista allocates the 2 gigabytes of page table entries up front (you can see this take place by bringing up Task Manager and displaying the working set and commit size of the test application) and thus brings itself to its knees as it tries to complete the allocation by swapping out other applications. To actually allocate the 8 terabytes of user address space as advertised by Microsoft requires 16 gigabytes of swapfile space before one single page of that address space is even touched! 64-bit version of Windows are easily susceptible to denial of service attacks that involve nothing more than a single user mode Win32 call.

Side effect: Hardware has responded by allowing larger "superpages" of granularity (http://ertos.nicta.com.au/publications/papers/Szmajda_Heiser_03.pdf) of up to 4 megabytes on today's AMD and Intel microprocessors. Parts of the Windows kernel now for example use the large superpages in order to reduce the footprint of page tables. Most Windows applications do not use superpages, and in my opinion a 4 megabyte page is overkill. Most code modules are not that large, and the larger page size does nothing to address the buffer overrun issue.

Root cause: Superpages are an example of the mentality that "bigger must be better", leading up to what are now essentially unbounded page tables and data structures on x86-based architectures. Had the AMD designers looked at how the PowerPC architecture was designed 15 years ago, they would have seen a better solution - bounded page table sizes. The PowerPC takes the approach of "hashed page tables", where instead of linearly accessing a large array of page table entries, the PowerPC hardware hashes an address down to a very small number of bits. This allows the overhead of a hash table to be on the order of hundreds of kilobytes or a few megabytes, not gigabytes as can happed on AMD64. Hashing works because of the common principal of data locality - programs tends to execute code and access data which is very localized to a small number of pages at a time. i.e. just because a program allocates one terabyte of address space does not mean it will be operating on that one terabyte of memory all the time. Therefore page tables really don't need to as "random access" as they are. Even the Core 2's TLB (and think of the TLB as really nothing more than an L1 data cache of page table entries) contains only a few hundred cached entries.

Problem: To work around these limitations of hardware protection and page tables, managed languages such as Java and C# have implemented their own software based memory protection schemes, bypassing most of the benefits of the hardware MMU. A large application heap, even under Java or .NET, will however still burn a lot of page table entries. Worse, the memory management schemes of Java or .NET may now directly conflict with the memory management schemes of the operating system itself.

Side effect: Slow performance. When I was taking a Visual Studio 2002 (a.k.a. VC7) training class many years ago, I was appalled at how much memory a simple C# test application need to run. Not having played with either C# or Java at that point, I was blown away that a Windows 2000 computer with 128 megabytes of RAM was thrashing like crazy just to boot up Visual Studio, compile a 20-line C# test program, and run that program. Prior to that I was building my Atari and Macintosh emulators on Windows using Visual Studio 98 on computers with 32 megabytes of RAM. I quickly realized that C# was not a language for me and went back to writing code in x86 assembly language and C.

Root cause: A high level language's memory allocation and "garbage collection" need to be designed in to the operating system itself, not executing as a separate and mutually exclusive layer. This is how it actually worked 20 years ago. Handle-based (classic Mac OS and 16-bit Windows) and selector-based (OS/2) memory models operate on an abstract form of a memory block. Program code does not access the memory block via direct pointer, but rather with the handle or selector. It is a level of indirection to hide the true location of the memory block from the application code. This leaves the operating system free to move the actual memory blocks around in memory in order to reduce memory fragmentation. This is known as "compacting" memory, a main component of what is known as "garbage collection" in languages such as C# and Java. The reason garbage collection conflicts with the operating system's memory management is that garbage collection requires reading through the entire heap of memory. This causes page faults and disk activity as rarely used pages of heap are loaded back in to memory to be garbage collected. Since the garbage collection usually kicks in when memory is low to being with, the additional paging can really slow down the system, as I observed when trying to use C#.

Problem: Operating systems such as Windows XP and Windows Vista have adopted an approach where all threads of the same program share the exact same address space. Linux has a similar but looser approach to this where threads share common address spaces. This means that within the typical 2GB user address space of a Windows application, multiple stacks are mapped in, each stack thus being vulnerable to bugs or attacks from other other threads. The more threads created, the more opportunity for attacks. If malware manages to inject a thread into a process, it has free reign over all the stacks of the other threads.

Side effect: For both practical reasons and security reasons this limits the amounts of threads that a Windows process should create. For example, create 100 threads in a Windows application, and you will get 100 separate 1-megabyte regions of memory getting reserved, which quickly eats into the available 2GB address. There are other per-thread memory blocks created, such as the TEB ("Thread Environment Block") which is memory that the operating system uses to store information about the particular thread. There are band-aid workarounds, such as reducing the default size of a thread's stack or using the /3GB or /USERVA boot options to increase user address space, but ultimately there are hard limits on how many threads a Windows application can have. Not a scalable or secure way to go forward to a 100-core future.

Root cause: The problem has to do with how Windows chooses to handle TLS ("thread local storage") memory. Because of the large size of page tables and the high cost of context switching between different page tables, Windows really does not have the option of using a separate page table per thread. So what it does is use segmentation to remap the base address of the FS segment to point to the starting address of a thread's TEB page. This is the one aspect of segmentation that AMD did keep alive in AMD64, in that the FS and GS segments to still work as segments. Otherwise the Windows threading model would just plain break. But a secure and scalable solution would be use separate page tables for each thread, and thus to map thread local storage and stack of different threads to the same virtual addresses. This would not only keep the threads more secure from each other, but allow a 32-bit application to make use of far more than the usual 2GB to 3GB of user address space.

Problem: Today, hypervisors such as Virtual PC and VMware add their own extra layer of memory indirection and memory management. The hypervisor acts as another layer of memory protection and memory management underneath the operating system being virtualized and can cause performance problems similar to the case of garbage collection.

Side effect: The virtualization techniques used by hypervisors such as shadow page tables, or the doubly nested page tables of AMD's Pacifica technology result in even more memory being wasted on page tables, and on more slowdown. I have personally observed about a 20% to 60% slowdown in using hypervisors. No wonder operating systems are taking longer and longer to boot. An application today can literally be using three completely different sets of memory translation and management techniques which adversely affect each other.

Root cause: As with the previous problem, this is due to layering incompatible memory management schemes on top of each other. The solution of course is to flatten managed languages, operating systems, and hypervisors to use a single memory allocation and memory compaction mechanism.

Even everyday C programs that make up the majority of Linux and Windows software are susceptible to the layering problem. The C runtime's memory allocator in most implementations uses a hybrid approach of using a large heap from which to allocate small memory blocks, and uses the operating system's allocator for large memory blocks.

I'd like to share some very simple Windows test code and results with you to illustrate this issue of the many layers of memory management. The test code is written and compiled to run as 32-bit user-mode code in Windows XP or Windows Vista. It eats up the 2-gigabyte user address space by repeatedly allocating memory blocks using either the C runtime's malloc() function or the Windows VirtualAlloc() function. I performed the tests of my quad-core Core 2-based 2.66 GHz Mac Pro which runs 64-bit Vista in 4 gigabytes of physical memory. I ran the test once natively (well, "natively" meaning it thunked through WOW64 as 32-bit applications do), and then again running the test on Windows XP virtualized by Virtual PC 2007 on top of that same 64-bit Vista. The average cost of an allocation is calculated in nanoseconds, so multiply by 2.66 to get an approximate clock cycle count on the Core 2.

Each line of the table below shows for a give allocation size and a given allocation method, the number of such blocks allocated from the 2-gigabyte address space, the approximate total amount of allocation (out of a maximum possible of about 2045MB), and the average allocation cost in nanoseconds:

allocation size / method number of blocks allocated (from the 2GB user address space) usable allocated bytes average allocation cost per block (ns) slowdown (relative to 1-byte malloc)
1 byte / malloc() about 131 million about 126MB 108 1.0 (reference)
1 byte / VirtualAlloc about 32 thousand about 32K 1455 13.5
1 byte / VirtualAlloc + one-byte write about 32 thousand about 32K 2910 27.0
100 bytes / malloc() about 18 million about 1801MB 287 2.7
100 bytes / VirtualAlloc about 32 thousand about 3MB 1424 13.2
100 bytes / VirtualAlloc + one-byte write about 32 thousand about 3MB 2879 26.6
4080 bytes / malloc() about 500 thousand about 1978MB 2270 5.6
4080 bytes / VirtualAlloc about 32 thousand about 126MB 1424 13.2
4080 bytes / VirtualAlloc + one-byte write about 32 thousand about 126MB 2910 27.0

When I then repeated the tests running Windows virtualized on Virtual PC 2007 on the same hardware, with the Hardware Virtualization option enabled, I got these results. This table lists the additional overhead of each test case compared to its non-virtualized version:

allocation size / method number of blocks allocated (from the 2GB user address space) usable allocated bytes average allocation cost per block (ns) additonal slowdown (relative to non-VPC cases above)
1 byte / malloc() about 131 million about 126MB 146 1.4
1 byte / VirtualAlloc about 32 thousand about 32K 7346 5.0
1 byte / VirtualAlloc + one-byte write about 32 thousand about 32K 15984 5.5
100 bytes / malloc() about 18 million about 1801MB 433 1.5
100 bytes / VirtualAlloc about 32 thousand about 3MB 7346 5.2
100 bytes / VirtualAlloc + one-byte write about 32 thousand about 3MB 16561 5.8
4080 bytes / malloc() about 500 thousand about 1978MB 9119 4.0
4080 bytes / VirtualAlloc about 32 thousand about 126MB 7653 5.4
4080 bytes / VirtualAlloc + one-byte write about 32 thousand about 126MB 16867 5.8

As you can see, the mixing of the C runtime memory management, the Windows kernel page based memory management, and the virtualization overhead can give more than a 100-to-1 slowdown between the best and worst cases. The best case involves small allocations which tend to come from already allocated and memory resident pages of memory. The worst case being the C runtime has to ask the Windows kernel for new pages of memory, the Windows kernel then maps in a range of pages, this causes the hypervisor (in this case Virtual PC) to get called to manage the real physical memory, and then this cycle gets repeated as the application code writes to the allocated memory block, which induces a "copy-on-write" fault. The copy-on-write mechanism is considered an optimization in Windows to lazily commit writeable physical pages of memory, but unfortunately this induces twice as much overhead as needs be.

The Windows memory model is showing its age, because the above data shows another design limitation of Windows - it allocates pages of memory in 64K blocks at a time. If a programs asks Windows for 1 byte of memory, it gets back a 64-kilobyte range of address space. This limits the "guard page" scheme of buffer overrun protection to at most about 8 such blocks per 64K (4K allocated, 4K guard page, repeated 8 times) which therefore would give an effective protected memory space of between 3 megabytes and 1 gigabyte depending on the average allocation size, consisting of at most about a quarter million such allocations. So as I said before, this a little over an order of magnitude better than OS/2 provided.

What really should trouble you though is the huge cost of hardware virtualization. My very simple test gives very similar results to the data that VMware presented in the paper I cited last week. The hardware virtualization (in this case, the "Vanderpool" technology found in the Core 2 microprocessor) seems to cause about a five-fold slowdown to the cost of allocating memory. 16 thousand nanoseconds may not sound like much, but that's over 40,000 clock cycles per memory allocation in the worst case. Imagine the extra slowdown if your computer is low on physical memory and swapping a lot, or if a managed languages is performing a lot of garbage collection.

Computer vendors tend to always focus on peak performance and best case benchmark results, such as the ideal rate at which video can be encoded, the ideal rate at which a 3-D video game can render, or the ideal rate at which a database can process transactions. But as with an automobile's gasoline mileage, consumers rarely see that peak performance. Which would you rather have:

a system with fast "best case" performance but which can quickly degrade by a factor of 100-fold slowdown or more? Or...
a system where "worst case" is bounded at say, no more than a 10-fold slowdown compared to a slighly slower "best case"?

I would go for System #2 any day (and I do because I tend to under-clock my Macs and PCs to save power). I would rather sacrifice a little bit of peak performance but know that I can deal with the worst case, than have a computer where performance can vary wildly by more than a factor of 100. That is why in my personal computer I go with slower microprocessors running at below 3.0 GHz but choose faster memory and faster hard drives because that helps limit the performance swings I will experience.

It is interesting that even the 1-byte malloc() case gets 1.4 times slow (actually closer to 1.35, or a 35% slowdown) which is exactly in line with the 33% to 37% Unsafe Code Tax. One can assume that even the 108 nanosecond reference time is itself already taxed, since the Unsafe Code Tax is a sum of all the clock cycles lost due to TLB misses and cache line misses due to context switches and other hardware memory management issues.

As was also found in the VMware paper that I mentioned last week (http://www.vmware.com/pdf/asplos235_adams.pdf), running Virtual PC 2007 with and without the hardware virtualization enabled gave significantly different results. Without hardware virtualization, the additional cost of the copy-on-write page fault appears to be about 2000 to 3000 clock cycles faster than when hardware virtualization is enabled. Errrr, why did this "feature" get released? A perfect example of Intel putting the hardware cart in front of the software horse when things were doing just fine as it was.

Hopefully you see my argument now that hardware virtualization is not a viable option for the future, as all these latencies associated with virtualization put on a strain on the system. Context switches from user mode to kernel mode to the hypervisor not only slow down the execution of individual threads, but the longer the time spent either in the operating system or in the hypervisor, the greater the chance that certain locks are being held longer. Longer held locks means greater contention on locks, more threads being blocked by locks, and thus a more serialized execution of the code. Who cares if your future computer has 100 cores, as Intel promised in 2005, if most of the execution time is spent waiting for locks to get released? That is not the path to realizing the vision of concurrency and scalability that should be possible on 100-core systems.

There was a paper published at the 2004 Virtual Machines Symposium by German researches (http://l4ka.org/publications/2004/Towards-Scalable-Multiprocessor-Virtual-Machines-VM04.pdf) which already identified this type of problem on quad-processor Pentium III Xeon web servers. What they found is that upwards of 30% of CPU cycles were wasted due to lock collisions because the operating system layer didn't co-ordinate thread scheduling with the hypervisor layer. I will elaborate on this point in a later posting, but this will lead to the reasoning behind last week's Step #3 about eliminating hardware interrupts and asynchronous exceptions. As with memory management and heap compaction / garbage collection, pre-emptive multitasking and synchronization needs to occur at one central point rather than occur at multiple conflicting layers as happens today.

Defining a Memory Model for the Virtual Execution Runtime

"The need for software security has been known for a long time, yet software exploits continue to be a problem. The root of the problem lies within the software itself. Bluntly stated, most software is not secure... The buffer-overflow bug is the most significant weakness in software today. It has been an enabler of thousands of software exploits. And, it's a bug - an accident that can be fixed." - Rootkits: Subverting the Windows Kernel by Greg Hoglund and James Butler

Linux and Windows software has been plagued by security holes for years, the most common being the simple software bug of the buffer overrun (a.k.a. buffer overflow). I've hopefully convinced you by now that the reasons for these software bugs are partly rooted in the fundamental design flaws of the AMD and Intel microprocessor architectures themselves. Marketing stunts like NX/DEP do little to address the root causes, and simply wrapping the problem inside of a VMware virtual machine or giving money to Symantec for warning bells is not a solution to security.

So far I've shown that:

Segments can provide per-object buffer overrun protection and support memory compaction, but allowing only 16383 per process is too few.
Pages can provide buffer overrun protection at an 8K granularity, which can quickly consume a 32-bit address space.
Large superpages reduce page table size and TLB misses, but do not add any buffer overrun protection.
It is slow to switch from the user mode of an application to the kernel mode of an operating system.
System calls can be another five-fold slower yet when hardware virtualization is present.
Reducing the size of a page below 4K can help with granularity at the expense of page table bloat.
The design of the 386 erred by adding extra bits to the segment size but not increasing the number of segment selectors.
Thread-local storage should truly be isolated and secure from other threads in a process.

I have a solution in mind of how to solve these problems in a virtual machine that is based on binary translation. i.e. a hypervisor that uses good old dynamic recompilation with no hardware virtualization or memory protection in place. I prefer to call this a Virtual Execution Runtime ("VERT") as unfortunately the term "hypervisor" is too synonymous with hardware virtualization based products such as Xen, VMware, and Virtual PC.

Whereas hardware virtualization based hypervisors attempt to achieve performance by using as much direct execution of the virtualized code as possible, the rules of the VERT are quite different:

All unprivileged code goes through binary translation - OSes, drivers, applications - period.
All unprivileged memory data accesses are checked and sandboxed in software by the translated code.
All memory allocation and garbage collection is provided by the VERT.
Application-level code never implements its own memory management, heap compaction, or thread scheduling.

This model is similar to something like what C#, Java, LLVM, or Singularity already do today in software on top of existing native operating systems. I am extending this concept to sandboxing of all x86 native code as well. I am merging the separate concepts of secure native code and secure managed code into one and the same thing. All languages, all bytecodes, they all use the same memory model. An object reference in C# is the same 32-bit value that can then be passed to a native C function. No separate layers of sandboxing, no marshalling of data between languages. My ideal software stack does not have different abstractions for the same thing in each language. This is a secure 32-bit flat memory model that neither 32-bit flat model Windows or Linux offers today, nor that the OS/2 segmented memory model scaled to.

This model can also scale up to using 64-bit addressing, but I believe this to be unnecessary. A quick look (using Task Manager) at the memory footprint of programs running on Windows Vista right now on my system shows the memory footprint of every running process to be below 95 megabytes. Why do programs need to burn 64 bits on a pointer when only about 28 bits is needed for real-world core? I will present a 32-bit addressing model for now, and explain how (and why) the limits of 32-bit addressing can be bypassed using multiple threads.

The first obvious change is to bite the bullet and shrink the page size. A 512-byte page would match the sector size of a hard disk, but that's still well above the typical memory allocation size. We could pick a 128-byte page, to match the size of L2 cache lines. I will select a 256-byte page, both because it falls in the range between those two numbers, and because it is an easier size to play bit manipulation tricks with in software. Thus, in a 32-bit application this results in over 16 million (2 to the power 24) possible pages. I'll refer to the upper 24 bits of the virtual address as the "page selector", and the lower 8 bits as the "page offset", similar to segmentation terminology.

An application that actually allocates anywhere near that amount of memory will thus need tens of megabytes of page tables. So the second step is to take the PowerPC route and go with a fixed size page table. Trust data locality to limit the number of active pages in use at any given moment, just as the designers of the PowerPC depended on. I will again go with the number 256, and limit the page table to 256 entries. For now this appears to limit the maximum amount of "live data" to about 64K.

I want to go one step further. What I just described merely shrinks the amount of intermediate page table information from something that is almost unbounded to something bounded. But the real TLB is still much smaller than the fixed size page table. Why not eliminate the page table entirely and just have the virtual machine feed the translation data directly into the TLB? This is a hardware optimization to consider later.

A 32-bit application with a 32-bit address space can now allocate about 8 million blocks of memory of 256 bytes of less, and by interleaving those with 8 million "guard blocks" of 256 bytes, have buffer overrun protection of up to 8 million objects. The application is free to allocate a large block of memory, say, 10 megabytes, and it will receive a contiguous run of about 40 thousands page selectors followed by one guard selector. 32-bit pointer arithmetic works just fine within all those page selectors, but trying to increment past the last byte of the last page selector will result in hitting a guard selector and thus fault the program.

The next step is to eliminate a common assumption of virtual memory, in that the byte offset of a virtual page has to match the byte offset of a physical page of memory. In other words, with 4K pages, the virtual address 0x12345678 can map to a physical address 0x33333678 or 0x98765678, but not to 0x11111111. The lowest 12 bits (the byte index into a page) has to be the same in both address spaces. I'm removing that assumption, and saying that the virtual machine could choose to map the virtual address of a block at any byte boundary alignment in the real address space. Most likely, a 256-byte block will be aligned at the smallest data cache boundary, which is typically 32 bytes or 64 bytes. There is no need to have the larger alignment since hardware address translation is not involved. Why burn a full 256 bytes (or 4K as today) of physical memory if only 40 bytes of that are actually valid, right? Allowing page granularity to map down to say 32-byte granularity in physical address space will reduce the pressure to swap pages out to disk. Using the scenario above of an average allocation size of 100 bytes per memory block, 512 bytes of virtual address space (a 256-byte block of allocated space and a 256-byte guard block) maps down to just 128 bytes of physical memory, wasting at most only 28 bytes. This is a size optimization really, and perhaps the memory allocator can expose versions of malloc() that are stricter than others as far as alignment. For example, an allocation that is intended to be used with hardware DMA may require correct alignment as well as that the physical addresses allocated are contiguous as well.

Finally, by taking advantage of either a small fixed-size page table or feeding the data into the TLB hardware directly, this model eliminates the need for hardware segment selectors (FS and GS) for implementing thread-local storage. This also eliminates the extra levels of explicit indirection that have to be compiled into code to access thread-local storage, as this indirection will be handled by the virtual machine at the same overhead as accessing any other memory. In this model, individual threads are really more like separate processes, choosing to explicitly share with other threads the data that they need to share, and keep private what needs to say private. Now a 32-bit application need not worry about the 2GB or 3GB or 4GB user mode address space limits. If an application swaps 100 threads and each thread allocates 100 megabytes of private data to work on, that's 10 gigabytes of address space accessible to a single 32-bit application.

Such a thread-local storage model addresses one last security hole which are buffer overruns on the stack itself which can cause a buffer to overrun a function's return address. So far I have only focused on buffer overrun protection for the C runtime heap and allocations which are global to the entire application. What about stack protection? I'll go against conventional wisdom and state that it is fundamental design problem to use the stack both for implicit operations (such as function calls and returns, pushing and popping temporaries, etc.) and for explicit memory operations that could be performed on the global heap, such as allocating text buffers and then passing a pointer to within the local stack frame to external functions or even to the OS. Not only does that invite buffer overrun attacks, but sloppy programmers open themselves up to bugs where a function exits while there are still dangling pointers to the freshly destroyed stack frame!

The stack should be used for pushing and popping values and for local scalar variables. Anything that is a buffer or has its address taken in any way needs to be allocated in a separate page such that it is protected. In the current C/C++ programming model, there is a function called alloca() which dynamically allocates a buffer on the stack. Think of alloca() as malloc() for the stack. The reason alloca() exists because it is much faster to allocate stack space (usually by just decrementing the stack pointer and perhaps adjusting a pointer or two) than to call the heavy-duty heap malloc() function. Static local arrays could in fact be allocated strictly using alloca() today. What we want is something that is faster than malloc() but still retains the security and thread-privacy of alloca(). The runtime therefore needs to expose something like an alloct() function which allocated thread-local data. This could be used both to replace stack locals, and for thread-local data that is today done using the funky FS prefixes and __declspec() syntax or TlsAlloc() calls that most programmers get confused by. C and C++ compilers for this new memory model need to be modified to always make the alloct() call and enforce this security discipline. And if a local buffer does need to be passed to another thread or to the operating system, then that should go through malloc(), period. If it is not truly private thread-local data, it can't be hiding in thread-local data.

To summarize, what I have done here is combined the concept of pages and segments to give the benefits of both mechanisms and imposed a slightly stricter memory allocation:

up to 8 million usable page selectors (per thread) that work with flat pointer arithmetic.
up to 16 million 256-byte pages (per thread).
the ability to allocate large contiguous blocks of linear address space as in existing flat memory models.
small page tables to allow fast context switching of threads and unique address space mappings per thread.
almost unlimited number of threads per process with private thread-local storage and private stacks for each thread.
most 64-bit applications can continue to use 32-bit pointers, making porting between 32-bit and 64-bit easier.
buffer overrun protection at a 256-byte granularity (for now!) for both heap allocations and local variables.
x86-style segment prefixes are gone! All addressing is done via 32-bit pointers, regardless of whether in 32-bit or 64-bit arithmetic mode.

Wouldn't you have loved a 64-bit Windows model where recompiling a 32-bit Windows application to use 64-bit registers was trivial and did not bloat data structures? Linux and Windows need to use this memory model. This memory model is possible.

I haven't yet described the hardware mechanism which would support this model (since I did after all say last week to strip most of the hardware!) so for now this memory model will have to be simulated in a software emulator and a software stack built and tested using emulation. That's ok, there are no shortage of existing PC emulators out there to use as a starting point.

"This emulated memory model will be slow!" you scream at me. Not so much. Remember, ordinary user-mode Windows code today suffers the Unsafe Code Tax of about 30% or more. Add the overhead of hardware virtualization, which is at least another 40% slowdown, and one quickly realizes that my proposed all-software solution already starts off with a free "get out of jail" card. The first two-fold slowdown of a binary translation scheme is free since that merely slows us down to the level that Virtual PC and VMware style solutions give us today when nesting user mode ring 3 on top of a sandboxed ring 0 on top of a hypervisor. I argue that a software approach will scale better on multi-core processors than today's layered hardware virtualization approach.

This is a call to action to compiler developers, operating system developers, chip designers, and runtime developers to start thinking about this memory model and help me fine tune it. That means you, Microsoft, AMD, and Intel. Enough with your proprietary ways! AMD and Intel must co-operate on common standards if they are truly serious about security issues. I encourage people to read the Rootkits: Subverting the Windows Kernel book, because if you study the various exploits described in the book, you will realize that the memory model I just described would have stopped those exploits in their tracks.

That's it for September. Thank you for reading. Among the things that I will be discussing in my October postings, I will:

Provide the actual code sequences which I have already tested in my Gemulator product to demonstrate that this model is already a viable optimization to using hardware based memory protection on today's microprocessors,
Suggest some simple changes to the hardware TLB as well as some simple new instructions that would help accelerate binary translation to the point of making hardware virtualization unnecessary,
Suggest an alternate set of 64-bit long mode instruction encodings which would help simplify the hardware decoding and software emulation of x86 instructions,
Take a look at past and present software emulation products and list out some of the goals that should set for future hypervisors.

Also in the near future, I will be looking for volunteers to help me implement these techniques in a few existing open source emulators. The ones I have in mind are PearPC (http://pearpc.sourceforge.net/), a PowerMac simulator that runs Mac OS X on top of Linux and Windows, and, Bochs (http://bochs.sourceforge.net/), a popular interpreter-based x86 simulator for running Windows.

I believe that Bochs could not only be used as the environment for testing new memory models on, but with the proper performance tuning that I'm quite experienced at doing, could serve as the basis of the actual open source Virtual Execution Runtime based hypervisor that secure customer-ready Linux kernels can be built on top of.

The game is afoot! Security and reliability can no longer be ignored.

As before, I would love to hear from you. Email me directly at darek@emulators.com or simply click on one of these two links to give me quick feedback:

Darek, keep writing, this is like bacon on my burger!

Darek, shut up and go back to your stupid Atari 400!

allocation size / method	number of blocks allocated (from the 2GB user address space)	usable allocated bytes	average allocation cost per block (ns)	slowdown (relative to 1-byte malloc)
1 byte / malloc()	about 131 million	about 126MB	108	1.0 (reference)
1 byte / VirtualAlloc	about 32 thousand	about 32K	1455	13.5
1 byte / VirtualAlloc + one-byte write	about 32 thousand	about 32K	2910	27.0
100 bytes / malloc()	about 18 million	about 1801MB	287	2.7
100 bytes / VirtualAlloc	about 32 thousand	about 3MB	1424	13.2
100 bytes / VirtualAlloc + one-byte write	about 32 thousand	about 3MB	2879	26.6
4080 bytes / malloc()	about 500 thousand	about 1978MB	2270	5.6
4080 bytes / VirtualAlloc	about 32 thousand	about 126MB	1424	13.2
4080 bytes / VirtualAlloc + one-byte write	about 32 thousand	about 126MB	2910	27.0