NO EXECUTE! Sep. 17 2007

NO EXECUTE! A weekly look at personal computer technology issues.

September 17 2007

[Home] [Part 1] [Part 2] [Part 3] [Next]

Follow Along in Your Textbooks

The following books and documents are my recommended prerequisite reading to follow along with this week's and next week's postings.

Books to brush up on virtual machines and microprocessor design: (Modern Processor Design) (The Pentium Chronicles) (Virtual Machines)
AMD proposed SSE5 instruction set extensions: (http://developer.amd.com/sse5.jsp)
AMD Lightweight Profiling proposal: (http://developer.amd.com/assets/HardwareExtensionsforLightweightProfilingPublic20070720.pdf)
AMD Programmer's Reference Manual: (http://www.amd.com/us-en/Processors/DevelopWithAMD/0,,30_2252_739_7044,00.html)
Intel Core 2 Errata document: (http://download.intel.com/design/processor/specupdt/31327914.pdf)
Intel SSE4 Programming Reference: (http://softwarecommunity.intel.com/isn/Downloads/Intel SSE4 Programming Reference.pdf)
Intel IA32 Optimization Manual: (http://www.intel.com/design/processor/manuals/248966.pdf)
Flexible Transactional Memory paper: (http://www.cs.rochester.edu/u/scott/papers/2007_ISCA_RTM.pdf)
Nirvana instruction tracing paper: (http://www.usenix.org/events/vee06/full_papers/p154-bhansali.pdf)
The Low Level Virtual Machine site: (http://www.llvm.org/)
Microsoft Research Singularity site: (http://research.microsoft.com/os/singularity/)
VMware paper comparing virtualization techniques: (http://www.vmware.com/pdf/asplos235_adams.pdf)

Note: The Intel Core 2 errata document does not seem to be online at this time. I will reference the version I cached a few weeks ago.

They're Digging In The Wrong Place!

If one charts the progress of the x86 architecture from the original 8088 to today's Core 2, there have been numerous layers of complexity added on top of what was at first a straightforward design. The 8088 processor was similar to the Motorola 68000 in that both processors offered a flat address space, accessible to any code at any time. This is the environment (called "real mode") that MS-DOS programs originally ran in, as did early graphical user interface based operating systems such as Windows 1.0 and software for the Apple Macintosh, the Atari ST, and the Commodore Amiga. I couldn't get much simpler than real mode. The two main problems of real mode on the PC were the limit of 640K of memory available to MS-DOS, and the ability for a program to easily shoot the system in the foot - either by accident due to a coding bug, or on purpose in the form or a virus or worm.

To overcome the memory limitation of MS-DOS, a hardware trick called "segment selectors" were introduced in the 80286 processor which allowed an operating system to partition up to 16 megabytes of memory into segments and give access to different segments of memory to different programs. Segments also allowed the operating system to specify which segments of memory could be written to by a particular program, which segments were "read-only" to a program, and which segment the program could execute code from. This environment was known as "protected mode". This allowed operating systems such as OS/2 1.0 (released by Microsoft and IBM in 1987 as an alternative to Windows) to more safely run multiple programs at once and protect them from each other's code bugs or malicious intents. Having used it for some time in the early 1990's, I still consider OS/2 1.3 to be one of the most stable operating systems that I've ever used.

As with the examples I gave last week, sometimes an engineering design decision can have unforeseen effects 50 years into the future, and into areas beyond the scope of the original design decision. As I've watched AMD and Intel do battle over the years, I get reminded of the scene in Raiders Of The Lost Ark where Indy realizes that due to an oversight his rivals were digging in the wrong place. I have become increasingly convinced that in their effort to one-up each other, and in the good faith effort to truly improve the security and versatility of computer software, that both AMD and Intel have unfortunately been spinning their wheels on wild goose chases to develop technologies which are not needed; and may even impede future progress. Developer Alex Ionescu joked to me the other day about how the battle over new versions of SSE is not unlike the race to add more blades to men's razors. AMD and Intel are engaged in meaningless fights.

as early as 1990 consumers already had their pick of Mac, Atari, Windows 3.0, and OS/2 based computers which sported usable multi-tasking graphical interfaces. All this on computer chips that were barely running at 1/100th the clock speed of today's PCs. What has the industry in fact done since then but make us paranoid, suffer frequent crashes, and take our money every few years?

I am firmly convinced that much of the past 20 years worth of progress in personal computers - from the extra complexity added to microprocessors to the entire "software stack" upon which the Windows operating system, its device drivers, its runtimes, and its applications are built upon - should be re-evaluated and redesigned from the ground up.

On the razor theme, engineer Jan Gray pointed out to me a phrase which he coined years ago called "Jan's Razor" (http://www.embeddedrelated.com/groups/fpga-cpu/show/960.php) which states: "In a chip multiprocessor design, strive to leave out all but the minimal kernel set of features from each processing element, so as to maximize processing elements per die." In other words: avoid adding extra hardware for something that can easily be done in software. Transmeta (http://www.transmeta.com/) was ahead of its time when it pursued the same concept almost a decade ago, building x86-compatible mobile processors based on binary translation software emulation. I purchased a Transmeta-based Sony VAIO subnotebook in late 2000 and loved it and loved the technology behind it.

By diving into multi-core in the same rushed fashion they've been diving into new technologies over the past 20 years, AMD and Intel are falling into a trap of duplicating unnecessary complex hardware numerous times by trying to make hardware virtualization work on multi-core processors. The chip makers should take heed of Jan's Razor to drastically simplify their hardware and move more functionality to software.

The Ten Steps to Fixing the x86 Hardware Mess

The long-term survival of the personal computer ecosystem requires a vertical redesign of the entire hardware and software "stack". The revolution needs to begin with the design of the microprocessor hardware itself due to the long gestation period of creating new silicon and the fact that everything else downstream is intimately dependent on those design decisions. Let me therefore give you what I consider to be the ten critical steps that AMD and Intel must come together on:

Eliminate hardware privilege levels - including the "rings" of supervisor and user states and the much hyped "hardware virtualization" technology - from the silicon.
Eliminate hardware memory translation and memory protection - including page tables, the TLB, segments, segment descriptors, and hardware task switching.
Eliminate asynchronous interrupts and most types of exceptions, as well as the hardware counters that can trigger those events.
Eliminate redundant x86 instructions from hardware that can be recompiled as an efficient sequence of other instructions.
Replace the TLB with a similar hardware structure which is explicitly queried and updated by software.
Reduce the latency of synchronization operations such as atomic read-modify-write instructions and "compare and swap" operations.
Focus future hardware design (and add x86 instructions as necessary) to optimize the execution of binary translation based virtual machines.
Expand the x86 specification to formally define the value of "undefined" results such that code execution is deterministic on any vendor's hardware.
Split the x86 specification into a core subset of common functionality that all x86 processors must support and standardize on and which operating systems and applications can target, and into vendor-specific extensions which mainstream application software should avoid.
Shift the implementation of deprecated hardware functionality to a software layer called the "VERT" (Virtual Execution Runtime). This is both to maintain the backward compatibility with existing x86 operating systems and applications, as well as to support new operating systems and programming models which will be based on vendor-agnostic standards.

Certainly do not accept this list on blind faith, although I expect you to be convinced by the time you reach the end of today's post. I am simply connecting the dots found in published research, message boards, and my own 20 years worth of experiments in virtual machines.

Before I go into the details of the logic for each of these 10 steps, understand that the end goal is to bring AMD and Intel around to the Transmeta approach of chip design - to separate a standardized software programming model for which operating systems and applications are written, separate from the vendor-specific changes that AMD and Intel will still be free to pursue. You can think of the virtual execution runtime as a microkernel or hypervisor of sorts which separates the abstraction from the hardware. I envision that AMD and Intel would in fact write their own virtual execution runtimes, perhaps even provide them on flash ROM on the processor. This way AMD and Intel can compete against each other based on a common standard, not random self-serving specifications that they create as they see fit. Companies such as Microsoft and VMware could also easily adapt their hypervisors to this new approach. Not waiting around for those guys, last month I also stated my intention to develop an open source reference version of such a runtime (http://www.emulators.com/gemul8r.htm).

As with any hardware change, there is a long domino effect of consequences to operating systems, applications, compiler tools, and third-party add-ons. This ties into the other major point I want to make clear right now. The changes to the software are not gated by the hardware changes above. Steps #9 and #10 are purely software specifications that can be implemented on today's existing AMD and Intel hardware by avoiding the use of the deprecated hardware functionality. Around the world, people are already working on fixing the software stack, from research into virtual machines, root cause analysis, dynamic code instrumentation, and new operating system.

Back To The Future

Hopefully you haven't already forgotten what I said about OS/2. Studying the past can show us the road forward, grasshopper. I was a college intern at Microsoft at the time that OS/2 (dubbed "the soul of the new machiines") was unveiled to the world in the spring of 1987 on or about April Fool's Day 1987 if I am not mistaken. In unison, IBM announced a redesigned PC based on the Intel 80286 microprocessor called the "PS/2" (http://en.wikipedia.org/wiki/IBM_Personal_System/2). PS/2 stands for "Personal System 2", not "Playstation 2", by the way.

OS/2 and the PS/2 were launched just months after Compaq (who later merged into HP) announced the "COMPAQ 386", the truly killer desktop machine of the time costing upwards of $20,000 a pop. IBM received a lot of criticism over going with the older 16-bit 80286 (the "286") instead of jumping on the bandwagon and using the 32-bit "386" microprocessor. The 386 delivered not only faster clock speeds and more efficient instruction timings which made existing MS-DOS software run much faster than on the 286, but it also introduced 32-bit registers, 32-bit addressing, demand paging (which is the technology behind virtual memory and swap files), and the ability to efficiently multitask both 16-bit and 32-bit programs.

At the time, IBM's decision seemed foolish as paying customers and software developers flocked to an ever increasing number of 386-based computers offered by Compaq and others. IBM was one of the last computer makers to make the switch to 386, by which time the PS/2 had failed and IBM faced the end of its dominance over the PC industry. Compaq, then Gateway 2000 (later just "Gateway"), then Dell, then HP, and finally Apple would go on over the years to become the successors to IBM, which has since sold off the personal computer business it created to China.

In hindsight, IBM may have lost the battle for the wrong reasons, since for a full decade after the 16-bit vs. 32-bit battle began, most PC users did in fact simply use the 386 and it successor the 486 microprocessor to do nothing more than run 16-bit software faster. All versions of MS-DOS are 16-bit. The versions of Microsoft Office that most people used well into the late 1990's were 16-bit, as were the Lotus 1-2-3 and WordPerfect applications. There was nothing fundamentally wrong with 16-bit applications or 16-bit operating systems for what users needed at the time. Even the Apple Macintosh at the time, although based on a 32-bit microprocessor, was running what was really a 16-bit operating system.

The false perception that "32-bits must be better than 16-bits" made the world 32-bit crazy, even though the technical basis for this was not really understood by the average computer user. Software developers rushed to port their applications to 32-bit operating systems based on questionable technical reasons.

One of the main arguments made against 16-bit microprocessors in general was the amount of physical memory supported. A PC based on a 286 had a limit of 16 megabytes of physical memory, the same limit as the 68000-based Atari ST, the Apple Macintosh, and the Commodore Amiga at the time. This was seen as a problem, even though ironically the average Atari ST or Apple Macintosh or PC didn't even have more than 16 megabytes of memory until well into the mid-1990's. So this argument was bogus for many years.

A similar argument came from programmers, claiming that it was too difficult to write 16-bit code due to the maximum size of 64 kilobytes for any given segment of memory. On MS-DOS, on Windows, on OS/2, and on the Mac, the operating systems required that code be broken up into small segments of 32K or 64K, that data be similarly broken up into these small segments. The various programming models on Mac and PC generally worked by taking two integers and combining them to get a "linear address" which was then used to access memory. On Intel-based environments like Windows, OS/2, and MS-DOS the first integer was a 16-bit "segment selector", which was provided by the operating system when the program asked to allocate a block of memory (of up to 64K in size). The program then combined the selector with a 16-bit offset (or "near pointer") to generate the linear address. The combination of selector and offset is also called a "far pointer" and written in hexadecimal notation that separates the selector from offset, for example, "2347:0031". Mac had a similar scheme involving the A4 and A5 registers the details of which I will skip.

If a program asked the operating system for a block of memory larger than 64K, the operating system returned several segment selectors. The integer values of the segment selectors were not numerically contiguous - they were just a set of 16-bit integers. Far pointers using the first selector were used to access the first 64K of the allocated block, the second selector the second 64K, and so on. Although a far pointer can be thought of as a 32-bit integer, ordinary 32-bit arithmetic cannot be used on far pointers. For example, allocate 128K and get back two 16-bit segment selectors, say, 0x1234 and 0x5678. The first 64K of the 128K is address using the 32-bit far pointer values 1234:000 through 1234:FFFF, while the next 64K is accessed using values 5678:0000 through 5678:FFFF. And attempt to access, say, 1235:0000 could likely crash the program.

In other words, although the operating system had access to the whole 16 megabytes of physical memory, any given program was given access only to specific portions of that memory via these little 64K windows called segments. A segment selector was used to reference such a window of memory, but the numerical value of the segment selector need not have any correlation to the actual physical address of that memory. A segment could also be set by the operating system to have a maximum size below 64K. For example, a segment could be as small as one byte, thus giving protection against the common programming bug known as the "buffer overrun" or "buffer overflow". If a program asked for 50 bytes of memory, it got a selector to a memory window 50 bytes wide. If it tried to specify an offset to the 51st byte into that segment, the 286 hardware stops the program in its tracks. This is actually desirable, since it indicates a programming error and also nips the buffer overrun in the bud before it causes any further damage.

This protection was not available in MS-DOS, only in OS/2, since the 286 processor had two modes of operation: real-mode and protected mode. Only in what is called the "protected mode segmented addressing" programming model used by OS/2, is this buffer overrun protection enforced by the hardware. Protect mode segmented addressing works quite simply: the segment selector is really an integer index into a lookup table of "segment descriptors", which contain the address and width of each of the windows into physical memory. The 286 hardware verifies that the particular selector in a far pointer indexes a valid descriptor, and checks that the offset portion of the far pointer is within the maximum size of the window. The 68000 and 68020 based versions of Mac OS did this same trick, in that the Mac OS returned "memory handles" which can be thought of exactly as segment selectors. On the Mac, the conversion of handle to physical address was done manually by the compiled code, not by hardware, but the idea is almost identical. The hardware on the 286 did additional checks such as verifying the offset and verifying the permission to access the memory window, which did not show up in Mac OS until the 68030 based Mac models.

LESSON LEARNED: the segmented memory programming model was the dominant programming model used for MS-DOS, for OS/2, for classic Mac OS, and for versions of Windows 1.x, 2.x, and 3.x. Hundreds of popular 16-bit programs were written and distributed to millions of computer users throughout the 1980's and first half of the 1990's - Lotus 1-2-3, WordPerfect, Photoshop, Word, Excel, PowerPoint, Works, PKZIP, and others. These applications had the benefit of automatic hardware-based protection against common programming errors such as the buffer overrun.

Before I continue, I hope that I've convinced you at least a tiny little teeny bit that there was really nothing fundamentally wrong with 16-bit operating systems or the 16-bit programming model. The fact that newer 32-bit microprocessors were faster was unrelated. 32-bit chips were faster because, well, they were newer and used new superscalar pipelining techniques (read the Modern Microprocessor Design book for details). We can agree though that the segment based programming model itself was sound and offered a strong level of security against accidental programming bugs. Do we agree? Good. Carry on then.

Yet still programmers whined. Compiler vendors including Microsoft, Borland, and Watcom responded by introducing language extensions such as "based pointers" and other tricks to hide the details of segments. But that still did not soothe programmers. They wanted 32-bit addressing and 32-bit registers as provided by the 386.

And so the rift developed between IBM and Microsoft - IBM sticking to the 286 and 16-bit programming models. Microsoft pushing for the 386 and a 32-bit programming model. IBM did eventually cave in and release OS/2 2.0 in 1992 to support the 386, but it was too late. Microsoft was already developing its own 32-bit operating system released in 1993 as "Windows NT". NT was originally being called "OS/2 3.0", in a manner not unlike the SSE naming wars between AMD and Intel today. Sigh.

Unfortunately by 1996 the world was running on 32-bit 486 and Pentium based PCs and 32-bit 68040 and PowerPC based Macs, choosing Windows 95, Windows NT 4.0, OS/2 Warp, Mac OS 7.6, and the young Linux as their 32-bit operating systems of choice at the time. Out went segments, and in came the flat memory programming model and demand paged virtual memory. This was probably one of the greatest mistakes in computer science history.

Just as the world was diving head first into this 32-bit utopia, a party crasher unexpectedly shows up at our front door. It's none other than our good friend the Internet. Uh-oh!

Lessons Forgotten

Thanks again to Alex for reminding me to dive into my textbooks and brush up on my operating system theory. I dived into my personal library of books and publications accumulated over the past 30 years. One very fascinating book I revisited is called "Inside OS/2" written in 1988 by Microsoft's OS/2 chief architect Gordon Letwin. There is terrific paragraph in there discussing one of the design goals of OS/2:

"Today, personal computers are being used as a kind of super-sophisticated desk calculator. As such, data is secured by traditional means - physical locks on office doors, computers, or file cabinets that store disks... Protection is not needed because the machine is secure and operates on data brought to in by traditional office channels. In the future, however, networked personal computers will become universal and will act both as the processors and as the source (via the network) of the data. Thus, in this role, protection is a key requirement and is indeed a prerequisite for personal computers to assume that central role." - Gordon Letwin, 1988

In other words, Gordon was pointing out that desktop computers of the mid-1980's, even in offices, were considered secure if they were locked up. Networking was not common yet (and certainly not wireless networking!) and the only means to install software on the machine was generally via floppy disk. Most floppy disks that users would have inserted into the machine were commercial store-bought software such as MS-DOS and Lotus 1-2-3 which were trusted to work as advertised and to not contain any malicious code. And hopefully you trusted that people who had access to the locked up computers did not have malicious intent.

A few early computer viruses did exist at the time. I remember the floppy disk "key virus" on the Atari ST, which resulted from my, shall we say, promiscuous exchange of floppy disks with fellow Atari users. The virus stored itself in the floppy disk's boot sector and loaded into memory if the floppy was in the disk drive at the time the computer booted. The virus only survived in memory for as long as the power was switched on, so for the virus to propagate it needed the user to remove the infected floppy disk and then insert a fresh floppy disk before rebooting. The key virus was a proof of concept that viruses could spread but was not very malignant.

Gordon Letwin had enough insight twenty years ago to realize that we needed to think ahead of the time when computers would not be locked up; a time when computers would be connected to each other in ways which allowed third parties which were not trusted to have access, and thus, to infect our computers with malicious computer code. I think Gordon's insight helped make OS/2 in the great operating system of its time.

There was even a very clever compatibility mode in OS/2 which allowed unmodified MS-DOS compatible applications to run in 16-bit protect mode on OS/2. One of the limitations of MS-DOS is that 16-bit real mode does not enforce segment limit checks. Therefore when an application allocates a block of memory, even if each block of memory is given a unique segment selector, each selector can used access a full 64 kilobyte range of memory. Yet in real-mode MS-DOS, consecutive selectors map to physical memory only 16 bytes apart! If the application was written properly so as to not perform the funky 32-bit pointer math that code could be run unmodified in protect mode on OS/2 where segment limits were in fact enforced. Beautiful!

So taking today's popular concept of virtualization, OS/2 already 20 years ago did an even better trick of using the 286 microprocessor to take an existing MS-DOS application and virtualize in a way to make it more secure than it was original was! I return to Microsoft in the fall of 1987 as an intern working on the OS/2 port of their Multiplan application (the original MS-DOS based predecessor to what later became Microsoft Works and Microsoft Excel). Using this virtual MS-DOS mode on OS/2 we were able to find buffer overrun bugs that would have easily been missed had we been developing and testing on MS-DOS.

LESSON LEARNED: Keep this principle in your head because I am going to keep repeating it - virtualization has the ability to perform additional checks to find bugs and to catch malicious code that would have otherwise gone undetected when executing natively.

Gordon also gave the world a different kind of warning while developing OS/2. The problem he was solving was two-fold. First, some MS-DOS applications were written to actually rely on memory segment overlap. They relied on the funky pointer math to overrun from one segment to another, and when run under OS/2 in protect mode the applications of course crashed. The solution then was to context switch the 286 over from protected mode into real mode to execute the application in a real MS-DOS type environment. Oops! Intel had not designed a way for the 286 to context switch from protect mode back to real mode short of hitting the power switch and restarting the computer! Why somebody would need to escape from protected mode hasn't been foreseen. Gordon persevered and went and discovered a bug in the 286, causing an intentional "triple fault" crash, which rebooted the 286 chip back into real mode to run MS-DOS milliseconds later!

LESSON LEARNED: This bug of the 286 is not unlike the numerous Core 2 memory bugs that the Unix community raised alarm about earlier in 2007. It is a very real threat today that somebody can use a CPU bug to reboots an operating system on-the-fly and hand over control to malicious code.

Throughout the 1990's, both Intel and Microsoft continued to evangelize the wonders of 32-bit computing. After its split from IBM, Microsoft developed its OS/2 successor - Windows NT. Diving into another book on my shelf, not surprisingly called "Inside Windows NT", one can see that the design of Windows NT borrows much from OS/2 in the name of security. NT uses a similar model as OS/2 to isolate various parts of the operating system into separate address spaces and even run them in user mode. For example, in Windows NT 3.1 (the 32-bit version of the MS-DOS based Windows 3.1) video drivers originally ran in user mode.

I finally became a fan of NT in 1995 with the release of Windows NT 3.51. It was not only highly optimized to where it felt as fast and crisp as using OS/2, but it was also living up to its claim of being a secure and portable server operating system. Windows NT was being ported to the MIPS, Alpha, and PowerPC RISC architectures. it is quite unfortunate that Microsoft didn't pursue the pseudo-32-bit mode for their other Windows product line, or for that matter, just rebrand Windows NT 3.51 _as_ Windows 95.

Unfortunately Microsoft had forgotten a lot of Gordon Letwin's good advice by 1995. Just as our friend the Internet was walking up the driveway to crash the party, design decisions were already being made which would go on to plague Windows right up until today.

The World Goes Flat

In the last chapter of "Inside OS/2", Gordon talks about the 386 microprocessor and what capabilities it could bring to future versions of OS/2 or other future operating systems. The triple fault trick could be replaced on the 386 by an actual documented method of switching in and out of protect mode. A new "virtual 86 mode", or V86, allowed MS-DOS to be run in protect mode but using the segmentation model of real mode, eliminating the security hole of even switching to real mode. 48-bit segmented addresses, consisting of a 16-bit segment selector and a 32-bit offset, promised an almost unlimited amount of address space.

One design dilemma that he hints at is the decision of 32-bit operating systems to expose a 32-bit "flat memory model" or whether to stick with segmentation and use 48-bit pointers. His choice was for OS/2 to stick with segments and 48-bit far pointers. The Windows NT guys went the other way, doing away with segment in the programming model and just using 32-bit offsets into one large 4-gigabyte segment. Each running program in Windows NT "sees" its own unique 4 gigabytes of address space, which the operating system configures via hardware page tables to map to different areas of physical memory. Out goes the concept of variable sized (but no larger than 64K) segment. In comes the concept of 4K "pages" of memory, where the operating system, with the support of the 386 hardware, can map each of the million pages that a program sees.

Pages don't have size limits as segments do, so even if a program asks for 10 bytes of memory it receives a pointer to at least 4096 bytes of valid memory that it can scribble on - 4086 bytes of which it had better not touch!

In some ways, a 32-bit protected flat memory model is really not that different from 16-bit protect mode segmentation in OS/2. 32 bits are needed to describe a memory location. With OS/2, the 32 bits breaks down 16:16 as selector:offset. In NT, the 32 bits breaks down 20:12, with the upper 20 bits representing an integer index into the million or so pages in the address space, and the bottom 12 bits being the offset. So if you think of pages simply as fixed size 4K segments, and page indexes being the equivalent of a 20-bit selector, the two memory models appear to be very similar - 16:16 of one, or 20:12 of the other, the two are VERY similar.

Here is the catch, the critical flaw of the flat memory model. The 32-bit flat memory model is not secure for example the same reason why the 16:16 real mode memory model of MS-DOS is less secure than the 16:16 protect mode in OS/2. Even if you treat the 20-bit page index as a segment selector, there is no bounds checking on the offset. If byte offset 0 into a page can be accessed, then so can offset 4095. As with MS-DOS, 32-bit pointer arithmetic works such that when you overflow past the end of one page, you slide right into another page. Pages don't overlap as segments in MS-DOS do, but they do sit right next to each other in the address space. So when a serious buffer overflow happens, it can overwrite untold amounts of memory until it finally hits a page that is not accessible. For large modern applications that literally allocate tens of megabytes, even hundreds of megabytes, of contiguous address space, it is very easy to construct a valid 32-bit pointer that will overwrite something.

And this very simple flaw of the memory model is exactly the kind of attack method that viruses and worms and rootkits have been using to attack Windows applications and the Windows NT kernel itself.

"But Windows NT is a microkernel architecture that isolates the kernel components from each other" you exclaim in protest! Well, not quite. NT was never a pure microkernel along the lines of say, the Mach microkernel which was developed by Rick Rashid (http://www.microsoft.com/presspass/exec/rick/default.mspx), head of Microsoft Research. In the transition from NT 3.51 to NT 4.0, Microsoft made a design decision explained here (http://www.microsoft.com/technet/archive/ntwrkstn/evaluate/featfunc/kernelwp.mspx?mfr=true) to move video drivers from the microkernel model of user mode ring 3 space and more toward the monolithic kernel model of ring 0. While I remember a lot of discussion in the press at the time about this possibly destabilizing Windows NT, Microsoft's article assures people that:

"With Windows NT 4.0, it remains true that if application code can crash the system, Windows NT has a bug, period. In other typical PC-based operating systems, because of architectural choices inherent in their designs, application code can crash the operating system even when the system code is flawless."

As Windows NT morphed into Windows 2000 then into Windows XP and finally Windows Vista, device drivers running in kernel mode ring 0 have continued to plague people with "blue screens of death". The ability for anybody to write a device driver, which then gets inserted into the kernel's address space, is a problem not only with Windows but just as much with Linux. I ran across this 1992 email thread between well known operating systems guru Andrew Tannenbaum and the then barely started author of Linux, Linus Torvalds: (http://www.oreilly.com/catalog/opensources/book/appa.html). In the heated debate, Linus defended the position, the same position that Microsoft took in moving NT further away from a microkernel architecture, that monolithic operating systems which run the whole kernel in a single unprotected address space are easier to write and execute faster than microkernels.

LESSON LEARNED: The mentality of the 1990's appears to be all about simplifying the job of the programmer and delivering the highest possible speed, which did well to bring software to market quickly in the Internet age and drive the "dot-com bubble". The sense of security in software was just an illusion back then.

Overlooked Solution?

Just as an aside, I think that IBM and Microsoft might have overlooked a very simple compromise. The choices in programming models so far have been 16-bit real mode, 16-bit segmented protected mode, 32-bit flat model protect mode, and the proposed 48-bit segmented protect model.

What these three models all share is the common programmer misconception that "sizeof(int) == sizeof(pointer)", or in layperson words, that integer data and pointers to memory must be the same size. As any computer science student should know, the size of data and the size of the address space are separate. In fact, the 386 microprocessor actually DID provide separate control bits to toggle between using 16-bit and 32-bit data registers, and for using 16-bit and 32-bit addressing. In fact, some MS-DOS programs took advantage of this to make use of 32-bit data registers (in order to perform arithmetic calculations faster) while still using the standard 16-bit real mode addressing model of MS-DOS. This was perfectly legal even in MS-DOS.

So what everyone overlooked was the very simple fact that OS/2 (and Windows 3.x for that matter) could have been extended to stay with the 16-bit protect mode segmented addressing model but provide use of the full 32-bit width of integer registers. A "pseudo-32-bit mode" where integers would be 32 bits wide, pointers (still the 16:16 far pointers) would be stored in memory as 32 bits but dereferenced in the 16:16 selector:offset style. I believe that this could have extended the life of OS/2 and Windows 3.x for another decade until 32-bit or even 64-bit programming models were fully worked out.

AMD and Intel did even later add PAE ("Physical Address Extensions") which allows for additional bits of physical address space. This could have allowed an operating system using a 16-bit segmented memory model to access, say, 64 megabytes or even 256 megabytes of physical memory, well into the early 2000's for the average personal computer user's needs. In hindsight, a great deal of misery inflicted on the world by 32-bit flat addressing memory models could have been avoided.

Am I the only one to realize this obvious blunder?

Who Do You Trust?

When put up against Gordon Letwin's design principles of OS/2, and even against Microsoft bold claim above, the anti-microkernel folks simply don't have an argument when security and reliability matters. When we live in a world where a hacker can sit in a shopping mall parking lot with his laptop, and wirelessly steal credit card numbers from cash registers, I'd say security should matter. Fundamentally the philosophy of OS/2 is that any piece of code that is not "trusted", i.e. any piece of code that is supplied by a third party such as a device driver, web browser plug-in, or application, MUST be treated as buggy and/or malicious code. Period.

The Windows 95 family (which includes Windows 98 and Windows Millennium) is the worst designed architecture of the 1990's. Because unlike NT, which tries to at least protect the kernel's address space from user mode programs, Windows 95 simply maps in the kernel memory into the upper 2 gigabytes of every 32-bit Windows process. Every piece of kernel data, including shared memory regions being shared privately between two programs, is actually visible to ALL programs. Windows 95 also made no effort to block user mode programs from directly accessing hardware. I can understand doing this for 16-bit programs running in MS-DOS compatibility mode, but why allow it for 32-bit Windows applications? Any Windows program running on Windows 95 is able to "see" the hardware ports, things like the serial port, the PC speaker, the network card, etc.

Is it any wonder then that by the time broadband Internet and DSL became popular during Windows 98's reign that computer viruses started spreading like crazy? As Microsoft finally killed off the 16-bit Windows line and standardized on the NT kernel for everybody (in the form of Windows XP in 2001), I posted this blog entry (http://www.emulators.com/secrets.htm#VirusMagnet) urging people to just not use Windows 98 on the Internet. Six years later, I am afraid that security is really still not the number one priority of software developers. And certainly switching away from Windows 98 to either Windows XP or Linux is not the security blanket I hoped it would be either.

This brings us to the next piece of the trust puzzle. As Windows 95 launched and applications were being recompiled from 16-bit mode to 32-bit mode, buffer overrun bugs which would have been caught in a segmented protect mode environment were missed in 32-bit mode. The next time you happen to get your hands on a computer that is still running Windows 95 or Windows 98, run a 32-bit application and bring up a dialog box that requires text input, such as the "File Open" dialog box. When prompted for a filename, type in C:\X and then just hold down the X key for a minute such that you end typing in a ridiculously large file name. Then press Enter. You might be surprised to know just how many applications, even Microsoft's own products, blow up at that point due to the extremely long file name inflicting buffer overrun bugs to the application.

A buffer overrun is dangerous in that programs internally ask the operating system for blocks of memory to hold data, such as, well, file names. Many programs ported over from MS-DOS were written to assume an "8 dot 3" file name, such as AUTOEXEC.BAT. A character of text, at least for ASCII text, requires one byte of memory per character. Plus one extra byte, which depending on the compiler, is used either to hold the size of the text (known as the "string length"), or is a "null terminator" (a byte containing the value zero) which signals the end of the text. The dot is implicit, so to properly store an MS-DOS file name as a text string in memory requires 8 + 3 + 1 = 12 bytes of memory. This was true in MS-DOS, and was true in pure 16-bit versions of Windows such as Windows 3.1. Path names, the "C:\Programs\Bin"-type directory paths, also had a limited size in MS-DOS of (off the top of my head, don't hold me to this) about 64 characters. In Windows 95 that limit jumped up to 240 characters.

What happens when such as 16-bit piece of code which allocates 64+12 bytes of memory to hold an entire path and file name is recompiled into 32-bit code and run on Windows 95 or on NT? For as long as the user does not type in a path or file name that exceeds the MS-DOS limits, the code works. But type in that 300 character long path name and it's usually goodbye.

This would not be so bad if it was isolated to individual buggy programs. But hackers quickly realized that if they studied these crashes and figured out what they corrupted - perhaps a pointer, perhaps a function return address - they could intentionally type in a specific piece of text so as to inject exact values into the buffer they were overflowing. In fact, they could even trick YOU into causing the buffer overflow itself. For there are many programs that need to read text from external sources, not just from the keyboard as you are typing. For example, a web browser receives text input from web pages all the time. Every time you click on a web link the browser reads in a URL (an Internet path name of the form http://www.emulators.com/") into an internal block of memory. Since early web browsers were written much like MS-DOS programs they assumed some arbitrary limit to how long a URL would be. The numbers 80 and 256 are common constants that are used throughout computer code for sizes of text buffers. 80 is chosen sometimes because 80 used to be the number of columns of text on a screen (and surely NOBODY would ever type in a file name or URL longer than one line of the screen?) while 256 is chosen as nice round binary number. Either one is a programming error.

And so malicious web site operators learned to hijack web browsers and web browser plug-ins. More recently hackers have realized that almost ANY program is susceptible to trust issues. Any kind of data - whether text, an MP3 file, a JPG photo file - can be tampered with such that the programs loading them malfunction. Tamper with a JPG file for example so as to make the dimensions of the photo appear large (say, a billion pixels across) and you can probably make an application run out of memory (least annoying) all the way to having that application try to tell the video driver to switch to a billion pixel display mode and blue screen the system (very annoying!).

Some programs are always running, always "on", whether you see them or not. For example, the TCP/IP driver in Windows is up and running whenever the computer is connected to something like a DSL connection to the Internet. TCP/IP drivers, guest what, contain buffer overrun bugs. Hackers have learned even how to infect your computer simply by "pinging" your computer (from across the Internet) with a bogus TCP/IP message, or "packet". The number of security holes in Linux, in Windows, and yes, even Mac OS, is just stunningly huge.

And until Windows Vista, the default Windows user in XP and earlier had full administrative privileges. So any untrusted code or data could generally do its dirty deeds with full system administrator access to the computer!

LESSON LEARNED: The security problems in personal computers are not just hardware design issues or choices of programming model. Sometimes the problem is as simple as an operating system that leaves its front door wide open!

Safety Without Pointers

Last week I discussed how engineering decisions can have long term ramifications even decades into the future. The rush into 32-bit flat model software development just as the Internet and network connectivity was picking up steam was a recipe for disaster. That was obvious by the end of the 1990's as computer science started to look at solutions to the problem and once again rediscovered some old concepts.

One rediscovered solution was the use of "managed" computer languages such as Java and C#. This was a reimplementation of a 1970's idea called "P-code" or "bytecode" (http://en.wikipedia.org/wiki/Bytecode), in which computer code is compiled not into 8086 or 68000 machine language, but into a virtual machine language known as a bytecode. The microprocessor does not directly execute bytecode. Instead, something very much like an Excel macro interpreter is used to simulate the execution of the P-code. The original release of Java used interpreted Java bytecodes. Interpreting bytecode can be very slow, even 100 times slower than native machine code, so interpreters are often replaced by dynamic binary translation compilers called "just-in-time compilers" or simply "JIT". The microprocessor natively executes that "jitted" code. When the JIT is also accompanied with a runtime library that provides services such as memory management or garbage collection, that whole package is referred to as a "virtual machine". You are probably familiar with the Java virtual machine, which is the jitted runtime environment for the Java language. In fact, the Java virtual machine, a virtual machine to run Mac OS on top of Windows, or the latest VMware Fusion virtual machine to run Windows on top of Mac OS are really all different variations on the same theme of virtual machines.

Microsoft C/C++ 7.0 for MS-DOS had the ability to emit P-code years ago and was used in shipping Microsoft applications. The son-of-P-code is a bytecode called "MSIL" (Microsoft Intermediate Language). MSIL is always compiled and runs on top of a runtime library called the CLR ("Common Language Runtime") in a system that is collectively known as .NET ("dot-net"). .NET supports not just a single language such as C# (the Microsoft equivalent of the Java language) but also managed version of C++ and Visual BASIC to name a few. The compilers for the various .NET languages all generate the same common MSIL bytecode, so by the time that bytecode is jitted by the CLR it has really become irrelevant which managed language it was compiled from. An open source version of .NET called Mono is also available.

Both Java and .NET virtual machines are portable, running exactly the same Java bytecode or MSIL bytecode whether on 32-bit Windows machines or on 64-bit Linux machines, or in the case of .NET, even on the PowerPC based Xbox 360. The common bytecode that is used by the various managed languages is completely abstracted away from the actual host microprocessor that the virtual machine runs on.

LESSON LEARNED: This is the core concept of a VM. A virtual machine is used to isolate the actual "host" microprocessor hardware from the "guest" programming model that the software developer originally targeted.

Another powerful feature of managed languages such as C# and Java is how they solve the pointer arithmetic trap by simply not exposing pointer arithmetic at all! By eliminating some syntax from C++, languages such as C# and Java prevent ways that C and C++ programmers commonly shoot themselves in the foot. Unlike C or native C++, you cannot construct an arbitrary 16:16 or 20:12 pointer. Instead, you use "references to objects", which are still really pointers, ha! But references can only be passed around or used indirectly. So in a way, the reference to an object is similar to a segment selector or the page index. A particular data field within an object is accessed by adding a small offset (usually well under 16 bits in size) to the reference, which is done strictly by the compiler, not the programmer. This is a very close analogy to how the classic Mac OS memory handles and OS/2 segment selectors were used to access memory. All three systems - managed languages, Mac OS memory handles, and OS/2 memory selectors - allow the operating system to even move around the objects, such as during a heap compaction or garbage collection, such that the running code doesn't even have to care when objects move around.

Another common properties of virtual machines is the performance of the jitted code - whether it be a Java VM, or .NET, or a Macintosh emulator running on top of Windows, or a Windows hypervisor such as VMware Player. Virtual machines tend to perform a lot of those bounds checks, which means a lot of numeric comparisons and branches. Virtual machines also tend to execute very short sequences of code then jump to another block of code. This requires a microprocessor with certain characteristics, generally one with a shorter pipeline, many integer execution cores, good branch prediction, and large caches. The Intel Core 2 is exactly such a processor with characteristics friendly to virtual machines. The AMD Opteron and Athlon before it are also quite good. What I loved about the Athlon when it came out in 1999 was that is a faster Pentium III, slightly more friendly to my emulation code than the Pentium III of the time.

The Intel Pentium 4, as I discovered 7 years ago, was the enemy of virtual machines. Intel got a lot of heat about the Pentium 4 in more ways than one, as it had relatively long latencies for basic operations such as CALL and RET (used by computer language for function calls), kernel context switches (2 to 3 times slower than Pentium III or Athlon), and synchronization primitives that were up to 5 times slower than the other chips.

LESSON LEARNED: The Pentium 4 fiasco demonstrated that chip manufacturers have to pay attention to virtual machines and make sure to optimize CPU processing cores for jitted code sequences. This will become ever increasingly important as wider adoption of virtual machines means that memory checking tasks are moving from hardware to software.

AMD's Blown Opportunity

As Intel was in the dog house from 2000 to 2005 redesigning its architecture, AMD was also busy enhancing its processors, pushing forward the idea of dual-core processors and coming up with a specification to extend the x86 architecture to add both 64-bit integer registers and 64-bit addressing to the programming model. In 2005, Intel jumped on the bandwagon, announcing its own support for 64-bit extensions, hundred-core processors, shorter more VM-friendly pipelines, and support for virtual machines directly in hardware. To which AMD responded with its own version of hardware virtualization. To which Intel responded with SSE4, then AMD pinging right back with SSE5. Necessary?

In putting together the "x86-64" specification, later renamed to "AMD64", AMD had an opportunity to seriously address the security issues of the 1990's. The magic bullet AMD gave the world was the "NX" or "No Execute" bit, an attribute bit on 4K pages of memory to mark 4K of memory as being non-executable data. This is mainly done on stacks and heaps so that malicious code cannot construct new code sequences and execute them from those places. Intel followed suit and copied the feature, calling it "DEP" or "Data Execution Prevention". I have no objection to the NX bit, as this is a feature that other microprocessor families already have.

My beef with NX is how it was overly hyped up, even marketed to gullible consumers as "Enhanced Virus Protection" which in my opinion is about as misleading and dumbed down as most car commercials. The type of buffer overrun error being targeted by NX requires that the buffer overrun occur on the stack (i.e. a local C/C++ buffer is overrun) and inject corrupted data in a specific place on the stack where the function return address is. That return address then has to point to some code which is also of the overrun data. This is one of the early tricks that hackers used in the Windows 98 era, but is hardly the _only_ form of buffer overrun exploit. Especially since all that Windows has to do is randomize stack base addresses for each thread since position-independent exploits are harder to write.

As was disclosed within a year of the release of NX a cleverer exploit called a "trampoline attack" points the corrupted return address at some valid and pre-existing code, such as pointing at a valid system call. By corrupting the stack with just the right data above the return address, a single buffer overrun can easily make a system call to delete all files on the hard disk for example, and do so by completely bypassing any protection offered by NX/DEP, such as in this example (http://www.mastropaolo.com/?p=12). It was a pretty tragic waste of effort on the part of AMD, Intel, and Microsoft, which put much of the hype over the Windows XP Service Pack 2 release over this supposedly magic security fix. Worse, it may have distracted Microsoft from going after more comprehensive security efforts: (http://www.crn.com/security/18841713).

The NX/DEP "feature" can even degrade system performance. The data structures that the x86 microprocessor uses to store protection and mapping information for each page, called the PTE ("Page Table Entry"), has traditionally been a 32-bit or 4-byte structure. For each allocated page of virtual memory in each process, a 4-byte page table entry must be allocated internally by the Windows kernel. There is also a rarely used 8-byte version of the PTE, generally used on server systems with larger amounts of physical memory, which the 4-byte PTE lacks enough bits to support. Unfortunately, the 4-byte PTE also lacked the spare bit needed for NX, so AMD added the NX attribute bit only to the 8-byte PTE. This means that when Windows XP or Windows Vista enables NX/DEP on a typical home computer, the memory footprint of page tables doubles. More memory wasted for kernel page tables means less physical memory available for the user's actual applications. In Windows Vista, the NX/DEP feature is enabled in terms of booting Windows with 8-byte page table entries, but by default the feature is actually disabled on individual programs. The user must specifically opt-in to enable full coverage, something that the average Windows Vista user probably doesn't know to do.

LESSON LEARNED: The NX Enhanced Virus Protection feature of AMD64 is mostly ineffective and can hurt system performance even when not actively being used.

Let's look at the feature central to x86-64 / AMD64 - the addition of a 64-bit mode of execution. As with the NX bit, I have nothing against the widening of integer register to 64 bits, or the ability to address memory using 64-bit pointers. This brings AMD and Intel hardware up to parity with 64-bit PowerPC based products such as the PowerMac G5 and the Sony Playstation 3! Being able to natively perform 64-bit arithmetic, to have more registers, and to address more than 4 gigabytes of RAM directly are all positive desirable features. I say keep those things.

However, I do take issue with the fact that AMD chose to arbitrarily dumb down the 64-bit mode in numerous ways:

arbitrarily removing some key x86 instructions that break virtual machines.
not providing separate control bits to toggle between only enabling 64-bit wide registers and only enabling 64-bit addressing, as the 386 did 20 years ago when 32-bit registers and addressing modes were added.
crippling the segmentation hardware in order to force a 64-bit flat memory model.

Each of these three design choices I can understand might be shortcuts to have allowed AMD to simplify the hardware and thus get their 64-bit microprocessors to market ahead of Intel. But this rush to 64 bits was at the detriment of the software community. And other than NX, AMD64 does not address security issues such as buffer overruns in general. I will go into detail on these three issues:

Losing arbitrary instructions: One of the most critical user mode instructions used by emulation software is the LAHF instruction. For most programmers this is an obscure instruction which copies a part of the EFLAGS register to the AH registers. (Geek term overload, I know, sorry!). In layperson terms, it is an instruction which copies 5 of the 6 arithmetic condition flags - Zero, Sign, Carry, Parity, and Adjust - to an integer register. This instruction has been used by C/C++ compilers to check results of floating point operations for example. More so by virtual machines and translated code which needs to save away the current register state. For example, operating system context switching code could use this instruction, for which there is the SAHF complement to copy data into the condition flags. Every emulator that I have written for MS-DOS and Windows uses the LAHF and/or SAHF instructions to do fast saving and restoring of the condition flags, as do other third party emulators.

When I tried to port portions of my Gemulator and SoftMac engines to 64-bit Windows a couple years back I was mystified by why the speed had plummeted in 64-bit mode. I'm not talking 30% slower Pentium type slowdown, I'm talking dozens of times slower. After several hours of debugging and reading the fine print in the AMD64 manual, I finally realized what was happening - the instruction had been removed from 64-bit mode and was actually faulting. 64-bit Windows XP knew of this, and was silently emulating the instruction. The code ran, but each of these emulated round trips into the Windows XP kernel cost about 1000 clock cycles. Unfortunately the workaround is to use a slower code sequence using the PUSHF and POPF instructions (likely the code sequence that Windows was using to emulate the instruction) which is many times slower than LAHF.

Unfortunately a whole generation of AMD Opteron, AMD Athlon64, and Intel Pentium 4 processors shipped lacking this basic instruction. It seems enough software developers raised a stink because in recent updates to AMD's documentation, the instruction is now again supported, and microprocessors such as the Intel Core 2 do in fact support the LAHF instruction in 64-bit mode.

LESSON LEARNED: First with the VM-unfriendly design of the Pentium 4, and then this silly omission in AMD64, it seems that neither AMD nor Intel cared much about the performance of virtual machine as late as 2000 to 2003. Virtual machines should be a first-tier design goal for future generations of microprocessors.

Control bits: Just as my earlier example of the "pseudo-32-bit" mode demonstrated how 32-bit microprocessors can deliver a backward compatible programming model, AMD64 could have offered the option to enable only the new 64-bit wide registers but to keep the existing 32-bit memory addressing model. This would have allowed Windows XP or Windows Vista to have easily kept most of their existing code base and application compatibility, while offering a new enhanced 64-bit mode for applications that chose to make use of 64-bit integer registers. It would have been a very easy transition.

In fact, Apple has made use of just such as trick in Mac OS X. Unlike the AMD - Intel war, in the early 1990's IBM, Apple, and Motorola got together to jointly define a specification for the new PowerPC microprocessor family. One of the beautiful things about PowerPC's design is that they thought years ahead. When I was a developer on Mac Office 98, I used a 1994 programmer's reference manual written jointly by IBM and Motorola which described the PowerPC register state and instruction set. Near the back of the manual is an entire chapter on 64-bit extensions to PowerPC. Even though 64-bit chips were years away, they had thought ahead and define what happens when you extend registers to 64-bit bits in width and allow 64 bits of address space. The end result is spec that, unlike x86, allows 32-bit PowerPC code to run almost unmodified, and sometimes completely unmodified, on 64-bit PowerPC microprocessors. One of the modes available, and the mode that I believe Mac OS X takes advantage of, is a programming model with 64-bit wide integer registers but a 32-bit address space. This allows Apple to ship one single version of Mac OS X for PowerMac G5 and to seamlessly multitask 32-bit and 64-bit Mac OS applications using the same kernel. I was actually amazed when I went to work on the Xbox 360 team a few years ago how almost identical the design and specification of the actual 64-bit Xbox 360 PowerPC was compared to the specification from over 10 years earlier.

AMD64 offered no such smooth upgrade to 64-bit mode, forcing Microsoft as well as Linux distributions to require completely separate kernels for 32-bit and 64-bit. The 32-bit Windows XP and the 64-bit Windows XP operating systems are actually quite radically different. Even Windows Vista has different feature sets on 32-bit and 64-bit, such as no longer supporting MS-DOS or 16-bit Windows applications in the 64-bit version of Windows.

LESSON LEARNED: AMD should have studied the PowerPC design to learn how to smoothly extend x86 to AMD64. Instead, it has given us an architecture that requires a radical rewrite of operating systems and applications. Backward binary compatibility, such as mix and matching 32-bit x86 code with new AMD64 code in the same process, is not possible.

Flat segments: This is the most infuriating change of AMD64. AMD decided to kill any form of usable segmented memory model in 64-bit mode. As state directly in the AMD64 specification:

"Most modern system software does not use the segmentation features available in the legacy x86 architecture. Instead, system software typically handles program and data isolation using page-level protection. For this reason, the AMD64 architecture dispenses with multiple segments in 64-bit mode and, instead, uses a flat-memory model." - AMD64 Programmer's Manual, Volume 2, Section 1.2.1

Segment selectors still exist, but in a way that effectively gives an address space where all segments start at memory location zero and are of maximum length. Why when during the 1990's the software community jumped on 32-bit flat model addressing are realized the security issues associated with a flat memory model did AMD at least not sit down with Intel, Microsoft, the Linux community, and the computer science community, and work out a joint specification along the lines of how PowerPC was developed?

I find it unbelievable that AMD64 forces the 64-bit flat memory model on us that it did, or that Intel went along with it. Ironically the 64-bit model is now so limited that it practically required both companies to come up with the hardware virtualization schemes such as Vanderpool and Pacifica so as to work around the lack of protection in 64-bit mode.

Ironically this 2006 IEEE Computer magazine article on the return of microkernels is now seriously bringing to the forefront the realization that even if slower, the microkernel architecture is necessary (over today's flat model monolithic kernels) if we are to have secure computers.

You can see how childish it is when AMD and Intel are now having childish brawls over SSE5 and SSE4.1 and what have you, while the fundamental x86 architecture is now broken and out of control.

I promised I'd comment on AMD's Lightweight Profiling proposal. I've read through the spec, and what is reflects is the state-of-the-art of instruction event tracing ten years ago. There are numerous software based frameworks for collecting and analyzing instruction traces - my favorite Nirvana which I helped develop at Microsoft Research, Intel's Pin, DynamoRIO, Valgrind, ReVirt, and apparently even VMware is now getting into the game. These software solutions already exist; the main problem is that developers aren't making use of them!

AMD LWP claims to be lightweight but takes the naive approach of generating 32 bytes, that's BYTES, of trace data per event. An event could be as common as an instruction retiring, a branch operation, or a cache miss. Potentially millions, if not hundreds of millions of events might record each second. I don't know of any hard disk that has a throughput of 100,000,000 * 32 = 3.2GB/sec. This is one of the fundamental problems of gathering long traces - you end up with gigabytes of data. This forces the tracing to be confined to very short intervals, or requires triggering on much rarer events. Rare events can already be traced with low overhead (pretty much by definition!). The software based frameworks have similar event recording, and in the case of Nirvana, we specifically addresses the problem of trace bandwidth and found ways to get the trace output down to roughly 0.5 bits per instruction, hundreds of times more compressed that AMD's proposal and well within the disk bandwidth needed.

So it begs the question, WHAT problem is AMD actually hoping to solve with LWP? Unless someone from AMD cares to fill me in, LWP appears to be a waste of transistors; another example of digging in the wrong place.

AMD executives should read and learn some things from this post (http://groups.google.com/group/comp.os.ms-windows.misc/msg/d710490b745d5e5e?&hl=en) by none other than Gordon Letwin. It's his take on why the computer industry needs standards and what happens when two competing entities battle over controlling the same standard instead of co-operating - whether it be OS/2, high definition movies, or the future of x86.

Connecting the Dots

To conclude this post, over the past 20 years, both hardware designers and operating system designers have repeatedly chosen the quicker easier path instead of the right path. The successful collaboration that produced PowerPC is perhaps the one exception to this. Nevertheless, security and reliability were not treated seriously, leading to the situation where today computers continue to get infected, which puts people's personal information at risk and bogs down the Internet due to zombie traffic. Companies cash in on selling anti-virus software and other bogus security fixes that don't fully work or fix the root causes of the problems. AMD and Intel continue to race each other to deliver meaningless SSE hacks instead of focusing on fixing the basic programming model for the 64-bit multi-core future. In an attempt to fix the problem in software, the software research community is rediscovering old concepts such as virtual machines, bytecodes, microkernels, and memory models that do not rely on flat addressing, but this all still rests on top of today's broken hardware and software stack.

I trust you now understand my logic which led me to proposing the 10 steps at the start of this post. The software community needs to go right to the bottom of the software stack, right to the bare metal, and rebuild a new software stack that does not rely on hardware features that fall prey to the whims of AMD or Intel. At this point, with AMD already stuck the dagger in the heart of segmentation, and computer languages working out their software-based solutions to hardware based memory protection, AMD and Intel just need to go ahead and completely remove all forms of segmentation, paging, and hardware protection from their chips. These hardware "features" no longer meet the criteria of Jan's Razor, especially in the heavily multi-core future.

Simply using VMware-style hardware virtualization to create isolated virtual machines as a form of security is not the best solution or best use of multi-core technology. As VMware's own paper shows, the thousands of clock cycle latencies involved with hardware virtualization, much like the 1000 clock cycle latency I observed in the trivial virtualization of the LAHF instruction, can be slower than classic binary translation. Worse, whereas binary translation may have slightly slower performance on average, it has considerably more predictable worst case times, whereas hardware latencies can be huge. So while products like Virtual Server and VMware Fusion are nice first steps of at least helping to make legacy software more stable and make use of multiple cores, they don't address the longer term security problems. And they are highly constrained to running on specific hardware, which limits their portability to other architectures or even existing legacy hardware still in use.

Next week I will begin discussing my proposed software-based solution to the x86 crisis, which involves a microkernel containing a software binary translation based virtual machine (the "VERT"), a joint specification for a common x86 instruction set and programming model that is not tied to the actual host hardware, and a different way to think about memory protection. Thank you for having the patience to read this far!

As before, I would love to hear from you. Email me directly at darek@emulators.com or simply click on one of these two links to give me quick feedback:

Darek, keep writing, this is gravy on my biscuits!

Darek, shut up and go back to your Atari!