NO EXECUTE! October 1, 2007

NO EXECUTE! A weekly look at personal computer technology issues.

October 1 2007

[Home] [Part 1] [Part 2] [Part 3] [Part 4] [Part 5] [Next]

Rebooting the 64-bit Specification

As much as I am annoyed by the limitations of AMD's 64-bit extensions, I can't blame AMD for going ahead and publishing the x86-64 (later "AMD64") specification back in 2002. Toward the late 1990's Intel hinted that the Pentium 4 would be the end of the line for, as Intel called it, the "IA32" x86 architecture. An entirely new "IA64" architecture and instruction set branded as "Itanium" was being jointly developed with Hewlett-Packard (http://www.theregister.co.uk/2000/02/08/hp_lens_focuses_itaniummerced_strategy/). Itanium promised to deliver 128 64-bit integer registers and a more RISC-like architecture. I remember as an exhibitor at the Spring 2000 COMDEX in Chicago walking past the Microsoft booth and seeing a demo of Windows running on a prototype Itanium system. Itanium was the future, and all the limitations of the x86 architecture would soon be history. Even as recently as 2002 (http://www.windowsitpro.com/Articles/ArticleID/23895/23895.html?Ad=1) it was assumed that an Itanium based future was still in the cards for us all, AMD64 had not yet been released, and it was unknown whether Intel would support such 64-bit extensions to x86. Amazing how only three years later, by spring of 2005, Intel did a complete about face!

The world owes AMD a debt of gratitude for keeping x86 alive. But here is the rub: I am disappointed that in extending x86, AMD did not redesign the 64-bit execution mode to better address security and virtual machine issues. If anything they did the opposite and merely added hacks to what was already a flawed mode. It is also disappointing Microsoft adopted the AMD64 specification, and that Intel simply did a quick about face and embraced its competitor's design.

These companies failed the consumer in many ways. By not working together to jointly deliver a better 64-bit specification, they confused the consumer, confused OEMs, and confused software developers. AMD and Intel both blew a golden opportunity to right the limitations of x86; the very limitations that prompted Intel to considering something like Itanium in the first place. To this day both companies still don't even use the same terminology. What AMD called AMD64, Intel calls "EMT64" and "Intel64". What AMD calls NX, Intel calls DEP.

Let us put this marketing nonsense aside and look at the actual technical issues that AMD tried to solve with AMD64. Prior to the publication of the AMD64 specification in 2002, the last real overhaul of x86 occurred in 1986 when Intel extended the 16-bit registers and 16-bit modes of execution of the 286 to the 32-bit modes of the 386, i.e "IA32". If you stop to analyze what Intel was trying to solve in 1986 with the IA32 extensions and then compare that with what AMD was trying to solve in 2002 with the AMD64 extensions, you may be surprised by some simple design options that they both appear to have missed.

I am about to demonstrate that much of the functionality of AMD's 64-bit extensions could have been implemented in Intel's processors as early as the 1980's with a trivial design change in the way that x86 instructions are encoded. Had Intel made this design decision twenty years ago, it would:

Give x86 the ability to use 16 general purpose registers, whether in 16-, 32-, or 64-bit modes of execution.
Make x86 instruction decoding simpler and faster, leading to faster virtual machines and better hardware instruction parallelism.
Reduce the size of 64-bit code compared to what it is today with AMD64/EMT64 extensions.
Allow mixing of 32-bit and 64-bit code modules within the same process due to binary compatibility between the two modes.
Simplify the porting of 32-bit code to 64-bit code, reducing the complexity of compilers, debuggers, and other developer tools.
Support PC-relative addressing in all modes of execution.
Avoid several CPU errata that today pose performance and reliability risks.

The PowerPC architecture was designed from its beginning to have this kind of 32-bit and 64-bit binary compatibility. This is what allows Apple to ship a single version of Mac OS X for the PowerMac G5 and seamlessly run both 32-bit and 64-bit code using the same operating system kernel. AMD blew a golden opportunity in 2002 by not adopting this same approach to the 64-bit extensions.

Here is the good news: it is not too late to redesign 64-bit extensions to x86 before the mainstream adoption of 64-bit software. Consumers and OEMs are revolting against Windows Vista. Just recently Microsoft announced that it will make available an option to "downgrade" Windows Vista to Windows XP. OEMs such as Dell have stopped preloading Windows Vista on their new computers and are giving customers the option to ask for Windows XP. Due to binary compatibility issues between even 32-bit Windows XP and 64-bit Windows XP, the vast majority of Windows users today are still running a 32-bit operating system. With XP getting an extension on its life, this will be true for some years to come (http://www.theregister.com/2007/09/28/microsoft_xp_deadline/).

It is not too late to fix the basic foundation upon with all future Windows, Mac OS, and Linux code will be built upon. AMD and Intel need to jointly make this fix before thrusting any more SSE extensions or hardware virtualization extensions upon us. As I said a few weeks ago, the foundation upon which software is built upon today is broken.

Following my comments last week about applying the ideas I'm discussing in this blog to open source emulators, one of my readers just made me aware of a company in Germany called innotek (http://www.innotek.de/index.php?option=com_content&task=view&id=34&Itemid=48) that has already produced an open source hypervisor called VirtualBox (http://www.virtualbox.org/) which is based on QEMU. I was not aware of this product but I am very encouraged to read in their documentation that VirtualBox uses binary translation, and that it analyzes the code at translation time for bugs and security issues that it patches on the fly. This is exactly the kind of proactive security push in virtual machines that companies such as Microsoft and VMware lag behind in their products. It is also encouraging to read that innotek's experience with hardware virtualization corroborates my data as well as the findings published in the Singularity and VMware papers - that Intel's Vanderpool (or "VT-x") hardware virtualization can be slower than binary translation of code. And worse, that AMD's clone of Vanderpool known as "Pacifica" or "AMD-V" is even slower:

"we can reduce the faults caused by our virtualization to a rate that performs much better than a typical recompiler, or even VT-x technology... VT-x support is not of high practical importance and we have noticed that our implementation of AMD-V is currently even slower than VT-x." - quotes from http://www.virtualbox.org/wiki/VirtualBox_architecture

The data against hardware virtualization is piling up quickly. As I said two weeks ago, AMD and Intel are clearly "digging in the wrong place" and should be devoting their efforts to issues besides this overhyped and pointless hardware virtualization. Instead, they should be solving the fundamental problems with x86 memory models, x86 performance issues, security holes in the x86 instruction set architecture, and making the hardware simplifications that I am calling for such that they can product lower cost, lower power, more scalable microprocessors.

With hypervisors such as VirtualBox already available, and with my own hypervisor project in development, it will be plausible and practical to develop secure and efficient virtual x86 extensions that do not exist in real hardware. These thought experiments which I am describing this week (and also will next week) can be implemented in real software and run on binary translation based virtual machines, with or without the support of AMD or Intel.

If you are familiar with x86 code generation and the mechanisms of Mod-R/M bytes and prefix bytes, you can skip the next two sections and jump ahead to my proposed solutions.

A Brief x86 Primer

In 1985, the PC was based on the Intel 286 microprocessor, running a 16-bit operating system MS-DOS, and programs were able to make use of eight 16-bit registers - AX BX CX DX SI DI BP and SP. Compared to what was running inside of Apple Macintosh, Atari ST, and Commodore Amiga machines, this was an embarrassment. Those machines used the Motorola 68000 microprocessor, and by 1985 had improved upon the limitations of the 68000 and offered the 68010 and 68020 microprocessors, followed soon by the 68030 and 68040. Those architectures provided sixteen 32-bit registers, eight registers which were "data registers", D0 through D7 which were used for arithmetic calculations, and eight "address registers" A0 through A7 which were used to reference memory. Not only could larger calculations be done on the 68020, but more of the calculations could be done in registers instead of in memory, resulting in faster speed. So in 1986 while MS-DOS users were stuck with black and white text screens, the rest of the world was running graphics user interfaces!

Even with the release of Windows and OS/2, PCs were still stuck with 16-bit calculations and the limitations of 8 registers. It was not until 1995 when Microsoft really pushed Windows 95 and 32-bit applications did the 386 actually get users for what it was designed to do - running 32-bit code and processing 32-bit data. By then the Macintosh had already migrated to the Power, which was an even more powerful design featuring thirty two 32-bit registers. Despite its much larger price tag, the Macintosh truly was years ahead of the average PC in the early 1990s.

So, pretend you are Intel and it is 1986, and you are trying to catch up to the 68020. What technical limitations of the 16-bit 286 microprocessor and programming model are you trying to fix or improve upon? Here is a list I have come up with that highlights the Intel weaknesses:

a maximum of only 8 general purpose registers ( or "GPRs", while 68020 had 16 GPRs).
a general purpose register width of only 16 bits (while 68020 had 32-bit registers).
a non-orthogonal general purpose register set which limited memory access to specific registers (while 68020 had eight).
a limit of accessing at most 64 kilobytes at a time of linear address space using segments (68020 could address 4 gigabytes).
various data leaks out of ring 0 kernel mode which allowed less privileged levels to access kernel state or read privileged data (issues which the 680x0 family fixed in the 68010).
a variable sized instruction set which makes instruction decoding difficult and ambiguous.

The 32-bit design that Intel delivered in the 386 addressed three of these six issues: they increased the GPR width from 16 bits to 32 bits, they introduced 32-bit flat addressing which allowed for 4 gigabyte segments, and modified the instruction encodings to support a more orthogonal use of the GPRs. Intel stopped short of fixing the other three issues: increasing the number of general purpose registers, sealing the kernel state leaks, and solving the instruction encoding mess. AMD tried to solve two of these problems two decades later with AMD64, but the final problem still remains unsolved today.

You might be asking yourself why this last issue of variable sized instructions matters since x86 has from day one always had a variable sized instruction set. So has the 68000 and 68040 family for that matter, the 6502, the Z-80, and most other non-RISC architectures. The issue is important for two reasons - performance and reliability. It is incredibly complex to decode an x86 instruction in software (believe, I have had to do this in numerous projects) and is even more difficult to decode in hardware. Most AMD and Intel microprocessors decode instructions by fetching 16 bytes of code bytes at a time into a queue. There the hardware has to determine the instruction boundaries, i.e. the starting address of each of up to 16 x86 instructions. The 16 bytes may contain 16 one-byte instructions. Or they may contain eight 2-byte instructions. Or perhaps an instruction spilled over from the previous 16 bytes, or an instruction from this block spills over into the next block. The x86 programming model places no restrictions on whether an instruction needs to start at an even address of odd address (which is a restriction on 680x0, PowerPC, Itanium, ARM, and most of today's other popular architectures), and only a restriction that an instruction cannot exceed 15 bytes.

The act of fully decoding an x86 instruction can cause multiple L1 and L2 cache misses, and multiple page faults. Operating systems need to be able to handle these page faults and distinguish between a code fetching fault and a data access fault. It's a real mess. Not to mention that 15 bytes allows for a LOT of potential byte sequences. 15 bytes is 120 bits, or over 2 to the power 120 ways to encode an x86 instruction. How many of those are actually valid? How does one validate every possible instruction code sequence? It can't be done today.

Not to mention, over the years Intel's documentation has been full of errata and optimization guidelines telling software developers of code sequences to avoid. In the latest Core 2 errata document for example, there are errata about "undefined behavior" with some code sequences (exactly which are not disclosed). This is the kind of very scary documentation that I really do not like to read. Any code that I cannot deterministically simulate or describe the side effects of is potentially a source of blue screens, crashes, or security exploits. This kind of undefined and unverifiable behavior is a gateway for things like denial of service attacks. Let's not forget the "F00F" bug of the original Pentium processor. One such "undefined" instruction, whose byte code starts with 0xF0 0x0F, can literally lock up the chip from user mode code. Intel did not and could not validate the behavior all the 2^120 instruction sequences, but hackers did manage to find some nasty ones. Therefore, variable sized instruction encodings lead to reliability and security issues.

As far as performance, since the original Pentium processor Intel was giving rules about the order that certain instruction should appear in for maximum performance. This is because until the Pentium all Intel microprocessors executed at most one instruction per clock cycle. Thus it was only necessary to decode at most one instruction per clock cycle using a single hardware decoder. The Pentium and Pentium MMX now allowed execution of two instructions per clock cycle, provided that the second instruction in the pair was a "simple" instruction that was easy to decode. That ideal occurs maybe 30% of the time in real code. Similarly, the architecture of the Pentium II, Pentium III, and Core Duo has the famous "4-1-1" rule, which allows the processor to decode three instructions per clock cycle but only if they are grouped in specific triplets of three instructions where the second and third instructions as "simple". The 4-1-1 rule exists on most Windows computers in use today, and is one common reason why all Windows software falls far short of the ideal 3-instructions-per-clock-cycle performance limit that Intel likes to advertize.

There are other bizarre little rules, such that there are penalties involved when an instruction starts at an address which is 13 bytes into a 16-byte block of code. i.e. code addresses of the form 0xnnnnnnnD may incur extra penalties! These weird little rules vary between AMD and Intel architectures, and even between different models of each branch's microprocessors. But they are there, many of them, and no AMD or Intel architecture is free of these performance hazards.

An astute reader should thus conclude that it is impossible to generate "optimal" x86 code, since the optimization rules vary from chip to chip. And even code that appears to be optimal on all models today may break tomorrow, as was the case when the Pentium 4 broke several long used Pentium MMX and Pentium III code optimizations. And, therefore for as long as these arbitrary and ever changing stalls existing in hardware, that the answer to improving x86 performance is not faster clock speed. The answer is to fix instruction encodings such that the hardware can more easily decode and execute three instructions per cycle. A dynamic binary translation approach that rewrites code on-the-fly is a solution to working around these hardware issues on legacy machines, and is the basis of projects such as DynamoRIO.

Let us first look at the reasons why it is so difficult to decode an x86 instruction, and then I will demonstrate a better instruction encoding that is suitable for the world of virtual machines. I will demonstrate below how at any time since 1986 either Intel or AMD was free to make some very simple modifications to the x86 instruction set so as to deliver:

A programming model with sixteen general purpose registers.
An orthogonal instruction set that is easier to decode and thus less subject to "4-1-1" performance issues.
An instruction set that has far fewer possible instruction permutations and is verifiable and deterministic.

I believe I have solved the above design issues, and in a cleaner manner than either Intel did with IA32 or AMD did with AMD64.

Prefixes and Mod-R/M Bytes: Unnecessary Evils

All microprocessor instruction encodings have traditionally followed a similar pattern, regardless of the architecture. An instruction consists of 8 bits which encode a primary "opcode" byte. The opcode specifies the actual operation being performed. For example, an opcode can specify an arithmetic ADD operation, it can specify a memory load operation, or a function call operation. Certain instructions also encode a secondary opcode to specifically select an exact operation. For example, on both the Pentium and 68040, there is a primary opcode that specifies bit manipulation operations, and secondary opcode to distinguish between say, a "bit set" operation and a "bit clear" operation. Sometimes additional bits are used to encode the data size, i.e. the number of bits being operated on by the instruction. For example, selecting between an 8-bit ADD or a 32-bit ADD. Depending on the operation, the size may be implicitly encoded into the primary opcode, while at other times it is a separate group of bits.

Most instructions need bits to specify the "addressing mode", which tells the microprocessor whether the opcode operates on data in memory, data in registers, or both, and which ones. Depending on the memory addressing mode, several more bytes are needed to encode a displacement, which can be a signed 8-bit integer up to a signed 32-bit integer which is used during address calculation. Finally, one or more 8-bit to 64-bit immediate constants may follow.

The x86 architecture has one more set of bits called "prefix bytes", which are one or more bytes that modify the meaning of any of the other instruction bits. For example, a prefix byte can change the primary opcode to mean something else, it can alter the data size, or it can modify the addressing mode. It is a technique that Intel created to give them a few more bits in places where they had no more bits left to encode an opcode or addressing mode. This allows some common x86 instructions to be encoded as a single byte. Contrast that with 68040 which has a minimum instruction size of two bytes, or PowerPC which has a fixed instruction size of 4 bytes.

An example of a one-byte instruction on the 16-bit 286 is the function return opcode, RET, which is the 8-bit opcode 0xC3. Or you can have a complex instruction such as ADD WORD PTR ES:[BX+0xFF37],0x1234 which requires a prefix byte 0x26 (to specify the ES segment), a opcode byte 0x81 (to specify the ADD WORD operation), what is called an "r/m byte" to specify the memory indirect addressing mode through the BP register (the r/m value of 0x87), two bytes to specify the 0xFF37 offset to BP, and 2 more bytes to specify the 0x1234 immediate operand, for a total of 7 bytes.

The actual encoding of this instruction when executing in 16-bit MS-DOS real mode is this:

ES segment prefix 16-bit ADD opcode r/m byte 16-bit address displacement 16-bit immediate operand
0x26 0x81 0x87 0x37 0xFF 0x34 0x12

The segment prefix bytes are one of 4 prefix bytes allowed on the 286, later extended to 6 prefixes on 386 and beyond. The segment prefix allows an instruction to select a memory operation on one of four 64-kilobyte segments of memory. This effectively increases the addressable memory of an MS-DOS program to 256 kilobytes at any given instruction. The 286 architecture specified several prefix bytes:

0x26 - ES (extra segment) prefix
0x2E - CS (code segment) prefix
0x36 - SS (stack segment) prefix
0x3E - DS (data segment) prefix
0xF0 - LOCK prefix, locks the memory bus to allow atomic read-modify-write operations
0xF2 - REPNE prefix, used by "string" opcodes 0xA4 through 0xAF.
0xF3 - REP/REPE prefix, similarly used by string opcodes

These prefix bytes are instruction modifiers which change the meaning of the original un-prefixed instruction. Since the prefix bytes come out of the same pool of 256 possible primary instruction opcodes, these seven prefix bytes limits the number of primary opcodes to 256 - 7 = 249. At the time of the 286, not all 249 possible primary opcodes had been assigned.

For the 386 / 486 / Pentium / Core microprocessors, Intel added several more prefixes by using up some of these unassigned primary opcodes:

0x64 - FS segment prefix
0x65 - GS segment prefix
0x66 - data prefix - this turns a 16-bit data operation into a 32-bit operation and vice versa
0x67 - address prefix - this changes the meaning of the Mod-R/M byte between the 16-bit addressing and 32-bit addressing versions.
0x0F - the extended opcode prefix

By adding the 0x0F "extended opcode prefix" it allow an "escape" from the limits of having just 256 major opcodes. A whole new set of 256 opcodes is now available, which are encoded as the byte sequences 0x0F 0x00, 0x0F 0x01, etc. MMX and SSE instructions are encoded as extended opcodes for example.

To understand the additional AMD64 extensions, one needs to first under the details of bits within the Mod-R/M byte, and how they really break down into three bitfields:

bit positions
field name
description of field
7:6 mod selects between a register-register or register-memory operation
5:3 reg register index of non-memory register operand, or a secondary opcode
2:0 r/m index of memory addressing mode

The gory details of Mod-R/M are best explained in the AMD64 Architecture Programmer's Manual, Volume 3, Appendix A. I highly recommend a good reading of this appendix. The thing to remember is that the two-bit field "mod" determines whether the instruction operate on memory and also helps to determine what the the full length of the encoded instruction is. The "mod" field can take on one of 4 values:

11 - "r/m" specifies a register index of a second register operand. an example instruction is "MOV AX,BX"

10 - "r/m" specifies a register index for an indirect memory operand which is followed by a 16-byte displacement

01 - "r/m" specifies a register index for an indirect memory operand which is followed by an 8-byte displacement

00 - "r/m" specifies a register index for an indirect memory operand, which is followed by a 16-bit displacement if the r/m field contains the value 6 (binary 110).

Got that? If the "mod" bits are 10, 2 more bytes of instruction follow. If they are 01, one byte follows. If they are 11, no more bytes are needed to determine the addressing mode, since there is no memory access. If they are 00, well, it gets trickier.

Remember how I mentioned that the 68020 divided its sixteen registers into eight data and 8 address registers. The 286 similarly divided its eight registers, with AX CX and DX being used for arithmetic operations, while BX BP SI DI and SP were more used accessing memory. Therefore, while the eight possible values in the "reg" field do in fact correspond to each of the 8 possible general purpose registers, the eight values in the "r/m" field do not. They actually correspond to select combination of addressing modes that consist of either indirecting the SI, DI, or BX registers, or indirecting the combined values of BX+SI, BX+DI, BP+SI, or BP+DI. In this context, the BX and BP registers are referred to as "base registers", which SI and DI are referred to as "index registers". Notice in fact that one cannot even explicitly access memory via the stack pointer register SP. The eight possible encoding of the "r/m" field is used to suppress any register indirection and is instead followed by an actual address.

This is the issue of orthogonality. The 68000 architecture allowed any of the eight register A0 through A7 to be used as base registers and any of the sixteen registers to be index registers. All of these instructions are valid on the 68000:

MOV.B D0,3(A0)
MOV.L A5,8(A7,A0)
ADD.L (A0,D5),#1

while the 286 is limited to a handful of register indirect addressing modes:

MOV AL,[BX+3]
MOV [BP+SI+8],BX
ADD [0x1234],1

and instructions like this are simply illegal on the 286:

MOV AL,[CX+3]
MOV [DX+AX],BX

The 68020 lifted the slight limitation of the 68000 in that any of the sixteen registers could be used base registers, and added the ability to scale index registers by a value of 2, 4 or 8, such that these instructions became valid:

MOV.B D0,5(D4,D5*2)
MOV.L A5,(A7,D7*4)

The 68000/68020 architecture was far more RISC-like and if you were a compiler writer, it was much easier to generate efficient code for the 68000 or 68020 than for the 286. The scaling of index registers not only reduced code size by eliminating the need for explicit multiply or shift instructions, but made it very efficient to access arrays of integers. Intel faced the problem of how to enhance the Mod-R/M byte such that memory addressing could be truly orthogonal and to add this scaling capability. What would you have done?

This is what Intel did for IA32 in 1986. It kept the meaning of the "mod" and "reg" fields identical, but changed the meaning of the eight encodings of the "r/m" field. 6 of the 8 encodings now allowed for a register indirection, specifically the registers EAX EBX ECX EDX ESI and EDI. I am putting an "E" in the register names, because these new encodings are only valid when 32-bit addressing is enabled, and thus the full 32-bit register is used. AX is the 16-bit lower portion of EAX, BX is the lower 16-bit portion of EBX, and so on.

The remaining two encodings are used as escape sequences. The encoding for EBP is similarly an escape code for suppressing the base register, as was true in 16-bit mode. This instruction:

MOV EAX,[EBP]

is really encoded as:

MOV EAX,[EBP+0]

Similarly, the stack register ESP by itself cannot be indirected, as was true in 16-bit mode. Instead, the encoding corresponding to ESP is an escape code which tells the decoder that there is another byte following, a byte called the S-I-B Byte, which stands for "Scale, Index, Base". The S-I-B byte similarly breaks down into three bitfields:

bit positions
field name
description of field
7:6 scale scaling factor in bits
5:3 ri register index of the index register
2:0 rb register index of base register

The S-I-B byte allows for almost full orthogonality at the expense of one extra byte of code. These now are valid 32-bit instructions:

MOV EAX,[ESP+4]
MOV [EBP+EAX*8],EBX
MOV [0x12345678+ECX*2],SI

As in the Mod-R/M byte, the S-I-B also has some escape values when ESP and EBP are involved, such that either base register can be supressed (when "rb" is the register index of EBP) or the index can be suppressed (when "ri" is the register index of ESP).

The convenience of having an almost orthogonal programming model comes at the expense of a ridiculously complex set of rules for decoding the Mod-R/M-S-I-B bytes. For example, while both AMD and Intel are pushing the virtues of MMX and SSE instructions, a typical SSE instruction can have up to 3 bytes of valid prefix bytes (and several more bogus prefix bytes), a primary opcode byte, a Mod-R/M byte, an S-I-B byte, and then several more displacement and immediate operand bytes.

This is the crazy situation we end up with now in 2007 with SSE3, SSE4, and SSE5. What started as a nice compact x86 instruction set that was much smaller and compact than RISC is now in fact far more complicated and bloated than RISC. From a performance point, the hardware decoder in some cases needs to fetch and decode up to six bytes of the instruction before it can determine the full length of the instruction. Until that value is known, the decoder can't start decoding the next instruction. This is what leads not only to the strict pairing and "4-1-1" rules, but the code bloat also increases the chance of the instruction spanning a 16-byte boundary, a cache line, or a page, which are all conditions that are documents in AMD and Intel manuals as being sub-optimal for decoding performance.

AMD64 makes this problem worse because of the way that AMD chose to extend the Mod-R/M-S-I-B bytes. Notice that each of the four register fields in those bytes - "reg", "r/m", "ri", and "rb" - contain just 3 bits. Yet AMD64 allows for 8 new general purpose registers - R8 through R15 - which can be used as orthogonally as the original 8 registers. Because of the escape codes in the "r/m" field, for any given instructions at most 3 of the 4 fields are in use. Therefore, AMD needed to find 3 bits!

Do you see the oversight yet? Do you see what both AMD and Intel missed?

AMD had several design options:

Redesign the Mod-R/M-S-I-B byte encodings.
Burn three of the remaining unused major opcodes, each representing one of those three bits.
Or, as AMD implemented, re-assign 16 existing major opcodes for use as what are called "REX prefixes".

This is my beef with AMD64, in that the REX prefixes break any chance of there being binary compatibility between 32-bit and 64-bit code, and why for example, you cannot load 32-bit DLLs into a 64-bit process.

AMD actually needed four bits - one to extend the "reg" field by one bit, one to extend the "rb" or "r/m" field by one bit, one to extend the "ri" field by one bit, and one to extend indicate that an instruction was to operate as a 64-bit data operation instead of the default 32-bit data size. So AMD combined these four bits into one single prefix called, a super-prefix byte, which is in the form:
bit positions
field name
description of field
7:4 upper nibble constant 0100 (0x4)
3 REX.W "wide" flag, indicates 64-bit data
2 REX.R extends Mod-R/M "reg" field
1 REX.X extends Mod-R/M "ri" field
0 REX.B extend Mod-R/M "r/m" or "rb" fields

This means that the REX prefix is actually 16 different prefixes, the byte values 0x40 through 0x4F. This is where 32-bit and 64-bit modes break. In 32-bit code, the primary opcodes 0x40 through 0x4F correspond to the single-byte register increment (INC) and register decrement (DEC) instructions. AMD chose to remove these instructions (and several other instructions) from 64-bit modes of execution in order to reserve the space for these and possibly future prefixes. Prefix 0x40 is even redundant since it extends nothing, leaving 15 useful REX prefix bytes.

The end result of the REX prefixes is that 64-bit code is in general almost one byte per instruction "fatter" than 32-bit code, and it is not uncommon to see 64-bit instructions with two or three prefix bytes. The new SSE4 and SSE5 instruction encodings for example, as specified by Intel and AMD respectively, will generally require 5 bytes to encode.

This code bloat and complexity has to stop. It is time to punt the AMD64/EMT64 64-bit extensions and rethink the design.

A Simpler and More Reliable 64-bit x86 Instruction Set

I am now going to describe two sets of design changes to x86 that I believe Intel should have made years ago. The first is a minor but very obvious change to add access to eight more GPRs but without resorting to a REX prefix. The second is a much larger set of changes that eliminate prefix bytes completely, resulting in the cleaner design of x86 that would allow for binary compatibility between 32-bit and 64-bit modes.

This second set of changes I will refer to as "VX64", or "virtual x64", as I believe that virtual machines should adopt a simpler, smaller, more efficient instruction encoding scheme so as to improve performance and reliability. If we truly are moving to a world where all operating systems execute within virtual machines, then nothing requires software developers to use the same byte encodings as legacy hardware. This is already the whole point of Java and .NET and LLVM is today - to abstract away the code bytes from the actual hardware being used. What nobody but myself seems to be considering is that this abstraction should also be performed on all native code as well - to design a new x86 instruction set encoding that is optimized for virtual machines.

Design Change #1 - the Mod-R-S-I-B Word and 3-operand arithmetic operations

Here is the simple design change to Mod-R/M-S-I-B that both Intel and AMD seem to have missed. Imagine if the S-I-B byte had always been required. This is not too far fetched, since as it turns out, most 64-bit instructions today require either a REX prefix, an S-I-B byte, or both. Common compiler optimizations in 64-bit code, such as the use of PC-relative (or as AMD calls it, "RIP-relative") addressing, the use of high registers R8 through R15, and the use of scaled indexing for accessing lookup tables and performing fast multiplication operations, pretty much guarantee this.

My trick is to combine the optional REX prefix and the optional S-I-B byte along with the existing Mod-R/M byte into a new and mandatory 16-bit "Mod-R-S-I-B" word. i.e. I am eliminating the "r/m" field and using those three freed bits to extend the other three register fields. The new layout of the Mod-R-S-I-B word is this, first with the Mod-R byte:

bit positions
field name
description of field
7:6 mod selects between a register-register or register-memory operation
5:4 scale scaling factor in bits
3:0 reg register index of non-memory register operand, or a secondary opcode

followed by the I-B byte:

bit positions	field name	description of field
7:4	ri	register index of the index register
3:0	rb	register index of base register

That's it, that's what Intel missed in 1986 and what AMD again missed in 2002. A trivial sleight of hand that would eliminate the need for the 16 REX prefix bytes, give 32-bit Windows applications access to 16 general purpose registers, and may reduce the average code size of 64-bit code.

This change also reduces the complexity of hardware decoding an instruction, since the Mod-R and I-B bytes can be decoded in parallel. And as with AMD64, PC-relative addressing is hacked in by specifying register R13 in the "rb" field.

Notice that there is nothing in the above change actually related to 64-bit data operations or 64-bit addressing. This is strictly a redefinition of the Mod-R/M-S-I-B bytes as they were defined in IA32 and AMD64. This scheme therefore still requires at least a single new prefix byte, a REX.W prefix byte, to specify a 64-bit operation. This would be similar to the 0x66 prefix byte which specifies a 16-bit operation. At least such a scheme would have kept the INC and DEC instructions intact and have allowed for interoperability between 32-bit and 64-bit code.

Do you see the other major advantage of this encoding scheme? When the "mod" field contains 11 to specify a register-to-register operation, we still have three register fields at our disposal. This means that by using "ri" and "rb" fields to specify two source registers, this encoding scheme would also then allow for RISC-like 3-operand operations such as these:

ADD EAX,EBX,ECX
SHL EDI,EDX,EAX
XOR EDX,EDX,ESI (this is equivalent to the 2-operand instruction XOR EDX,ESI)

The ability to read two registers and write a result to a third would eliminate a lot of register-to-register MOV operations that currently have to be done on both 32-bit and 64-bit code, which in turn would lead to additional code size savings and performance improvements.

What AMD could have done is have just two REX prefix bytes: one that indicates a 64-bit data operation and implies the use of this new Mod-R-S-I-B format, and one that indicates a 32-bit data operation but also implies the new Mod-R-S-I-B format. Primary opcodes 0x82 and 0xD6 have always been officially undefined in Intel documentation and could be used as those two prefixes.

A similar option that was proposed to me by fellow Mac emulation enthusiast Philip Cummins is to use two such prefixes in a sticky manner, where one prefix would enable the new Mod-R-S-I-B mode, and one would revert to the old Mod-R/M-S-I-B operation. Sort of like the "set" and "clear" opcodes which toggle bits in the EFLAGS register.

However you look at it, the issue of adding additional general purpose registers to the x86 state could be done much more cleanly than IA32 and AMD64 specify today.

Design Change #2 - Eliminating Prefix Bytes

Let us expand this thought experiment to tackle the issues of opcodes that straddle line boundaries, instructions that start on odd byte offsets, and the issue of the almost unlimited number of instruction encodings caused by prefix bytes. In most instruction set architectures there are no prefix bytes. The primary opcode is found and decoded from a very specific set of bits, usually in the very first byte of the instruction. To achieve this on x86 would mean removing all prefix bytes completely - punting the 0x0F extended opcode prefix, punting the 0x66 and 0x67 size prefixes, punting the six segment prefixes, punting REPNE and REPE, punting LOCK, and punting any form of 64-bit prefix.

Let me for the sake of argument set in store a rule that the first byte of an instruction is the primary opcode byte, period. Let me also set in stone a rule that the second byte of the instruction is the Mod-R byte and the third byte of the instruction is the I-B byte as described above. Let us also require that each instruction have a fourth byte such that instruction alignment always ends up being 4-byte aligned. This fourth byte is used in a manner specific to each primary opcode.

I will also add the rule that any displacements or immediate constants that follow this 4-byte block also be multiples of 4 bytes in size. This maintains the requirement that instruction alignment is always maintained on a 4-byte boundary. No x86 or AMD64 instruction today has more than 8 bytes worth of displacements or immediate operands today and this would be maintained.

I am calling this no-prefix 4-byte aligned encoding scheme "VX64" which stands for "Virtual x64". It is a byte encoding that is not supported by any of today's existing hardware and therefore would have to executed by a binary translation based virtual machine. What the VX64 instruction encoding method gives in return is that:

All valid instructions are exactly 4, 8, or 12 bytes in size and are always 4-byte aligned.
Instructions are allowed to span a page, cache line, or 16-byte boundary, but the 4 bytes that contain the opcode never do.
The decoder can fully determine the instruction size based on just decoding the first two bytes of the instruction.

To elaborate on this last point, the decoder needs only create a 12-bit index based on the major opcode (byte 0) and the upper 4 bits of byte 1 to generate a 12-bit "opcode and mode" index. Such an index can then be used to index a 8192-bit ROM table (4096 entries of 2 bits) which returns one of four values:

00 - instruction is invalid (i.e. 0 bytes in size)
01 - instruction is 4 bytes in size
10 - instruction is 8 bytes in size
11 - instruction is 12 bytes in size

The output of this lookup can then be used to tell a second decoder where the next instruction starts. You can see how these return values can easily give the instruction size - the 2 bits are simply a count of 4-byte blocks (or "dwords"), with 0 being assigned as the size of an invalid instruction.

You may be asking why 12 bits and not just 10 bits, since it should be sufficient to just decode the two "mod" bits from the second byte. The reason for using 4 bits - the "mod" bits as well as the "scale" bits, is due to the fact that the mod value of 11 (to indicate a register-to-register operation) would be incompatible with any kind of non-zero scaling value and thus I want to keep those combinations invalid for now. I will iterate on this further to close this encoding loophole, but that iteration will also require using those same 4 bits. Preferring to be conservative, I will therefore stay with the 4096-entry ROM table.

The encoding scheme I have defined so far looks like this:

byte and bit positions
field name
description of field
(byte 0) 7:0 prop the major opcode
(byte 1) 7:6 mod selects between a register-register or register-memory operation
(byte 1) 5:4 scale scaling factor in bits
(byte 1) 3:0 reg register index of non-memory register operand, or a secondary opcode
(byte 2) 7:4 ri register index of the index register
(byte 2) 3:0 rb register index of base register
(byte 3) 7:0 byte3 opcode specific (TBD)
(bytes 4..7) disp32 / imm32 optional opcode-specific displacement or immediate operand
(bytes 8..11) imm32 optional opcode-specific immediate operand

What I have presented here so far has no ambiguity as to where specific opcode and addressing fields get decoded from. The first 12 bits of the instruction fully determine the instruction size. There are no redundant prefixes since there are no prefixes at all. There are no undefined results due to using conflicting prefixes. There aren't multiple ways to encode the same instruction. Most arithmetic operations, and most loads and stores to frequently used stack variables require just a 4-byte instruction. There are no funky faults that can occur part way through instruction decode. There are no "offset $D" performance hazards. There are no size modifying bytes past the first two bytes of the instruction. New instructions can be added to the instruction set without requiring them to be fatter than other instructions. The same encoding scheme is used for 16-bit addressing, 32-bit addressing, and 64-bit addressing, allowing some level of mixing and interoperability of 32-bit and 64-bit code.

This VX64 encoding scheme makes both hardware decoding and software decoding of x86 simpler. It is an abstracted and portable form of x86, not unlike what Java, MSIL, or LLVM bytecodes are today, but a form that is very close to x86 and even potentially source code compatible with legacy x86. I seriously urge developers to consider using a virtual instruction set for future Linux and Windows applications. Virtual machines and parallel computing are here to stay and we should all start writing code with that in mind.

Next, I will attempt to define an x86-compatible instruction set that uses this VX64 bytecode. I will further refine the "mod" and "scale" fields, and further whittle this down such that 12-byte instructions are eliminated completely. I will demonstrate how the current AMD64/Intel64 instruction set - including 64-bit operations, x87, MMX, SSE, SSE2, right up through SSE3 and SSSE3 - can be encoded in VX64 using about 180 of the 256 available primary opcodes.

The homework I leave you with today is this: what is the fastest way to increment a pair of integers in memory? If you were writing a compiler or binary translation framework, which x86 instructions would you use to implement functions such as there below? Try not to cheat by looking at a commercial compiler's output quite yet!

void IncInt32(long *pl)
{
pl[0]++;
pl[1]++;
}

void IncInt64(long long *pq)
{
pq[0]++;
pq[1]++;
}

The results may surprise you, especially when I do take a look at the compilers use by most Windows applications.

I would love to hear from engineers at AMD, Intel, or other chip manufacturers. Email me directly at darek@emulators.com with your comments.

You can also give me a quick vote by clicking on one of these two links:

Darek, keep writing, this is syrup on my pancakes!

Darek, shut up and go hug your Atari!

ES segment prefix	16-bit ADD opcode	r/m byte	16-bit address displacement		16-bit immediate operand
0x26	0x81	0x87	0x37	0xFF	0x34	0x12

byte and bit positions	field name	description of field
(byte 0) 7:0	prop	the major opcode
(byte 1) 7:6	mod	selects between a register-register or register-memory operation
(byte 1) 5:4	scale	scaling factor in bits
(byte 1) 3:0	reg	register index of non-memory register operand, or a secondary opcode
(byte 2) 7:4	ri	register index of the index register
(byte 2) 3:0	rb	register index of base register
(byte 3) 7:0	byte3	opcode specific (TBD)
(bytes 4..7)	disp32 / imm32	optional opcode-specific displacement or immediate operand
(bytes 8..11)	imm32	optional opcode-specific immediate operand