(c) 2023 by Darek Mihocka, founder,

August 31 2023

[Part 43]  [Table Of Contents]  [Return to]

Software development never changes.  You think something will just take a few more months and it ends up taking, oh, 5 years,  Yes, I am guilty of that in my enthusiasm of ARM64!  Back in Part 40 over 5 years ago as I was gushing with joy over the release of Windows 10 on ARM tablets I promised you folks a tutorial on how to build a native ARM64 app, planning to use the new Xformer 10 release (well, "new" in 2018) to discussing how to port, build, and debug native ARM64.

At that time the first batch of Windows 10 on ARM tablets was coming to market - the HP Envy X2, the Lenovo Miix, the ASUS Novago, and later that year the Samsung Galaxybook2 and the Lenovo Yoga - the first batch of 64-bit desktop Windows devices based on ARM64.  Visual Studio 2017 had just been released with ARM32 and ARM64 cross-compiler targeting support, with full desktop Win32 APIs being supported.  The 2018 incarnation of Windows on ARM was NOTHING like the very restrictive "Windows RT" from 2012, and that's why I was really excited to see my vision of emulation based ARM devices from my post in 2010 finally coming to life.  But this was just the beginning and it would take us Microsoft software folks and hardware partners another five years to mature the products to the point where ARM64 is a true alternative (to both consumers _and_ to developers) to the Intel dominated world we've known for over three decades.

The ARM64 ecosystem has exploded by leaps and bounds during the past 5 years, with not just Qualcomm and Microsoft releasing newer tablets and laptops such as Surface Pro X and Surface Pro 9.  There was also Amazon's development and use of ARM64 based servers lowering the cost of cloud computing, and of course Apple's big switch from Intel to ARM64 in late 2020 really set things on fire with the Apple M1 pure 64-bit ARM64 Apple Silicon running a pure native ARM64 hosted Mac OS and a new incarnation of the Rosetta emulator to run legacy Intel Mac apps.  The Windows and Visual Studio teams at Microsoft kept cranking away, adding 64-bit emulation as part of Windows 11 on ARM in 2021, developing the "ARM64EC" hybrid execution mode for mixing both Intel and ARM code in the same binary, and releasing native hosted .NET and Visual Studio compilers in 2022.  As of February 2023, Microsoft officially supports running Windows 11 on ARM on Apple M1 and M2 devices in addition to all the Qualcomm Snapdragon based devices from Surface, Dell, Lenovo, HP, ASUS, and others.  The latest Windows Insider builds (rs_prerelease build 25941 at the time of this writing) have even dropped 32-bit ARM32 support and run purely 64-bit ARM64 code, in-box apps are now all pure native ARM64, and even the x86/x64 emulation has been vastly improved and sped up since the original Windows 11 release in late 2021.

With all these new devices and different ways to run Windows 11 on bare metal or in a VM, it can be confusing to consumers.  I personally run Windows apps on my Macbook M2 Pro in 5 (five!!) different ways - CrossOver/Wine and Rosetta, Parallels virtual machines, QEMU 8.0 emulating Intel virtual machine, QEMU 8.0 ARM64 virtual machine, and VMware's latest Fusion release for the Mac.  Once you have it set up it becomes very seamless where it gets hard to tell when you're running something natively, when it's running under Rosetta (Apple's Intel emulator for ARM64), or when you're running under one of Microsoft's "XTA" family of emulators.  The best emulator is the one you don't even realize is running!

The fact that most of us tech workers in Seattle were sent home in March 2020 and happily spent the next 24 months working 7 days a week coding up a storm instead of sitting in rush hour traffic I truly believe contributed to a burst in productivity at Microsoft, Apple, Parallels, VMware, other companies, and the open source community.  For me, working from home was nothing new - it was a lot like all the pre-pandemic years that I spent developing Gemulator and SoftMac at my home in the 1990's or the time I spent working remotely for Intel a decade later.  So after returning to my Microsoft office in Redmond in March 2022 and living the rush hour gridlock dream again of taking at least 45 minutes to drive the 9 miles each way to and from my home on Mercer Island, I realized, no, office life is not for me anymore.  And quite frankly the social and economic implosion of the entire west coast (including Seattle) during the pandemic had already sent me and many of my co-workers seeking vacation homes and other cities to work from.  I spent a great deal of that time working back from Canada or poolside from Las Vegas.  Maintaining a home in Seattle and watching that city decline was no longer enjoyable.  Thus, 35 years and 8 months after first setting foot on Microsoft campus as a college intern I handed in my 12th and final employee badge last September and have since sold my home and moved back east. 

But I am certainly not done with ARM64 and have been working remotely for Qualcomm (the maker of the Snapdragon ARM64 CPUs) for much of the past year.  And I still owe you readers that post about ARM64 software development and all the cool exciting things that have happened to hardware and software since 2018.  So over the next couple of months I will make a series of postings related to the latest ARM64 devices, Windows 11, Visual Studio, ARM64EC, Time Travel Debugging, and of course the state of emulation on ARM64.  But first...

The Third CPU Race - The Race for IPC Supremacy

Following the 2013 release of Intel's 4th generation Core i7 codenamed "Haswell" (which I wrote about exactly 10 years ago in Part 36 and declared at the time to be the ideal chip for emulation) Intel pretty much ruled the CPU market for the next 7 years.  As I discussed in that post, Haswell added a slew of new instruction features such as AVX2 and BMI, made 4K video standard, set a new IPC/ILP (instructions per cycle, instruction level parallelism) bar and was (I believe) the first x86 chip to market with a 4-wide ALU pipeline.  The 25% or so overnight increase in IPC and performance blew away anything that Intel had offered before or that AMD based or ARM based systems could compete with.  While I was very hopeful about ARM at the time, remember, even ARM64 haven't hit the market yet.  Intel had seemingly won the CPU races once and for all.

As a reminder, the "first CPU race" I consider to be the raw clock speed race that began in the late 1970's with the 1 MHz clock speed of the 6502 in the Apple II.  By the start of the 1990's most Intel based PCs (as well as 68030 and 68040 based Atari and Apple Macintosh and Amiga computers) were hitting 25 and 33 MHz speeds, and the race famously accelerated during the 1990's and came to sudden halt in 2000 or 2001 with the 3 GHz launch speed of Intel's Pentium 4.  I've blogged about that topic a lot and still am not a huge fan of raw clock speed as a means to achieve performance - it makes software developers lazier by cranking out more bloated code, and, it burns power.  Neither of these side effects is in line with today's goals of reducing greenhouse emissions, getting longer battery life, or improving cloud computing density needs.

Thus the "second CPU race" which started right after Pentium's spectacular meltdown was the multi-core race to add as many cores on the same silicon die.  AMD fired the first shot in the x86 world with their Opteron CPU in 2002, and Intel fired back in 2006 with the Core 2.  Quad-core CPUs (usually with 8 hyper-threads) have become the norm on most consumer laptops and tablets since about 2008, with workstation and server CPUs hitting 16 cores and 32 threads and even higher in recent years.  While throwing more cores on the silicon solves two immediate problems - it lets you run more apps in parallel and it allows you to reduce the clock speed frequency and thus reduce power consumption - it is not trivial to code for.  In theory, you could have thousands of cores running at super low clock speed on a tablet or laptop, but, this requires software that is extremely parallelizable such that it can use the thousands of cores and threads.

Having spent many years of my career working at the Visual Studio team on compiler toolsets and of course being a software developer for over 40 years now, writing parallelizable code is very difficult.  I lectured about this in the summer of 2008 with my presentation The Real Crisis where I argued that parallelism is not solved by merely adding more cores.  As I concluded in the very last slide of that talk:

"IPC throughput still has a long way to go. Most current CPU cores can ideally retire at least 3 instructions per clock cycles, yet historic throughput is 0.5 to 1.0 IPC.

I am of the personal opinion that we cannot rely on programmers or static compilers to fix the problem - code that is "ideal" today may not be on next year's CPU architecture.

Legacy code is always doomed to become inefficient over time. Is the solution really to just keep rewriting it over and over again?

Take the “write once run anywhere” approach of Java, but apply it to all code."

I was obviously making the case for using emulation and dynamic translation to do on-the-fly code optimization to maximize IPC and ILP, a topic I revisited in my series of posts in 2015.  I was already raising the alarm about parallelism in 2008 at a time when quad-core was most common.  It is certainly still a big problem today as 16-core CPUs and greater becomes common.

CPUs still have a long way to go to improve the IPC of their _individual_ cores - the "micro-architecture" which is the actual implementation of a given CPU core.  Haswell, with its larger caches and wider pipeline improved raw IPC by about 25%.  It also improved code's ability to achieve ILP by adding new instructions such as the flagless shifts and rotates, 3-operand ALU instructions which allow one ALU instruction to replace what normally requires a MOV + ALU instruction sequence.

I tend to think of IPC (instructions per cycle) as analogous to the number of lanes on a highway, and if you increase the number of lanes you can increase traffic flow.  So you can improve the IPC of older code simply by widening the pipeline, that's the 25% part of Haswell's improvement.  But then on top of that you can improve ILP (instruction level parallelism) by making use of the new instructions or emitting your code in such a way as to put more traffic in more lanes at once.  So Haswell's true potential is far more than 25% compared to previous Intel generations; but you have to recompile the code to take advantage of it.

So I tend to think of IPC as more of a hardware responsibility to widen the path for code to run on, and ILP as more of a developer and software responsibility to write code that takes advantage of that path.  It is a virtuous cycle called "hardware software co-design" (HSCD) when you take both into account.  The idea is not to achieve parallelism by adding more cores, but rather extract parallelism by making the micro-architecture more efficient and have more headroom.

HSCD was _not_ happening back in 2008 as evidenced by the fact that Intel and AMD were delivery cores that could in theory execute 3 instructions per cycle (IPC=3.0) yet real-world code was barely achieving IPC=1.0.  With Intel's micro-architecture improvements in Sandy Bridge in 2011 and Haswell in 2013 the raw IPC of most code reached about 1.3 to 1.5 but then it mostly hit a plateau for the next 7 years.

You may have wondered why did I not post any blogs in 2015 regarding say the Intel 6th generation "Skylake" CPUs.  The truth is, I did buy several Skylake devices such as the Microsoft Surfacebook and had access to other ones at work but I was underwhelmed.  After 2014's 5th generation "Broadwell" and all the other generations 6, 7, 8, 9, 10 and 11 through the year 2020, my testing and benchmarking showed that raw IPC on Intel cores was barely budging from year to year.  Sure, Intel was adding virtualization improvements for cloud computing, instruction tracing for better debugging, but for existing code Intel had really not even achieved a sustained IPC=2.0 by 2020 yet.

The flip side of that, the ILP, was also not improving.  Although AVX2 (and its related features such as BMI) added a ton of new x86 and x64 instructions to help improve ILP, most developers even today in 2023 do NOT ship AVX2 optimized code.  Which is ironic given that Windows 11 effectively raised the hardware bar to minimum AVX2 for both AMD and Intel based CPUs, there are just simply too many millions of older PCs out there that are using CPUs from before 2013 which are either only AVX or even only SSE3 or SSE4 based.  As I said in the Real Crisis presentation, legacy code and even much new code today may well be perfectly tuned and optimized for SSE3 in 2008 but it's failing to run optimally today by not making use of AVX2.  AVX2 (and vector extensions in general) are probably a black box to most of you, I will talk about AVX and its derivatives in detail in an upcoming post.

What about Moore's Law?  Over the decades, reducing transistor sizes from micrometers to 10's of nanometers to today's 5 nanometers benefits the clock speed race and the core count race, but in itself does nothing to improve IPC.  Many "new" CPUs are just shrunken versions of a previous CPU or use the smaller transistor size to pack more cores into the same area of silicon.

This 7-year stagnation of IPC for individual cores finally came to an end in November 2020.  The opening shots for the "third CPU race" which I am calling the IPC supremacy race went off that month, coming at Intel from both Apple and AMD.  One can argue that improving IPC has always been happening from day one - every time Intel or AMD or Motorola or IBM ever added a new instruction they were typically doing so with the aim of improving ILP and thus IPC - I will counter argue that the obsession with clock speeds and core counts of past decades made the IPC race more of a 30-year charity walk than a true race.  I consider November 2020 as the start of what has truly been an amazing and fun race to watch so far, as multiple hardware vendors are now sprinting at full speed ahead.  It is part of the reason I joined Qualcomm last year to help cheer on Team Snapdragon in this new race

November 2020 changed everything

After Intel took top spot in 2013, AMD was a good 40% behind in IPC with its existing Bulldozer and Bobcat and other micro-architectures.  Their Zen2 micro-architecture which for example powered the Xbox One game console was close but no cigar at catching up to Intel.  On the ARM side, although Microsoft and Qualcomm were shipping ARM64 based Windows 10 devices since 2018 the performance was not quite there.  Qualcomm had improved on the original Snapdragon 835 with Snapdragon 850 which was about 50% faster, then Snapdragon 8cx and 8cx Gen 2 (used in Surface Pro X) which were 50% faster yet, but were at that point roughly at the same IPC=1.5 to IPC=2.0 as Intel's various "lakes".  And the 32-bit x86 emulation being at about 30% of native speeds in 2020 it meant that emulated apps were not even achieving any faster IPC than PCs were at in 2008.

Hype began in October 2020 when AMD declared that Zen3 would be their biggest jump in IPC in their history.  I was intrigued, since my largest development machine at the time was a 10-core Intel 10th generation Core i9 machine which decently blew away all my other devices due to sheer core count, thread count, and peak 5 GHz clock speed.  The IPC was not much better than Haswell but it had brute force (think 10-cylinder V10 running premium fuel vs. 4-cylinder engine).  So of course I had to test one!  I found a website of a gaming rig builder MysteryByte who was already taking pre-orders for the new AMD Zen3 5950X machines, and had them build one for me.

Around that same time, Apple was of course launching their new ARM64 CPU (which they brand not as "ARM64" but as "Apple Silicon"), the Apple M1.  In June 2020 they made some large claims at WWDC saying that the Macbook M1 would beat about 98% of existing Intel laptops out in the market.  That was a very_tall claim, but one that I suspected might be true and was hoping _was_ true.  If Apple could outperform Intel at native code at least, that would be a game changer.  If they could beat Intel at emulated code, holy cow that would validate everything I'd been saying for the previous 10 to 15 years.  So as soon as they started taking preorders in October 2020 I placed my order for a Macbook Air M1.

As it turned out both the AMD 5950X machine and the Macbook were delivered to me the same week in November so I had an most interesting time over the Thanksgiving and Christmas holiday stretch studying the respective micro-architectures.  Any monkey can run Cinebench or GeekBench and report the results on a website (and the results I was reading for both CPUs were fantastic indeed and did both beat Intel) but I wanted to know _why_ the results were fantastic.  So of course I resorted to my usual techniques of micro-benchmarks, emulators, and debuggers to look at why certain code sequences achieved the sustained IPC rates they did.  It was ironically 20 years exactly of when I had taken deliver of the two Pentium 4 machines in November 2000 which began this whole "poke at the hardware" obsession.

Apple M1

Let's start with the Apple M1.  8 cores with a peak clock speed of 3.2 GHz, so a hair faster clock than the Surface Pro X's 8cx Gen 2 which clocked at 3.15 GHz and launched around the same time.  Yet if I took the exact same C code, something like the 8queens benchmark which I've used for over a decade as one of my test benchmarks, there was almost a 2x advantage on the M1 compared to the 8cx Gen 2 (also knows as the SQ2).  I ran Chrome (which was already available in a native M1 build) and it scored about double the score on the M1 as did native Edge on Windows 10 on the SQ2.  I wrote up and tested various code sequences and consistently measured about a 75% to 90% native performance advantage of the M1 Air vs. the Surface Pro X.

Many (mostly naive) people at the time reached some bogus conclusions as to why M1 was beating the pants off most every other mobile CPU out there.  It's the larger caches, it's the integrated RAM and faster memory bandwidth, it must be some secret new ARM64 instructions, etc.  Factors such as that can certainly be measurable but might typical only contribute a few percent.  They alone would not explain an almost 2x performance improvement over the SQ2 or why most native benchmarks were beating the pants off mobile Intel CPUs.

One very significant difference between the M1 and all previous ARM64 CPUs in phones and tablets is that the M1 was not trying to consume minimal power (unlike Microsoft's approach with Surface Pro X to be thin and fanless and sip power).  The M1 is more analogous to a Xeon server CPU in a laptop, while Microsoft's Surface Pro X was using something analogous to a Core i3 ultra-low voltage mobile CPU.  Apple was throwing more wattage at the CPU, which you can immediately sense as you feel the fanless M1 Air heat up as it is running benchmarks, or as you hear the CPU fan spin up on the Macbook Pro M1.

That extra wattage budget buys them the higher clock speed of 3.2 GHz compared to the more typical 2.5 GHz to 3.0 GHz of most previous ARM64 tablets and laptops.  Since SQ2 vs. M1 is really only a 2% clock speed difference, when I did the math to compute the IPC of various code sequences it became clear to me that M1 delivered IPC=3.0 in a consistent sustained manner, while the SQ2 delivered IPC=2.0 at best.  And again the best Intel cores in 2020 barely sustained IPC=2.0 on typical code.

Once I started digging in I realized that Apple had made some interesting improvements over existing AMD, Intel, and competing ARM64 designs:

  • it was the first ARM64 core I saw that has a 3-wide load unit.  This is significant because even Intel designs at the time in 2020 supported only 2 loads per clock cycle, which had been true since Sandy Bridge was released in 2011.   Both the AMD and Intel philosophy for years was to support 2 loads and 1 store per clock cycle (since code typically performs operations such as "z = x + y" so you tend to read data about twice as often as you write data).  Having the third load unit I immediately recognized as being beneficial to certain kinds of code such as emulators where you need an extra lane of memory reads to handle emulator overhead such as table lookups.
  • the L1 data cache loads have a latency of 3 cycles, not 4 cycles like every AMD, Snapdragon, and Intel CPU over the previous 10 years.  This certainly helps to improve the memory bandwidth of scenarious like "pointer chasing" through a linked list or looking up hash tables.
  • trivial highly parallelizable code sequences were peaking at IPC=8.0 instead of IPC=4.0 as on other ARM64 chips.  This meant that the M1 could fetch and decode 8 ARM64 instructions per clock cycle, thus 32 bytes of code per clock cycle.  Most ARM32 and ARM64 cores are typically designed for power efficiency and thus only decode 4 instructions (16 bytes of code) per cycle.  16-byte wide decode has been the norm in many AMD and Intel CPUs for a long time as well, and typically decode up to 4 instructions per cycle as well.
  • I could tell there were more than 4 ALUs.  Existing ARM64 cores had either 2 or 3 ALUs which just by itself means you can't match Intel integer performance since Intel went 4-wide in Haswell.
  • trivial instructions really take 1 clock cycle.  ARM64 (just like PowerPC twenty years ago) has robust set of bit manipulation and bitfield insert/extract instructions.  On the Xbox 360 for example (which used a 64-bit PowerPC CPU) we exploited instructions such as RLWINM for shift and rotate and bitfield extract operations all the time.  Both PowerPC and ARM64 can do things using a single 1-cycle instruction that Intel x86 requires 2 or sometimes 3 instructions.  One way to improve ILP is to not require long data dependencies to achieve simple tasks like extracting or inserting a bitfield.  So if you can do in 1 instruction what someone else needs 2 or 3, that's good.  This is also true for common operations such as function calls and returns, which on Intel and AMD requires 2 cycles for a predicted "CALL" instruction and 2 cycles for a predicted "RET" instruction, while on the M1 the equivalent "BL" and "RET" instructions take 1 cycle each.  Most other CPUs at the time did not pull off this 1-cycle trifecta.
  • Although I used different build environments and compilers on each (Apple's Xcode on the M1 and Visual Studio 2019 on the Pro X) the resulting compiled ARM64 native code for say 8queens was not radically different.  This ruled out the use any secret new instructions to benefit the M1 and meant that the M1 was simply faster at running plain old regular ARM64 code.

More recently now with the availability of Parallels and VMware Fusion and QEMU to host Windows 11 directly on the M1, I could test identical Windows ARM64 binaries on the M1 as on the Pro X and the speed difference still held up.

One other advantage Apple had (much like when they made the switch from PowerPC to Intel in 2006) is their initial starting point with ARM64 was at a more mature level of the instruction set; they have no legacy devices support.  What I mean by that is when they jumped on Intel they jumped directly to SSE3 and didn't have to support running code that only had SSE2 or MMX or used x87 floating point.  Similarly with the M1 they started at a much higher level of ARM64 instruction set level that includes the newer ARMv8.1 and 8,2 style atomics (which more closely model Intel's compare-exchange atomics approach) and also certain ARMv8.4 extensions which reduce unaligned memory access faults.

In comparison, the two released versions of Windows 11 (build 22000 "Cobalt" from 2021 and build 22621 "Nickel" from 2022) are compiled to support ARMv8.0 (the limitation imposed by the Snapdragon 835 devices from 2018) and run just fine on those original 2018 devices such as the Lenovo Miix.  Having to support such limited legacy devices introduces the same kind of problem as the AVX2 problem on Windows 11 in that developers won't necessarily make use of the best hardware features.

That is my simplified explanation of why M1 achieved what it did.  Nothing magic, no smoke and mirrors in their claims, they focused on the fundamentals on fetch and decode throughput, ALU and memory throughput, and no doubt did a lot of hardware software co-design and take full advantage of a mature ARM64 instruction set.

Almost a year after the M1 launched a very excellent document called the M1 Explainer showed up that dug far deeper into the M1 design than I did.  If I did the equivalent of looking at the M1 through a microscope, the M1 Explainer is doing X-ray crystallography on it!  It is a fantastic dense read at 350 pages but spot on accurate to what I've observed.

In an upcoming I will look at emulation performance and also the latest Apple M2 Pro (which has some very cool new hardware tricks over the M1).

AMD Zen3 5950X beats Intel Core i9

As I mentioned the AMD Ryzen 9 5950X arrived the same week as the M1, so I was busy poking at that machine putting it up against my existing Intel Core i9 box.  I had both machines built to similar specs - 64GB of DDR4 RAM, 4GB SSD boot drive, nVidia 2060 GPU, and identical builds of what at the time was the Windows 11 Cobalt release (a.k.a. 21H2).  I installed the same Visual Studio toolset on both, basically set them up identically other than the AMD vs. Intel CPU difference.  Even the peak clock speeds of both CPUs (as displayed by Task Manager and measured by my micro-benchmarks) landed between 5.0 and 5.1 GHz on both machines.

The first thing I noticed as most people did is that it blew the doors off the Intel Core i9 at Cinebench, both single- and multi-core numbers.  The multi-core results made sense since the 5950X is 16C/32T (16 cores, 32 threads) while the 10th gen Core i9 is 10C/20T.  But what stunned me was the single-core result beat Intel by about 20%, approximately a score of 600 vs. 500.  That's raw IPC at work and in line with AMD's claim of very significant IPC jump over their previous Zen2 design.  Even something trivial like 8queens was showing a significant speedup on Zen3 over Zen2 and Intel.

In comparing AMD to Intel I was able to run identical test binaries on CPUs since they effectively had the same instruction set support - AVX2 + BMI1 + BMI2 etc.  Yet based on the instruction counts and execution times the AMD 5950X delivered IPC=3.0 while the 10th gen Intel gave IPC=1.9 and Skylake an IPC=1.8 for the same 8queens binary.

Much like the M1, AMD Zen3 had a robust set of supersized caches and execution units that put it at or above where Intel was in 2020: 32-byte fetch and decode, 4-wide ALU, 3-wide load unit, 3-wide vector unit, and in some cases 1-cycle faster vector math than Intel (3-cycle floating multiply vs. 4-cycle for Intel 10th gen, for example).  But these improvements either just catch up to Intel or are not relevant to a simple integer workload so that did not explain it.  After some further testing I realized that AMD had implemented better move elision.

Let's take a detour and discuss what I mean about "move elision".  it is the concept of removing a move "MOV" instruction from the instruction stream at instruction decode time and not passing it through the CPU pipeline as an actual ALU micro-op.  Register-to-register MOV instructions are extremely common in all types of x86 and ARM and even PowerPC binaries.   This is due to the fact that the common C/C++ register calling convention requires function arguments as well as return values to be copied, thus MOV-ed, to specific registers.

For example, let's take a look at actual 64-bit Intel x64 code in the C runtime, you can find these code sequences yourself using the Visual Studio linker and typing "link -dump -disasm C:\Windows\System32\UcrtBase.dll":

00000001800012AA: 48 8B 4B 60 mov rcx,qword ptr [rbx+60h]
00000001800012AE: 4C 8D 44 24 58 lea r8,[rsp+58h]
00000001800012B3: 48 8B D0 mov rdx,rax
00000001800012B6: E8 51 00 00 00 call 000000018000130C
00000001800012BB: 40 38 7C 24 50 cmp byte ptr [rsp+50h],dil
00000001800012C0: 74 40 je 0000000180001302
00000001800012C2: 83 F8 01 cmp eax,1
00000001800012C5: 74 3B je 0000000180001302

Notice the setup for the CALL instruction (a function call, doesn't matter for this example what is being called), the registers RCX, RDX, and R8 get loaded with some value.  RCX and RDX are each loaded with a MOV instruction, while R8 is loaded with an LEA "load effective address" which is effectively a flagless ADD of the stack pointer with the constant 0x58 (it is loading the address of a stack local).  After the CALL returns, the return value is EAX is checked and branched on.

The reason these specific registers are used is that in 64-bit Windows the calling convention requires RCX is loaded with the first function argument, RDX with the second argument, R8 with the third argument, and the return value is returned in EAX (32-bit) or RAX (64-bit) which meant the called function likely performed a MOV to EAX.

Older generations (say 10 years ago) of AMD, Intel, and ARM cores essentially treated MOV and LEA as any other integer instruction meaning that each of those MOV and LEA operations would have a 1-cycle latency and burn an ALU lane.  Starting in 2014 with Intel's 5th generation "Broadwell" the cost of most register moves went to 0.  And that's because the decoder converts that MOV instruction into a register rename operation and "elides" or eliminates that instruction.  AMD implement move elision in both Zen2 and Zen3, but goes one step further than Intel in that it eliminates some forms of LEA as well and MOV-ing a register to itself.

Huh?  Why would move a register to itself?  Well, almost every single compiled function in Windows 7, 8, 10, and 11 starts with some kind of a padding instruction; usually a dummy 2-byte instruction that does nothing but is useful for OS hotpatching and tracing.  You can force this at compile time u sing the /hotpatch compiler switch.  It is not correct for the compiler to simply emit two 1-byte NOP instructions because the CPU could be in the process of decoding the second NOP.  This makes hotpatching of a 2-NOP padding unsafe.  You have to instead emit a single 2-byte opcode, which register-to-register MOV is.

Compiled x86 functions usually start with this 3-instruction sequence.  Pay attention to the very first instruction:

100C1430: 8B FF mov edi,edi
100C1432: 55 push ebp
100C1433: 8B EC mov ebp,esp

It is a MOV of the EDI register to itself!  AMD Zen3 elides this, even in 64-bit mode where the 32-bit MOV is actually a zero extended operation (i.e. it is equivalent to MOVZX RDI,EDI).  That tells me that AMD has an additional layer of hardware optimization which tracks whether the upper 32 bits of a register are known to be zero to avoid the actual zero extending operations.

End of move elision detour.  The Zen3 also appears to be able to handle 6 LEA instructions per cycle compared to 4 on Intel, which hints to me that AMD's design can make use of either the memory address generation unit or the ALU's integer unit to compute effective addresses.  Other older cores have been known to treat LEA as a memory unit operation, not ALU.

The M1 also implements register move elision and as I mentioned also has the 3-wide load unit that Zen3 has, so in many respects the M1 and the 5950X are evenly matched.  No surprise then that both M1 and 5950X deliver IPC=3.0 on the 8queens benchmark.  Since Intel was still at 2-wide load in 2020, AMD had a leg up on pointer chasing and memory intensive workloads.

A pleasant surprise of the Zen3 micro-architecture turned out to be something neither Apple M1 or Intel did: store-forwarding elision.  This is the concept of eliminating MOV operations from memory when the value in memory can be predicted based on an earlier store to that memory.  This type of operation occurs frequently in unoptimized C code, where the compiler will write a register to memory only to immediately load it right back.  But this also occurs in perfectly legitimate code, sometimes by design!

Intentional store-forward is generated by C compilers (x86, x64, even ARM64) when they needs to transfer data between different classes of registers, e.g. integer-to-vector or vector-to-floating-point.  This has been around forever in Windows especially when dealing with floating point conversions.  Take a look at this actual function in the C runtime C:\Widows\SysWow64\UcrtBase.dll:

1005A67B: 83 EC 08 sub esp,8
1005A67E: F3 0F 5A C0 cvtss2sd xmm0,xmm0
1005A682: 66 0F D6 04 24 movq mmword ptr [esp],xmm0
1005A687: E8 24 CC FE FF call 100472B0
1005A68C: D9 1C 24 fstp dword ptr [esp]
1005A68F: F3 0F 10 04 24 movss xmm0,dword ptr [esp]
1005A694: 83 C4 08 add esp,8
1005A697: C3 ret

This is a __vectorcall SSE wrapper function for a conversion function that is implemented in older x87 floating point.  The only way that x87 registers can be loaded or stored is via memory; not from integer or vector registers.  So you can see above, the incoming value in XMM0 is first widened to double precision floating point then stored to stack memory to pass that argument to the wrapped function.  That function returns its value in an x87 register which then has to be written to the stack and re-read back in to the XMM0 return register.

The problem with store-forwarding is that most CPU designs store data to the L1 data cache and read data from the L1 data cache, which incurs the 4-cycle L1 load cost.  Most CPUs do not implement a bypass to feed a store directly back to a load.  Trust me folks, this kind of storing and loading of the same data happens ALL over the place in 32-bit and 64-bit code (a little more in 32-bit code but 64-bit is not immune).  Especially older 32-bit xode which pushes function arguments to the stack only to read them back in the fuction, the cycles due to store-forwarding stalls can really add up.

AMD Zen3 is the first hardware design I've seen where there is a bypass to elide store-forwarding latency.  But, and this is a big but, only for accesses to the stack using the ESP or RSP stack pointer as the base register.  Which means if the caller writes to the stack using say RAX because they loaded a stack address into RAX, Zen3 will not elide the following read away.  But they do handle the common case of spilling a register to the stack to then refill another register in the next instruction.  AMD Zen3 behaves as if the load is really just a register MOV and costs 0 cycles.  The x87-to-XMM data transfer from the code sequence above is still an open problem not solved by AMD yet.

So in summary - 3-wide load unit, faster SSE floating point, and limited store-elision easily put AMD on top of Intel back in 2020.  And while the M1 certainly wins the award for most improved micro-architecture, the raw speed winner in 2020 was still the AMD 5950X since it could beat Apple M1 on clock speed (5.0 vs. 3.2) and core count (16/32 vs. 8).

2020 ended in spectacular fashion as two chip giants really raised the bar to practically overnight put Intel into 2nd and sometimes 3rd place in benchmark results.

Intel Alder Lake strikes back!

Within 12 months of the one-two Apple AMD punch, Intel countered with its own major Haswell-like jump in IPC for the first time in over 7 years.  This came in the form of the 12th generation Intel "Alder Lake" micro-architecture.  Much like Haswell, Alder Lake throws a whole lot of featured in at once such as AVX-512, large "Performance" and small "Efficiency" cores, as well as true IPC improvements.  For today I will focus mainly on the IPC improvements of the Performance cores and cover AVX-512 issues in the future post about AVX.

Alder Lake matched both Apple and AMD at delivering IPC=3.0 at integer workloads such as 8queens, matched their the 3-wide load unit wirth, matched AMD's LEA elision, widened the vector unit throughput for certain operations, and cut some SSE latencies in half (for example floating point addition and shuffle operations).

Alder Lake's most significant improvement is that Intel appears to have solved the general store-forwarding problem for integer registers.  i.e. if memory is written say through the RAX register and read back say through the RCX register, and if RAX and RCX are holding the same address then the load through RCX is essentially free.  It does not look like they solved it for floating point loads and stores yet iether, but this is still ahead of AMD's implementation and well ahead of any ARM64 implementation.

I am also noticing that Alder Lake appears to have new macro-op fusion kicking in related to certain bitfield related sequences and compare-branch sequences.  That's a possible sign that Intel is doing additional hardware software co-design on modern Windows 11 workloads and solving multi-instruction data dependency stalls with instruction fusion.

One other interesting note about Alder Lake is that the Performance cores are power hogs as has been well documented.  Like Apple, Intel is throwing wattage at the IPC problem for now and seems to be temporarily backing off optimizing for power-per-watt (which was a big selling point of earlier Core generations).

Snapdragon 8280 boosts Windows 11 on ARM

Qualcomm has also not sat still, releasing the Snapdragon 8cx 3rd gen CPU (a.k.a. the Snapdragon 8280) in 2022 shortly before I joined them.  8280 powers the Lenovo x13s and the ARM version of the Surface Pro 9 tablet as well as Microsoft's very excellent Volterra Windows Dev Kit 2023.  I have to pause and rave about my Volterra dev kit, which at $699 USD and loaded with 32GB of RAM is a great way for developers to enter the ARM64 world and explore development on ARM64.

The 8280 is what I consider to be the first of the Snapdragon processors that is well designed for emulation and also suitable as a "NUC" or Mac Mini style Windows 11 desktop CPU (especially in the Volterra form factor with all the ports in the back).  Like the M1, 8280 implements the ARMv8.4 extensions which vastly improve the speed of misaligned memory accesses, and adds the Intel style "compare exchange" atomics support.  At 3.0 GHz it is not clocked as fast as an M1 and overall performance is generally slower than the M1, but it's a vast improvement over the past 8cx generations as well as the 835 and 850.  With beefier "small cores" the overall multi-core performance of the Surface Pro 9 and Volterra are by my measurements at least 40% faster than the Surface Pro X.  I will discuss the 8280 more in my upcoming Windows 11 on ARM post.

The Great Unification to 64-bit (and user mode emulation!)

So that is where we are today in August 2023.  The IPC race is still going strong, with recent announcements from several CPU vendors this summer talking of things to come in the next year or two.

What's interesting to me reading the recent reports is that AMD, Intel, and Apple seem to be converging toward similar 64-bit only CPU micro-architectures.  They all now fetch and decode 32 bytes of code at a time, they have similar widths of ALU and memory units, they all have fantastic branch predictors, and have mostly eliminated historical pipeline stalls due to partial-register updates, partial-flags updates, and misaligned memory accesses.  There are the very issues that plagued the Intel Pentium 4 twenty years ago yet today are solved problems.

The next great leap in design appears to be to dump old legacy 32-bit and 16-bit support in order to free up what is otherwise rarely used silicon.  ARM has already deprecated Thumb2 and ARM32 modes of execution for future cores, with the Apple M1 already dropping ARM32 support from silicon.  This makes perfect sense to me, since Mac OS had no legacy ARM32 code to support and most apps on iPad and iPhone are generally rebuilt as ARM64 for new OS.releases.  Even Android is pretty much pure 64-bit ARM64 these days.  So it is not surprising then that Microsoft recently also punted all 32-bit ARM support from Windows Insider builds starting with last month's build 25905 which very clearly states "any installed Arm32 applications will no longer launch on your device". A bonus of this elimination of ARM32 is that after the upgrade to these new builds your ARM64 device's C: drive has about 1GB more space due to the elimination of about 1GB of ARM32 binaries in Windows 11.

Windows has always been a different beast when it comes to old x86 code though, jumping through hoops under the hood to make sure old Windows 95 and NT and XP software mostly runs even today on Windows 11, even under emulation on ARM64.  That legacy support is now a burden on AMD and Intel CPU, taking up space for what is rarely used or even dead silicon.  As the emulation of x86 efforts have shown, you can achieve great x86 app compat and relatively fast performance through pure user mode emulation and skip hardware support for such legacy code.  We all knew this is possible since x86 game emulation worked quite well on the Xbox 360 way back in 2006.

In 2007 I therefore did a thought experiment in Part 5 to think about how x86 opcodes could be re-designed to get rid of some of the horrible gotchas of legacy x86 such as prefix bytes, segment limits, and long instructions.  I called this hypothetical design "VX64" and one suggestion I made was to generalize all opcodes to support 3-operand forms (the way PowerPC and ARM always have) which would reduce instruction counts and code sizes overall.  In my posts from 2010 to 2015 I further made the stronger argument to _only_ implement 64-bit mode and run all legacy code in emulation (just like the Xbox 360 did, and what Apple and Microsoft do today in running older Intel binaries on their ARM64 device).  We definitely know this works in 2023, the concept is no longer hypothetical.

To my delight it seems Intel is finally coming around to this way of thinking based on two recent announcements they made just months and weeks ago:

- Intel Publishes "X86-S" Specification For 64-bit Only Architecture - Phoronix

- Intel AVX10: Taking AVX-512 With More Features & Supporting It Across P/E Cores - Phoronix

If you read through the PDF files that Intel posted, you'll see that X86S is a proposal to simplify the x86/x64 instruction set to eliminate 16-bit and 32-bit cruft from silicon, while the AVX10 extensions then add additional architected integer registers and the kind of 3-operand operations common to RISC processors today.  This is my thought experiment from 2007 come to life!

The fact that AVX10 exposes the ability for legacy ALU operations such as ADD and SUB to be flagless very much mirrors the "flags or no flags" approach of most RISC instruction sets such as PowerPC and ARM64.  The ability to translate an x86 flag-setting instruction into a flagless RISC instruction is specifically useful for improving the ILP of both native code (where EFLAGS updates are frequently unused) and of translated code and thus the emulated IPC.  Flagless operations coupled with a dynamic optimizer would potentially achieve full legacy hardware 32-bit x86 performance.  Intel would have the advantage that emulating things like arithmetic and floating point flags would be a no brainer since most instructions would "emulate themselves" with 100% accurate side effects.  Emulators such as Rosetta and XTA have a little tougher job in that department and have to emit additional instructions to emulate EFLAGS.

These Intel proposals are not mutually exclusive and have to be implemented in a specific order.  Hardware first requires the AVX10 changes for exposing 32 registers in order to even be able to drop 32-bit mode and implement using emulation.  Emulating 32-bit mode in 64-bit mode requires more than just 16 integer registers (trust me on this, having worked on both the x86-to-PowerPC emulator for Xbox 360 and the x86-to-ARM64 emulator for Windows 11, you need the extra registers).  Once such hardware exists (but possibly still supporting legacy 16-bit and 32-bit in silicon) then something like future Windows build could choose to use either that legacy silicon or invoke an x86-on-AVX10 emulator.  As I said before, the best emulator is the one you don't even realize is running, seamlessly being able to run an app or game in either mode without the end user noticing the difference would be the litmus test which kills off 32-bit x86 support in hardware.

In summary, it is interesting to consider that not only are the CPU vendors converging on similar designs at the micro-architectural (implementation) level, but that Intel is leaning closer and closer to the RISC style of exposed architectural instruction set and talking emulation as a required first class citizen.  This is truly turning into a Decade Of Transformation for CPU designs.

Let's check back on the IPC Supremacy race next year.  Next topic for next month, setting up an up-to-date Windows 11 on ARM64 development machine.

[Part 43]  [Table Of Contents] [Return to]