NO EXECUTE!

(c) 2008 by Darek Mihocka, founder, Emulators.com.

December 1 2008


[Part 1]  [Part 2]  [Part 3]  [Part 4]  [Part 5]  [Part 6]  [Part 7]  [Part 8]  [Part 9]  [Part 10]  [Part 11]  [Part 12]  [Part 13]  [Part 14]  [Part 15]  [Part 16]  [Part 17]  [Part 18]  [Part 19]  [Part 20]  [Part 21]  [Part 22]  [Part 23]  [Part 24]  [Part 25]  [Part 26]  [Part 27]  [Next]  [Return to Emulators.com]

Centrino 2 and Gateway's Killer Penryn Notebook

Recent Intel processor releases are of relevance to software developers, to virtual machine users, and to anyone computer shopping this economically challenged Christmas. The processors which I will discuss today are the mobile Core 2 Penryn branded as Intel "Centrino 2", the much hyped and freshly released Intel Core i7 formerly known by the codename "Nehalem", and the biggest little surprise of the year, the dual-core Intel Atom. For you Atari 800 and Atari ST fans out there which I will discuss today: Gemulator 9.0 is completed and posted as open source!

In late September I was invited to give a presentation at Microsoft to talk about performance optimizations. I brought my Acer Aspire and my Macbook, and proceeded to recycle some of my PowerPoint presentations from the summer. About an hour into my presentation I absent mindedly knocked my cup of coffee over and, horror of horrors, watched several ounces of coffee ooze into my Macbook's keyboard. About two seconds later the Macbook shut off and died. That night I tried to revive the poor dead Macbook. I poured distilled water through it, baked it in the oven to dry it out (all little too long, oops), and still no dice. RIP Macbook 2006-2008.

My dead Macbook, being a 2006 model, was based on the original Core 2 processor based on 65nm technology which supported the SSE3 and SSSE3 instruction set extensions. About a year ago Intel released the 45nm die shrink version of the Core 2 codenamed "Penryn", which also added the SSE 4.1 instruction set and larger L2 cache. As luck would have it, a wave of Penryn-based Centrino 2 notebook computers hit the market this summer. The way that you can tell a Penryn processor from the older Core 2 is by the model number. Models in the 5000, 6000, 7000 range are older Core 2. Models in the 8000 or 9000 range are Penryn. You want Penryn. Beware the glut of discounted desktop and notebook computers on store shelves right now bearing the 5000 series parts. They are older less efficient models.

On my way to the Apple store to buy one of the new Macbooks, I stopped at Best Buy to check out what other brands offered. While most decent notebooks from Dell and Sony were in the $2000 and up range, I was pleasantly surprised to run across a model from Gateway that seemed to have all the specs of a Macbook Pro:

  • 17-inch 1920x1200 LCD screen,
  • 4GB of DDR3 memory,
  • built-in nVidia 9800 video chipset with its own 512MB of memory,
  • 200 GB SATA hard disk with a second hard disk bay,
  • preloaded with 64-bit Windows Vista SP1,
  • built-in DVD burner, 802.11n wi-fi, VGA and HDMI video output!

The clincher is that the Gateway, the 7811FX notebook, was priced at $1250, compared to what I knew would be about $3000 for a comparable Macbook Pro. How they pack all that technology in for that price and have profit left over to pay Microsoft the fee for Vista, I don't know, because every spec of this Gateway machine blows away any Dell, Sony, HP, or Apple notebook in this price range.

The Gateway notebook was pre-installed with 64-bit Windows Vista Service Pack 1 all ready to go, and I have been using it as a true desktop replacement for the past month or so. The notebook is based on the 2.26 GHz Core 2 P8400 Penryn processor, but pretty much keeps up with 2.66 GHz desktop parts as you will see in the benchmark shortly. It easily keeps up with and beats the AMD Phenom desktop processor.

The one huge drawback of this notebook is that it is heavy, a good 10 pounds. It is heavier than my old Dell D800 notebook and the dead Macbook, and certainly not something I can zip under my jacket on the motorcycle. I have the ASUS EEE and Acer Aspire for that.

This is one of two new Gateway computers that I have purchased in the last few weeks. Gateway of course is not the old South Dakota based "cow box" company from the 1990's. They are a Chinese company now, part of Acer, and from what I've seen of Acer's Aspire One notebook earlier this summer, this 7811FX notebook, and the next machine that I am about to describe, I am very impressed by the latest Acer-Gateway offerings. I had written the old Gateway off years ago.

Except for the slightly slower CPU clock speed and then notebook hard disk, the Gateway 7811FX will toast most desktop machines at CPU and graphics operations:

Gateway 7811FX notebook speed rating in Windows Vista


Core i7 - Nehalem Arrives Early

Impressed by the mobile Penryn, I figured I should get cracking and build myself a quad-core Penryn desktop. I had been seeing the online prices dropping to about the $300 price point for the 2.8 GHz chips with a whopping 12 megabytes of cache. Ironically, the very day that I set out to the computer store to buy one of these Penryn chips, Monday November 19, is the day "Nehalem" came out. Nehalem is the much hyped "Core i7" chip that replaces the Core 2 (although from my measurements, it is essentially an enhanced Penryn). I was not expecting Nehalem until CES or Macworld time frame in January and certainly not to be in stock already. Best Buy had fresh Core i7 based Gateway FX desktops. I purchased the Gateway FX6800-01 desktop for the same $1250, which similarly to the FX notebook came pre-installed with 64-bit Vista, the latest ATI 4850 video card, and super fast hard disk. Not surprisingly this new Gateway is the first computer I have seen score all 5.9 in Vista's performance tests:

  Gateway FX6800 desktop speed rating in Windows Vista

What I wonder now, does Vista use Olympic figure skating scores and treat6.0 as some ideal that no machine can achieve, or will we soon see hardware that scores 6.0 and higher?

Nehalem / Core i7 has several feature enhancements over Penryn - it brings back the Pentium 4 feature of Hyper-Threading, giving the Windows the illusion of having an 8-way machine (as you see in the Task Manager screen shot below), it adds the missing SSE4.2 instructions, and comes standard with an L3 cache right from day one:

A more subtle change in Core i7 is the integration of the memory controller into the processor itself, something that AMD did a few years ago with the Opteron processor. As that did for AMD, this change results in some pretty stunning memory performance improvements. Core i7 has a vastly higher ceiling on memory throughput - my measurements show with all 8 threads executing REP MOVSD memory copy operations can achieve an aggregate throughput of over 250 GB/s, that's 250 gigabytes per second, whereas the original Core 2 Quad seems to peak out at about 100 GB/s.

Using my CPU_TEST utility which I described in September, the most noticeable micro-architectural improvement of Core i7 over the previous Core 2 models is the much lower latency of unaligned spanning accesses. I will get into those numbers after I tell you about the third pleasantly surprising CPU release of late, the dual-core Atom.


Atom 2.0 - The Best $100 You Will Ever Spend

When I left Seattle on my road trip to Dallas this past August, I filled up the tank with gasoline costing me about $4.50 a gallon, ouch. Just this weekend I filled up for under $2 a gallon. That kind of price drop raises eyebrows and brings about talk of deflation. I mean, nothing gets that cheap that fast, delivering more than twice the "bang for the buck" in four months. Yet this kind of increasing value, either in the form of faster product, lower costing product, or both, is what the hardware industry has been giving us for over 30 years. A prime example of this is the dual-core Atom processor, or as I am calling it, the Atom 2.0.

While shopping for the Nehalem, I spotted an Intel Atom motherboard for $98, apparently with Atom processor included. Cool I thought, I will build a desktop Atom system after I finish playing with the Nehalem. The Atom made its appearance this summer in "netbook" machines such as the ASUS EEE and the Acer Aspire One, and in my opinion is one of the most technically amazing processors out there.

When I unpacked the box, I realized that this was one of those embedded CPU/motherboard combinations, nothing to assemble, it was all ready to drop into a case. So I took an old case from a dead Pentium 4, screwed in the Atom board (tiny little thing!), added a 2GB DDR2 memory DIMM, and booted up.

Hmmmm, the BIOS claimed this chip EMT64 compatible, naw, can't be. And how can there be two L2 caches for a hyper-threaded processor? Was this a 64-bit dual-core processor?

Apparently I did not hear the news, because the desktop Atom is in fact a dual-core hyper-threaded 64-bit CPU!!!!!! I repeat... !!!!!! To make sure, I booted a 64-bit Windows Vista setup DVD and started setup. Much to my surprise, 64-bit Windows Vista installed just fine. I couldn't believe this, 100 dollars for a 64-bit upgrade. Just to double check, I ran the CPUZ utility and compared the output of my Asus Aspire notebook from August (on the left running on Windows XP) with the desktop Atom board (on the right running on 64-bit Windows Vista):

I am a little confused by this because the stepping information would indicate that they are the exact same processor, implying that I should be able to boot 64-bit Windows on my Acer Aspire One! (???) Regardless, the dual-core Atom board is the least expensive 64-bit multi-core upgrade ever. I have since added a WinTV tuner card to the PCI slot, put in a fresh SATA hard disk, and for a grand total of a little over 300 dollars put together a Windows Media Center machine which has been running very quietly ever since. New record for least expensive desktop 64-bit computer. New record for Windows Vista power consumption too I think. I measured the total power consumption of this machine to be, are you ready... 55 watts. By comparison, the dual-processor hyper-threaded Dell 650 workstation which represented the top-of-the-line desktop PC about 5 or 6 years ago consumes over 300 watts.

There are limitations of the Atom board that will not appear to die-hard games, namely the lack of an AGP or PCIe slot to install a fast graphics card, and the lack of more memory sockets. As a home server, Media Center machine, email machine, or software development machine, the Atom is more than suitable. A six-fold power savings and ten-fold cost reduction over comparable desktop PCs of just 4 or 5 years ago is certainly a very decent improvement in "bang for the buck". This screen shot shows the Windows Vista performance rating of the Atom system:

Built-in VGA video is the weakest performer of the Atom motherboard, but sufficient to enable "Aero" mode in Vista


Gemulator 9.0 Released!

One of my stated goals for Macworld was to finish Gemulator 9.0 and release it as open source. As of last night, it is finished and posted and available for download in binary and source form from this web site. For the past few weeks I've been furiously cleaning up code and deleting old obsolete code to prepare for this release. The previous release of Gemulator which I made way back, you guessed it, almost 8 years ago, predated Windows XP, yikes! It was optimized for Pentium III and hard coded for Windows structured exception handling, using exceptions to handle guest memory-mapped I/O emulation. Fine for the Pentium III, but a ridiculously stupid approach for Pentium 4 and multi-core processors where exception latencies severely hurt overall performance. Much of my time over the past 18 months has been spent ripping out those Windows-centric techniques and replacing them with the software-TLB and software-pipelined dispatch approaches I've described in this blog and in the ISCA workshop paper.

At the same time, I've been enhancing the debugging capabilities of Gemulator to have it function as a 680000/68040 training tool. When I visited my old friend Ignac Kolenko at Conestoga College and lectured to his class this summer, I was challenged to produce a 68040 emulator which can replace the hardware "Tutor" boards that the class had been using in the past. And so with a lot of input from Ignac, I've made some changes to Gemulator 9.0 to facilitate easier debugging, tracing, and loading of custom ROM images.

First of all, download the source code to Gemulator 9.0 here. Being a command line kind of guy, you build the product from the command prompt of Windows. You will need the Visual Studio 98 (yes, 98 as in VC 6.0) tools, as I am not a fan of the latest Visual Studio 2008. You will also need MASM 6.15 in your path, which I believe is included in one of the Visual Studio 98 service packs.

When you unpack GEMCE900.ZIP, you will find a root directory with some make files and build scripts, a BUILD directory where the product is built, and a SRC directory. If you are in a VC6 or VC7 command prompt window, type MKALL.BAT to build the product. This will build the 6502 interpreter, the 68000/68040 interpreter, the Atari 800 virtual machine, and the Atari ST virtual machine. The final EXE will be found in the BUILD\SHIP directory as ATARIST.EXE.

If you are unfortunate enough to be using Visual Studio 2005 (a.k.a. VC8) or Visual Studio 2008 (a.k.a. VC9), you will need to run the MKALLASM.BAT script first to built some of the libraries, then open the ATARIST.VCPROJ file in your IDE to complete the build.

If you have no Microsoft build tools, fear not. Earlier this year Microsoft released their open source Singularity operating system (http://research.microsoft.com/os/Singularity/). When you extract the project, included in the SINGULARITY\BASE\BUILD directory are the complete VC8 build tools, including MASM 8, the 32-bit C/C++ compiler, and the old 16-bit DOS C/C++ compiler!

As a bonus for you PC Xformer fans, if you go into the SRC\ATARI8.VM directory of the Gemulator sources, and type NMAKE from a 16-bit build prompt, it will build XF.EXE, the 16-bit Atari 800 emulator for MS-DOS.

Some tips for using Gemulator 9.0:

  • Gemulator 9.0 is compatible with Windows 98 or higher, but really you want to run it on Windows XP or later. Trust me, if anything so you can resize your debugger window to as many columns as possible.
  • If you plan to do debugging, go into the Advanced menu and select Debug Mode.
  • Debug Mode launches a debugger window, and breaks into the debugger each time the guest is rebooted. This allows you to set breakpoints before booting a guest OS.
  • You can launch Gemulator 9.0 from the Windows shell by clicking on the ATARIST.EXE icon, or launch it from a command prompt. In the command prompt case, the debugger with use your existing command prompt.
  • For best result, when launching from the command prompt, use this command line to launch Gemulator: START /WAIT ATARIST.EXE
  • Gemulator 9.0 is designed for debugging and so uses a software-pipelines interpreter dispatch loop instead of a more efficient branch predicting dispatch loop. Worst case you will get slightly slower performance than Gemulator 8.0x, but the code is much more portable than the previous Gemulator 8.0x code (trust me, be glad you never had to see it).
  • The code is 32-bit x86. Left to reader to make the changes to build for 64-bit Windows, I ran out of time. :-)

Since Gemulator 9.0 is now public and open source, I will use it as the basis of processors benchmarks in place of the older Gemulator and SoftMac releases which do not run correctly on Windows Vista.


Measuring Atom Performance

A lot of reviewers have beat up the Atom for being too slow, and based on my testing I feel that is unfair to make such a blanket statement. My testing shows that the Atom, particularly this most recent dual-core 64-bit hyper-threaded Atom, is more than adequate for running the latest 64-bit Windows Vista, for running Visual Studio development tools, and for serving as an inexpensive Windows Media Center TV tuner and home server. When factoring in price and power consumption, it delivers more "bang for the buck" than other processors. If it can reduce the power consumption and noise of an existing desktop machine from over 300 watts to just 55 watts, that is worth considering spending the relatively small 100 dollars on.

But just how "slow" is the Atom? Compared to the latest 8-thread Core i7, it is about five times slower at raw single-threaded throughput. This is due to the fact that Atom is an in-order processor that does not do the fancy instruction re-ordering that Pentium and Core architectures do. Nor does it run at the higher clock speeds of other architectures. The two Atom systems that I have now purchased both run at 1.60 GHz, and interestingly no matter what setting I put Windows Vista at (Balanced, High Performance or Power Saver) it stays at 1.60 GHz. Which I guess makes sense for a core that consumes two watts of power! The combination of running at almost half the clock speed of other processors, the smaller on-chip cache, and the restrictions of an in-order pipeline result in the 5x speed difference. Real-world performance will not be quite as bad when factoring in memory and disk bottlenecks.

To put that in perspective, at identical clock speeds the Pentium 4 architecture is 2x to 3x slower than the Core 2 architecture. Factor in clock speed, and it turns out that not that long ago, many people were spending thousands of dollars on "top of the line" 1.5 GHz to 2.0 GHz Pentium 4 systems which were really no faster than the hundred dollar Atom is today. I tested this with two of my own legacy machines: a dual-processor Pentium III system running Windows XP and a dual-processor Pentium 4 Xeon system (my Dell 650 workstation) running Windows Vista. Both of these systems are equipped with 1.5 GB of RAM, and in the case of the Xeon system, fast SCSI hard disks. I performed a simple test yesterday - build the Gemulator 9.0 source code. I repeated the build several times on each system to allow for disk caching to reduce disk I/O wait time and keep it a mostly CPU-bound test and recorded the built time once it had stabilized after a few builds. The results for the three systems, plus a fourth system which contains a more recent Pentium "D" (dual-core 64-bit Pentium 4) system under-clocked to 1.5 GHz are shown here:
 
Desktop computer specs
clock speed, CPU, RAM, disk, OS
Gemulator 9 build time
(seconds, lower is better)
1600 MHz dual-core Atom, 2GB RAM, SATA disk, Vista70.6
2000 MHz dual-processor P4 Xeon, 1.5 GB RAM, SCSI disk, Vista65.9
1000 MHz dual-processor Pentium III, 1.5 GB RAM, IDE disk, XP85.6
1500 MHz dual-core Pentium D, 2.0 GB RAM, SATA disk, Vista93.3

Expanding this to include the actual execution speed of Gemulator, I have arranged the screen shots below in the same order as the four systems were just listed. The first screen shot in each row is the CPUZ output showing the low-level specs of the processor being tested, and the second screen shot shows that processor's Gemulator 9 benchmark result running the Quick Index Atari ST benchmark program. Larger percentages mean faster speed.
 
1600 MHz dual-core Atom,
2GB RAM, SATA disk, Vista
2000 MHz dual-processor P4 Xeon,
1.5 GB RAM, SCSI disk, Vista
1000 MHz dual-processor Pentium III,
1.5 GB RAM, IDE disk, XP
1500 MHz dual-core Pentium D,
2.0 GB RAM, SATA disk, Vista


The Atom outperforms the Pentium III on all counts, both at the Visual Studio build time and the execution speed of the Atari ST emulator. The faster clock speed of the Atom and larger on-chip caches more than make up for the Pentium III's more clever out-of-order pipeline. I think it is safe to peg the performance of the Atom at about 20% above a 1 GHz Pentium III, or thus about the performance level of the 1.2 GHz Pentium III. As somebody who obviously still owns and uses Pentium III machines, that's a performance level that I'm satisfied with.

Factoring for clock speed, the Atom is almost on par with the 2.0 GHz Pentium 4 Xeon, which is interesting. With almost identical bus and clock speeds, and similar cache sizes, the efficiency of the in-order Atom core appears to be about the same as that of the out-of-order Pentium 4 core! However, throw in the much larger L2 cache of the Pentium D and the Atom is slightly slower. So therefore we can now bound the performance of the Atom at somewhere between that of a 1.2 GHz Pentium III and a 1.5 Pentium 4, which in Core 2 numbers is somewhere around about a 600 MHz Core 2. And thus how one arrives at the 5x speed difference between an Atom and the latest Core 2 chips.


Core i7 vs. Centrino 2 vs. Core 2

Now to analyze the improvements in Core i7 over previous processors. Since my Mac Pro runs at a fixed 2.66 GHz and my new Gateway i7 machine is also fixed at 2.66 GHz, I took my 2.4 GHz AMD Phenom machine (which I described back in June) and my other 2.4 GHz quad Core 2 machine and over-clock them both to 2.66 GHz by tweaking the bus speeds up by 11%. I have actually been running the Core 2 over-clocked to over 3.3 GHz with no problems, so 2.66 GHz was a breeze. As I did with the Atom, I compared the Gemulator 9.0 build times and run times of these systems, and for reference also threw in my Core Duo iMac and the Gateway Penryn notebook. The results of the build times are shown here, with the addition of a column to show total clock cycles:
 
Desktop computer specs
clock speed, CPU, RAM, disk, OS
Gemulator 9 build time
(seconds, lower is better)
Total clock cycles
(billions)
2666 MHz Core i7, 3GB, SATA, Vista20.053.32
2666 MHz Core 2 Quad, 8GB, SATA, Vista21.858.12
2666 MHz Core 2 (Mac Pro), 4GB, SATA, Vista22.158.92
2260 MHz Centrino 2 Penryn, 4GB, SATA, Vista23.753.56
2666 MHz AMD Phenom, 8GB, SATA, Vista24.866.12
2000 MHz Core Duo, 2GB, SATA, XP28.356.6

Despite larger and more than ample RAM in the other systems to cache the whole build process, Core i7 and Penryn both do about 9% better than the original Core 2 or Core 2 Quad, and 24% better than AMD Phenom. In fact the Gateway notebook does amazingly well considering its slower clock speed and slower hard disk. I similarly measured at most about a 10% speedup of Core i7 over original Core 2 at 2.66 GHz on some other tests, while other tests such as Gemulator 9 were practically identical. I won't bore you with the screen shots because the results are too similar. Core 2, Centrino 2, and Core i7 appear to have very very similar architectures. Core i7 is not quite as drastic a departure from Core 2 as the advertising hype would have you believe.

To probe this I used my CPU_TEST utility. The following links are to the output files of running CPU_TEST /MHz 2xxx /ALL (where 2xxx is the specific clock speed of the system such as 2666) on the various systems:

atom1600cpu.txt (Atom dual-core)

core2penryn2260cpu.txt (Core 2 Penryn "Centrino 2" dual-core)

corei72666cpu.txt (Core i7 quad-core)

macpro2666cpu.txt (Core 2 Xeon quad-core in Mac Pro)

p4d3000cpu.txt (Pentium D dual-core under-clocked to 1500 MHz)

All five systems were tested while running 64-bit Windows Vista SP1, so the kernel results are comparable. You can "diff" these files against each other to see how one architecture fares against another at specific micro-benchmarks. I will walk you through some lines of output which show significant differences:

  • test 1 memory sr - this tests the address generation interlock delay, i.e. the penalty between loading an address into a register and dereferencing the memory location. This penalty dropped from 4 cycles in Pentium 4 to 3 cycles in Core 2, but is now back up to 4 cycles in Penryn and Core i7!
  • test 6 and/or mem - Atom is the only architecture that does not optimize AND mem,0 and OR mem,FF to plain store operations, a common compiler space optimization trick.
  • test 7 divide - Penryn and Core i7 both have much faster integer dividers than the other chips.
  • test imul r1r1 - Pentium 4 has a very slow integer multiply.
  • test 3 os pg flt - Core i7 has the lowest latency for operating system ring transitions such as page fault exceptions, Atom the worst.
  • test 3 xchg/mvsx - the test for "partial register stalls", AMD Phenom does best as AMD architectures do not have partial register dependencies, Core processors do a little worse, Pentium 4 and Atom do worst.
  • test 17 read rtc - Penryn and Core i7 both have much faster RDTSC instructions (for reading clock cycle counts in user mode programs)
  • test 23/24 shld - Pentium 4's awful bit shifting performance, Atom also poor.
  • test 29 call mis - penalty of a mispredicted indirect call. Pentium 4 is awful. Penryn and Core i7 both 1 clock cycle faster than Core 2. AMD fastest.
  • test 42 w32 r8 - Core i7 is the only processor with no store-forwarding penalty for "write long read byte" operation.
  • test 43 w8 r32 - Core i7 has higher penalty for "read byte write long" store-forwarding, but not as bad as Pentium 4
  • test 83 lockxchg - cost of an interlocked "compare exchange" instruction used to thread-safe locking operations. Pentium 4 close to 100 cycles, Penryn and Core i7 better than Core 2, AMD better, Atom best!
  • test A15 span 32 - all architectures have no penalty for unaligned access that cross 32-byte boundary (but not 64-byte).
  • test A15 span 64 - Core i7 has much lower penalty than other Intel architectures for cache-line spanning access. AMD no cost.
  • test A15 span 4K - Penryn, Core 2, and Pentium 4 had additional penalty for spanning 4K page boundary, while Atom, Core i7, and AMD have no such penalty.
  • test A18 CLFLUSH - Penryn and Core i7 have lowest cost for cache line flush.
  • test A19c FXSAVE - Penryn and Core i7 have lower cost for FXSAVE/FXRSTOR instructions (used by thread context switching) than Core 2, AMD lowest.
  • test A19d STMXCS - Core i7 has much lower cost for writes to SSE control register (which is used to adjust rounding modes in floating point operations and in context switching).
  • SSE4 POPCNT - only Core i7 and AMD Phenom support the POPCNT (population count) instruction in SSE4.
  • SSE4 CRC32 - only Core i7 supports the SSE4 CRC32 instruction.
  • SSSE3 PSHUFB - only Atom and the Core architectures supports the packed byte shuffle instruction.
  • SSE2 MOVD - a surprising one, AMD Phenon and Pentium 4 has high cost to basic SSE2 move operation!
  • Atom MOVBE - Atom still only supports the "move big endian" instruction, Core i7 does not.
  • dispatch n nops - similar to call misprediction test, measures the cost of a misprediction and the usable size of the pipeline bubble. Core i7 has lowest cost.

There are other subtle differences, but those are the major ones.

One other thing I measured by playing with thread affinity in Windows Task Manager on different multi-threaded tests is that the hyper-threading in Core i7 appears to be much more efficient than the old hyper-threading the in Pentium 4, although still not near the ideal of being equivalent to true extra cores.


Conclusions and Recommendations:

  • As you can see, last year's Core 2 Penryn and the new Core i7 share similar improvements over Core 2. Core i7 is essentially a derivative of the Penryn, which in turn is a derivative of the Core 2. Not at all the drastic change as from Pentium 4 to Core 2.
  • In general, optimizing performance for Core 2 will lead to speeds on Penryn and Core i7 as well since most micro-benchmarks are identical on the three processors.
  • The one "regression" in Penryn and Core i7 is the added clock cycle for address generation stalls. Code needs to be careful to space out loads of addresses from the dereferencing of those addresses.
  • The Pentium 4, the Atom, and the AMD Phenom are all very clearly very different architectures from the Core series or each other. Pentium 4 is clearly the least efficient.
  • Atom is generally slower than Core series, but has some nice surprises, such as the very low cost of interlocked atomic memory operations, useful for multi-threaded code.
  • Core i7 eliminates significant misaligned memory access and store-forwarding stalls from Core 2 and more closely behaves like AMD architectures in that respect.
  • Core i7 has very fast memory copying capability, almost triple that of the Core 2. The REP MOSVD instruction appears to be better optimized for aligned 16-byte copies.
  • Core i7's memory and context switching optimizations mean better performance when running multiple threads.
  • Aggregate performance on Core i7 over Core 2 or Penryn is anywhere from about parity to about a 10% speedup.
  • I do recommend the $100 dual-core Atom desktop board to anyone looking up upgrade an older Pentium III or Pentium 4 system with an eye toward also saving power and reducing noise.
  • I highly recommend the Gateway 7811FX notebook computer. It is very inexpensive for what it provides, easily half the price or less than a comparable 17-inch Macbook Pro.
  • For gaming systems and high-end desktop workstations, I am lukewarm on Nehalem / Core i7. Hyper-threading and SSE4.2 are of interest to me to test and make use of as a software developer, but to the average person, the Core i7 will behave far too similarly to the existing Core 2 Penryn and Centrino 2 systems out there.
  • If shopping for Core 2, I suggesting buying a Penryn based system, that being the Core 2 8000 or 9000 series processors. I would avoid any of the Core 2 5000, 6000, or 7000 series processors, as these are the original 65m Core 2 chips that are now being liquidated.

Inside Intel

A little over 18 months ago I walked away from a very comfortable and well paying day job at Microsoft where I'd worked as a developer since the late 1980's. In May 2007 I chose to sit at home unemployed and dive right back into my old hobby of Atari and Mac emulators which I'd previously abandoned. Crazy, huh? Well, as I have blogged about in 26 previous postings to this blog, the longer I worked at Microsoft and the deeper I got into analyzing and understanding the workings of Microsoft's various products, the more I came to realize that problems which plague software today have root causes in the hardware itself. As I've said before, in my opinion the Atari ST and OS/2 computers I was using 20 years ago were more stable than my Windows Vista PCs today, and certainly far more efficient given the hardware of the time.

One the main reasons I've given for this degradation (besides the obvious one - too many dime-a-dozen C++ programmers polluting this world with garbage code) is that software developers have been too quick to jump on any "new and improved" hardware technology that AMD and Intel and IBM throw at them - in-order, out-of-order, MMX, SSE, Altivec, 64-bit, hardware virtualization, many-core, hyper-threading, etc. the list goes on. In some cases the "new" features did not add performance over the past hardware, and when in the hands of less than proficient software developers, just complicated and destabilized matters. I have spent the past 18 months digging into that premise, and using both my own Gemulator software and the open source Bochs emulator putting ideas into practice, debunking myths about virtualization and about x86 processors, and had fun re-learning everything I through I knew about computers.

I have said previously that I would throw away a lot of the "advances" from the past 20 years and mostly go back to where CPU and OS designs were 20 years ago. Hardware engineers and software developers have not been talking to each other as much as they should have over the past 20 years. Software guys wait for new hardware, then they adapt their software to make use of that hardware. But what drives the hardware designs? As was described in the book Pentium Chronicles, in the old days the hardware engineers just went by gut feel! Eventually they started gathering program execution traces and tuning the hardware for those traces. The trouble there is that it takes many years for software to adapt to new hardware, and then many more years to feed those new traces back into the hardware. Microsoft's Visual Studio compiler and the GCC compiler finally know how to emit decent code for Pentium 4, years after it matters! The feedback loop of hardware feeding software feeding hardware is on the order of 5 to 10 years, and that is far too long.

What masked this issue for a long time was the steady and reliable increase in CPU clock speeds over the past 30 years. If a new CPU architecture wasn't as efficient as the one it replaced, it was usually hidden by the sheer clock speed increase of that new architecture. The Pentium 4 was actually designed to be less efficient (in terms of instruction throughput per clock cycle) because it relied on there being much faster clock speeds to run at, well beyond 3 GHz. Oops. The clock speed plateau of the past 7 years killed the Pentium 4 and made efficiency important again. Two processors (say, the Pentium 4 and the Core 2) running at the same clock speed using the same memory, same hard disks, even the same motherboards, can give a two-fold or three-fold difference in performance. Physical constraints such as power consumption, heat, and die size put a limit on clock speed and thus make it more important than ever for software and hardware people to work together.

After months of informally talking to recruiters and engineers at both AMD and Intel, bitching at them about how I think they should fix these issues, my months of research using both Bochs and Gemulator and other virtual machines to gather the data to prove my arguments, and blogging about the issues of course, I was thrilled to receive a full time job offer from Intel about two weeks ago. I accepted a few days ago, and in just a few hours I start work as an engineer at Intel. Wish me luck!

That concludes my blogging for the year. I have a new job and a very anticipated Metallica concert to go to tonight. I wish everyone a Merry Christmas and happy holidays and hope to see you come out to Macworld next month.


[Part 1]  [Part 2]  [Part 3]  [Part 4]  [Part 5]  [Part 6]  [Part 7]  [Part 8]  [Part 9]  [Part 10]  [Part 11]  [Part 12]  [Part 13]  [Part 14]  [Part 15]  [Part 16]  [Part 17]  [Part 18]  [Part 19]  [Part 20]  [Part 21]  [Part 22]  [Part 23]  [Part 24]  [Part 25]  [Part 26]  [Part 27]  [Next]  [Return to Emulators.com]