NO EXECUTE!
(c) 2008 by Darek Mihocka, founder, Emulators.com.
December 1 2008
Centrino 2 and Gateway's Killer Penryn Notebook
Recent Intel processor releases are of relevance to software developers, to virtual machine users, and to anyone computer shopping this economically challenged Christmas. The processors which I will discuss today are the mobile Core 2 Penryn branded as Intel "Centrino 2", the much hyped and freshly released Intel Core i7 formerly known by the codename "Nehalem", and the biggest little surprise of the year, the dual-core Intel Atom. For you Atari 800 and Atari ST fans out there which I will discuss today: Gemulator 9.0 is completed and posted as open source!
In late September I was invited to give a presentation at Microsoft to talk about performance optimizations. I brought my Acer Aspire and my Macbook, and proceeded to recycle some of my PowerPoint presentations from the summer. About an hour into my presentation I absent mindedly knocked my cup of coffee over and, horror of horrors, watched several ounces of coffee ooze into my Macbook's keyboard. About two seconds later the Macbook shut off and died. That night I tried to revive the poor dead Macbook. I poured distilled water through it, baked it in the oven to dry it out (all little too long, oops), and still no dice. RIP Macbook 2006-2008.
My dead Macbook, being a 2006 model, was based on the original Core 2 processor based on 65nm technology which supported the SSE3 and SSSE3 instruction set extensions. About a year ago Intel released the 45nm die shrink version of the Core 2 codenamed "Penryn", which also added the SSE 4.1 instruction set and larger L2 cache. As luck would have it, a wave of Penryn-based Centrino 2 notebook computers hit the market this summer. The way that you can tell a Penryn processor from the older Core 2 is by the model number. Models in the 5000, 6000, 7000 range are older Core 2. Models in the 8000 or 9000 range are Penryn. You want Penryn. Beware the glut of discounted desktop and notebook computers on store shelves right now bearing the 5000 series parts. They are older less efficient models.
On my way to the Apple store to buy one of the new Macbooks, I stopped at Best Buy to check out what other brands offered. While most decent notebooks from Dell and Sony were in the $2000 and up range, I was pleasantly surprised to run across a model from Gateway that seemed to have all the specs of a Macbook Pro:
The clincher is that the Gateway, the 7811FX notebook, was priced at $1250, compared to what I knew would be about $3000 for a comparable Macbook Pro. How they pack all that technology in for that price and have profit left over to pay Microsoft the fee for Vista, I don't know, because every spec of this Gateway machine blows away any Dell, Sony, HP, or Apple notebook in this price range.
The Gateway notebook was pre-installed with 64-bit Windows Vista Service Pack 1 all ready to go, and I have been using it as a true desktop replacement for the past month or so. The notebook is based on the 2.26 GHz Core 2 P8400 Penryn processor, but pretty much keeps up with 2.66 GHz desktop parts as you will see in the benchmark shortly. It easily keeps up with and beats the AMD Phenom desktop processor.
The one huge drawback of this notebook is that it is heavy, a good 10 pounds. It is heavier than my old Dell D800 notebook and the dead Macbook, and certainly not something I can zip under my jacket on the motorcycle. I have the ASUS EEE and Acer Aspire for that.
This is one of two new Gateway computers that I have purchased in the last few weeks. Gateway of course is not the old South Dakota based "cow box" company from the 1990's. They are a Chinese company now, part of Acer, and from what I've seen of Acer's Aspire One notebook earlier this summer, this 7811FX notebook, and the next machine that I am about to describe, I am very impressed by the latest Acer-Gateway offerings. I had written the old Gateway off years ago.
Except for the slightly slower CPU clock speed and then notebook hard disk, the Gateway 7811FX will toast most desktop machines at CPU and graphics operations:
Gateway 7811FX notebook speed rating in Windows Vista
Core i7 - Nehalem Arrives Early
Impressed by the mobile Penryn, I figured I should get cracking and build myself a quad-core Penryn desktop. I had been seeing the online prices dropping to about the $300 price point for the 2.8 GHz chips with a whopping 12 megabytes of cache. Ironically, the very day that I set out to the computer store to buy one of these Penryn chips, Monday November 19, is the day "Nehalem" came out. Nehalem is the much hyped "Core i7" chip that replaces the Core 2 (although from my measurements, it is essentially an enhanced Penryn). I was not expecting Nehalem until CES or Macworld time frame in January and certainly not to be in stock already. Best Buy had fresh Core i7 based Gateway FX desktops. I purchased the Gateway FX6800-01 desktop for the same $1250, which similarly to the FX notebook came pre-installed with 64-bit Vista, the latest ATI 4850 video card, and super fast hard disk. Not surprisingly this new Gateway is the first computer I have seen score all 5.9 in Vista's performance tests:
Gateway FX6800 desktop speed rating in Windows Vista
What I wonder now, does Vista use Olympic figure skating scores and treat6.0 as some ideal that no machine can achieve, or will we soon see hardware that scores 6.0 and higher?
Nehalem / Core i7 has several feature enhancements over Penryn - it brings back the Pentium 4 feature of Hyper-Threading, giving the Windows the illusion of having an 8-way machine (as you see in the Task Manager screen shot below), it adds the missing SSE4.2 instructions, and comes standard with an L3 cache right from day one:
A more subtle change in Core i7 is the integration of the memory controller into the processor itself, something that AMD did a few years ago with the Opteron processor. As that did for AMD, this change results in some pretty stunning memory performance improvements. Core i7 has a vastly higher ceiling on memory throughput - my measurements show with all 8 threads executing REP MOVSD memory copy operations can achieve an aggregate throughput of over 250 GB/s, that's 250 gigabytes per second, whereas the original Core 2 Quad seems to peak out at about 100 GB/s.
Using my CPU_TEST utility which I described in September, the most noticeable micro-architectural improvement of Core i7 over the previous Core 2 models is the much lower latency of unaligned spanning accesses. I will get into those numbers after I tell you about the third pleasantly surprising CPU release of late, the dual-core Atom.
Atom 2.0 - The Best $100 You Will Ever Spend
When I left Seattle on my road trip to Dallas this past August, I filled up the tank with gasoline costing me about $4.50 a gallon, ouch. Just this weekend I filled up for under $2 a gallon. That kind of price drop raises eyebrows and brings about talk of deflation. I mean, nothing gets that cheap that fast, delivering more than twice the "bang for the buck" in four months. Yet this kind of increasing value, either in the form of faster product, lower costing product, or both, is what the hardware industry has been giving us for over 30 years. A prime example of this is the dual-core Atom processor, or as I am calling it, the Atom 2.0.
While shopping for the Nehalem, I spotted an Intel Atom motherboard for $98, apparently with Atom processor included. Cool I thought, I will build a desktop Atom system after I finish playing with the Nehalem. The Atom made its appearance this summer in "netbook" machines such as the ASUS EEE and the Acer Aspire One, and in my opinion is one of the most technically amazing processors out there.
When I unpacked the box, I realized that this was one of those embedded CPU/motherboard combinations, nothing to assemble, it was all ready to drop into a case. So I took an old case from a dead Pentium 4, screwed in the Atom board (tiny little thing!), added a 2GB DDR2 memory DIMM, and booted up.
Hmmmm, the BIOS claimed this chip EMT64 compatible, naw, can't be. And how can there be two L2 caches for a hyper-threaded processor? Was this a 64-bit dual-core processor?
Apparently I did not hear the news, because the desktop Atom is in fact a dual-core hyper-threaded 64-bit CPU!!!!!! I repeat... !!!!!! To make sure, I booted a 64-bit Windows Vista setup DVD and started setup. Much to my surprise, 64-bit Windows Vista installed just fine. I couldn't believe this, 100 dollars for a 64-bit upgrade. Just to double check, I ran the CPUZ utility and compared the output of my Asus Aspire notebook from August (on the left running on Windows XP) with the desktop Atom board (on the right running on 64-bit Windows Vista):
I am a little confused by this because the stepping information would indicate that they are the exact same processor, implying that I should be able to boot 64-bit Windows on my Acer Aspire One! (???) Regardless, the dual-core Atom board is the least expensive 64-bit multi-core upgrade ever. I have since added a WinTV tuner card to the PCI slot, put in a fresh SATA hard disk, and for a grand total of a little over 300 dollars put together a Windows Media Center machine which has been running very quietly ever since. New record for least expensive desktop 64-bit computer. New record for Windows Vista power consumption too I think. I measured the total power consumption of this machine to be, are you ready... 55 watts. By comparison, the dual-processor hyper-threaded Dell 650 workstation which represented the top-of-the-line desktop PC about 5 or 6 years ago consumes over 300 watts.
There are limitations of the Atom board that will not appear to die-hard games, namely the lack of an AGP or PCIe slot to install a fast graphics card, and the lack of more memory sockets. As a home server, Media Center machine, email machine, or software development machine, the Atom is more than suitable. A six-fold power savings and ten-fold cost reduction over comparable desktop PCs of just 4 or 5 years ago is certainly a very decent improvement in "bang for the buck". This screen shot shows the Windows Vista performance rating of the Atom system:
Built-in VGA video is the weakest performer of the Atom motherboard, but sufficient to enable "Aero" mode in Vista
Gemulator 9.0 Released!
One of my stated goals for Macworld was to finish Gemulator 9.0 and release it as open source. As of last night, it is finished and posted and available for download in binary and source form from this web site. For the past few weeks I've been furiously cleaning up code and deleting old obsolete code to prepare for this release. The previous release of Gemulator which I made way back, you guessed it, almost 8 years ago, predated Windows XP, yikes! It was optimized for Pentium III and hard coded for Windows structured exception handling, using exceptions to handle guest memory-mapped I/O emulation. Fine for the Pentium III, but a ridiculously stupid approach for Pentium 4 and multi-core processors where exception latencies severely hurt overall performance. Much of my time over the past 18 months has been spent ripping out those Windows-centric techniques and replacing them with the software-TLB and software-pipelined dispatch approaches I've described in this blog and in the ISCA workshop paper.
At the same time, I've been enhancing the debugging capabilities of Gemulator to have it function as a 680000/68040 training tool. When I visited my old friend Ignac Kolenko at Conestoga College and lectured to his class this summer, I was challenged to produce a 68040 emulator which can replace the hardware "Tutor" boards that the class had been using in the past. And so with a lot of input from Ignac, I've made some changes to Gemulator 9.0 to facilitate easier debugging, tracing, and loading of custom ROM images.
First of all, download the source code to Gemulator 9.0 here. Being a command line kind of guy, you build the product from the command prompt of Windows. You will need the Visual Studio 98 (yes, 98 as in VC 6.0) tools, as I am not a fan of the latest Visual Studio 2008. You will also need MASM 6.15 in your path, which I believe is included in one of the Visual Studio 98 service packs.
When you unpack GEMCE900.ZIP, you will find a root directory with some make files and build scripts, a BUILD directory where the product is built, and a SRC directory. If you are in a VC6 or VC7 command prompt window, type MKALL.BAT to build the product. This will build the 6502 interpreter, the 68000/68040 interpreter, the Atari 800 virtual machine, and the Atari ST virtual machine. The final EXE will be found in the BUILD\SHIP directory as ATARIST.EXE.
If you are unfortunate enough to be using Visual Studio 2005 (a.k.a. VC8) or Visual Studio 2008 (a.k.a. VC9), you will need to run the MKALLASM.BAT script first to built some of the libraries, then open the ATARIST.VCPROJ file in your IDE to complete the build.
If you have no Microsoft build tools, fear not. Earlier this year Microsoft released their open source Singularity operating system (http://research.microsoft.com/os/Singularity/). When you extract the project, included in the SINGULARITY\BASE\BUILD directory are the complete VC8 build tools, including MASM 8, the 32-bit C/C++ compiler, and the old 16-bit DOS C/C++ compiler!
As a bonus for you PC Xformer fans, if you go into the SRC\ATARI8.VM directory of the Gemulator sources, and type NMAKE from a 16-bit build prompt, it will build XF.EXE, the 16-bit Atari 800 emulator for MS-DOS.
Some tips for using Gemulator 9.0:
Since Gemulator 9.0 is now public and open source, I will use it as the basis of processors benchmarks in place of the older Gemulator and SoftMac releases which do not run correctly on Windows Vista.
Measuring Atom Performance
A lot of reviewers have beat up the Atom for being too slow, and based on my testing I feel that is unfair to make such a blanket statement. My testing shows that the Atom, particularly this most recent dual-core 64-bit hyper-threaded Atom, is more than adequate for running the latest 64-bit Windows Vista, for running Visual Studio development tools, and for serving as an inexpensive Windows Media Center TV tuner and home server. When factoring in price and power consumption, it delivers more "bang for the buck" than other processors. If it can reduce the power consumption and noise of an existing desktop machine from over 300 watts to just 55 watts, that is worth considering spending the relatively small 100 dollars on.
But just how "slow" is the Atom? Compared to the latest 8-thread Core i7, it is about five times slower at raw single-threaded throughput. This is due to the fact that Atom is an in-order processor that does not do the fancy instruction re-ordering that Pentium and Core architectures do. Nor does it run at the higher clock speeds of other architectures. The two Atom systems that I have now purchased both run at 1.60 GHz, and interestingly no matter what setting I put Windows Vista at (Balanced, High Performance or Power Saver) it stays at 1.60 GHz. Which I guess makes sense for a core that consumes two watts of power! The combination of running at almost half the clock speed of other processors, the smaller on-chip cache, and the restrictions of an in-order pipeline result in the 5x speed difference. Real-world performance will not be quite as bad when factoring in memory and disk bottlenecks.
To put that in perspective, at identical clock speeds the Pentium 4 architecture is 2x to 3x slower than the Core 2 architecture. Factor in clock speed, and it turns out that not that long ago, many people were spending thousands of dollars on "top of the line" 1.5 GHz to 2.0 GHz Pentium 4 systems which were really no faster than the hundred dollar Atom is today. I tested this with two of my own legacy machines: a dual-processor Pentium III system running Windows XP and a dual-processor Pentium 4 Xeon system (my Dell 650 workstation) running Windows Vista. Both of these systems are equipped with 1.5 GB of RAM, and in the case of the Xeon system, fast SCSI hard disks. I performed a simple test yesterday - build the Gemulator 9.0 source code. I repeated the build several times on each system to allow for disk caching to reduce disk I/O wait time and keep it a mostly CPU-bound test and recorded the built time once it had stabilized after a few builds. The results for the three systems, plus a fourth system which contains a more recent Pentium "D" (dual-core 64-bit Pentium 4) system under-clocked to 1.5 GHz are shown here:
Desktop computer specs clock speed, CPU, RAM, disk, OS | Gemulator 9 build time (seconds, lower is better) |
1600 MHz dual-core Atom, 2GB RAM, SATA disk, Vista | 70.6 |
2000 MHz dual-processor P4 Xeon, 1.5 GB RAM, SCSI disk, Vista | 65.9 |
1000 MHz dual-processor Pentium III, 1.5 GB RAM, IDE disk, XP | 85.6 |
1500 MHz dual-core Pentium D, 2.0 GB RAM, SATA disk, Vista | 93.3 |
Expanding this to include the actual execution speed of Gemulator, I have arranged the screen shots below in the same order as the four systems were just listed. The first screen shot in each row is the CPUZ output showing the low-level specs of the processor being tested, and the second screen shot shows that processor's Gemulator 9 benchmark result running the Quick Index Atari ST benchmark program. Larger percentages mean faster speed.
1600 MHz dual-core Atom, 2GB RAM, SATA disk, Vista | ||
2000 MHz dual-processor P4 Xeon, 1.5 GB RAM, SCSI disk, Vista | ||
1000 MHz dual-processor Pentium III, 1.5 GB RAM, IDE disk, XP | ||
1500 MHz dual-core Pentium D, 2.0 GB RAM, SATA disk, Vista |
The Atom outperforms the Pentium III on all counts, both at the Visual Studio build time and the execution speed of the Atari ST emulator. The faster clock speed of the Atom and larger on-chip caches more than make up for the Pentium III's more clever out-of-order pipeline. I think it is safe to peg the performance of the Atom at about 20% above a 1 GHz Pentium III, or thus about the performance level of the 1.2 GHz Pentium III. As somebody who obviously still owns and uses Pentium III machines, that's a performance level that I'm satisfied with.
Factoring for clock speed, the Atom is almost on par with the 2.0 GHz Pentium 4 Xeon, which is interesting. With almost identical bus and clock speeds, and similar cache sizes, the efficiency of the in-order Atom core appears to be about the same as that of the out-of-order Pentium 4 core! However, throw in the much larger L2 cache of the Pentium D and the Atom is slightly slower. So therefore we can now bound the performance of the Atom at somewhere between that of a 1.2 GHz Pentium III and a 1.5 Pentium 4, which in Core 2 numbers is somewhere around about a 600 MHz Core 2. And thus how one arrives at the 5x speed difference between an Atom and the latest Core 2 chips.
Core i7 vs. Centrino 2 vs. Core 2
Now to analyze the improvements in Core i7 over previous processors. Since my Mac Pro runs at a fixed 2.66 GHz and my new Gateway i7 machine is also fixed at 2.66 GHz, I took my 2.4 GHz AMD Phenom machine (which I described back in June) and my other 2.4 GHz quad Core 2 machine and over-clock them both to 2.66 GHz by tweaking the bus speeds up by 11%. I have actually been running the Core 2 over-clocked to over 3.3 GHz with no problems, so 2.66 GHz was a breeze. As I did with the Atom, I compared the Gemulator 9.0 build times and run times of these systems, and for reference also threw in my Core Duo iMac and the Gateway Penryn notebook. The results of the build times are shown here, with the addition of a column to show total clock cycles:
Desktop computer specs clock speed, CPU, RAM, disk, OS | Gemulator 9 build time (seconds, lower is better) | Total clock cycles (billions) |
2666 MHz Core i7, 3GB, SATA, Vista | 20.0 | 53.32 |
2666 MHz Core 2 Quad, 8GB, SATA, Vista | 21.8 | 58.12 |
2666 MHz Core 2 (Mac Pro), 4GB, SATA, Vista | 22.1 | 58.92 |
2260 MHz Centrino 2 Penryn, 4GB, SATA, Vista | 23.7 | 53.56 |
2666 MHz AMD Phenom, 8GB, SATA, Vista | 24.8 | 66.12 |
2000 MHz Core Duo, 2GB, SATA, XP | 28.3 | 56.6 |
Despite larger and more than ample RAM in the other systems to cache the whole build process, Core i7 and Penryn both do about 9% better than the original Core 2 or Core 2 Quad, and 24% better than AMD Phenom. In fact the Gateway notebook does amazingly well considering its slower clock speed and slower hard disk. I similarly measured at most about a 10% speedup of Core i7 over original Core 2 at 2.66 GHz on some other tests, while other tests such as Gemulator 9 were practically identical. I won't bore you with the screen shots because the results are too similar. Core 2, Centrino 2, and Core i7 appear to have very very similar architectures. Core i7 is not quite as drastic a departure from Core 2 as the advertising hype would have you believe.
To probe this I used my CPU_TEST utility. The following links are to the output files of running CPU_TEST /MHz 2xxx /ALL (where 2xxx is the specific clock speed of the system such as 2666) on the various systems:
atom1600cpu.txt (Atom dual-core)
core2penryn2260cpu.txt (Core 2 Penryn "Centrino 2" dual-core)
corei72666cpu.txt (Core i7 quad-core)
macpro2666cpu.txt (Core 2 Xeon quad-core in Mac Pro)
p4d3000cpu.txt (Pentium D dual-core under-clocked to 1500 MHz)
All five systems were tested while running 64-bit Windows Vista SP1, so the kernel results are comparable. You can "diff" these files against each other to see how one architecture fares against another at specific micro-benchmarks. I will walk you through some lines of output which show significant differences:
There are other subtle differences, but those are the major ones.
One other thing I measured by playing with thread affinity in Windows Task Manager on different multi-threaded tests is that the hyper-threading in Core i7 appears to be much more efficient than the old hyper-threading the in Pentium 4, although still not near the ideal of being equivalent to true extra cores.
Conclusions and Recommendations:
Inside Intel
A little over 18 months ago I walked away from a very comfortable and well paying day job at Microsoft where I'd worked as a developer since the late 1980's. In May 2007 I chose to sit at home unemployed and dive right back into my old hobby of Atari and Mac emulators which I'd previously abandoned. Crazy, huh? Well, as I have blogged about in 26 previous postings to this blog, the longer I worked at Microsoft and the deeper I got into analyzing and understanding the workings of Microsoft's various products, the more I came to realize that problems which plague software today have root causes in the hardware itself. As I've said before, in my opinion the Atari ST and OS/2 computers I was using 20 years ago were more stable than my Windows Vista PCs today, and certainly far more efficient given the hardware of the time.
One the main reasons I've given for this degradation (besides the obvious one - too many dime-a-dozen C++ programmers polluting this world with garbage code) is that software developers have been too quick to jump on any "new and improved" hardware technology that AMD and Intel and IBM throw at them - in-order, out-of-order, MMX, SSE, Altivec, 64-bit, hardware virtualization, many-core, hyper-threading, etc. the list goes on. In some cases the "new" features did not add performance over the past hardware, and when in the hands of less than proficient software developers, just complicated and destabilized matters. I have spent the past 18 months digging into that premise, and using both my own Gemulator software and the open source Bochs emulator putting ideas into practice, debunking myths about virtualization and about x86 processors, and had fun re-learning everything I through I knew about computers.
I have said previously that I would throw away a lot of the "advances" from the past 20 years and mostly go back to where CPU and OS designs were 20 years ago. Hardware engineers and software developers have not been talking to each other as much as they should have over the past 20 years. Software guys wait for new hardware, then they adapt their software to make use of that hardware. But what drives the hardware designs? As was described in the book Pentium Chronicles, in the old days the hardware engineers just went by gut feel! Eventually they started gathering program execution traces and tuning the hardware for those traces. The trouble there is that it takes many years for software to adapt to new hardware, and then many more years to feed those new traces back into the hardware. Microsoft's Visual Studio compiler and the GCC compiler finally know how to emit decent code for Pentium 4, years after it matters! The feedback loop of hardware feeding software feeding hardware is on the order of 5 to 10 years, and that is far too long.
What masked this issue for a long time was the steady and reliable increase in CPU clock speeds over the past 30 years. If a new CPU architecture wasn't as efficient as the one it replaced, it was usually hidden by the sheer clock speed increase of that new architecture. The Pentium 4 was actually designed to be less efficient (in terms of instruction throughput per clock cycle) because it relied on there being much faster clock speeds to run at, well beyond 3 GHz. Oops. The clock speed plateau of the past 7 years killed the Pentium 4 and made efficiency important again. Two processors (say, the Pentium 4 and the Core 2) running at the same clock speed using the same memory, same hard disks, even the same motherboards, can give a two-fold or three-fold difference in performance. Physical constraints such as power consumption, heat, and die size put a limit on clock speed and thus make it more important than ever for software and hardware people to work together.
After months of informally talking to recruiters and engineers at both AMD and Intel, bitching at them about how I think they should fix these issues, my months of research using both Bochs and Gemulator and other virtual machines to gather the data to prove my arguments, and blogging about the issues of course, I was thrilled to receive a full time job offer from Intel about two weeks ago. I accepted a few days ago, and in just a few hours I start work as an engineer at Intel. Wish me luck!
That concludes my blogging for the year. I have a new job and a very anticipated Metallica concert to go to tonight. I wish everyone a Merry Christmas and happy holidays and hope to see you come out to Macworld next month.
[Part 1] [Part 2] [Part 3] [Part 4] [Part 5] [Part 6] [Part 7] [Part 8] [Part 9] [Part 10] [Part 11] [Part 12] [Part 13] [Part 14] [Part 15] [Part 16] [Part 17] [Part 18] [Part 19] [Part 20] [Part 21] [Part 22] [Part 23] [Part 24] [Part 25] [Part 26] [Part 27] [Next] [Return to Emulators.com]