Today, NVIDIA launched their new graphics processor, the GeForce 8800GTX. The 8800 is getting rave reviews (see the links below), and on many benchmarks it even outperforms two ATI graphics cards in a Crossfire rig. In fact, the 8800GTX (575MHz core clock, 768MB of GDDR3 at 900MHz) spanks the competition handily in spite of the fact that drivers for the GPU are in a pretty green state at the moment. One look at the benchmarks will show you that the 8800 represents a major advance in graphics performance, so you’re probably wondering how NVIDIA pulled it off. In short, the 8800 derives its performance gains from a brand new architecture that’s different in some fundamental ways from previous GPU architectures.
The G80 architecture that powers NVIDIA’s new G8800 graphics processor has been four years in development, and its new-from-the-ground-up design marks a significant departure from previous PC GPUs. The G80 is the first in a coming wave of DirectX 10-compatible cards that offer a unified shader model of the kind that has previously only been available to gamers in the form of the Xbox 360’s ATI-designed Xenos GPU.
I’m still poring over NVIDIA’s presentation materials and trying to get a good handle on exactly what the company has done with the G80, so I’ll only briefly review the major advances of the design from a “big picture” perspective.
Some time ago, NVIDIA hosted a conference call in response to a pair of announcements from ATI and Peakstream in the area of stream processing. The company was coy about what they were up to, and all they’d say is that their project wasn’t like what either Peakstream or AMD/ATI are now doing. The Peakstream and AMD/ATI approaches were, in the words of the NVIDIA spokesperson, “like putting lipstick on a pig”—they were taking the existing GPU architecture and bolting on a software layer of abstraction that’s intended to hide its graphics-specific nature while exposing its stream processing functionality.
NVIDIA countered that they had an announcement in the pipeline about a new type of product that’s different from both a hardware and a software perspective, but they wouldn’t say much more than that.
The G80 as a stream processor
The 8800 is clearly the product they were talking about; it’s actually built from the ground up as a highly multithreaded, general-purpose stream processor, with the GPU functionality layered over it in software. This is the reverse of existing general-purpose GPU (GPGPU) approaches. So with the G80, a programmer can write a stream program in a regular high-level language (HLL) that compiles directly to the stream processor, without the additional overhead that goes along with translating HLL programs into a graphics-specific language like OpenGL’s GLSL.
(If you’re not sure what a “stream processor” is, or if what I just said confuses you, be sure to stop right now and go read this article for background on the concepts I’m covering here.)
Ideally, a program for the G80 would consists of hundreds of stream processing threads running simultaneously on the GPU’s many arrays of tiny, scalar stream processors. These threads could do anything from graphics and physics calculations to medical imaging or data visualization.
Making a fully generalizable stream processor like the G80 required NVIDIA to include a feature that graphics programmers have desired for as long as they’ve wanted a unified shader model: a hardware virtual memory implementation for the GPU that enables seamless access to main memory for GPU-based programs.
Virtual register files and virtual memory
From what I can tell, the G80 has a multilevel parallel data cache (L1 and L2 caches, shared among stream processor clusters) that pages data in and out from main memory, just like a regular processor’s virtual memory implementation. However, the L1 cache level isn’t like a typical CPU’s L1 cache. In fact, it may be more like the local store of Cell’s SPUs. Here’s how it seems to work.
From a higher level, the G80’s execution hardware has eight of what I’ll call “tiles,” for lack of a handy, NVIDIA-supplied word. Each tile consists of two groups of eight stream processors, for a total of 16 stream processors per tile. Each tile also has some thread-fetching hardware for managing instruction flow through the different stream processors, and a chunk of scratch memory that NVIDIA has labeled “L1 cache” in their diagrams.
This “L1 cache” is a “parallel data cache,” and their diagrams suggest that this cache is carved up into sixteen blocks, one for each stream processor.
(Note to the people who do these illustrations for NVIDIA: settle on one color for functional units, one for memory, one for groups of functional blocks, etc. Then, keep the same color in all the diagrams throughout the presentation. So all of the orange in the diagram above should actually be green, and all of the L1 should be orange instead of gray.)
Each of these 16 blocks acts as a giant, 4,000-entry virtual register file for a stream processor in the tile. Each SP can read from and write to this virtual register file as it executes a thread, with the result that the G80 can perform in one pass the kind of inter-element vector arithmetic operations that required multiple passes on previous GPUs.
The hardware also provides support for load/store access between main memory and this virtual register file, which is why it’s a bit like the Cell processor’s local store. Finally, this virtual register file is automatically backed by an L2 and, ultimately, by DRAM. Because the register file contents can be paged out to DRAM, this means that the CPU can access the results of a stream computation with a simple memory read.
Separating out the display hardware
It’s worth noting that the G80 chip, by itself, doesn’t have the hardware necessary to output an image to the display. Instead, it sends output to a separate chip on the graphics card, which has the hardware needed to drive an DVI or analog VGA display. In the case of the G80, this hardware was probably put on a separate chip because the G80 chip itself is already really huge. It could also be the case, though, that they eventually plan to gang together multiple G80 chips onto a single graphics card (after a die shrink to 65nm, probably) and have all of the chips feed data into the same display driver hardware.
This is the kind of thing that AMD/ATI will do when they eventually drop a GPU into a cHT socket. There will be an DVI display driver chip somewhere else on the motherboard, and the GPU will write to that so that its output can be displayed on-screen.
Take a look at the benchmarks in the “Further Reading” section below, and you’ll see that the 8800 has debuted today at the very top of the graphics market. AMD/ATI will be playing catch-up to this (and they will catch up), but this holiday season hardcore gamers will go to sleep on Christmas Eve hoping to wake up to an 8800 under the tree. Hopefully, Santa has $599 laying around for the top-end GTX model.
The 8800 not only offers a leap in graphics performance and image quality, but it also promises to give more indigestion to a certain dedicated physics processing unit company. Programmers will be able to write gameplay-affecting physics code in C, and have it run on the G80.
NVIDIA has more than just gamers in mind for the 8800, though. They’re targeting the same high-performance computing (HPC) markets as the other stream processing solutions from AMD/ATI and Peakstream. It’s worth noting that Intel isn’t resting on its laurels in this regard, either. I eagerly await news of what they’re up to in the area of graphics, CPU/GPU integration, and stream processing.
Further ReadingBeyond3D: NVIDIA G80: Architecture and GPU AnalysisAnandtech: NVIDIA’s GeForce 8800 (G80): GPUs Re-architected for DirectX 10Tom’s Hardware: GeForce 8800: Here Comes the DX10 BoomAMD/ATI and NVIDIA tout new uses for the GPUPeakStream unveils multicore and CPU/GPU programming solution