All the pyrotechnics in the world couldn't match the gasconade with which GPU & CPU vendors announce their new architectures. You'd halfway expect this promulgation of multipliers and gains and reductions (but only where smaller is better) to mark the end-times for humankind; surely, if some device were crafted to the standards by which it were announced, The Aliens would descend upon us.
But, every now and then, those bombastic announcements have something behind them – there's substance there, and potential for an adequately exciting piece of technology. NVidia's debut of consumer-grade Pascal architecture initializes with GP104, the first of its non-Accelerator cards to host the new 16nm FinFET process node from TSMC. That GPU lands on the GTX 1080 Founders Edition video card first, later to be disseminated through AIB partners with custom cooling or PCB solutions. If the Founders Edition nomenclature confuses you, don't let it – it's a replacement for nVidia's old “Reference” card naming, as we described here.
Anticipation is high for GP104's improvements over Maxwell, particularly in the area of asynchronous compute and command queuing. As the industry pushes ever into DirectX 12 and Vulkan, compute preemption and dynamic task management become the gatekeepers to performance advancements in these new APIs. It also means that LDA & AFR start getting pushed out as frames become more interdependent with post-FX, and so suddenly there are implications for multi-card configurations that point toward increasingly less optimization support going forward.
Our nVidia GeForce GTX 1080 Founders Edition review benchmarks the card's FPS performance, thermals, noise levels, and overclocking vs. the 980 Ti, 980, Fury X, and 390X. This nearing-10,000-word review lays-out the architecture from an SM level, talks asynchronous compute changes in Pascal / GTX 1080, provides a quick “how to” primer for overclocking the GTX 1080, and talks simultaneous multi-projection. We've got thermal throttle analysis that's new, too, and we're excited to show it.
The Founders Edition version of the GTX 1080 costs $700, though MSRP for AIBs starts at $600. We expect to see that market fill-in over the next few months. Public availability begins on May 27.
First, the embedded video review and specs table:
NVIDIA GeForce GTX 1080 vs. GTX 980 Ti, GTX 980, & Fury X [Video]
NVIDIA GTX 1080 Specs, 1070 Specs, 980 & 980 Ti Specs
|NVIDIA Pascal vs. Maxwell Specs Comparison|
|Tesla P100||GTX 1080||GTX 1070||GTX 980 Ti||GTX 980||GTX 970|
|GPU||GP100 Cut-Down Pascal||GP104 Pascal||GP104 (?)||GM200 Maxwell||GM204 Maxwell||GM204|
|Fab Process||16nm FinFET||16nm FinFET||16nm FinFET||28nm||28nm||28nm|
|TPCs||28 TPCs||20 TPCs||Unknown||-||-||-|
|Memory Clock||?||10Gbps GDDR5X||Unknown||7Gbps GDDR5||7Gbps GDDR5||7Gbps|
|Power Connectors||?||1x 8-pin||Unknown||1x 8-pin
|2x 6-pin||2x 6-pin|
Architecture – A Quick Primer on Pascal & GP100
GP100 is part of “Big Pascal,” and the Tesla P100 Accelerator card – not meant for gaming – used a truncated version of Big Pascal. GP100 hosts 15.3 billion transistors and is sized at 610mm^2, marking the chip as nVidia's largest ever produced. Performance is rated for a staggering 5.3TFLOPs FP64 (double-precision) COMPUTE, or 10.6TFLOPs of FP32 and 21.2TFLOPs of FP16 (viable for deep learning). The use cases for GP100 and the Tesla P100 Accelerator are entirely relegated to scientific computations, deep learning, data mining and analysis, and simulation demanding of double-precision accuracy.
The GP100 architecture looks like this:
Before explaining GP104 for the GTX 1080, we'll recap some of our multi-page architecture deep-dive for GP100.
This GP100 GPU hosts 3584 FP32 CUDA cores (2:1 FP32:FP64) divided into six Graphics Processing Clusters (GPCs), each of which contains a set of ten Simultaneous Multiprocessors (SMs). Each set of two SMs shares a single Texture Processing Cluster (TPC).
Drilling into SM architecture on GP100 reveals that each SM is partitioned into two blocks of 32 FP32 CUDA cores, which are independently allocated a dedicated warp scheduler and instruction buffer, two dedicated dispatch units, and a register file (32,768 x 32-bit size). The partitioned blocks total 64 FP32 CUDA cores on GP100, and share a single instruction cache. Interestingly, GP100's SM blocks also share a unified Texture / L1 Cache, filling a somewhat flexible role on the SM. GP100 hosts four TMUs per SM; multiply that by the P100's 56 SMs (of a total 60 for Big Pascal, if a full version should ever be released), that's 4*56=224 TMUs (full potential of 240 TMUs for the uncut GP100). This architecture largely carries over to GP104 and remains relevant, as you'll see momentarily.
Finally, each SM on GP100 hosts an independent 64KB Shared Memory block, indicated in the SM architecture diagram toward the bottom.
Pascal's debut architecture actually contains fewer FP32 CUDA cores per SM than the Maxwell GM200 architecture (128 FP32 CUDA cores per SM; Kepler ran 192 FP32 CUDA cores per SM), but still manages to increase overall CUDA core count across the die. The reduced CUDA core count per SM, with each SM being partitioned into two blocks of 32 cores, streamlines the datapath organization by granting an independent instruction buffer, warp scheduler, and dispatch units to each “partition” in the SM, with four total dispatch units per SM. This datapath optimization falls in-step with GP100's move to 16nm FinFET – also true for GP104 – and helps reduce inefficiency on a per-core level. Overall power consumption is reduced as a result, along with the somewhat “native” reduction offered by a smaller process with FinFET transistor architecture.
(Source: Wikimedia Commons)
FinFET transistors use a three-dimensional design that extrudes a fin to form the drain and source; the transistor's fins are encircled by the gate, reducing power leakage and improving overall energy efficiency per transistor. When scaled up to billions of transistors and coupled with optimized datapath organization (which now takes less physical die space), nVidia is able to create the 300W GP100 and ~180W TDP GP104 GTX 1080 GPU.
|GPU||Kepler GK110||Maxwell GM200||Pascal GP100|
|Threads / Warp||32||32||32|
|Max Warps / Multiprocessor||64||64||64|
|Max Threads / Multiprocessor||2048||2048||2048|
|Max Thread Blocks / Multiprocessor||16||32||32|
|Max 32-bit Registers / SM||65536||65536||65536|
|Max Registers / Block||65536||32768||65536|
|Max Registers / Thread||255||255||255|
|Max Thread Block Size||1024||1024||1024|
|Shared Memory Size / SM||16KB / 32KB / 48KB||96KB||64KB|
Here's an excerpt from our GP100 architecture deep-dive:
“So, even though Pascal has half the cores-per-SM as Maxwell, it's got the same register size and comparable warp and thread count elements; the elements comprising the GPU are more independent in this regard, and better able to divvy workloads efficiently. GP100 is therefore capable of sustaining more active ('in-flight') threads, warps, and blocks, partially resultant of its increased register access presented to the threads.
The new architecture yields an overall greater core count (by nature of hosting more SMs) and also manages to increase its processing efficiency by changing the datapath configuration. The move to FinFET also greatly increases per-core performance-per-watt, a change consistent with previous fabrication changes.
Because the cores-per-SM have been halved and total SM count has increased (to 10, in this instance), the per-SM memory registers and warps improve efficiency of code execution. This gain in execution efficiency is also supported by increased aggregate shared memory bandwidth (an effective doubling) because the per-SM shared memory access increases.
Pascal's datapath structure requires less power for data transfer management. Pascal schedules tasks with greater efficiency (and less die space consumed by datapath organization) than Maxwell by dispatching two warp instructions per clock, with one warp scheduled per block (or 'partition,' as above).”
That primed, let's move to GP104 and explore its changes from GP100.
Architecture – GP104 Simultaneous Multiprocessors, GPCs, TPCs
GP104 is a gaming-grade GPU with no real focus on some of the more scientific applications of GP100. GP104 (and its host GTX 1080) is outfitted with 7.2B transistors, a marked growth over the GTX 980's 5.2B transistors (though fewer than the GTX 980 Ti's 8B – but transistor count doesn't mean much as a standalone metric; architecture matters).
GP104 hosts 20SMs. SM architecture is familiar in some ways to GM204: An instruction cache is shared between two effective “partitions,” with each of those owning dedicated instruction buffers (one each), warp schedulers (one each), and dispatch units (two each). The register file is sized at 16,384 x 32-bit, one per “partition” of the SM. The new PolyMorph Engine 4.0 sits on top of this, but we'll talk about that more within the simultaneous multi-projection section.
Each GP104 SM hosts 128 CUDA cores – a stark contrast from the layout of the GP100, which accommodates for FP64 and FP16 where GP104 does not (because neither is particularly useful for gamers). In total, the 20 SMs and 128 core-per-SM count spits out a total of 2560 CUDA cores. GP104 contains 20 geometry units, 64 ROPs (depicted as horizontally flanking the L2 Cache), and 160 TMUs (20 SMs * 8 TMUs = 160 TMUs).
Further, SMs each possess 256KB of register file capacity, 1x 96KB shared memory unit, 48KB of L1 Cache, and the 8 TMUs we already discussed. There are four total dedicated raster engines on Pascal GP104 (one per GPC). There are 8 Special Function Units (SFUs) per SM partition – 16 total per SM – and 8 Load/Store (LD/ST) units per SM partition. SFUs are utilized for low-level execution of mathematical instruction, e.g. trigonometric sin/cos math. LD/ST units transact data between cache and DRAM.
The new PolyMorph Engine 4.0 (PME - originating on Fermi) has been updated to support a new Simultaneous Multi-Projection (SMP) function, which we'll explain in greater depth below. On an architecture level, each TPC contains one SM and one PolyMorph Engine (10 total PolyMorph engines); drilling down further, each PME contains a unit specifically dedicated to SMP tasks.
GP104 has a native operating frequency of 1.61GHz (~1607MHz), deploying the new GPU Boost 3.0 to reach 1.73GHz frequencies (1733MHz) on the Founders Edition card. AIB partner cards will, as usual, offer a full range of pre-overclocks or user-accessible overclocks. NVidia successfully demonstrated a 2114MHz clock-rate (that's an overclock of ~381MHz over Boost) on its demo rig, though we've validated overclocking with our own testing on a later page. NVidia noted that overclocks should routinely be capable of exceeding 2GHz (2000MHz).
Architecture – GP104 Pascal Asynchronous Compute & Preemption
NVidia's vision for asynchronous compute in Pascal is dead-set on advancing physics, post-processing, and VR applications. In gaming workloads, asynchronous command queuing allows overlapping tasks for improved game performance (particularly with Dx12 and Vulkan). In VR, timewarps and spatial audio processing are aided by asynchronous processing.
There are three major changes to asynchronous compute in Pascal, affecting these items: Overlapping workloads, real-time workloads, and compute preemption.
For gaming, GPU resources are often split into graphics and compute segments – e.g. a selection of the cores, cache, and other elements are assigned to graphics, while the remainder is assigned to compute. With the resources partitioned to, for example, rendering and post-FX, it may be the case that one of those “partitioned” clusters of resources completes its workload prior to its partner. That compute allocation might still be crunching a particularly complex problem when the render allocation completes its job, leaving the units allocated to rendering idle – that's wasted resources.
Asynchronous command structures allow for resource adjustments on-the-fly, enabling more concurrent, in-flight jobs to reach completion. As enabled by modern, low-level APIs that are “closer to the metal,” as they say, asynchronous compute reduces dead idles and keeps all resources tasked even when waiting for other tasks to complete. Not everything can be calculated asynchronously – there are always dependencies to some level – but gaming tasks do benefit heavily from asynchronous compute. If a selection of cores has completed its graphics tasks, but physics processing is ongoing in a neighboring selection, the render resources can be reallocated until that process concludes.
This is called “Dynamic Load Balancing,” and allows workloads to scale as resources become available or busy. Maybe 50% of resources are allocated to compute, the other 50% to rendering; if one job completes, the other job can consume idle resources to speed-up completion and reduce latency between frames. The counter is “Static Load Balancing,” where fixed resources are assigned to tasks and left idle when jobs are completed out-of-step.
Compute Preemption is a major feature of GP100 “Big Pascal.” For gaming applications, GP104's Compute Preemption has a major impact on fluidity of VR playback and promises prioritization of timewarp operations appropriately; this reduces the chance of frame drops and sudden motion sickness or “vertigo” of the wearer. It's not all VR benefit, though. Pascal works at the pixel level to preempt and prioritize tasks deemed more critical to avoiding a potential frame delay or increased latency event (see: frametimes).
NVidia has provided some images that make this easier to explain in a visual fashion:
In the above image, the command pushbuffer is responsible for storing triangle and pixel data, and queues work to produce thousands of triangles and pixels as its routine operation. This graphic shows the command pushbuffer as having completed three draw calls and as working on its fourth. A draw call, as many of you know, is the process of creating triangles for use in geometry. The fourth draw call is about half-way through completion when a task with greater priority steps-in and demands resources – like commandeering a vehicle. Because of pixel-level preemption, the GPU is able to store its progress on that triangle, pause the draw call, and assign its now-free resources to the more important task.
A more important task might be that timewarp calculation, because we know that missing a timewarp operation can cause a VR user to become physically ill (this, obviously, takes priority over everything else in the system).
Once that timewarp (or other) operation is complete, the resource is released and may load its progress, then continue crunching. All the way down to a single pixel, that progress is stored and may be picked-up where left off; the draw call does not have to start over. Rasterization and pixel shading have already been completed on these pixels, so the only “loss” is the time for switching tasks, not the work itself.
The pixel shading work completes and allows the resources to be switched to the more demanding workload.
Task switching takes approximately 100 microseconds, from what we're told.
DirectX 12 benefits from another version of Compute Preemption, called “thread-level compute preemption.” This is also on Pascal GP104.
Threads are grouped into grids of thread blocks. Let's re-use our same example: A higher-priority command comes down the pipe and demands resources, instantiating the preemption request. Threads which are in-flight on the engaged SMs will complete their run, future threads and in-progress threads are saved, and the GPU switches tasks. As with pixel-level preemption, the switching time is less than 100 microseconds.
This approach to asynchronous compute also means that there's no more risk to over-filling the graphics queue and blocking real-time items from emergency entrance. Preemption is exposable through the DirectX API and will benefit GP104 in Dx12 titles. Pascal can also perform low, instruction-level compute for general compute workloads. This would help, as an example, GPU-bound production programs by freeing resources for brief moments when a user may need them elsewhere – like video playback or window dragging (eliminate that 'window lag' when moving windows during GPU-heavy renders).
That is the heart of compute preemption on GP104, and fuels future performance gains in Dx12 and Vulkan APIs.
Architecture – GP104 Pascal Memory Subsystem
The most immediate change to GP104 is its introduction of Micron's new GDDR5X memory, which nVidia has taken to calling “G5X.” We will stick with the traditional nomenclature – GDDR5X is the new memory, and GDDR5 is the “old” on-card DRAM.
The GTX 1080 is the first video card to host GDDR5X memory. GP100 hosts HBM2, Fiji hosts HBM, and previous Maxwell, Kepler, and effectively all other cards made within the last several generations – that's all GDDR5.
GDDR5X offers a theoretical throughput of 10-14Gbps (Micron will continue optimizing its design and process), significantly bolstering memory bandwidth over the ~8Gbps limitation of GDDR5. This is, of course, exceptionally slower than HBM2 – but HBM2 yields aren't here yet, and the memory is so expensive and process-refined as to be poor value for gamers today. That may change in a year, but we're really not sure what the roadmap looks like right now. AMD is slated to ship HBM2 on Vega. Fiji remains the only active architecture with any form of consumer-available high-bandwidth memory.
High-bandwidth memory on GP100 is architected as a collection of HBM dies atop an interposer, which sits on the same substrate that hosts the GPU proper. Each stack is a composite of five dies (4-high HBM2 stack atop a base die, atop the interposer and substrate). Here is a look at HBM2 on GP100:
The HBM2 accompanying GP100 has an 8Gb DRAM density (versus 2Gb density of HBM1), can be stacked up to 8 dies (per stack), and operates at a 180GB/s bandwidth (versus 125GB/s of HBM1). This is enormously faster than the ~28GB/s/chip of GDDR5.
Micron's GDDR5X is found on GP104 and operates at 10Gbps (an effective 10GHz memory clock). Keeping with tradition, the new memory architecture improves performance-per-watt while still managing a speed increase, mostly by designing a new IO circuit architecture and reducing voltage demands from GDDR5 – GDDR5X eats 1.35v pre-overclock.
NVidia engineered the bus for signal integrity, performing path analysis to look for impedance/cross-talk (z). Each path was traced from the GPU to the DRAM pads, then analyzed for signal integrity. Light travels approximately one inch in the 10GHz signal time of GDDR5X, leaving a half-inch of “light opportunity” to grab each transferred bit before it's lost. With PCIe buses, differential signaling is used to check for the bit – effectively the transmission of two opposing signals (the source signal and its opposite), which assists in isolating bits that might otherwise look like noise. GDDR5X does not take this differential signaling approach and instead sends a single signal down the wire, moving from the chip, through a via, to the DRAM.
As for other components of the memory subsystem, the GTX 1080 hosts eight memory controllers comprising its 256-bit wide memory interface (32-bit controllers * 8 = 256-bit interface); each controller manages eight ROPs – 64 total – and hosts 256KB L2 Cache. There is a total of 2048KB L2 Cache.
Memory compression has also been advanced from its third generation into a new, (surprise!) fourth generation. Fourth Generation Delta Color Compression is an extension on its predecessor, which we described in-depth in the GTX 980 review. The fourth generation of delta color compression reduces the total amount of data written to memory and improves TMU and L2 cache efficiency.
GP104 compresses colors (eventually drawn to screen) on the chip, further reducing power consumption by eliminating a need to perform remote compression. Delta color compression works by analyzing similar colors and compressing them into a single value. On a frame-by-frame basis, all colors “painted” to the screen are reduced with a 8:1 compression ratio; the GPU locates comparable colors (values which are near enough to be stored into a single color block – e.g. green, dark red, yellow, etc. – then recalled later). The compression is performed on-chip and reduces bandwidth saturation, then later unpacked for delivery to the user.
The compression ratio is the most impressive aspect of this – 8:1. If you've got 8 different shades of yellow in a frame – maybe a car hood that's got some light gradient going on – those 8 shades can be compressed into a single “yellow composite,” dispatched to wherever it's going, and unpacked later. The 2:1 compression is still available for instances where 8:1 is too aggressive, and data can be delivered with no compression (though delta color compression is lossless, not all scenes are fully compressible).
Here is an example of Third Generation Delta Color Compression, which debuted on Maxwell:
The pink coloration represents color data which was compressed; that is, every spot of pink is comparable to another color in the scene, and will use a delta value to reach that color rather than an absolute value. This significantly speeds-up transactions occurring in memory by lessening the amount of data traveling down the bus, but still retains color quality and accuracy. The more pink in the image, the more compression is going on – that's good, as the total memory bandwidth consumption is reduced. We want lossless compression wherever possible.
The above is a 2:1 compression. Let's simplify this by imagining just two colors, though: Blue and light blue. The algorithm will analyze the colors, select a neutral value between the two of them, and then store the difference (hence, delta color compression) for later recall. Instead of storing 32 bits of color information to memory per color – 64 total bits, in this case – the GPU can store just one 32-bit color value, then the marginal details for the delta.
Sticking with our example of blue colors, but expanding to Pascal's 8:1 color compression capabilities, immediate gains can be seen within the skybox of a game. Look up in Just Cause 3, for instance, and you'll be met with wide-open, blue skies. Other than some clouds, almost the entire palette is blue – and it's all similar mixes of light and dark blues. Without compression, 8 color samples would demand a full 256-bits of storage alone. With the compression, because a neutral value can be stored with the difference values separating the blues, that 256-bit requirement is reduced to just ~32 bits. Major savings.
Let's revisit those Maxwell compression shots, but add Pascal to the mix. The order of images is: Final frame (presented to the user), then Maxwell 2:1 DCCv3, then Pascal DCCv4.
Pascal has compressed almost the entire scene, sans a few colors that are too specific and escape the compression algorithm's sweep. These triangles may have too much variation, as is the case in the greens for Maxwell's DCCv3. For reference, Maxwell's DCC is capable of reducing bandwidth saturation by 17-18% on average, freeing up memory bandwidth for other tasks.
The GTX 1080 can run 2:1, 4:1, and 8:1 compression algorithms, where the 8:1 compression mode combines 4:1 color compression of 2x2 pixel blocks, followed by a 2:1 compression of the deltas between the blocks.
Pascal's memory compression and GDDR5X bandwidth work together to grant a ~1.7x effective memory bandwidth increase over the GTX 980. Some games benefit more heavily. In Assassin's Creed: Syndicate, for instance, the GTX 1080 sees a 28% bandwidth reduction versus the GTX 980. In Star Wars: Battlefront, that reduction is only 11%. The mean seems to rest around ~20%.
Changes to SLI Bridges & Retirement of 3- & 4-Way SLI
The new SLI bridges (HB Bridge) only run two slot configurations maximally, rather than three or four of previous SLI bridges. It is possible to unlock 3-way or 4-way SLI, but we'll talk about that momentarily.
The high-bandwidth SLI bridges allow for faster copies that dramatically increase available bandwidth for interleaved transfers – e.g. VR positional sound and multi-projection. The bridge is comprised of two lanes that connect dual-graphics SLI configurations. Pascal's MIO display bridge has faster IO and reduces latency and micro-stutter in SLI, reflected in this frametime graph:
Above, blue represents the HB bridge with 2x GTX 1080s vs. an old bridge with 2x GTX 1080. Latency is more controlled versus the black line (old bridge).
NVidia describes its move away from 3-way SLI as a composite of game developer support and Microsoft decisions. A move to MDA (as in Dx12, which we tested and explained here) shifts responsibility for performance to the application as GPUs act more independently. The bridge is not exposed in MDA mode, and so goes unused. LDA (Linked Display Adapter) couples GPUs together, but is being moved away from by Microsoft and DirectX. The driver is relied upon to present abstractions to the application for LDA mode, and so the implication is that nVidia and AMD are more in control of their performance (though game-level optimizations must still be made).
Games are now more interdependent with their frame output. Heavy post-processing (play any game which is intensely post processed – like Doom, ACS, or Just Cause 3) lends itself to greater interdependence from frame-to-frame, which also means that AFR (alternate frame rendering) loses some of its efficacy. That's why we saw such abysmal multi-GPU performance in Just Cause 3 and ACS.
This in mind, nVidia is moving away from multi-GPU above 2-way SLI. NVidia informed the press that they're planning to allow 3-way (or more?) SLI for users who specifically demand the function. A special “key” will be provided – likely a digital key – to unlock > 2-way SLI capabilities, and must be requested manually from the manufacturer.
Continue to the next page for our testing methodology and the lead-in to our FPS, thermal, noise, & power benchmarks.