NVIDIA Pascal GP100 Architecture Deep-Dive

By Published May 06, 2016 at 7:53 am
  •  

 

Pascal Cache – L1 & L2 Cache, Unified Memory

Unified Memory and memory sharing are driving features of modernized GPU architectures. This extends beyond system resources and reaches into L1/L2 Cache and texture cache with GP100. With regard to Pascal specifically, L2 Cache is unified into a 4096KB pool (as opposed to GM200's 3072KB L2 Cache) that further reduces reliance upon DRAM and power draw, thereby garnering additional performance gains.

Pascal dedicates a single pool of 64KB of shared memory to each SM, eliminating previous reliance on splitting memory utilization between L1 and shared pools. GP100 uses Unified Memory in conjunction with NVLink (for scientific computation applications) to improve concurrent read-modify-write parallelism within atomic memory operations.

Unified Memory reduces programmer effort by streamlining application development with the provision of a unified virtual address space that “houses” both system RAM and GPU VRAM. Programmers can write less manually-intensive code and focus on improving parallelism, thanks in part to a reduction in explicit memory copy calls to move memory between system and local GPU memory. This single pool of memory is accessible by both the CPU and GPU in relevant applications and use cases, and CUDA (the language and execution of that language) will manage memory between the CPU and GPU without direct programmer oversight. Unified memory, with some thanks to the 49-bit virtual memory addressing in Pasal, is able to support operations that exceed total unified memory size by performing operations out-of-core.

gp100-pascal-unified-memory

Direct Memory Access for GPUs

Sort of comparable to DMA, GPUDirect and RDMA allow an interchange of data between multiple GPUs. These GPUs can be installed in a single system or across a network (with a sufficiently high-speed line), and are able to directly access the memory of other in-system GPUs for more efficient management of memory. GP100 doubles RDMA bandwidth with some specific gains for deep learning. We'll leave this aspect here for today, and will explore in greater depth in a forthcoming piece. RDMA and GPUDirect begin to exit the scope of this content.

Pascal Memory Subsystem – HBM 1 vs. HBM 2

gp100-pascal-hbm2-arch2

(Above: Cross-section of GP100 & HBM2)

AMD's Fiji (Fury X) was the first GPU to use HBM, or high-bandwidth memory. Rather than rely upon the tremendously slower, PCB-bound GDDR5 memory of traditional GPUs, Fiji hosted its memory atop an interposer, which sat atop the substrate.

HBM in Fiji – and in GP100, though we'll get there soon – is stacked vertically. Physical distance between the graphics processor and memory is reduced, the bus is wider, and the electrical and thermal requirements are better controlled. HBM is better in every performance metric and falls behind only in cost and yield (for HBM2 especially). GDDR5 pushes a maximum theoretical throughput of about 8Gbps, with Micron's upcoming GDDR5X improving speeds ~47% to 13-14Gbps.

Each stack of HBM has a 1024-bit wide interface per stack, or 128GB/s throughput per stack. HBM can support 2Gb per die maximally. GDDR5 operates at ~28GB/s per chip. Due to restrictions and yields, Fiji was unable to offer more than 4GB of HBM, a severe limitation as games continue to push more textures and assets to VRAM and seem to care more about capacity than speed.

HBM2 makes significant changes from HBM “1.” With HBM2, relevant Pascal devices like the GP100 will be able to maximally host 16GB of VRAM pursuant to increased die density of the new version. HBM2 densities run upwards of 8Gb with 4-8 dies per GPU, maximally enabling the 16GB capacity (though initial P100 Accelerator shipments will be 8GB). Here is a simple table:

  HBM1 HBM2
Density 2Gb DRAM 8Gb DRAM
Stack Density 4 dies per stack 4-8 dies per stack
Bandwidth 125GB/s 180GB/s

Pascal GP100 currently builds its HBM2 memory as a composition of five dies, but presents a coplanar surface so that heatsink mounting and hotspot cooling remain somewhat simple. The dies include the base die and the stacked DRAM above it. This sits atop a passive silicon interposer, as does the GP100 GPU, and all of that sits on the substrate.

gp100-pascal-hbm2-arch

GP100 has a 4096-bit wide memory interface for its HBM2 memory, resultant of dual 512-bit memory interfaces running between the stacks. HBM2 additionally offers native ECC (Error Checking & Correction) for parity-bit checking and on-the-fly recovery of erroneous data, useful for performance applications and cluster computing. This functions differently on Pascal than Kepler GK110, where 6.25% of GDDR5 was reserved for ECC bit checking. With HBM2 and GP100, this function is offered natively and without a “reserve” of capacity.

Industry Trends Forward & Gaming Needs

gp100-pascal-board-front

gp100-pascal-board-back

We're still awaiting GeForce announcements, but this early look at Pascal's existing GP100 GPU – although it is intended entirely for use in the scientific community – provides us with an architectural understanding of the processor. It is likely that not all of the Pascal GPUs will include the featureset defined herein, but the SM architecture and basics should remain the same. HBM2 is likely to be a “swappable” aspect for low-end devices and may mean a mix of GDDR5, GDDR5X, and HBM2 graphics cards. COMPUTE Preemption is also less useful for gaming, as are some of the other error checking features, and will be entirely unadvertised or excluded from GeForce-class chips.

The industry is trending toward faster, closer DRAM for the GPU. HBM was the start, and HBM2 carries that torch. It is yields and cost that hold-back proliferation of high-bandwidth memory, but Micron's GDDR5X (if it catches on) will offer an improvement over GDDR5 at a lower cost than HBM.

The rest of the major improvements come in the form of performance-per-watt gains through die shrinks made by TSMC (16nm FinFET vs. 28nm planar). AMD is making its own die shrinks to 14nm FinFET with Polaris, and will exact its own performance-per-watt gains as a result of this.

The industry, especially the mobile industry, dictates reduced power consumption overall but demands that reduction without significant performance compromise. It looks as if Pascal will be able to achieve that – and more, for the scientific community. We have a few more pieces we're planning to post on Pascal and neighboring nVidia technologies, so be sure to add us to your RSS feed, facebook feed, or Twitter / YouTube lists.

Editorial: Steve “Lelldorianx” Burke
Forthcoming Video Production: Keegan “HornetSting” Gallick


« Prev Next

Last modified on May 07, 2016 at 7:53 am
Steve Burke

Steve started GamersNexus back when it was just a cool name, and now it's grown into an expansive website with an overwhelming amount of features. He recalls his first difficult decision with GN's direction: "I didn't know whether or not I wanted 'Gamers' to have a possessive apostrophe -- I mean, grammatically it should, but I didn't like it in the name. It was ugly. I also had people who were typing apostrophes into the address bar - sigh. It made sense to just leave it as 'Gamers.'"

First world problems, Steve. First world problems.

We moderate comments on a ~24~48 hour cycle. There will be some delay after submitting a comment.

Advertisement:

  VigLink badge