Hardware Guides stub

AMD Vega Architecture: HB Cache & the NCU | CES 2017

Posted on January 5, 2017

AMD’s Vega GPU architecture has received cursory details pertaining to high-bandwidth caching, an iterative step to CUs (NCUs), and a unified-but-not-unified memory configuration.

Going into this, note that we’re still not 100% briefed on Vega. We’ve worked with AMD to try and better understand the architecture, but the details aren’t fully organized for press just yet; we’re also not privy to product details at this time, which would be those more closely associated with shader counts, memory capacity, and individual SKUs. Instead, we have some high-level architecture discussion. It’s enough for a start.

High-Bandwidth Cache 

“High-bandwidth cache” is the new phrase that’s effectively replacing the common day usage of “VRAM,” as it pertains to AMD’s Vega. This is true even for Vega GPUs which could use GDDR memory (a possibility), provided the memory remains sufficiently fast for AMD to attribute the label. We have no details on the requirements of HBC as a label, IE at what point is memory considered fast enough to be HBC, but maybe closer to launch.

The idea, though, is that HBM and the HB cache controller jointly function somewhat like a tertiary layer of cache. The speed is significantly higher, bandwidths are upwards of 1TB/s, and the cache controller can manage incoming and outgoing datastreams with greater efficiency. Those buzz words and adjectives are all great, of course, but we don’t have anything really in the way of hard numbers just yet. As we saw with Fiji, HBM isn’t enough to solve the problem of high performance gaming – powerful ALUs are still needed to accompany the memory.

Back to the controller, though: the HB cache controller breaks data up into smaller pages, making memory more manageable, and – using AMD’s words – is “more intelligent” in its prefetching routine. Again, no real detail behind that, but that’s the plan. We don’t know the cache latency or much else, but we have this basic block diagram:

amd-vega-arch-w1

In theory, Vega can support upwards of a 512TB virtual address space – that’s system memory and HBM (probably on an interposer) – though AMD is trying to avoid calling this “unified memory.” That’s to reduce confusion, given the unified memory efforts of previous APUs and plenty of other architectures. We’re curious to see whether this will be fully integrated with Intel CPUs, as we’re not sure if Vega will have access to the Intel memory bus in order to create the larger virtual address space. There might be a requirement for an abstraction layer in there somewhere. We’ll hopefully get more information from AMD engineers after CES, if not while here.

Vega also won’t be just a gaming GPU, though. We see its applications stretching more desperately toward potential deep learning and neural net computing. The promises of precision switching are mostly benefited by non-gaming applications. With AMD’s “Rapid-Packed Math,” Vega will be able to switch precision from 32-bit to 16-bit operations (or single-precision to half-precision), which is useful in effectively doubling FLOPs at the cost of precision. For some applications, like those dealing in bulk data analysis (deep learning) where precision falls way to sheer quantity, this makes perfect sense. Half precision speeds up calculations by effectively 2x, as you can now pack two 16-bit values into the same register space as a single 32-bit value. Joining this, the new Vega NCU and its higher per-clock instruction throughput, including assistance from deeper buffering, should be able to better leverage HBM than the preceding Fiji.

For gaming, this precision switching focus is an unlikely play. Vega is split between targeting gaming and targeting deep learning, and AMD’s trying to make up ground in deep learning processing. Half-precision won’t cooperate with all data types that games juggle (you’d likely end up with some artifacting), and developers would have to move to more explicitly support the precision switching that Vega is capable of performing. That’s not to mention that game developers generally do not retool in-dev games to support incoming technologies, and have a somewhat shaky track record of supporting even semi-established technologies for upcoming games.

This isn’t to downplay precision switching, but it’s to hopefully remind the inevitable Hype Train Express passengers that not all emergent architecture details are targeted at gaming.

A few more rapid bullet points before closing out this cursory look:

  • More focused data movement should speed-up the pipeline when searching for and streaming large textures or other relevant files.
  • “More than double” the geometry engine peak throughput per clock.
  • Faster primitives culling and overdraw prevention/reduction by better culling useless polygons before rasterization into pixels.

Once we have a better understanding for the changes, we’ll revisit the topic of Vega. For now, this is what we’ve got.

Editorial: Steve “Lelldorianx” Burke
Video: Keegan “HornetSting” Gallick