AMD's RX 480 launch introduces the Polaris architecture to the world, arranging an alliterative architecture assortment from both GPU vendors (Pascal, if you're curious, is the other). This is AMD's answer to the largest market segment, shipping in 4GB and 8GB variants that are priced at $200 and $240, respectively.
During the RX 480 press briefing, AMD strongly defended its stance on maturing and tuning its architectures to extract the maximum possible performance prior to an architectural shift. “We don't have a billion dollars to spend on a single architecture,” said AMD SVP & Chief Architect Raja Koduri, clearly referencing nVidia's boastful Order of 10 unveil. Koduri went on to praise his team for doing an “amazing job with existing products,” but welcomed the arrival of a new 14nm FinFET process node to usurp the long-standing ubiquity of 28nm planar process.
The AMD RX 480 8GB is on the bench for review today. In this RX 480 8GB review, we benchmark framerate (FPS) & frametime performance, overclocking, thermals, clockrate vs. time endurance, fan RPMs, and noise levels.
AMD RX 480 vs. GTX 1070, GTX 970, 960, & R9 390X [Video Review]
AMD RX 460, RX 470, & RX 480 Specs
|AMD RX 480||AMD RX 470||AMD RX 460|
|Architecture||Polaris 10||Polaris 10||Polaris 11|
|Compute Units (CUs)||36||32||14|
|Base / Boost Clock||1120MHz / 1266MHz||? / ?||? / ?|
|COMPUTE Performance||>5 TFLOPS||>4 TFLOPS||>2 TFLOPS|
|Graphics Command Processor (GCP)||1||1||1|
|Pixels Output / Clock||32||?||16|
|VRAM Capacity||4GB GDDR5 @ 7Gbps
8GB GDDR5 @ 8Gbps
|4GB GDDR5||2GB GDDR5|
|Memory Speed||7Gbps (4GB model)
8Gbps (8GB model)
|Memory Bandwidth||224GB/s (4GB model)
256GB/s (8GB model)
|Display Port||1.3 HBR / 1.4 HDR||1.3/1.4 HDR||1.3/1.4 HDR|
|Release Date||June 29||Mid-July||End of July|
Polaris 10 vs. Polaris 11 Specs & Architecture
|Polaris 10||Polaris 11|
|Compute Units (CUs)||36||16|
|COMPUTE Performance||“>5 TFLOPS”||“>2 TFLOPS”|
|Architecture||Gen 4 GCN||Gen 4 GCN|
|Playback Support||4K encode/decode||4K encode/decode|
|Output Standard||DP1.3/1.4 HDR||DP1.3/1.4 HDR|
Of most immediate (and brief) note, our above table reveals some new information: The RX 470's release date is set for “mid-July,” with the RX 460 release date scheduled for the end of July.
Let's take a look at the block diagram.
Architecture – Exploring Polaris & Ellesmere Block Diagrams
Above: AMD Polaris 10 block diagram.
Above: AMD Polaris 11 block diagram.
This is Polaris.
Polaris runs on the new 14nm FinFET process, coinciding almost simultaneously with nVidia's new 16nm FinFET node from TSMC. Both companies have reduced their process size from 28nm, where the industry resided for a number of years – more than typical – while waiting for something more efficient to come along. The jump to 20/22nm planar nodes wasn't worth it for either company. Tooling-up a factory and building the chip for marginal gains made less sense than maturing development on the existing 28nm process. A long waiting period ensued.
FinFET, as we described in our highly-detailed GTX 1080 review, reduces power leakage and voltage requirements. From our previous content:
“FinFET transistors use a three-dimensional design that extrudes a fin to form the drain and source; the transistor's fins are encircled by the gate, reducing power leakage and improving overall energy efficiency per transistor.”
AMD has coupled its FinFET process with datapath organization improvements and improved data compression, both of which reduce overall power consumption. Memory alone has seen an energy reduction upwards of 40% per bit transacted, significantly lowered versus Hawaii and previous generations. This allows more room for energy provided to the cores, of course, but also reduces total consumption. Changes to boosting functions have also improved power utilization, mainly by introducing 7 DPM states (DPM1=sleep, DPM7=fully unlocked for high-end production/gaming).
Clock gating and power gating for under-utilized circuits furthers the perf/watt argument, as does the introduction of heuristic pre-fetch routines that keep cycles occupied with instructions.
But that's getting ahead of the architecture discussion.
Packed into the RX 480 Polaris 10 chip is a grouping of 36 CUs, over which rests a single GCP (Graphics Command Processor), flanked by two Hardware Schedulers (HWS) and four Asynchronous Compute Engines (ACEs). Polaris 10 and Polaris 11 both operate on a single GCP and have expanded reliance upon the HWS over what was found in Gen 3 GCN. The HWS block was first introduced on Gen 3, and owners of Fury- and 390-class GPUs will be happy to know that microcode updates to firmware will enable some of the Polaris-class HWS enhancements. One of those is the introduction of QRQs, which aid in hardware power reductions when using the Oculus Rift HMD. The HWS is controlled by microcode and can be updated through drivers, beneficial as hardware and APIs mature.
Above: A render of Polaris 10.
The back-end of the render pipeline (GCP → Setup Engine → Scheduler / CEs) begins tasking incoming resources appropriately to low-level GPU components (some virtualized, some physical – ACEs, for instance, are a physical compute resources on the silicon). The Graphics Command Processor takes instruction from the CPU and sends it to the scheduler, which is a GPU component. The scheduler then begins the process of managing a familiar graphics pipeline (discussed here), e.g. drawing primitives and geometry, performing light/shading passes, eventually fetching textures, applying transforms, and preparing to rasterize the output. Post-processing, as always, happens at the end of the pipeline.
But none of that is news – just a refresher for our upcoming discussion on asynchronous compute within Polaris 10.
Looking back at the block diagram, we see that there are four Shader Engines containing the 36 CUs on the RX 480. Here's a reminder of what a CU looked like in 2013:
GCN 4.0 has arrived with Polaris (and the above CU architecture is from a 2013 AMD presentation), but the diagram is still useful. In fact, here's the most recent version of a CU block diagram:
Not much has changed at this low level. Under modern GCN architecture, each CU possesses four Vector Units (SIMD-16), four Vector Registers, a local data share, L1 Cache, four TMUs (sometimes called Texture Filter Units), and sixteen Texture Fetch / Load / Store units.
Doubling L2 Cache, Power Savings, & Delta Color Compression
One of the RX 480's biggest changes is its doubling of L2 Cache. With more capacity in cache for data storage, texture references and color compression remain resident for longer (reducing computational workload). This improves processing efficiency and reduces bandwidth consumption where unnecessary. There's no reason to transact the same data back-and-forth if it can be stored into a local, nearby cache.
Critically, this also has a side effect which is perhaps overlooked: Energy savings. Along with the power reduction native to smaller FinFET process nodes – moving away from planar helps tremendously – the caching system reduces power consumed by GPU memory. Delta Color Compression (DCC) and 2MB of L2 Cache work in conjunction to minimize VRAM activity, and while it is impossible for us to test something this low-level at GN, AMD tells us that power savings are upwards of 40% on memory transactions alone.
AMD's version of DCC can compress colors up to 8:1, offering 4:1 and 2:1 compression as fall-backs in instances which cannot fully compress. DCC functions similarly to what we've shown in our Pascal reviews: The scene is analyzed for similar colors, and those colors are then compressed into as few blocks as possible. An example makes this easy: Imagine looking at a game's skybox. These often consist almost entirely of blues – maybe a few whites are tossed in. The blues might be compressible 8:1; in such an instance, eight blue values may be stored as one blue value. Delta values can than be used to create the rest as needed, rather than absolute color values – which require greater bandwidth.
Pipeline Improvements to Geometry Culling (Tackling Tessellation)
At AMD's press briefing preceding Computex, one of the company's presentations took clear jabs at competitor nVidia: “There's some people who really like to tessellate and tessellate and tessellate. That can create lots of traingles that don't really contribute to the scene.”
AMD is likely referencing nVidia's utilization of tessellation to leverage its own architecture, showcased most recently and heavily with HairWorks in The Witcher.
Polaris updates the geometry engines on the silicon. A primitive discard accelerator culls primitives (triangles, pieces of larger geometry) sooner in the pipeline, targeting primitives that are sort of “orphaned” without sample points. This is different from usual scene z-culling, where obscured geometry is culled prior to being drawn as a means to reduce workload. AMD's discard accelerator specifically looks for geometry it deems valueless to the greater scene, then axes it from the render pipeline. This should improve AMD's performance with heavy tessellated scenes, though does introduce the question of reviving “frame quality” analysis alongside usual framerate and frametime analysis.
There is also a new index cache for instanced geometry, which is useful in games that re-use the same object multiple times. Skyrim is a good example of a game that relies heavily upon object and model instancing. By using an index cache for small, repeated geometry, memory transactions are reduced and bandwidth is freed-up for other tasks. The primitive is maintained in the index cache and re-used as called for scenes, so that the GPU does not need to communicate with memory for every fetch.
A new instruction prefetch also improves pipeline efficiency. The prefetch is a heuristic circuit that capitalizes on cycles left under-utilized by pre-fetching data and avoiding stalls in the pipeline. 16-bit registers also lower the power requirement.
A Push to Notebooks (& a New Driver Strategy)
Polaris 11 is targeted heavily at notebooks – though it will find its way into the RX 460.
The Polaris 11 GPU reduces power consumption and favors a more modest approach to graphics. The company's marketing language feels poorly chosen to use – “console-class performance” – but has good intentions. AMD hopes to re-enter the mobile computing market with Polaris, pushing first into notebooks.
Power is obviously the big argument, but AMD also makes an argument for height. The dimensions of the smallest Polaris chip measure out to 24.5x24.5, with a 1.5mm height (Polaris 11). That small z-height is a noteworthy reduction over Bonaire, which sat at 1.9mm, and theoretically enables thinner notebooks.
We do not yet know of any officially announced notebooks.
As for the new driver strategy, AMD will now be releasing game-specific drivers on an as-needed basis, and will push 6 full WHQL drivers per year. This allows enthusiast and gaming users to pull updates for new games as desired, without flooding less concerned users with driver update notifications. AMD claims to have been the target of driver criticism from two camps: Critics of drivers being updated too frequently (think: mainstream and business users), and critics of drivers not being updated frequently enough (gaming users – we were part of that group).
Last year, AMD released 3 WHQL drivers. The company had a period of 180+ days without any official driver releases. This followed the Omega announcement, wherein GamersNexus was told to expect a major driver update on a monthly basis.
That never happened – but AMD hopes to set things right. In a press briefing, the company highlighted its 4 WHQL driver launches thus far in 2016, further underscoring game-specific driver updates in between those launches.
It's a good habit to get into, and one that AMD has promised to better keep up with.
Continue to Page 2 for test methodology.