The theoretical end of AMD's Polaris desktop GPU line has just begun shipment, and that's in the form of the RX 460. Back at the pre-Computex press event, AMD informed us that the Polaris line would primarily consist of two GPUs on the Polaris architecture – Polaris 10 & 11 – and that three cards would ship on this platform. Two of the three have already shipped and been reviewed, including the ~$240 RX 480 8GB cards (review here) and ~$180-$200 RX 470 cards (review here). The next architecture will be Vega, in a position to potentially be the first consumer GPU to use HBM2.
Today, we're looking at Polaris 11 in the RX 460. The review sample received is Sapphire's RX 460 Nitro 4GB card, pre-overclocked to 1250MHz. The RX 460, like the 470, is a “partner card,” which means that no reference model will be sold by AMD for rebrand by its partners. AMD has set the MSRP to $110 for the RX 460, but partners will vary widely depending on VRAM capacity (2GB or 4GB), cooler design, pre-overclocks, and component selection. At time of writing, we did not have a list of AIB partner prices and cards available.
As always, we'll be reviewing the Sapphire RX 460 4GB with extensive thermal testing, FPS testing in Overwatch, DOTA2, GTA V, and more, and overclock testing. Be sure to check page 1 for our new PCB analysis and cooler discussion, alongside the in-depth architecture information.
The RX 460 GPU is stripped down compared to the RX 470 and RX 480, both running the same GPU with changes to the CU count. We'll recap architecture in a moment.
The RX 460 runs 896 Stream Processors (64 SPs per CU, with 14 CUs * 64 SPs, that's 896 total Stream Processors), clocked at the core to 1250MHz with Sapphire's pre-OC applied. The reference specification by AMD calls for a 1200MHz boost, leaving AIB partners room north of this clock-rate for pre-OC models.
AMD is using GDDR5 memory for the RX 460, operating at 7Gbps in this particular card (1750MHz actual - *2 for DDR, *2 for GDDR5 = 7Gbps). With the 128-bit memory interface, the card ends up with a theoretical maximum memory bandwidth of 112GB/s for the RX 460; by way of comparison, the RX 470 pushes ~211GB/s, with the RX 480 between ~224GB/s and ~256GB/s (4GB, 8GB, respectively).
The RX 460 ships in both 2GB and 4GB models. The $110 MSRP is likely the floor for potential 2GB models – though that may not be populated at launch, as both GPU vendors have proven this cycle – with 4GB cards likely to retail higher.
We'll tear Sapphire's RX 460 Nitro down in a moment and show the PCB and internals. First, a refresher on architecture.
AMD RX 460, RX 470, & RX 480 Specs
|AMD RX 480||AMD RX 470||AMD RX 460|
|Architecture||Polaris 10||Polaris 10||Polaris 11|
|Compute Units (CUs)||36||32||14|
|Base / Boost Clock||1120MHz / 1266MHz||926MHz / 1206MHz||1090MHz / 1200MHz|
|COMPUTE Performance||>5 TFLOPS||Up to 4.9TFLOPs||Up to 2.2TFLOPs|
|Graphics Command Processor (GCP)||1||1||1|
|Peak Texture Filter Rate||182.3GT/s||154.4GT/s||57.6GT/s|
|Peak Pixel Filter Rate||40.5GP/s||38.6GP/s||19.2GP/s|
|L2 Cache||2MB||2MB (?)||1MB|
|VRAM Capacity||4GB GDDR5 @ 7Gbps
8GB GDDR5 @ 8Gbps
|4GB GDDR5||2GB GDDR5|
|Memory Speed||7Gbps (4GB model)
8Gbps (8GB model)
|Memory Bandwidth||224GB/s (4GB model)
256GB/s (8GB model)
|Display Port||1.3 HBR / 1.4 HDR||1.3/1.4 HDR||1.3/1.4 HDR|
|Release Date||June 29||August 4||August 8|
Polaris 10 vs. Polaris 11 Specs & Architecture
|Polaris 10||Polaris 11|
|Compute Units (CUs)||36||16|
|COMPUTE Performance||“>5 TFLOPS”||“>2 TFLOPS”|
|Architecture||Gen 4 GCN||Gen 4 GCN|
|Playback Support||4K encode/decode||4K encode/decode|
|Output Standard||DP1.3/1.4 HDR||DP1.3/1.4 HDR|
Refresh on Polaris 11 GPU Architecture
Here's the Polaris 11 block diagram:
And Polaris 10:
As with Polaris 10, Polaris 11 sees focus on power efficiency improvements – a trend for this generation of Pascal & Polaris – alongside the process shrink. 14nm FinFET fabrication process moves AMD away from planar transistor architecture, surrounding the gate with fins to reduce power leakage at a low level. Atop this platform, AMD has also moved to optimize its data path and to compress its data transactions more heavily. Memory receives a good deal of this optimization treatment, reducing energy spent per bit transacted by upwards of 40%.
For gaming and other supported use cases, delta color compression can help in this regard – the act of comparing nearby, numerically adjacent colors in a scene, then compressing them into 2:1, 4:1, or 8:1 ratios. As we've done in the past, using a skybox is the easiest example: The average skybox will largely be comprised of a few different types of blue (imagine: Just Cause 3), and those colors are close enough that they can be condensed into a single (or a few) “color squares.” The compressed “square” will be an intermediary value between all colors within the selected scene; this could be up to 8 colors compressed to one, though will more commonly be 2:1 or 4:1. The result is reduced transactions across the bus, thus reducing power draw somewhat proportionately and allowing some level of additional room in the bus for other transactions (though additional instructions are required and do take some of that width).
On the receiving end, those colors get unpacked to their original values. The colors are restored by using a delta value (ergo, “delta” color compression), or the difference between the original and the intermediary value. This extracted color is computationally easier to get to from the intermediary color. That's color compression. The concept is fairly is fairly simple and has existed for many GPU generations, but was recently improved (by both AMD and nVidia) to support maximally 8:1 compression. Compression is lossless.
The non-trivial reduction in energy “spent” on memory is part of a greater picture of efficiency gains. The Polaris architecture also institutes 7 DPM states to create a volt-frequency curve, effectively an updated version of “boost.” Boost clocks have now become tremendously complex in comparison to their more infantile implementations, and both vendors have some level of volt-frequency scaling that modulates its states based upon system demand. In the lower DPM states, closer to GPU sleep, the voltage is tanked to its lowest stable value and clock-rate is reduced in step (e.g. 300MHz clock to a ~700mV voltage, or similar). There's no reason to drive a 1200MHz clock-rate at all times of operation, like for desktop use or most web browsing (not all). This is also why we see some clock-rate swings in the new reference cards from both nVidia and AMD, as they'll often run a more conservative (of energy) volt-frequency curve that results in a more turbulent core clock. The process is generally intelligent enough to not impact FPS, but certainly can hit the 0.1% low values (see: discussion on GTX 1080, where we had this issue with overclocking power availability).
Additional improvements have been made with clock-gating and power-gating on under-utilized circuits. AMD's Polaris architecture uses a heuristic approach to pre-fetch routines and occupy all available cycles. This more evenly disperses instructions across GPU cycles, so if a cycle looks like it may be idle (no work in the queue), but work is forthcoming, the GPU will work on a pre-fetch.
As for Polaris 11, the block shows that each Shader Engine is home to 14 CUs, each hosting 64 Stream Processors. SP architecture is the same as GCN 3.0, more or less, and the GCN 4.0 SP block is shown below:
As previously, we're looking at 4 TMUs per CU, for a total of 56 texture map units on the RX 460. Adjacent to these are located the 16 texture load/store units (LS), along with local L1 Cache (16KB). 256x32b L/S units are present on the RX 460.
Zooming out and looking back at the block diagram as a whole, we also see that the Polaris 11 GPU sticks with the 4x ACE layout. A single GCP sits at the “top” of the GPU and handles CPU instructions, with ACEs assisting in asynchronous compute processing at a hardware level. AMD is also capable of leveraging compute preemption where applicable. Read more about compute preemption here.
Hardware Schedulers (HWS) have been around for a little while, but are still somewhat new. The HWS block saw introduction in Gen3, and is relied upon more heavily in Gen4 GCN. Owners of Fury and 390-class GPUs may be happy to hear that HWS can be updated via microcode through drivers, meaning that some level of firmware patching can continue to provide additional hardware-level optimizations to those cards. That functionality is also native to the Polaris GPUs. Polaris now supports QRQs for hardware power reductions when using a VR headset, allowing for prioritization in the queue for timewarp commands. The idea here, as with nVidia's compute preemption, is that a current process can be interrupted to prioritize the more important not-make-the-user-vomit process.
Sapphire RX 460 Nitro PCB Shot & VRM Components
Here's a look at the Sapphire RX 460 Nitro PCB. Note that we've got a tear-down video forthcoming on the YouTube channel (and accompanying article), as well as a forthcoming PCB / VRM analysis.
The bare PCB is above. Cooler discussion is further down. The PCB is fairly simple, comparatively. Let's mark it up:
Above: PCB components as outlined by Libor “BuildZoid” Sadilek, who is working on our upcoming RX 460 PCB analysis video.
The Sapphire RX 460 Nitro is using a 4-phase power design for its Vcore VRM, with a +1-phase for the memory VRM. The NCP81022 OnSemi voltage controller makes an appearance on the RX 460, accompanied by Magna MDU1514 and 1517 high- and low-side FETs (respectively). We'll discuss this in more depth (and talk about hard mods and volt mods) in our upcoming PCB analysis. Subscribe to the channel for that.
Sapphire RX 460 Nitro Cooler
The cooler is trivial, as one might expect: There's a dual-push fan configuration atop a widely spaced aluminum fin heatsink, using a large aluminum slug with embedded copper coldplate for contact to the board and VRAM. The copper coldplate isn't the highest quality we've seen, and surface roughness contributes to some of the thermal results we later show, but it's also only responsible for sinking the Polaris 11 GPU – the smallest chip this generation.
As a point toward Sapphire, the push fans are removable by two easily accessed screws. This is to accelerate the RMA process, allowing users to RMA individual fans rather than an entire unit. By well thought-out design, the fans connect to the shroud by using metal contacts and pins, rather than the more common fan cables. These pins, of course, eventually terminate in the usual fan cables that connect to the board – but eliminating the cabling from the user's hands does reduce risk of accidentally leaving something unplugged.
Two heatpipes are present in the Nitro, both 6mm. The heatpipes feed through the VRM and the GPU core, as one might expect.
Continue to page 2 for testing methodology.