Hardware Guides stub

Fixing Vega FE: Undervolting to Improve Performance & Power Draw

Posted on July 17, 2017

This feature benchmark dives into one of the top requests we received from our Patreon backers: Undervolt Vega: Frontier Edition and determine its peak power/performance configuration. The test roped us in immediately, yielding performance uplift largely across the board from preliminary settings tuning. As we dug deeper, once past all the anomalous software issues, we managed to improve Vega: FE Air’s power available to the core, reduce power consumption relative to this, and improve performance in non-trivial ways.

Although power target and core voltage are somewhat tied at the hip, both being tools for overclocking, they don’t govern one another. Power target offset dictates how much additional power budget we’re willing to provide the GPU core (from the power supply) in order to stabilize its clock. GPU Vcore governs the voltage supplied, and will generally range from 900 to 1250mv on Vega: FE cards.

Vega’s native DPM configuration runs its final three states at 1440MHz, 1528MHz, and 1600MHz for the P-states, with DPM7 at 1600MHz/1200mv. This configuration is unsustainable in stock settings, as the core is both power-starved and thermally throttled (we’ll show this in a moment). The thermal limiter on Vega: FE is ~85C, at which point the power and clock will fluctuate hard to try and maintain control of the core temperature. The result is (1) spikey frequencies and frametime latencies, worsening perceived performance, and (2) reduced overall performance as frequency struggles to maintain even 1528MHz (let alone the advertised 1600MHz). To resolve for the thermal issue, we can either configure a more intelligent fan curve than AMD’s stock configuration or create a Hybrid card; unfortunately, we’re still left with a new problem – a power limit.

The power limit can be resolved in large part by offsetting power target by +50%. Making this modification is easy and “fixes” the issue of clock-dropping, but introduces (1) new thermal issues – resolvable by configuring a higher fan RPM, of course, and (2) absurdly high power consumption for a non-linear scaling in performance. In order to truly get value out of this approach, undervolting seems the next appropriate measure. AMD’s native core voltage is far higher than necessary for the card to operate at its 1600MHz target, and so lowering voltage improves performance from the out-of-box config. This is for thermal and power reasons alike. We ultimately see significantly reduced power consumption, to the tune of ~90W in some cases, a more stable core clock and thereby higher performance, and lower temperature – and thereby controllable noise.

 

Items to Consider & Methods

Prior to diving into this headlong, there are a few important disclaimers to make:

Not all cards are the same. Ours may undervolt better or worse than yours. If yours is better, that’s fantastic – run with it, but ours couldn’t handle anything better than the 1090mv number we’re primarily publishing. If it’s worse, just increase voltage until stability is met. We’d advise slowly stepping down voltage in increments rather than just copying numbers, as it’s likely that your GPU core will respond differently than ours.

Next, note that the software is still buggy. Conflicting reports abound, but at the end of the day, WattMan and Wattool are both imperfect solutions to this problem. At present, WattMan and Wattool both have an HBM2 underclocking bug that rears its head only under certain conditions. We are not 100% positive when this bug emerges, but we think we’ve pin-pointed it to manual overclocking of all 7 DPM states and corresponding voltages on the volt-frequency curve. This seems particularly true when attempting to set all 7 to be equal or close in values. That’s not to say it’s impossible to make these tools work, we’re just pointing it out because you could potentially cut off a huge amount of performance inadvertently from an HBM2 downclock (that wasn’t user-initiated). Keep an eye on HBM2 frequency while performing these tweaks. Note also that fan RPM targets are 200RPM lower than what the device will spin-up.

Finally, for our testing, we care about constraining variables between two of our main cards. We’ve got three total configurations: Stock Auto – no changes whatsoever; Stock +50% PWR, with no voltage or DPM changes; Stock +50% PWR & undervolted. This is all done with the air card (hence “stock”), not the Hybrid mod. We’ve additionally configured our +50% PWR & +50% PWR & undervolt configurations to use a fixed fan speed of 3400RPM, which is a bit aggressive – but the point is to ensure we’re not thermally throttling. Thus, the latter two configurations are directly thermally comparable, but the Stock – Auto configuration (which runs a fan curve that’s way too limp) will not run the same fan RPM, and is thus not 100% comparable.

Game Bench 

GN Test Bench 2017NameCourtesy OfCost
Video CardThis is what we're testing--
CPUIntel i7-7700K 4.5GHz lockedGamersNexus$330
MemoryGSkill Trident Z 3200MHz C14Gskill-
MotherboardGigabyte Aorus Gaming 7 Z270XGigabyte$240
Power SupplyNZXT 1200W HALE90 V2NZXT$300
SSDPlextor M7V
Crucial 1TB
GamersNexus-
CaseTop Deck Tech StationGamersNexus$250
CPU CoolerAsetek 570LCAsetek-

Vega: Frontier Edition Undervolting Power Consumption

vega undervolt current 3

Starting with current draw at the PCIe cables only, the completely stock card starts off drawing about 268W, but as we approach the 400-second mark, the card starts spiking hard between 17.7A and 23A. This behavior correlates with clock throttling – which we’ll show in a moment – and is precisely why we’ve been saying that Vega: FE Air can’t hold its advertised 1600MHz boost clock out of the box. Its power limit and cooler are simply insufficient. The cooler can do it if exiting the fan profile and going to high dBAs, but this is where it sits out of box.

The next move is to get the frequency to hit 1600MHz constantly, so we increase the power target by 50% and set a fixed fan speed to solve for the thermal limit, absolving Vega of both its limitations at once. The red line is the result. A new problem emerges: Thermals and frequency are now under control, but PCIe cable power draw is hitting 30A at time, averaging about 28-29A. That’s about 344-370W down the PCIe cables, and is going to start generating a lot more heat as a result.

Finally, our undervolted line emerges: The blue line represents an undervolt of -110mv, dropping us from 1200mv to 1090mv. Current is now 23A, for a power consumption of about 283W at the PCIe cables. That’s about 15W more than the stock setup that struggled to maintain 1600MHz, about 87W lower than the power-offset setup that sustains 1600MHz, and should lower thermals as well. Let’s go to that chart.

Undervolting Impact on Thermals – Vega: Frontier Edition

As a note, read the “Items to Consider & Methods” section for clarity on when the red, blue, and orange lines are comparable. The fan speed differences make temperature between the ‘auto’ configuration and other two configurations an indirect comparison; we’re mostly interested in red and blue for this one.

vega undervolt thermals 3

Our orange line again represents the Stock – Auto configuration, which runs a fan curve that isn’t aggressive enough, a voltage that’s too high, and a power budget that’s too low. It’s the worst of all options. The result is constantly hitting the thermal limit and throttling, observed at the 85C mark – though we sometimes observe spikes to nearly 90C.

Applying a 50% power offset and fixing the frequency to 1600MHz, temperatures are about 73C – but the fan is at 70% to control the thermal variable for undervolting. The result is a noise level at a somewhat unbearable 60dBA versus the auto noise level of roughly 50dBA. There’s room to drop the fan speed with the lower voltage, though, because less heat is being generated as less power is consumed. Our point was just to eliminate the thermal concern for our A/B undervolting test; if you were to do this on your own, we’d suggest min-maxing the fan curve to reduce noise. Ours was unnecessarily high, but was a safety to control the thermal variable.

The more appropriate comparison would be our blue line versus our red line, as these two were tested with the same settings aside from just one variable: Voltage to the core. With the exact same fan speed, the same +50% power offset, and with voltage lowered by 110mv, the Vega undervolted card performs at around 63-66C, for about a 7-10C reduction from the card operating at 1200mv.

Pretty good so far. The last question is of frequency.

Vega FE’s Struggle to Maintain Frequency

vega undervolt frequency 3

Plotting frequency, the orange line shows the stock, out of box configuration for the Vega: Frontier Edition air-cooled card. We’re throttling hard, and only rarely achieving 1600MHz; the regularity with which 1600MHz is achieved diminishes significantly as time goes on, largely due to thermal constraints with the default fan curve. We tend to be operating at DPM power state 5-6 rather than state 7, which would give us full performance.

The red and blue lines converge on this chart, as increasing the power target and removing the thermal limit gives us a perfectly flat 1600MHz frequency – closer to what’s advertised on the box. That said, the red line is pulling 344-370W through the PCIe cables, so that’s a little aggressive and may not be worth the power and thermal load over stock. Undervolting, however, permits 1600MHz and draws 87W less power than the red line, but 15W more than the orange line. That’s a damn good trade.

That shows the theory and proves that all of this works well. Our data shows that this undervolting is working, once you learn to work with the applications, and so the next challenge is to determine whether this impacts actual performance.

We’re keeping these tests limited, as 1600MHz sustained will clearly perform better than 1440 to 1528MHz at DPM5-6. 3DMark starts us off, then we’ll look at two gaming workloads. We’ll leave SPECviewperf out for today, as we sort of already showcased that performance ceiling in our Vega Hybrid results.

FireStrike Ultra – Vega Undervolted Benchmark

vega undervolt firestrike ultra

FireStrike Ultra starts us out. The Vega FE Air card when completely stock ran a graphics score of 4906, with our 50% power offset cards both operating at around 5370 graphics score. This includes the undervolted card, which manages about a 9-10% performance uplift over the stock card. Here’s the crazy thing: Again, we’re not overclocking to achieve this. All we’re doing is making more power available while reducing the voltage, which nets a marginal power consumption spike at the trade of more consistent and faster frametimes. That’s a pretty good trade for 15W, and is far better than the 87W of the power offset without undervolting.

For point of reference, our Hybrid FE overclock performed at 5774, which is 7% faster than the undervolted card. Kind of puts into perspective just how far undervolting and over-powering will get you.

vega undervolt timespy

TimeSpy gives us a gain of about 7.6% from the undervolted card over the stock card, with our Hybrid OC gaining another 9.6% on top of that – though drawing significantly more power at around 33A.

Here’s FireStrike Extreme:

vega undervolt firestrike extreme

As for games, some experienced instability at 1090mv and had to be moved up to 1100mv; For Honor was particularly unstable, and required a core voltage of about 1120mv.

Ghost Recon: Wildlands – Vega FE Undervolted

Let’s look at Ghost Recon first.

vega undervolt grw 4k

At 4K and with VH settings, the undervolted AMD Vega Air card performed at 41FPS AVG, with lows close by at 37 and 36. The stock card with no modifications operated at 37.7FPS AVG, resulting in a performance uplift of 8.8% from the stock card. This uplift is because the stock card cannot maintain 1600MHz without a power offset – but again, a power offset without overvoltage increases your power consumption by 80-90W, thereby increasing thermals that the card deals with. This undervolting and over-powering appears to be the best approach to extracting more performance.

DOOM – Vega FE Undervolted

vega undervolt doom 4k

With DOOM using Vulkan, Async Compute, and rendered at 4K, the Vega undervolted card operates at 71.6FPS AVG, with low-end frametimes also improved over the stock card. Our AVG FPS improvement is about 11.5% in DOOM, following the trend of DOOM being a somewhat best-case scenario for AMD on a routine basis. The performance uplift is tremendous when considering our minimal power consumption increase and better overall control on the card.

Conclusion: Overpower, Undervoltage Far Better than Stock Config

But not without their caveats.

The trouble with this solution is that it is imperfect by nature. First, every chip is not made the same; ours may undervolt better or worse than others out there, and that means there’s no easy “use these numbers” method. You’ll ultimately have to guess and check at stability to find the numbers that work, which means more work is involved in getting this solution to be rock-steady. That’s not to say it’s difficult work, but it’s certainly not as easy as plugging a card in and using it. We found that some games required 1120mv to remain stable, while others were fine at 1090mv. Ideally, you’d make a profile for each application – but that’s a bit annoying, and becomes difficult to maintain. The next option would be to choose the lowest stable voltage for all applications (in our case, that might be 1120mv). You lose some of the efficiency argument when doing this, as the bottom-end is cut off, but still gain overall.

A straight +50% overpower configuration is a huge waste of power down the PCIe cables, which results in running hotter than necessary and thereby louder.

Software is also buggy and frustrating. No, not everyone sees the same issues – that’s the nature of buggy software. It is difficult to precisely pinpoint the issue causing HBM2’s brutal downclocking of -445MHz, but we have seen it happen routinely and on multiple systems with multiple environments. We think that this has to do with manually configuring all 7 DPM states and their corresponding voltage states; when we only configured DPMs 4-7, the downclocking issue did not occur. Fan speed curves are also inaccurate, and report about 200RPM higher than what the user requests. Wattool has the same bugs as WattMan, and Afterburner can’t adjust voltage (yet). The point isn’t to say that it’s impossible to undervolt like we did, it’s just to say that you should really be aware of all the different variables when tweaking. It’s possible to inadvertently hinder performance (in major ways) if HBM2 underclocks without the user’s knowledge. Keep an eye on it.

As for the task at hand, it seems the best possible configuration is to overpower the core (+50%), undervolt the core (roughly -110 to -90mv), and run a fan RPM that keeps temperatures at or below 80C. That’ll depend on your cooling solution, case, and case/room ambient temperatures.

This yields a decent boost to application performance (professional and gaming) without costing the insane +90W draw of a straight +50% overpower configuration.

Editorial: Steve Burke
Video: Andrew Coleman