Recapping the Issue
EVGA admitted to GamersNexus that the lack of thermal pads between the VRAM and VRM components and the heatsink was an "oversight," but did think the issue to be somewhat blown out of proportion by the internet. We even agreed, to some extent, but warned that high ambient or high case ambient could exacerbate the issue to a point of concern for owners. In a world where a case could easily fluctuate in internal temperatures by upwards of 5-7C, that temperature gain could be enough to impact more than just clock-rate of the GPU through usual Boost functionality.
Above: GPU diode temperatures tested in different cases. These are perfectly acceptable cases, but the swing in *C demonstrates that poor ventilation -- where otherwise acceptable -- could pose issues for the EVGA VRMs that lack thermal pads.
The ACX coolers have fallen prone to a number of design flaws over the past few years, but this one is one of the easiest to prevent. EVGA's lack of a thermal interface between the PCB baseplate and the ACX heatsink means that the cooler could actually be doing the opposite of cooling. Heat build-up on the silicon looks for an interface to transfer to the copper or alloy sink. That heatsink doesn't do anything unless there's a way for the heated device to transfer its heat, and could end up trapping the heat in an area of effectively zero airflow (see some of our tear-downs to better understand this). There's no airflow in that part of the card -- the air is designed to hit the heatsink and dissipate heat through the greater surface area provided by fins, not somehow ethereally phase through the heatsink and cool those hot zones.
This lack of thermal pads means air is trapped between the heatsink and the devices, and air isn't a great thermal conductor. One of the worst, actually, with a ~0.24W/mK thermal conductivity at 25C. This is exactly why we explain a thermal interface is ideal between an IHS and heatsink -- hot spots form in those imperfections.
This flaw is more than the usual manufacturing surface imperfection of a coldplate, though. This is a design flaw.
The good news is that EVGA has already promised free provision of thermal pads for DIY modding by users. In our discussion with EVGA's Jacob Freeman on the phone, we also learned that EVGA would replace users' cards at no charge to the user. An owner would have to send the device in to EVGA in order to receive a new device with the thermal pads pre-applied. EVGA's thermal pad mod brings operating temperature of the VRM closer to 75C, which is actually pretty damn good. That allows more than enough amperage to the GPU, and ensures more efficient operation. The company deserves credit for this rapid response and free solution to an issue that we are still trying to understand.
We have also been informed that the factory is now making changes to products as they ship, ensuring that new batches of the GTX 1080 FTW and 1070 FTW products will include thermal pads pre-installed.
The Failure
Above: EVGA MOSFET failure, source.
MOSFET power stage failure could happen at 125C, GN contributor and veteran overclocker "Buildzoid" tells us:
"VRMs do not exhibit any signs of approaching a breaking point that can be measured without monitoring the temperature of said VRM. When the VRM hits the critical amount of stress, it will thermal runaway and burn out with no warning."
Above: A power stage normally has a flat line that drops once above 125C. The data sheet we have for EVGA's power stages suggests a 100C safe operating range. See above. We have requested more technical specifications on the OnSemi MOSFETs used by EVGA.
"The powerstages for example have a safety shutdown at 180C, however at that point, 1A would probably kill them. Basically, to do a good VRM safety implementation, you would want a current protection that lowers the amount of current allowed based on VRM temperature."
We have been working with Buildzoid to produce a video on this for GamersNexus. Thus far, we've learned that the maximum operating temperature of the VRM is 125C (according to the datasheet). Considering that Tom's is reporting a ~107C output for the VRM in their testing platform, it would not be unreasonable for a poorly ventilated case with a higher ambient temperature to achieve 125C.
In fact, some reddit users have already posted "RIP" photos of their cards. In the picture we've taken from this post, the card looks like it's undergone textbook MOSFET failure.
Research
The GTX 1080 draws 180W+ at 1.05V, with the GTX 1070 drawing 150W+ at 1.05V. With factory OC'd cards pulling 1900MHz or greater, you'd be looking at something like 170-190A draw on the average GPU core, supplied by the VRM. EVGA is shipping its FTWs with 215W BIOS, which we discussed previously in our GTX 1080 Classified tear-down.
The GTX 1070 and 1080 FTW cards use NCP81382 power stages for each phase, of which there are 10 for the core voltage VRM. Working with Buildzoid, we have learned the following:
"These are rated to do an average 35A at 100C (so 350A for the core at 100C). What this translates to at higher temperatures like 125C, I don't know. The data sheet doesn't specify but other MOSFETs will lose 40-50% of their capability going from 100C to 125C. Assuming the same is true for the NCP81382, that would leave them with only 17-21A of average current capability (170A-210A at 125C). Toms Hardware managed to take thermal images of the VRM reaching 106C and measure 114C with a thermal probe. This was in bench test scenario with 22C ambient. In a case where ambient temperatures range between 30 and 40C depending on airflow, that 114C can easily end up 125C+. A VRM trying to provide too much current at too high a temperature will quickly end up like [the above from reddit]."
What to Do
We still think, as of this writing, that destruction of FETs should be lower risk for users with highly ventilated cases and ambient closer to 20C-25C. We are still investigating this issue, however, and will be posting a more in-depth content piece that deeply analyzes EVGA's VRM capabilities and shortcomings.
For owners of GTX 1080 FTW or GTX 1070 FTW devices, we would strongly suggest getting the thermal pad mod from EVGA as this should immediately remedy any issues. If you feel uncomfortable performing this mod, contact EVGA to get the card replaced.
There isn't really a "mid-step" to failure, here. Your card has either already been damaged and you know it -- probably because of smoke and a cessation of function -- or it hasn't, and it's probably fine. There may be some life reduction in the FETs, but if there's not been a catastrophic failure, we'd recommend just not stressing the card until you've put thermal pads between the VRM / VRAM and the heatsink.
If you are having difficulty getting support from EVGA for any reason, please contact GN via twitter for back-up.
From Buildzoid, here are additional suggestions:
"To try prevent this [VRM burn-out], you need to either improve VRM cooling capabilities with some kind of physical modification (like the thermal pads that EVGA is sending out, or you can try buy your own and fit them), [or] you can try to provide better airflow over the back of the card. You can try to feed the card cooler air (opening the side of the case is an option for this). [You could also] lower the card's power limit below stock levels. The GTX 1080 FTW ships with a 215W BIOS. Lowering the power limit to 80% will result in a 172W power limit lowering the current draw by a good 30 to 40A."
We think the most likely causes for failure would be:
- Hot internal case temp (poor ventilation) +
- Low (auto) fan speed
- Maybe CPU cooler exhausting hot air onto back of card (or intake fans on top of case)
Update: GN is planning to acquire an FTW 1080 card for testing. We will be testing the card in restricted airflow test scenarios and in open air / stock (out of box) scenarios. The plan is to place thermocouples on the 7th power stage, which is about dead center of the card, and see if we sustain the temperatures reported by other outlets. The next step would be to stress test the card for several hours and see if a failure is provoked.
Update: Some more specs straight from EVGA: TjMax on MOSFET is 150C, ambient 125C. You obviously lose efficiency / potentially derate as temperature increases. Tcase on power stages is 125C. We will need to test the card ourselves at this point to see which scenarios would cause the failures reported by some users, but we stick to our original stance that failure is only going to happen under specific and unfavorable thermal conditions. Still, if you've got something like the 1080 FTW, it is worth doing the thermal pad mod. There's no reason not to improve thermal performance on something like this.
We will keep you posted as we continue to investigate. We're hoping to have a video live this weekend from overclocker Buildzoid.
Editorial: Steve "Lelldorianx" Burke
Contributing Expertise: "Buildzoid" from Actually Hardcore Overclocking