Bench Theory: How Reliable Are Benchmarks? Error Margins & Standard Deviation

By Published February 16, 2018 at 2:09 pm
  •  

As part of our new and ongoing “Bench Theory” series, we are publishing a year’s worth of internal-only data that we’ve used to drive our 2018 GPU test methodology. We haven’t yet implemented the 2018 test suite, but will be soon. The goal of this series is to help viewers and readers understand what goes into test design, and we aim to underscore the level of accuracy that GN demands for its publication. Our first information dump focused on benchmark duration, addressing when it’s appropriate to use 30-second runs, 60-second runs, and more. As we stated in the first piece, we ask that any content creators leveraging this research in their own testing properly credit GamersNexus for its findings.

Today, we’re looking at standard deviation and run-to-run variance in tested games. Games on bench cycle regularly, so the purpose is less for game-specific standard deviation (something we’re now addressing at game launch) and more for an overall understanding of how games deviate run-to-run. This is why conducting multiple, shorter test passes (see: benchmark duration) is often preferable to conducting fewer, longer passes; after all, we are all bound by the laws of time.

Looking at statistical dispersion can help understand whether a game itself is accurate enough for hardware benchmarks. If a game is inaccurate or varies wildly from one run to the next, we have to look at whether that variance is driver-, hardware-, or software-related. If it’s just the game, we must then ask the philosophical question of whether it’s the game we’re testing, or if it’s the hardware we’re testing. Sometimes, testing a game that has highly variable performance can still be valuable – primarily if it’s a game people want to play, like PUBG, despite having questionable performance. Other times, the game should be tossed. If the goal is a hardware benchmark and a game is behaving in outlier fashion, and also largely unplayed, then it becomes suspect as a test platform.

We use standard deviation to help build our error margins on bar charts, helping establish the difference between a margin that is statistically insignificant and immeasurably different, one that is not appreciably different but measurably different, and one that is appreciably and measurably different.

A reminder before we start: this isn’t about vendor A vs B. The goal is to establish a confidence interval for each game, so that we may then establish whether differences in framerate are either functionally identical or legitimately different. In Metro: Last Light, where we see nearly 0 deviation run-to-run, a 64FPS vs. 66FPS difference across all test passes may establish one device as technically superior. It would not be appreciably superior to the human eye, but it would be measurably better. In a game like Overwatch, where deviation can be quite high, we might say that a 64FPS vs. 66FPS difference is outside of our confidence as being a significant difference, and that the cards are functionally the same.

We often rerun tests after collecting these initial data points, depending on data consistency and integrity. Those reruns aren’t included here, as we’re just focusing on initial test consistency for this piece.

Test Platform

GN Test Bench 2017 Name Courtesy Of Cost
Video Card This is what we're testing - -
CPU Intel i7-7700K 4.5GHz locked GamersNexus  $330
Memory GSkill Trident Z 3200MHz C14 Gskill -
Motherboard Gigabyte Aorus Gaming 7 Z270X Gigabyte $240
Power Supply NZXT 1200W HALE90 V2 NZXT $300
SSD Plextor M7V
Crucial 1TB
GamersNexus -
Case Top Deck Tech Station GamersNexus $250
CPU Cooler Asetek 570LC Asetek -

BIOS settings include C-states completely disabled with the CPU locked to 4.5GHz at 1.32 vCore. Memory is at XMP1.

Sniper Elite 4 – 4K/High/Async/Dx12 – Standard Deviation

stdev sniper elite 4k avg fps

For Sniper Elite’s standard deviation in AVG FPS, we primarily see devices around the 0.1 to 0.8 marker, with a few jutting out noticeably from the pack. Of those that spike in standard deviation, we see the heavily overclocked Titan V and the overclocked Vega 56 cards, both of which are testing stability limits of the core. The Titan V remains a bit more volatile even stock, at 1.91FPS standard deviation.

Overall, it’s fair to say that this game is relatively consistent. There are some spikes outward, primarily with a trend developing toward higher-end devices that are pushing higher framerates, and thus more susceptible to variance in performance. In sum, though, this game has proven a great tool for testing asynchronous compute performance, low-level API performance, and has demonstrated relative accuracy in FPS numbers. When deviation runs high, we typically just run more test passes to determine if things smooth-out. If not, we make a note of it in the content.

stdev sniper elite 4k 1pct low

As for 1% lows, we see bigger spikes here that we always account for with error bars or margin of error discussion in the content. The nature of a 99th-percentile metric is that its accuracy is reduced, as you’re averaging from fewer data points. Still, overall, our standard deviation is within about 1FPS for most devices, with a good portion of devices limiting below 2.5FPS deviation. The Titan Xp Hybrid sticks out here at 4FPS for 1% low deviation, but this is an instance where we’d probably retest and analyze the reported clock data. It’s possible that this sort of deviation spike is from a change in the test path – in essence, technician error.

Destiny 2 – 1080p/Highest/FXAA – Standard Deviation

destiny 2 1080p deviation

Here’s how Destiny 2 looks. In terms of averages, represented by the blue bar, we’re seeing a standard deviation of about 1FPS AVG, with the GTX 1080 FTW spiking outward and asking for a few more test passes.

1% lows also remain relatively consistent for a 99th percentile metric. We’re seeing standard deviation of about 0.7 for 1% lows, and about 1.6 for the 0.1% lows. The 0.1% low metric has a large spike from the GTX 1070 Ti, which we later reran and smoothed-out. In this particular instance, our higher standard deviation was a result of how the test software was instantiated – we didn’t have a long enough delay from button press to recording start, resulting in a stutter when the software was called. This data allows us to correct for that change. In the example of our 4.2 standard deviation here, the data points included a high of 103FPS and a low of 95FPS, the latter of which was a result of the program start time. Catching this prior to publishing our data is what allows us to improve data accuracy.

DOOM – 4K/Ultra/0xAA/Vulkan – Standard Deviation

stdev doom 4k avg

DOOM at 4K/Ultra is next. Note that, for this game, the engine bottlenecks performance at 200FPS – hence testing at 4K.

Overall, the game and its Vulkan API integration are exceptionally consistent. Ignoring the outlier products, like the Titan V and one Vega entry, AVG FPS deviates by about 0.6. Standard deviation spikes for the Titan V, where we had a performance range of 10FPS AVG from top-to-bottom.

stdev doom 4k 1pct

This next 1% LOW standard deviation chart gives an example of when we decide to rerun a test. The Titan V card here has a 1% low standard deviation of 16, but versus a lower overall value. This is from the same dataset as the AVG FPS outlier, which was later rerun. Aside from this point, most of the numbers run a standard deviation of 1 or below, which is very consistent for a 99th percentile metric. DOOM, like Sniper, proves to be reliable in its run-to-run variance and accuracy.

Ashes of the Singularity – 4K/High/Dx12 – Standard Deviation

stdev aots 4k avg

We mostly test Ashes of the Singularity at 4K, High, and using Dx12. This game proves to have more variance than most, something we’ve iterated time and again as we’ve used it more. For instance, we’ve noticed that performance will sometimes improve with more subsequent launches, up to a point of memory saturation. For this game, we see an average standard deviation of about 1-1.5FPS, with several line items exceeding a standard deviation of 2FPS. This is getting a bit disparate, but as long as we adjust our margins of error and variance, we can still use the benchmark as a tool for an optimized DirectX 12 title.

stdev aots 4k 1pct

For 1% lows, it’s pretty consistent. The standard deviation here is completely within reason, with only one card jutting out past 2FPS.

Ghost Recon: Wildlands Standard Deviation

stdev grw 1080p avg

Ghost Recon: Wildlands, a Dx11 game, proves to be one of the most consistent and reliable. Our AVG FPS manages a standard deviation of 0 to 0.6, a result of the built-in benchmark’s profound consistency. This is similar to Metro: Last Light, in this regard, where we see nearly 0 run-to-run deviation.

Metro: Last Light Standard Deviation

stdev mll 1440p

Speaking of, here’s that long-in-the-tooth game now, albeit only with the 1070 and 580 that we retested last week. For these, we saw a standard deviation that was functionally 0. We observed no real change in performance run-to-run, except for one dip for the 0.1% lows.

Overwatch Standard Deviation

stdev overwatch deviation

Overwatch is a bit trickier. The reason we run our Overwatch tests for 5 minutes at a time is because of the variability that can exist run-to-run when testing for lower time intervals. Here’s an example of that: Repeatably and consistently, we’re observing larger FPS deviation than typically, a result of the dynamic nature of a multiplayer bot match. With a 5-minute test duration, this deviation is squashed and results become more consistent.

Hellblade Standard Deviation

stdev hellblade 4k

Hellblade is one of the most intensive games we’ve tested in the past year, and also runs on DirectX 11. At 4K and Very High settings, we measured an AVG FPS standard deviation to be moderately variable, ranging from 0 to 2.6FPS. The variance doesn’t appear to be tied to one type of device more than the others. To account for this, we run additional test passes on this particular game.

0.1% lows are highly variable with this game, ranging from a 0FPS standard deviation to an 8FPS deviation. The Titan V is a great example of a card that required multiple additional passes to determine whether the variance was caused by Titan V driver or hardware issues, or by Hellblade performance issues.

Conclusion

We have data for about a dozen other games that we tested in the past year, but we’ll cap it here. Starting with Destiny 2, we began publishing run-to-run deviation in games we test, and will carry this forward for new game launches in 2018. Early last year, we also added margin of error bars to our FPS charts, and will continue with that. The next step is to detail the standard deviation of some of the most common CPU benchmarking tools. We also have some more detailed plans on GPU test detail, but we’ll keep those quiet until ready.

Editorial, Testing: Steve Burke
Video: Andrew Coleman

Last modified on February 16, 2018 at 2:09 pm
Steve Burke

Steve started GamersNexus back when it was just a cool name, and now it's grown into an expansive website with an overwhelming amount of features. He recalls his first difficult decision with GN's direction: "I didn't know whether or not I wanted 'Gamers' to have a possessive apostrophe -- I mean, grammatically it should, but I didn't like it in the name. It was ugly. I also had people who were typing apostrophes into the address bar - sigh. It made sense to just leave it as 'Gamers.'"

First world problems, Steve. First world problems.

We moderate comments on a ~24~48 hour cycle. There will be some delay after submitting a comment.

Advertisement:

  VigLink badge