Bench Theory: Does Benchmark Duration Matter? One Year of Testing

By Published February 01, 2018 at 2:27 pm

The short answer to the headline is “sometimes,” but it’s more complicated than just FPS over time. To really address this question, we have to first explain the oddity of FPS as a metric: Frames per second is inherently an average – if we tell you something is operating at a variable framerate, but is presently 60FPS, what does that really mean? If we look at the framerate at any given millisecond, given that framerate is inherently an average of a period of time, we must acknowledge that deriving spot-measurements in frames per second is inherently flawed. All this stated, the industry has accepted frames per second as a rating measure of performance for games, and it is one of the most user-friendly means to convey what the actual, underlying metric is: Frametime, or the frame-to-frame interval, measured in milliseconds.

Today, we’re releasing public some internal data that we’ve collected for benchmark validation. This data looks specifically at benchmark duration or optimization tests to min-max for maximum accuracy and card count against the minimum time required to retain said accuracy.

Before we publish any data for a benchmark – whether that’s gaming, thermals, or power – we run internal-only testing to validate our methods and thought process. This is often where we discover flaws in methods, which allow us to then refine them prior to publishing any review data. There are a few things we traditionally research for each game: Benchmark duration requirements, load level of a particular area of the game, the best- and worst-case performance scenarios in the game, and then the average expected performance for the user. We also regularly find shortcomings in test design – that’s the nature of working on a test suite for a year at a time. As with most things in life, the goal is to develop something good, then iterate on it as we learn from the process.

A Recap of Validation History

Destiny 2 beta below:

GTX 1080 Ti SC2, 4K, Highest

First 20 Minutes of Destiny 2 Beta Campaign

  AVG FPS 1% LOW 0.1% LOW
Spot-Check #1 51.0 45.0 44.0
Spot-Check #2 52.0 46.0 40.0
Spot-Check #3 53.0 48.0 46.0
Spot-Check #4 53.0 47.0 43.0
Spot-Check #5 51.0 44.0 39.0
Spot-Check #6 55.0 48.0 43.0
Spot-Check #7 51.0 46.0 39.0
Spot-Check #8 51.0 47.0 44.0
Spot-Check #9 44.0 39.0 33.0
Spot-Check #10 51.0 42.0 42.0
5-Minute Campaign Intro 58.0 49.0 45.0
Final Bench Scene 55.0 48.0 47.5
Standard Deviation 3.4 2.9 3.9

GTX 1080 Ti SC2, 4K, Highest

Multiple Competitive Matches

  AVG FPS 1% LOW 0.1% LOW
Match #1, Spot-Check #1 53.0 45.0 43.0
Match #1, Spot-Check #2 53.0 42.0 40.0
Match #2, Spot-Check #1 57.0 50.0 41.0
Match #2, Spot-Check #2 55.0 48.0 46.0
Final Bench Scene 55.0 48.0 47.5
Standard Deviation 1.7 3.1 3.2

Let’s get some examples on the screen of times we’ve published internal research: With Destiny 2’s beta, we tested various parts of the game, including testing durations spanning 30-seconds to 20 minutes. This also allowed us to determine that most parts of the intro campaign performed equivalently, while a select few questions were highly demanding of the system. This also included multiplayer benchmarking and singleplayer benchmarking of various durations.

3 for honor real vs built in benchmarks

Above: For Honor beta initial testing (before full launch)

We also did this for games like For Honor, where we determined that the built-in benchmark wasn’t at all representative of real-world gameplay, something that pushed us away from using the built-in option.

We did this again for Mass Effect: Andromeda, where we discovered that, with early drivers on AMD cards, the game would stutter on the first test pass through the test area. The result was that we needed to include more test passes than normally, then present data both with and without the stutter included. This is something that was later resolved by AMD.

4 mass effect andromeda fury x

RX 480 Test Passes - 1080p/Ultra
  AVG FPS 1% LOW 0.1% LOW
Pass 1 72 56 6
Pass 2 74 57 53
Pass 3 75 60 55
Pass 4 75 59 56
Pass 5 74 59 54

Excerpt from MEA content: "In this one, our first test pass shows 72FPS AVG, with 56FPS 1% low and 6FPS 0.1% low. This shows itself in stutters during the first pass, but smooths out in subsequent passes. We improve from roughly 6FPS 0.1% lowest performance to 53FPS in the second pass."

GTX 1060 Test Passes - 1080p/Ultra
  AVG FPS 1% LOW 0.1% LOW
Pass 1 88 66 60
Pass 2 92 70 65
Pass 3 90 69 65
Pass 4 91 70 65

The point is that we do this for each game, and often discover anomalous behaviors for each GPU vendor, or for particular regions in the game, or with specific graphics settings.

Another one of our discoveries was when dynamic reflections had the most significant impact in Overwatch, for which we generated frametime charts plotting the difference – a jump between ~10ms and ~16ms on the tested device. We often also test graphics scaling on a particular set of hardware, giving us an understanding for where devices may gain an unexpected lead over competing devices.

6 overwatch gfx frametimes

7 overwatch gfx framerate

With Destiny 2’s beta, this allowed us to determine that nVidia had a significant advantage only when under “Highest” settings, but that its advantage faded away under “High” settings.

destiny 2 1070 v56 frametimes 1080phighest

Above: Launch frametimes

destiny2 gpu 1070 v56 frametimes 2

Above: Beta frametimes at 1440p (later resolved)

destiny 2 1440p highest beta vs launch

AMD fixed this upon launch of the game, something that nVidia also later leveraged to improve its own performance – again, specifically under highest settings. The point here is that there is significant performance impact between these two settings, but visual impact may not be significant to the user, potentially meaning that lower-end devices could do just fine on High, but not Highest. This is important for determining which settings should be used for reviews and benchmarks.

8 watch dogs 2 cpu scaling

We studied this again in Watch Dogs 2, where we demonstrated a CPU settings scalability chart for framerate.

All of that is to say that we work hard to understand what we’re testing, and harder to create charts demonstrating why we test the scenarios we do. The next big concern is reliability and repeatability. With a benchmark, you’ve really got two options: Repeatable and reliable, or realistic. You can’t have both, but you can study the game’s behaviors to best simulate realism with repeated tests in areas representative of the whole game.

Our approach to benchmarking theory is to collect large datasets with accurate, repeatable numbers. Ultimately, what we care about is device scalability, not hard FPS; we care about hard FPS in per-game benchmarks, where we test each settings configuration for that particular game, and do so on a wide range of devices. This means that we’re really looking at percentages to determine the best relative to other devices, but not necessarily whether a framerate is considered ideal for a particular game – that analysis is served separately, often in standalone content.

We’ll discuss repetition and standard deviation of test results in another content piece this week, but we first need to talk about optimal test duration. This becomes a balancing act of managing to fit-in more repetitions or more accuracy, depending on the game and its behavior. Some games have a great level of variance, like multiplayer games, and are often best tested with fewer, longer tests. Other games are best tested with numerous short tests. When working on as many devices as we do, the math pretty easily shows that it would be physically impossible to run every test for long durations while still retaining accuracy. For this reason, we optimize test duration on a game-by-game basis.

We normally keep this information private, as it is core to our business and ability to compete. That said, as we are again revising our methodology for 2018, we thought now would be a good time to reveal some of last year’s test research. If you are a content creator and use this information in testing, mention GamersNexus in related coverage.

All of these tests were conducted a minimum of 4 times and averaged. Test durations ranged from 30 seconds to 5 minutes, depending on the game. Error bars are present to display standard deviation between all test runs.

Test Platform

GN Test Bench 2017 Name Courtesy Of Cost
Video Card This is what we're testing - -
CPU Intel i7-7700K 4.5GHz locked GamersNexus  $330
Memory GSkill Trident Z 3200MHz C14 Gskill -
Motherboard Gigabyte Aorus Gaming 7 Z270X Gigabyte $240
Power Supply NZXT 1200W HALE90 V2 NZXT $300
SSD Plextor M7V
Crucial 1TB
GamersNexus -
Case Top Deck Tech Station GamersNexus $250
CPU Cooler Asetek 570LC Asetek -

BIOS settings include C-states completely disabled with the CPU locked to 4.5GHz at 1.32 vCore. Memory is at XMP1.

Metro: Last Light Performance Scaling

Settings: Very High quality, High tessellation, 1440p rest default

We’re starting with the oldest benchmark title, as it is the easiest to configure for multiple test durations. We have historically proven Metro to also be the single most consistent benchmark title, under the right settings. All test methodology and components are in the article linked in the description below.

9 metro last light gtx 1070

Starting with only the GTX 1070 Gaming X, we see that the average FPS sits at 84 for a set of 4x 30-second test passes, 85FPS AVG for 4x 60-second test passes (this is within test variance and error), and 87.8FPS AVG for 90-seconds of testing, which exits margin of error and becomes a performance increase of 4.5%. What’s relevant here is how this compares relative to the RX 580 we’ll next show – if both scale equivalent over both test durations, and all we care about is relative performance between devices, then the difference is irrelevant.

Average FPS hovers at 87 for a 120-second run and 88 for a 150-second run. Overall, this is exceptionally consistent. Our total range is 4FPS AVG, for a total bottom-to-top increase of 4.8%. For frametimes, the 1% and 0.1% lows are also relatively equal, and are largely within test-to-test variance.

10 metro last light rx 580

The RX 580 showed performance between 59 and 61FPS throughout all tests, generally sitting around 60FPS – Vsync, of course, is disabled. The card is taxed enough that the small performance swings exhibited by the GTX 1070 are not shown here.

11 metro last light relative perf

Here’s a chart of relative performance to one another, using average FPS at each duration. The RX 580 is roughly equal to 68-69% of the GTX 1070’s performance when tested at 60-, 90-, 120-, and 150-second durations. The RX 580 is equal to 72% of the GTX 1070 when tested for a shorter duration, a result of operating 0.8FPS faster for the 580, and operating a few percent slower on the 1070. Some of this is within variance, but minor differences do begin to emerge. Of course, this doesn’t apply to all games or devices, but it gives us a starting point. Thus far, no major differences have appeared.

GTA V Benchmark Duration

Settings: All VH/Ultra (where possible), FXAA, 1440p, no advanced graphics

We next tested GTA V, for which we use scripted automation to complete the final plane scene a minimum of 4 times per test.

12 gtav gtx 1070

With the GTX 1070, we observed average framerates ranging from 104 to 112, a wider range than the previous test. From 30 seconds to 90 seconds, we are tracking a 7.5% performance uplift at 90-seconds. We observed this in 2015, back when we started GTA V, and made an active decision to limit our test passes to 30 seconds for this title. If you look at the GTA benchmark scene, once the plane nears the town, framerate climbs and load on the devices is no longer as high. Because we also tracked GTA V’s unique performance behavior upon hitting 187.5FPS, discussed in two previous videos, we wanted to limit testing to a more stressful and consistent part of the benchmark.

As for lows, those remain relatively consistent between the two longer test passes. The shorter test pass exhibits better 0.1% low performance, but this difference is largely within test variance.

13 gtav rx 580

The RX 580 exhibits almost identical performance to the GTX 1070: We’re at 73.4FPS for the shorter test, and 77FPS for the two longer tests. Lows also exhibit similar behavior: General consistency, with some variance dictating emergent differences.

14 gtav relative

In terms of percentages, the RX 580 maintains almost precisely 70% of the performance of the GTX 1070 across all three tests. In this regard, any of the three test patterns would be valid for comparing these two devices; relative performance is in lockstep with the GTX 1070, meaning that we derive the same conclusion of relative value at any of the three durations. The only difference is the absolute FPS number, which our publication considers to be of lesser value for purposes of review, despite considering it of highest value for standalone game benchmarking.

Overwatch Benchmark Duration

Settings: Ultra, 100% resolution, 1440p, bot match

2 overwatch human vs bot benchmark

Overwatch is next. We wrote an entire, in-depth graphics optimization guide for this game, where we studied various performance behaviors versus graphics settings. That testing is also where we decided to move all testing, including benchmarks for 2017, over to a 5-minute test duration. For this game, we collect ten times as much data per pass as our more controlled built-in benchmarks, and this is strictly due to the huge variation that multiplayer games are subject to. We also benchmark singleplayer bot matches, something we previously found to be equal in performance to online multiplayer matches, but with far greater reliability and consistency.

15 overwatch gtx 1070

Now, our reasoning for going to 5-minute passes won’t be revealed until our standard deviation video, as the averages wash-over those differences. At 30-seconds on the GTX 1070, we observed 74FPS Averages, with 60FPS 1% and 53FPS 0.1% lows. We track marginally lower performance at 60 seconds, but our confidence interval is lower than average due to variance, so we can’t confidently state whether the differences are significant. At 5 minutes, our confidence is high, and our data looks good – 72FPS AVG, 58 and 53 for the lows.

16 overwatch rx 580

The RX 580 outputs similar scaling performance across durations. The 30-second tests, as also shown on the 1070, tend to output slightly higher performance metrics due to a more even split between non-combat and combat, with non-combat rising higher. We believe this is unrealistic to Overwatch, as the game’s most important moments revolve around combat; for this reason, our 5-minute tests are conducted from the time the doors open through combat, and we remain alive and in combat for the entire 5-minute duration. This gives the most important data.

17 overwatch relative

Relatively, the RX 580 maintains about 65% to 67% of the GTX 1070 in this test. Remember, this content isn’t about the 1070 versus the 580 – that’s irrelevant. The two devices are just being used to illustrate scalability over duration. Although performance is ultimately similar, our confidence interval is significantly lower for the shorter test passes in Overwatch, so we opt for the 5-minute tests.

Ashes of the Singularity Duration Scaling

Settings: Extreme, Dx12, 4K

18 aots gtx 1070

Ashes of the Singularity is next. For this one, we observe higher framerates over 30 seconds than 60 and 90 seconds, resulting in a performance disparity of about 10.7%. This is the greatest we’ve yet observed; however, once again, we need to determine whether this has significance when looking at GPUs in a relative fashion, rather than looking at absolute FPS. Charted alone, the RX 580 looks about the same – 38FPS for the shorter test, 34FPS for the longer tests, with lows deadly accurate.

19 aots 580

20 aots relative

Relatively, however, the GPUs are the same. We see that the RX 580 maintains 70% of the GTX 1070’s performance in all three tests. From the perspective of relative performance, which is what we mostly care about in reviews, these three results would yield the same conclusion.

Sniper Elite 4 Benchmark Duration Scaling

Settings: High, Dx12, Async, 4K

21 sniper gtx 1070

22 sniper rx 580

23 sniper relative

Sniper Elite 4 is the next one. In this game, we see the GTX 1070 operating at about 50 to 53FPS on average, with the higher value stemming from our combat test during the 90-second run. The other two tests were conducted by running around the village, a geometrically complex area, without any combat. The RX 580 exhibited more consistent performance as it is more pinned for resources, but the relative performance gives us values of 74% to 78% of the GTX 1070’s performance. This is one of the wider ranges. When we started testing Sniper a year ago, we chose to focus on walking through the more geometrically complex scenes, rather than introducing the variance of combat.

Some Extras: For Honor

Settings: 1440p, Extreme

for honor 1070

for honor 580

for honor relative

For Honor is a bit unique in that it, like other Ubisoft titles, routinely exhibits anomalous or poor performance behavior. This is a game that can't sustain an overclock at the same level as its peers. At some point, you encounter the question of whether you're benchmarking the game or benchmarking the hardware -- but Ubisoft makes a lot of games, often in a similar fashion, and so we chalk this up to a "Ubisoft optimization benchmark."

Regardless, when testing the heavier load, geometrically complex structures in the game, we see lower FPS on the GTX 1070, and see performance closer between the two (81% of the 1070 for the 580). When testing in combat, which is less complex and involves more skybox or ground view, the 580 sits at 72% of 1070 performance. This is a game where we must make a judgment call. For our testing, we always favor the heavier load benchmark, as it creates a worse- or worst-case scenario, better ensuring that we're reporting conservative numbers.

Conclusion: Confidence Interval & Standard Deviation At Play

Of course, the nature of this kind of work is that computers are highly complex, and anyone serious about testing will find ways to improve current testing methods. We’re constantly improving these testing procedures.

In this content, we show that shorter test durations of 30-seconds can be used reliably; however, where run-to-run standard deviation climbs, we can improve confidence interval by testing for longer durations. Where standard deviation is nearly 0 – like Metro: Last Light, which outputs almost precisely the exact same results run-to-run – we can maintain shorter test passes while still keeping accuracy. This is further the case when absolute framerate deviation over time is minimal, although we typically test for relative performance in reviews, not absolute performance. We test for absolute performance in game-specific benchmarks, where we likely test all the different settings to min-max framerate/graphics on a particular set of hardware. The objectives are different between the two – relative is ideal for reviews, we think.

Overwatch is a good example of a game with excessive deviation run-to-run. To properly test this game, as we’ve done for the past year, we find it optimal to run tests in a bot match (which equates multiplayer match performance) over a 5-minute period. The dynamism of Overwatch and multiplayer games means, naturally, more variables and deviation are introduced. Testing for 5-minute periods helps to reduce that concern by smoothing-out the averages over a longer period, reducing run-to-run deviation.

It all depends on the game, ultimately. We test each game internally before releasing data, as this enables us to determine the best mode of benchmarking for that title.

More in this series to come.

Editorial, Testing: Steve Burke
Video: Andrew Coleman

Last modified on February 01, 2018 at 2:27 pm
Steve Burke

Steve started GamersNexus back when it was just a cool name, and now it's grown into an expansive website with an overwhelming amount of features. He recalls his first difficult decision with GN's direction: "I didn't know whether or not I wanted 'Gamers' to have a possessive apostrophe -- I mean, grammatically it should, but I didn't like it in the name. It was ugly. I also had people who were typing apostrophes into the address bar - sigh. It made sense to just leave it as 'Gamers.'"

First world problems, Steve. First world problems.

We moderate comments on a ~24~48 hour cycle. There will be some delay after submitting a comment.

  VigLink badge