A Recap of Validation History
Destiny 2 beta below:
GTX 1080 Ti SC2, 4K, HighestFirst 20 Minutes of Destiny 2 Beta Campaign |
|||
AVG FPS | 1% LOW | 0.1% LOW | |
Spot-Check #1 | 51.0 | 45.0 | 44.0 |
Spot-Check #2 | 52.0 | 46.0 | 40.0 |
Spot-Check #3 | 53.0 | 48.0 | 46.0 |
Spot-Check #4 | 53.0 | 47.0 | 43.0 |
Spot-Check #5 | 51.0 | 44.0 | 39.0 |
Spot-Check #6 | 55.0 | 48.0 | 43.0 |
Spot-Check #7 | 51.0 | 46.0 | 39.0 |
Spot-Check #8 | 51.0 | 47.0 | 44.0 |
Spot-Check #9 | 44.0 | 39.0 | 33.0 |
Spot-Check #10 | 51.0 | 42.0 | 42.0 |
5-Minute Campaign Intro | 58.0 | 49.0 | 45.0 |
Final Bench Scene | 55.0 | 48.0 | 47.5 |
Standard Deviation | 3.4 | 2.9 | 3.9 |
GTX 1080 Ti SC2, 4K, HighestMultiple Competitive Matches |
|||
AVG FPS | 1% LOW | 0.1% LOW | |
Match #1, Spot-Check #1 | 53.0 | 45.0 | 43.0 |
Match #1, Spot-Check #2 | 53.0 | 42.0 | 40.0 |
Match #2, Spot-Check #1 | 57.0 | 50.0 | 41.0 |
Match #2, Spot-Check #2 | 55.0 | 48.0 | 46.0 |
Final Bench Scene | 55.0 | 48.0 | 47.5 |
Standard Deviation | 1.7 | 3.1 | 3.2 |
Let’s get some examples on the screen of times we’ve published internal research: With Destiny 2’s beta, we tested various parts of the game, including testing durations spanning 30-seconds to 20 minutes. This also allowed us to determine that most parts of the intro campaign performed equivalently, while a select few questions were highly demanding of the system. This also included multiplayer benchmarking and singleplayer benchmarking of various durations.
Above: For Honor beta initial testing (before full launch)
We also did this for games like For Honor, where we determined that the built-in benchmark wasn’t at all representative of real-world gameplay, something that pushed us away from using the built-in option.
We did this again for Mass Effect: Andromeda, where we discovered that, with early drivers on AMD cards, the game would stutter on the first test pass through the test area. The result was that we needed to include more test passes than normally, then present data both with and without the stutter included. This is something that was later resolved by AMD.
RX 480 Test Passes - 1080p/Ultra | |||
AVG FPS | 1% LOW | 0.1% LOW | |
Pass 1 | 72 | 56 | 6 |
Pass 2 | 74 | 57 | 53 |
Pass 3 | 75 | 60 | 55 |
Pass 4 | 75 | 59 | 56 |
Pass 5 | 74 | 59 | 54 |
Excerpt from MEA content: "In this one, our first test pass shows 72FPS AVG, with 56FPS 1% low and 6FPS 0.1% low. This shows itself in stutters during the first pass, but smooths out in subsequent passes. We improve from roughly 6FPS 0.1% lowest performance to 53FPS in the second pass."
GTX 1060 Test Passes - 1080p/Ultra | |||
AVG FPS | 1% LOW | 0.1% LOW | |
Pass 1 | 88 | 66 | 60 |
Pass 2 | 92 | 70 | 65 |
Pass 3 | 90 | 69 | 65 |
Pass 4 | 91 | 70 | 65 |
The point is that we do this for each game, and often discover anomalous behaviors for each GPU vendor, or for particular regions in the game, or with specific graphics settings.
Another one of our discoveries was when dynamic reflections had the most significant impact in Overwatch, for which we generated frametime charts plotting the difference – a jump between ~10ms and ~16ms on the tested device. We often also test graphics scaling on a particular set of hardware, giving us an understanding for where devices may gain an unexpected lead over competing devices.
With Destiny 2’s beta, this allowed us to determine that nVidia had a significant advantage only when under “Highest” settings, but that its advantage faded away under “High” settings.
Above: Launch frametimes
Above: Beta frametimes at 1440p (later resolved)
AMD fixed this upon launch of the game, something that nVidia also later leveraged to improve its own performance – again, specifically under highest settings. The point here is that there is significant performance impact between these two settings, but visual impact may not be significant to the user, potentially meaning that lower-end devices could do just fine on High, but not Highest. This is important for determining which settings should be used for reviews and benchmarks.
We studied this again in Watch Dogs 2, where we demonstrated a CPU settings scalability chart for framerate.
All of that is to say that we work hard to understand what we’re testing, and harder to create charts demonstrating why we test the scenarios we do. The next big concern is reliability and repeatability. With a benchmark, you’ve really got two options: Repeatable and reliable, or realistic. You can’t have both, but you can study the game’s behaviors to best simulate realism with repeated tests in areas representative of the whole game.
Our approach to benchmarking theory is to collect large datasets with accurate, repeatable numbers. Ultimately, what we care about is device scalability, not hard FPS; we care about hard FPS in per-game benchmarks, where we test each settings configuration for that particular game, and do so on a wide range of devices. This means that we’re really looking at percentages to determine the best relative to other devices, but not necessarily whether a framerate is considered ideal for a particular game – that analysis is served separately, often in standalone content.
We’ll discuss repetition and standard deviation of test results in another content piece this week, but we first need to talk about optimal test duration. This becomes a balancing act of managing to fit-in more repetitions or more accuracy, depending on the game and its behavior. Some games have a great level of variance, like multiplayer games, and are often best tested with fewer, longer tests. Other games are best tested with numerous short tests. When working on as many devices as we do, the math pretty easily shows that it would be physically impossible to run every test for long durations while still retaining accuracy. For this reason, we optimize test duration on a game-by-game basis.
We normally keep this information private, as it is core to our business and ability to compete. That said, as we are again revising our methodology for 2018, we thought now would be a good time to reveal some of last year’s test research. If you are a content creator and use this information in testing, mention GamersNexus in related coverage.
All of these tests were conducted a minimum of 4 times and averaged. Test durations ranged from 30 seconds to 5 minutes, depending on the game. Error bars are present to display standard deviation between all test runs.
Test Platform
GN Test Bench 2017 | Name | Courtesy Of | Cost |
Video Card | This is what we're testing | - | - |
CPU | Intel i7-7700K 4.5GHz locked | GamersNexus | $330 |
Memory | GSkill Trident Z 3200MHz C14 | Gskill | - |
Motherboard | Gigabyte Aorus Gaming 7 Z270X | Gigabyte | $240 |
Power Supply | NZXT 1200W HALE90 V2 | NZXT | $300 |
SSD | Plextor M7V Crucial 1TB |
GamersNexus | - |
Case | Top Deck Tech Station | GamersNexus | $250 |
CPU Cooler | Asetek 570LC | Asetek | - |
BIOS settings include C-states completely disabled with the CPU locked to 4.5GHz at 1.32 vCore. Memory is at XMP1.
Metro: Last Light Performance Scaling
Settings: Very High quality, High tessellation, 1440p rest default
We’re starting with the oldest benchmark title, as it is the easiest to configure for multiple test durations. We have historically proven Metro to also be the single most consistent benchmark title, under the right settings. All test methodology and components are in the article linked in the description below.
Starting with only the GTX 1070 Gaming X, we see that the average FPS sits at 84 for a set of 4x 30-second test passes, 85FPS AVG for 4x 60-second test passes (this is within test variance and error), and 87.8FPS AVG for 90-seconds of testing, which exits margin of error and becomes a performance increase of 4.5%. What’s relevant here is how this compares relative to the RX 580 we’ll next show – if both scale equivalent over both test durations, and all we care about is relative performance between devices, then the difference is irrelevant.
Average FPS hovers at 87 for a 120-second run and 88 for a 150-second run. Overall, this is exceptionally consistent. Our total range is 4FPS AVG, for a total bottom-to-top increase of 4.8%. For frametimes, the 1% and 0.1% lows are also relatively equal, and are largely within test-to-test variance.
The RX 580 showed performance between 59 and 61FPS throughout all tests, generally sitting around 60FPS – Vsync, of course, is disabled. The card is taxed enough that the small performance swings exhibited by the GTX 1070 are not shown here.
Here’s a chart of relative performance to one another, using average FPS at each duration. The RX 580 is roughly equal to 68-69% of the GTX 1070’s performance when tested at 60-, 90-, 120-, and 150-second durations. The RX 580 is equal to 72% of the GTX 1070 when tested for a shorter duration, a result of operating 0.8FPS faster for the 580, and operating a few percent slower on the 1070. Some of this is within variance, but minor differences do begin to emerge. Of course, this doesn’t apply to all games or devices, but it gives us a starting point. Thus far, no major differences have appeared.
GTA V Benchmark Duration
Settings: All VH/Ultra (where possible), FXAA, 1440p, no advanced graphics
We next tested GTA V, for which we use scripted automation to complete the final plane scene a minimum of 4 times per test.
With the GTX 1070, we observed average framerates ranging from 104 to 112, a wider range than the previous test. From 30 seconds to 90 seconds, we are tracking a 7.5% performance uplift at 90-seconds. We observed this in 2015, back when we started GTA V, and made an active decision to limit our test passes to 30 seconds for this title. If you look at the GTA benchmark scene, once the plane nears the town, framerate climbs and load on the devices is no longer as high. Because we also tracked GTA V’s unique performance behavior upon hitting 187.5FPS, discussed in two previous videos, we wanted to limit testing to a more stressful and consistent part of the benchmark.
As for lows, those remain relatively consistent between the two longer test passes. The shorter test pass exhibits better 0.1% low performance, but this difference is largely within test variance.
The RX 580 exhibits almost identical performance to the GTX 1070: We’re at 73.4FPS for the shorter test, and 77FPS for the two longer tests. Lows also exhibit similar behavior: General consistency, with some variance dictating emergent differences.
In terms of percentages, the RX 580 maintains almost precisely 70% of the performance of the GTX 1070 across all three tests. In this regard, any of the three test patterns would be valid for comparing these two devices; relative performance is in lockstep with the GTX 1070, meaning that we derive the same conclusion of relative value at any of the three durations. The only difference is the absolute FPS number, which our publication considers to be of lesser value for purposes of review, despite considering it of highest value for standalone game benchmarking.
Overwatch Benchmark Duration
Settings: Ultra, 100% resolution, 1440p, bot match
Overwatch is next. We wrote an entire, in-depth graphics optimization guide for this game, where we studied various performance behaviors versus graphics settings. That testing is also where we decided to move all testing, including benchmarks for 2017, over to a 5-minute test duration. For this game, we collect ten times as much data per pass as our more controlled built-in benchmarks, and this is strictly due to the huge variation that multiplayer games are subject to. We also benchmark singleplayer bot matches, something we previously found to be equal in performance to online multiplayer matches, but with far greater reliability and consistency.
Now, our reasoning for going to 5-minute passes won’t be revealed until our standard deviation video, as the averages wash-over those differences. At 30-seconds on the GTX 1070, we observed 74FPS Averages, with 60FPS 1% and 53FPS 0.1% lows. We track marginally lower performance at 60 seconds, but our confidence interval is lower than average due to variance, so we can’t confidently state whether the differences are significant. At 5 minutes, our confidence is high, and our data looks good – 72FPS AVG, 58 and 53 for the lows.
The RX 580 outputs similar scaling performance across durations. The 30-second tests, as also shown on the 1070, tend to output slightly higher performance metrics due to a more even split between non-combat and combat, with non-combat rising higher. We believe this is unrealistic to Overwatch, as the game’s most important moments revolve around combat; for this reason, our 5-minute tests are conducted from the time the doors open through combat, and we remain alive and in combat for the entire 5-minute duration. This gives the most important data.
Relatively, the RX 580 maintains about 65% to 67% of the GTX 1070 in this test. Remember, this content isn’t about the 1070 versus the 580 – that’s irrelevant. The two devices are just being used to illustrate scalability over duration. Although performance is ultimately similar, our confidence interval is significantly lower for the shorter test passes in Overwatch, so we opt for the 5-minute tests.
Ashes of the Singularity Duration Scaling
Settings: Extreme, Dx12, 4K
Ashes of the Singularity is next. For this one, we observe higher framerates over 30 seconds than 60 and 90 seconds, resulting in a performance disparity of about 10.7%. This is the greatest we’ve yet observed; however, once again, we need to determine whether this has significance when looking at GPUs in a relative fashion, rather than looking at absolute FPS. Charted alone, the RX 580 looks about the same – 38FPS for the shorter test, 34FPS for the longer tests, with lows deadly accurate.
Relatively, however, the GPUs are the same. We see that the RX 580 maintains 70% of the GTX 1070’s performance in all three tests. From the perspective of relative performance, which is what we mostly care about in reviews, these three results would yield the same conclusion.
Sniper Elite 4 Benchmark Duration Scaling
Settings: High, Dx12, Async, 4K
Sniper Elite 4 is the next one. In this game, we see the GTX 1070 operating at about 50 to 53FPS on average, with the higher value stemming from our combat test during the 90-second run. The other two tests were conducted by running around the village, a geometrically complex area, without any combat. The RX 580 exhibited more consistent performance as it is more pinned for resources, but the relative performance gives us values of 74% to 78% of the GTX 1070’s performance. This is one of the wider ranges. When we started testing Sniper a year ago, we chose to focus on walking through the more geometrically complex scenes, rather than introducing the variance of combat.
Some Extras: For Honor
Settings: 1440p, Extreme
For Honor is a bit unique in that it, like other Ubisoft titles, routinely exhibits anomalous or poor performance behavior. This is a game that can't sustain an overclock at the same level as its peers. At some point, you encounter the question of whether you're benchmarking the game or benchmarking the hardware -- but Ubisoft makes a lot of games, often in a similar fashion, and so we chalk this up to a "Ubisoft optimization benchmark."
Regardless, when testing the heavier load, geometrically complex structures in the game, we see lower FPS on the GTX 1070, and see performance closer between the two (81% of the 1070 for the 580). When testing in combat, which is less complex and involves more skybox or ground view, the 580 sits at 72% of 1070 performance. This is a game where we must make a judgment call. For our testing, we always favor the heavier load benchmark, as it creates a worse- or worst-case scenario, better ensuring that we're reporting conservative numbers.
Conclusion: Confidence Interval & Standard Deviation At Play
Of course, the nature of this kind of work is that computers are highly complex, and anyone serious about testing will find ways to improve current testing methods. We’re constantly improving these testing procedures.
In this content, we show that shorter test durations of 30-seconds can be used reliably; however, where run-to-run standard deviation climbs, we can improve confidence interval by testing for longer durations. Where standard deviation is nearly 0 – like Metro: Last Light, which outputs almost precisely the exact same results run-to-run – we can maintain shorter test passes while still keeping accuracy. This is further the case when absolute framerate deviation over time is minimal, although we typically test for relative performance in reviews, not absolute performance. We test for absolute performance in game-specific benchmarks, where we likely test all the different settings to min-max framerate/graphics on a particular set of hardware. The objectives are different between the two – relative is ideal for reviews, we think.
Overwatch is a good example of a game with excessive deviation run-to-run. To properly test this game, as we’ve done for the past year, we find it optimal to run tests in a bot match (which equates multiplayer match performance) over a 5-minute period. The dynamism of Overwatch and multiplayer games means, naturally, more variables and deviation are introduced. Testing for 5-minute periods helps to reduce that concern by smoothing-out the averages over a longer period, reducing run-to-run deviation.
It all depends on the game, ultimately. We test each game internally before releasing data, as this enables us to determine the best mode of benchmarking for that title.
More in this series to come.
Editorial, Testing: Steve Burke
Video: Andrew Coleman