As part of our new and ongoing “Bench Theory” series, we are publishing a year’s worth of internal-only data that we’ve used to drive our 2018 GPU test methodology. We haven’t yet implemented the 2018 test suite, but will be soon. The goal of this series is to help viewers and readers understand what goes into test design, and we aim to underscore the level of accuracy that GN demands for its publication. Our first information dump focused on benchmark duration, addressing when it’s appropriate to use 30-second runs, 60-second runs, and more. As we stated in the first piece, we ask that any content creators leveraging this research in their own testing properly credit GamersNexus for its findings.
Today, we’re looking at standard deviation and run-to-run variance in tested games. Games on bench cycle regularly, so the purpose is less for game-specific standard deviation (something we’re now addressing at game launch) and more for an overall understanding of how games deviate run-to-run. This is why conducting multiple, shorter test passes (see: benchmark duration) is often preferable to conducting fewer, longer passes; after all, we are all bound by the laws of time.
Looking at statistical dispersion can help understand whether a game itself is accurate enough for hardware benchmarks. If a game is inaccurate or varies wildly from one run to the next, we have to look at whether that variance is driver-, hardware-, or software-related. If it’s just the game, we must then ask the philosophical question of whether it’s the game we’re testing, or if it’s the hardware we’re testing. Sometimes, testing a game that has highly variable performance can still be valuable – primarily if it’s a game people want to play, like PUBG, despite having questionable performance. Other times, the game should be tossed. If the goal is a hardware benchmark and a game is behaving in outlier fashion, and also largely unplayed, then it becomes suspect as a test platform.
The short answer to the headline is “sometimes,” but it’s more complicated than just FPS over time. To really address this question, we have to first explain the oddity of FPS as a metric: Frames per second is inherently an average – if we tell you something is operating at a variable framerate, but is presently 60FPS, what does that really mean? If we look at the framerate at any given millisecond, given that framerate is inherently an average of a period of time, we must acknowledge that deriving spot-measurements in frames per second is inherently flawed. All this stated, the industry has accepted frames per second as a rating measure of performance for games, and it is one of the most user-friendly means to convey what the actual, underlying metric is: Frametime, or the frame-to-frame interval, measured in milliseconds.
Today, we’re releasing public some internal data that we’ve collected for benchmark validation. This data looks specifically at benchmark duration or optimization tests to min-max for maximum accuracy and card count against the minimum time required to retain said accuracy.
Before we publish any data for a benchmark – whether that’s gaming, thermals, or power – we run internal-only testing to validate our methods and thought process. This is often where we discover flaws in methods, which allow us to then refine them prior to publishing any review data. There are a few things we traditionally research for each game: Benchmark duration requirements, load level of a particular area of the game, the best- and worst-case performance scenarios in the game, and then the average expected performance for the user. We also regularly find shortcomings in test design – that’s the nature of working on a test suite for a year at a time. As with most things in life, the goal is to develop something good, then iterate on it as we learn from the process.
We moderate comments on a ~24~48 hour cycle. There will be some delay after submitting a comment.