As everyone begins running the Final Fantasy XV PC benchmark, we’d like to notify the userbase that, on our test platform, we have observed some run-to-run variance in frame-to-frame intervals from one pass to the next. This seems to stem entirely from the first pass of the benchmark, where the game is likely still loading all of the assets into memory. After the first pass, we’ve routinely observed improved performance on runs two, three, and onward. We attribute this to first-time launcher initialization of all the game assets.
The short answer to the headline is “sometimes,” but it’s more complicated than just FPS over time. To really address this question, we have to first explain the oddity of FPS as a metric: Frames per second is inherently an average – if we tell you something is operating at a variable framerate, but is presently 60FPS, what does that really mean? If we look at the framerate at any given millisecond, given that framerate is inherently an average of a period of time, we must acknowledge that deriving spot-measurements in frames per second is inherently flawed. All this stated, the industry has accepted frames per second as a rating measure of performance for games, and it is one of the most user-friendly means to convey what the actual, underlying metric is: Frametime, or the frame-to-frame interval, measured in milliseconds.
Today, we’re releasing public some internal data that we’ve collected for benchmark validation. This data looks specifically at benchmark duration or optimization tests to min-max for maximum accuracy and card count against the minimum time required to retain said accuracy.
Before we publish any data for a benchmark – whether that’s gaming, thermals, or power – we run internal-only testing to validate our methods and thought process. This is often where we discover flaws in methods, which allow us to then refine them prior to publishing any review data. There are a few things we traditionally research for each game: Benchmark duration requirements, load level of a particular area of the game, the best- and worst-case performance scenarios in the game, and then the average expected performance for the user. We also regularly find shortcomings in test design – that’s the nature of working on a test suite for a year at a time. As with most things in life, the goal is to develop something good, then iterate on it as we learn from the process.
This content marks the beginning of our in-depth VR testing efforts, part of an ongoing test pattern that hopes to determine distinct advantages and disadvantages on today’s hardware. VR hasn’t been a high-performance content topic for us, but we believe it’s an important one for this release of Kaby Lake & Ryzen CPUs: Both brands have boasted high VR performance, “VR Ready” tags, and other marketing that hasn’t been validated – mostly because it’s hard to do so. We’re leveraging a hardware capture rig to intercept frames to the headsets, FCAT VR, and a suite of five games across the Oculus Rift & HTC Vive to benchmark the R7 1700 vs. i7-7700K. This testing includes benchmarks at stock and overclocked configurations, totaling four devices under test (DUT) across two headsets and five games. Although this is “just” 20 total tests (with multiple passes), the process takes significantly longer than testing our entire suite of GPUs. Executing 20 of these VR benchmarks, ignoring parity tests, takes several days. We could do the same count for a GPU suite and have it done in a day.
VR benchmarking is hard, as it turns out, and there are a number of imperfections in any existing test methodology for VR. We’ve got a good solution to testing that has proven reliable, but in no way do we claim that perfect. Fortunately, by combining hardware and software capture, we’re able to validate numbers for each test pass. Using multiple test passes over the past five months of working with FCAT VR, we’ve also been able to build-up a database that gives us a clear margin of error; to this end, we’ve added error bars to the bar graphs to help illustrate when results are within usual variance.
Not long ago, we opened discussion about AMD’s new OCAT tool, a software overhaul of PresentMon that we had beta tested for AMD pre-launch. In the interim, and for the past five or so months, we’ve also been silently testing a new version of FCAT that adds functionality for VR benchmarking. This benchmark suite tackles the significant challenges of intercepting VR performance data, further offering new means of analyzing warp misses and drop frames. Finally, after several months of testing, we can talk about the new FCAT VR hardware and software capture utilities.
This tool functions in two pieces: Software and hardware capture.
This is a test that's been put through the paces for just about every generation of PCI Express, and it's worth refreshing now that the newest line of high-end GPUs has hit the market. The curiosity is this: Will a GPU be bottlenecked by PCI-e 3.0 x8, and how much impact does PCI-e 3.0 x16 have on performance?
We decided to test that question for internal research, but ended up putting together a small report for publication.
Futuremark has pushed an update to its popular 3DMark benchmarking software, now adding a proper "stress test" mode to the tool. Previously, the closest option 3DMark offered to a stress test was a looped FireStrike run, which has two issues we've pointed-out in some methodology discussion: (1) The loop-back is interrupted by a brief black screen and restart of the bench, and (2) the test run does not equally load the GPU, and so power draw fluctuates (we saw a range of ~40W on the 1080). Neither of these is ideal for real burn-in testing.
The new Stress Test benchmark runs uninterrupted (up to 40 hours in pro version, 10 minutes in the free version) and should more evenly load the GPU and CPU. To us, the best feature is a frame-rate stability check which issues a pass/fail based upon FPS performance during OC stability benchmarking. In theory, the tool should analyze for FPS consistency, which will give users an idea of potential OC limitations (like voltage or TDP).
We moderate comments on a ~24~48 hour cycle. There will be some delay after submitting a comment.