This content marks the beginning of our in-depth VR testing efforts, part of an ongoing test pattern that hopes to determine distinct advantages and disadvantages on today’s hardware. VR hasn’t been a high-performance content topic for us, but we believe it’s an important one for this release of Kaby Lake & Ryzen CPUs: Both brands have boasted high VR performance, “VR Ready” tags, and other marketing that hasn’t been validated – mostly because it’s hard to do so. We’re leveraging a hardware capture rig to intercept frames to the headsets, FCAT VR, and a suite of five games across the Oculus Rift & HTC Vive to benchmark the R7 1700 vs. i7-7700K. This testing includes benchmarks at stock and overclocked configurations, totaling four devices under test (DUT) across two headsets and five games. Although this is “just” 20 total tests (with multiple passes), the process takes significantly longer than testing our entire suite of GPUs. Executing 20 of these VR benchmarks, ignoring parity tests, takes several days. We could do the same count for a GPU suite and have it done in a day.
VR benchmarking is hard, as it turns out, and there are a number of imperfections in any existing test methodology for VR. We’ve got a good solution to testing that has proven reliable, but in no way do we claim that perfect. Fortunately, by combining hardware and software capture, we’re able to validate numbers for each test pass. Using multiple test passes over the past five months of working with FCAT VR, we’ve also been able to build-up a database that gives us a clear margin of error; to this end, we’ve added error bars to the bar graphs to help illustrate when results are within usual variance.
Speaking of variance, there’s a lot more of it in VR benchmarking than in “standard” benchmarking. Controlling input by way of body movement (Vive) or head tracking + controller input (Rift) introduces some new challenges to repeatable testing. We’ll explain all of the methods below, and will dedicate the first game tested to walking through the charts one at a time.
Interpreting these charts isn’t necessarily obvious. The type of data presented here is different from what we normally present, and because VR testing is still new, we need to spend some time on explaining what the performance metrics mean. Also, note that we generated on average 10 charts per game (frametimes, SW+HW frametimes, competitive frametimes, bar graphs with drops/warps/averages, etc.) and will include most, but not all charts here. The video will be presented in a more condensed manner, if that interests you.
VR Benchmarking Methodology
We previously ran an introductory piece to the behind-the-scenes process of trying to figure out VR testing, something we started in September. To go through some of the basics:
Two rigs are established. There is a game benchmark machine and a hardware capture machine, which must meet high specifications for storage and for incoming data from the split headsets. The configurations are as follows:
Intel VR Game Test Bench | |||
Component | Provided by | Price | |
CPU | Intel i7-7700K | GamersNexus | $345 |
Cooler | Asetek 570LC w/ Gentle Typhoon | Asetek GamersNexus |
- |
Motherboard | Gigabyte Z270 Gaming 7 | Gigabyte | $230 |
RAM | Corsair Vengeance LPX 3200MHz | Corsair | $135 |
GPU | GTX 1080 Ti Hybrid | NVIDIA | $700 |
Storage 1 | Plextor M7V | Plextor | $96 |
Storage 2 | Crucial MX300 1TB | GamersNexus | $280 |
PSU | NZXT Hale90 v2 1200W | NZXT | $270 |
Case | Open-air test bench | GamersNexus | - |
And for AMD:
AMD VR Game Test Bench | |||
Component | Provided by | Price | |
CPU | AMD R7 1700 | AMD | $330 |
Cooler | Asetek 570LC w/ Gentle Typhoon | Asetek GamersNexus |
- |
Motherboard | Gigabyte Gaming 5 X370 | Gigabyte | $213 |
RAM | Corsair Vengeance LPX 3000MHz | AMD | $120 |
GPU | GTX 1080 Ti Hybrid | NVIDIA | $700 |
Storage 1 | Plextor M7V | Plextor | $96 |
Storage 2 | Crucial MX300 1TB | GamersNexus | $280 |
PSU | NZXT Hale90 v2 1200W | NZXT | $270 |
Case | Open-air test bench | GamersNexus | - |
Our hardware capture system is as follows:
Hardware Capture VR Test Bench | |||
Component | Provided by | Price | |
CPU | Intel i7-4790K | GamersNexus | $330 |
Cooler | Stock | GamersNexus | - |
Motherboard | Gigabyte Z97X Gaming 7 G1 BK | GamersNexus | $300 |
RAM | HyperX Genesis 2400MHz | HyperX | - |
GPU | ASUS GTX 960 Strix 4GB | ASUS | - |
Storage 1 | Intel 750 SSD 1.2TB | BS Mods | $880 |
Capture Card | Vision SC-HD4 | NVIDIA | $2,000 |
PSU | Antec Edge 550W | Antec | - |
Case | Open-air test bench | GamersNexus | $250 |
The hardware capture system is the most important. We need to sustain capability to process heavy IO, and so use an Intel 750 SSD 1.2TB SSD as provided by our friends at BS Mods. The 1.2TB capacity isn’t just for show, either: Our VR capture files can take upwards of 30-50GB per capture. GamersNexus uses an in-house made compression script (programmed by Patrick Lathan & Steve Burke) to compress our files into a playable format for YouTube, while also allowing us to retain the files without high archival storage requirements. The files compress down to around ~200~500MB, but do not perceptibly lose quality for YouTube playback.
Prior to compression, though, we analyze the files with an extractor tool, which looks at color overlays frame-by-frame to determine (1) if any frames were dropped by the capture machine (they never are, because our storage device is fast & $2000 capture card supports the throughput), and (2) if any frames were dropped by the game machine. The latter happens when the DUT cannot support fluid playback, e.g. if a low-end GPU or CPU gets hammered by the VR application in a way that causes drop frames, warp misses, or other unpleasant frame output.
The VR gaming machine spits out DisplayPort to our monitor, then HDMI to a splitter box. The splitter box feeds into the capture machine via a splitter cable, then into the capture card. The other output in the splitter box goes either to the headset or to the HTC Vive Link Box, which then goes to the headset & the game machine (USB, audio, display).
In total, it’s about ~10 cables in true octopus fashion to connect everything. The cables must be connected in the correct order to get everything working. No output will go to the HMD if they are connected out of sequence.
Data Interpretation: We’re Still Learning
The gaming machine, meanwhile, is running FCAT VR software capture to intercept frame delivery at a software-level, which then generates files that look something like this:
Each file contains tens of thousands of cells of data. We feed this data into our own spreadsheets and into FCAT VR, then generate both chart types from that data. The hard part, it turns out, is still data interpretation. We can identify what a “good” and “really bad” experience is in VR, but identifying anything in between is still a challenge. You could drop 100/m frames on DUT A and 0 on DUT B, and the experience will be perceptibly/appreciably the same to the end-user. If you think about it, 100 dropped frames in a 5400-interval period is still about 1.85% of all intervals missed, which isn’t all that bad. Likely not noticeable, unless they’re all clumped together and dotted with warp misses.
Defining the Terminology
We still haven’t defined those terms, so let’s do that.
Drop Frame: When the VR runtime takes the prior frame and modifies it to institute the latest head position. The VR HMD is reprojecting or adjusting the prior frame, but failing to update animation in time for the next runtime hit. With regard to animation, this is a dropped frame. With regard to user experience, we are updating in a way that avoids inducing user sickness or discomfort (provided there aren’t too many in rapid succession). We can get synthesized frames out of this.
Warp Miss: The VR HMD has missed the refresh interval (90Hz, so every ~11ms +/- 2ms), and doesn’t have time to show a new frame. There is also not enough time to synthesize a new frame. We’ve instead replayed an old frame in its entirety, effectively equivalent to a “stutter” in regular nomenclature. Nothing moves. Animation does not update and head tracking does not update. This is a warp miss, which means that the runtime couldn’t get anything done in time, and so the video driver recasts an old frame with no updates.
Delivered Frame: A frame delivered to the headset successfully (see also: Delivered Synthesized Frame).
Unconstrained FPS: A convenient metric to help extrapolate theoretical performance of the DUT when ignoring the fixed refresh rate (90Hz effective v-sync) of the HMD. This helps bring VR benchmarks back into a realm of data presentation that people are familiar with for “standard” benchmarks, and aids in the transition process. It’s not a perfect metric, and we’re still in the air about how useful this is. For now, we’re showing it. Unconstrained FPS is a calculation of 1000ms/AVG frametime. This shows what our theoretical frame output would be, given no fixed refresh interval, and helps with the challenge of demonstrating high-end device advantages over DUTs which may otherwise appear equivalent in delivered frame output.
Average Frametime: The average time in milliseconds to generate a frame and send it to the HMD. We want this to be low; ideally, this is below 11ms.
Interval Plot: A type of chart we’re using to better visualize frame performance over the course of the headset’s refresh intervals. In a 60-second test, there are 5400 refresh intervals.
Warp misses are miserable experiences, particularly with multiple in a big clump. Warp misses intermixed with drop frames illustrate that the hardware cannot keep up with the game, and so the user experiences VR gameplay that could feel unpleasant physiologically in addition to mechanically.
Learn more about these definitions here, in our previous content. We also spoke with Tom Petersen about these terms.
Games Tested
For now, we’re testing with these games:
HTC Vive Benchmarks
- Raw Data (High + Epic settings tested, mostly using High for publication, MRS0; first level)
- Everest (settings configured so that all bars are at equal length, two ticks down from max, MRS/LMS0; Khumbu Icefall)
- Arizona Sunshine (Very High textures & quality, Advanced CPU features; horde round 1)
Oculus Rift Benchmarks
- DiRT Rally (High settings, Advanced Blending enabled, standardized bench)
- Elite: Dangerous (VR High in VR training level)
Test Variance & The Scientific Method
As for benchmarking itself, the level of uncertainty for VR test execution is higher than for “standard” game benchmarking. This is for a few reasons, one of which being that there is naturally some variance from one pass to the next, since there’s such a large human element. Many of the King of the Hill games also have randomized enemy spawns, which can impact things further.
There is also concern of the HMD and runtime modifying the frame in untold ways to better deliver frames. The Rift, for instance, can reduce resolution at the borders of the output when necessary, thus permitting a smooth experience without impacting visual quality of what’s immediately in front of the eyes. Fortunately, this is only done in the Portal tech demo, and does not come into play with our test titles. Choosing VR titles for testing will require additional diligence from media outlets than usual game testing, as there are a lot of adaptive technologies at play; implementing just one of them could mean that a GTX 1060 or RX 480 could appear equivalent to a GTX 1080 in delivered framerate (90FPS), but actually be spitting out lower quality images. This re-introduces old benchmark concerns of image quality, not just framerate.
We have a few months of experience running VR benchmarks, but it’s nowhere near our near-decade of standard benchmarking experience. This means that we are still learning, and that these are the early days of VR benchmarking. With dozens of data sets for the games we’ve benchmarked, we are fortunately able to devise a margin of error from one pass to the next. We have built a margin of error bar into the bar graphs to illustrate our level of uncertainty at this time (+/- 1.5%), which also accounts for randomization in enemy spawns, potential technician head tracking variance, or locomotion variance. This means that, for instance, an unconstrained framerate of 160FPS and unconstrained 155FPS could be effectively equal, as far as we’re concerned. There is not a significant difference in this example (read: the two values are performing within margin of each other, or very close to it).
We might also have, for instance, average frametime variance of +/- 0.25ms for some games (could be up to ~0.5ms in other games, which we’ve excluded from today’s tests). This is mostly unavoidable at this stage. Part of taking a scientific approach is acknowledging when two numbers may not be statistically significant in their differences, and is outlining potential inaccuracies as a result of the test mode. For us, that is +/-1.5% in unconstrained framerate and +/- 0.25ms in the tested titles. This means that, for instance, the R7 1700 or i7-7700K may produce frametimes of 9.4ms and 9.8ms. We can consider these as effectively equal, especially if taking the more ad-hoc perceptual approach to data analysis (does the user see this difference? No. This is different from 144Hz arguments, since the HMD physically does not allow >90FPS). Similarly, we might see the R7 1700 produce 9.4ms AVG frametime in one pass, then 9.7ms in the next. The results are not consistently equivalent, but are close enough to serve as benchmarks given an understanding of the variance.
How to Read the Charts
The left axis shows frametimes in milliseconds – lower is better here, and we have roughly an 11ms window to deliver in time. The Rift has some programming in the runtime that will allow us to go a little bit over the 11ms refresh window (closer to 13ms). The magenta line represents the hardware capture, while the red line represents the software capture. The hardware capture cannot see what’s going on at a software level, and so only validates findings by illustrating dropped frames never delivered to the headset. The software line is more of what we’re interested in. The HW capture line also helps validate that the SW capture is functioning correctly.
The lower third of the chart is an interval plot. This one helps visualize delivered synthesized frames, dropped frames, and delivered new frames.
We have stripped the bar labels, as we’re not in the data analysis section yet. The point is to show how to read the charts, not present data (yet).
This chart plots delivered FPS to the headset, which is the most important metric, then drop frames as the second most important metric, and unconstrained FPS as a calculation. Unconstrained FPS is an imperfect prediction of how many frames would be delivered per second given an HMD without VSync forced, since the HMDs refresh at a hard 90Hz. This is calculated by taking 1000ms and dividing it by the average frametime, which is done in the new FCAT VR tool automatically.
The two hard metrics are Delivered FPS – which we can validate with an effectively infallible hardware capture – and drop frames, also validated by hardware capture. Drop frames are an absolute measure in total frame count over the test period, which is 60-seconds. At 90Hz, a 60s test pass will produce 5400 refresh intervals on the headset.
A final note, we have margin of error bars on these charts that are based on multiple tests. We hope to tighten these margins in the future, but VR testing is still young, so we are leaving some extra room for error. We currently have a test variance of roughly +/- 1.5%.
This is an average frametime chart. This just an averaged view of the over-time chart.
Time to get on to GN’s first round of published VR benchmarks, featuring the AMD Ryzen R7 1700 vs. Intel i7-7700K. Both CPUs are similarly priced (~$330 and ~$345), and both advertise VR prowess.