New CPU Test Methodology 2020: Code Compile, Updated Gaming, Transcoding, & More

By Patrick Lathan & Steve Burke Published May 02, 2020 at 11:45 pm

It’s time again for our CPU testing methodology to be updated, alongside the test bench. We’ve done some significant streamlining behind the scenes that make these tests easier to run and the results easier and more accurate to process, but on the public side, we’ve completely overhauled the software suite we’re using. Last time we updated our testing methodology, we added a code compile benchmark that was short-lived. The test featured GCC, Cygwin, some other environments, and ended up being a top-to-bottom sort by cache. We ditched that test (and consulted Wendell of Level1 Techs on it in this video), and we’re just now replacing it. New code compile benchmarking (with more usefulness) has been added for 2020, alongside the addition of Handbrake H.264 to H.265 transcoding (ranked by time), updated Adobe Premiere video rendering and Adobe Photoshop benchmarks, updated file compression and decompression benchmarks, and more. Gaming gets a total overhaul, too, with a big suite of new games added.

Additionally, we’ve updated several existing game and production benchmarks from last year’s suite, with a few left unchanged. This is to keep producing data that we can still compare to old data, which is useful for rapid analysis of parts that may not have been re-tested in the current year. For example, if we were testing a 10700K and wanted to reference its performance vs. a 2600K, but didn’t have a fresh retest, we could reference data from GTA V, Tomb Raider, Civilization, and ACO to form an understanding without fully retesting. We try to limit this, but time often gets the better of us, and it’s good to have reference points to ensure ongoing accuracy.

Here’s the suite (graphics settings noted in each section):

GAMES:

  • Red Dead Redemption 2 DX12
  • Red Dead Redemption 2 Vulkan
  • The Division 2 DX12
  • Total War: Three Kingdoms DX12
  • Shadow of the Tomb Raider DX12
  • F1 2019 DX12
  • Hitman 2 DX12
  • Grand Theft Auto V DX11
  • Civilization VI DX11
  • Assassin’s Creed: Origins DX11

PRODUCTION:

  • Chromium Windows build (using clang-cl)
  • Blender 2.81 GN Logo
  • Blender 2.81 GN Monkey Heads
  • 7-Zip Compression
  • 7-Zip Decompression
  • Adobe Photoshop
  • Adobe Premiere
  • Chaos Group V-Ray
  • Handbrake H.264 to H.265

VALIDATION: 

  • Cinebench R15 (internal use only)
  • CPU-Z Validation report (internal use only)
  • HWiNFO64 (frequency, thermal validation)

We’ve been hard at work running our inventory of CPUs through this new test suite and recording the results, and that work is ongoing. In other words, we know not every CPU is on the chart yet, but this isn’t a review. Our selection of CPUs here was to cast a wide net to try and locate shortcomings in the testing methodology. We’ve found them, at this point, but we’re also going to publicly present a few of those shortcomings (and our solutions to fix them). Again, rather than asking “WhY iSn’T MY CpU oN HeRe?”, remember that the goal today is an outlay of the methodology.

Starter CPUs include, but aren’t limited to:

  • AMD TR 3990X (Amazon)
  • AMD TR 3970X (Amazon)
  • AMD R9 3900X (Amazon)
  • AMD R7 3700X (Amazon)
  • AMD R7 2700
  • AMD R5 3600 (Amazon)
  • AMD R5 3500X
  • AMD R5 3400G
  • AMD R5 1600 (AF) (Amazon)
  • Intel i9-10980XE (Amazon)
  • Intel i9-10900X (Amazon)
  • Intel i9-9900K (Amazon)
  • Intel i7-9700K (Amazon)
  • Intel i7-8700K
  • Intel i7-8086K
  • Intel i7-5775C
  • Intel i5-9600K
  • Intel Pentium G5500

NVIDIA driver version 445.75 was used for all of today’s tests on the EVGA RTX 2080 Ti XC Ultra.

All tests are run with the Windows High Performance power plan selected or the AMD Ryzen High Performance power plan. The newest Windows version at time of writing was used, and we will update as necessary for major changes.

Separate Windows installs are used for each chipset--i.e. distinct boot drives for TRX40, X570, X299, Z390, Z270, and Z97. We never mix-and-match installs and have reserved SSDs for each platform tested (we’ve spent a lot on SSDs). Right now, unless otherwise stated for PCIe Gen 4 testing or validation, we are using Samsung 860 Pro 512GB drives for all of our host drives.

We use four 8GB sticks of G.Skill TridentZ RGB 3200MHz memory unless otherwise noted, with all primary and with select secondary and tertiary timings manually controlled for test consistency. We’ve found this to be very important for test-to-test consistency, as allowing the motherboard to auto-select timings (which it will do) could ruin the data. We use four 8GB sticks of HyperX 2400MHz memory for DDR3 testing, but DDR3 platforms frequently have trouble maintaining this speed with four sticks of RAM installed, so we make note of the specific frequency used for each test using DDR3.

NZXT Kraken X62 280mm CLCs at max pump and fan speed are used for all CPUs not considered HEDT parts. This is an intentional decision that we have followed for the last ~4-5 years of testing CPUs, although it has particular relevance given how thermally sensitive AMD’s CPU performance has become with Ryzen 3000’s Precision Boost 2, soon Intel with TVB. CPUs that are designated HEDT use 360mm coolers instead, like Intel X99 and X299 parts and AMD Threadripper parts. We draw the line at the platform used. All platforms have full IHS coverage, including Threadripper.

Moving forward, we’ll be listing frequency and not voltage on charts for our OC tests. As a general rule, we do quick-and-dirty OCs for benchmarking, with the goal of completing all tests without crashing, throttling, or hitting dangerous temperatures. The exception to this will be our extreme overclocks with liquid nitrogen. Those only occasionally wind-up on these charts, but we’ll note their voltages with an LN2 disclaimer. The voltages we use for review purposes under normal liquid coolers aren’t necessarily the minimum we could get away with, and they certainly aren’t a “do this at home” recommendation from us, so to avoid confusion and to avoid users copying our settings without realizing this, they will no longer be included on charts. We don’t want people damaging their CPUs in long-term use. Ours only go through a few hours of testing at a time, so it’s of no concern for us. As far as new CPU reviews, we always try for a lower voltage if we have time in the review cycle, and so that’ll be discussed in the review relating to the chip being tested. You’ll need to check the reviews as they come out for the settings used, alongside discussion of whether we had enough time to really optimize.

NEW TESTS:

We’ll start by going over our new tests, including two new production or workstation tests and several new game tests. We’ve also modified or resurrected several tests from our previous bench suite, which will be discussed toward the end of the content.

Chromium Build

We’ve abandoned the difficult-to-wrangle GCC benchmark we previously experimented with, and we’ve moved to building Google Chromium on Windows using Google’s thorough and user-friendly instructions. Unfortunately, this means we’re also leaving behind the interesting cache-dependent results from the GCC benchmark, but it’s worth it. We discussed the shortcomings of our previous GCC compile in a video where we were joined by Wendell of Level1 Techs, who helped us break-down why the test was basically the same as sorting the CPUs by cache. You really didn’t need to do the test to know which would perform best in it -- it was a cache sort, and not necessarily realistic for user environments. 

Our newer process uses MSVC tools, but performs the compilation using clang-cl. The build system is Ninja, which has the added benefit of being able to generate a detailed report afterwards. We selected this project because it’s easy to configure, easy to reproduce, it’s explicitly compatible with Windows, and it’s a large project that someone might actually want to build themselves at home. The one significant problem we’ve run into is that extremely high core count processors bottleneck on the 32GB of memory that we use for our standard test bench, forcing the system to page out to the SSD, making the project take significantly longer as a whole. We’ve corrected this by using 64GB of memory in specific cases; so far this has only proven necessary on the 64C/128T 3990X.

1 cpu methods chromium compile benchmark

Here’s a quick sample chart. Please remember that we’ve only just started using our new test methods, so charts are explicitly not fully populated yet. We used parts that might seem random to the viewer, but they were chosen to analyze results across the price spectrum to determine if what we were doing made sense for testing.

The 3990X and 3970X have such an overwhelming core count advantage that they top the chart handily, but frequency and core count both play a part here, giving Intel an edge in some instances. The overclocked 8C/16T 9900K, for example, almost makes-up the gap and ties the 10C/20T 10900X, while the 8C/8T 9700K outperformed the 8C/16T R7 2700 at stock frequencies.

AMD’s top-of-the-line 3950X does very well here, beating out even the Intel 10980XE. Intel and AMD CPU frequencies aren’t comparable to each other, but we can compare the 3GHz base/4.6GHz boost 10980XE to the rest of the Intel stack and know that it’s a relatively low-frequency HEDT part when compared to gaming-focused CPUs like the 3.6GHz base/5GHz boost 9900K, while the 3950X is still part of AMD’s non-HEDT lineup and is therefore a relatively fast CPU compared to other Ryzen 3000s with a single-core boost of 4.7GHz. This is part of the reason why the 16C/32T 3950X can beat the 18C/36T 10980XE in the Chromium test.

As stated earlier, the 3990X is actually too fast for its own good when working with limited memory: If you don’t buy enough RAM for your 3990X build and you intend to do something similar to this, it’s possible your workload would run better on a cheaper CPU with a more appropriate amount of RAM. The 3990X with 64GB takes 22 minutes, but requires 209 minutes with 32GB of RAM due to SSD reliance. Fast I/O would also be recommended for instances where this is unavoidable. The 3400G sets the floor at 262 minutes, the 3990X leads, and the 3600 manages a center position at 115 minutes, flanked by the i7-9700K at 5.1GHz and at stock.

2 cpu methods 3950x chromium 1

2 cpu methods 3950x chromium all

This chart will be barely legible, but the important parts are clear. We did one pass of the Chromium build with HWiNFO running in the background to reveal that the CPU maintains high usage throughout the test, but rarely hits 100% usage on all cores. This allows the limited-core boosting behavior to come into effect, with single cores occasionally hitting the maximum 4.7GHz boost clock. We’ve intentionally zoomed-in this chart’s scale, despite not normally liking doing that, just to help make the data a little bit more discernible. The important take-away is that boost gets leveraged here.

3 cpu methods 3950x blender all

Contrast this with the Blender workload, which maintains constant 100% load on all cores and keeps all core clocks under 4GHz. This is expected behavior and is how AMD Ryzen Precision Boost 2 works, not to be confused with Precision Boost Overdrive. We’ve been over this countless times before, but it’s worth reiterating these points.

Handbrake Transcoding

We’ve also added Handbrake to our benchmarks to accompany compile testing. When the day comes that we need more room on the massive storage server that Wendell helped us to build, we plan to run our old Handbrake video compression script as well as transcoding older videos from H.264 to H.265, which inspired us to add this test to our suite. We’ll be using the AMD VCE hardware encoder on our RX 5700 for transcoding when that time comes, but for our CPU benchmarking, we’re sticking with the CPU-only x265 encoder. The input is a real video that was uploaded to our channel and has already been stored at 1080p.

4 cpu methods handbrake

It’s a known feature of the encoder that it has limited scaling with thread counts higher than 16, so although the Threadripper chips top the chart again here, they’re not much faster than parts with significantly lower core counts like the 3900X and 10980XE. The 9900K is at the sweet spot for thread count and frequency, so with a completion time of 15.6 minutes, it’s closer than might be expected to Intel HEDT parts like the 10900X at 14.7 minutes and the 10980XE at 12.4 minutes. AMD’s 3950X lands near the top of the chart again with a completion time of 11.4 minutes, with the Threadripper chips barely scoring any higher. Diminishing returns from increased core count allow the 3970X to score higher than the 3990X here, since having a higher stock frequency matters more than having only half the threads.

Total War: Three Kingdoms [DX12]

Total War: Three Kingdoms is the most recent main-line Total War title. It includes three baked-in benchmarks, of which we’ve selected Battle and Campaign to mirror our previous Total War: Warhammer 2 benchmarks. As before, Battle covers two armies fighting on the RTS map from several camera angles, and Campaign uses sweeping camera movements across the strategy map to measure performance there. We’re testing at both 1080p and 1440p for the battle benchmark, but only 1080p for the campaign.

5 three kingdoms battle 1080p

First up is the Battle chart, where we see the overclocked 9900K topping the chart, as it so often does in games, at 158FPS AVG at 1080p. We’ve achieved 5.1-5.2GHz overclocks on various 9900K and 9900KS samples, the fastest of any CPUs in our inventory without using extreme cooling. 8 cores and 16 threads is the most any modern mainstream game desires, so the low core count relative to high-end Ryzen parts isn’t a concern. A 64C/128T part like the 3990X would have had a problem in our old Total Warhammer benchmark, but updates to the engine allow the Threadripper chips to pass with framerates that aren’t the best, but aren’t bugged out. Those high core-count parts are only really tested to determine if the gaming experience is ruined by compatibility issues, since no one should be buying them for a gaming-only machine anyway. The 3950X also had no issues, but at 140FPS AVG it’s essentially tied with the 3700X and even the overclocked 6C/12T 3600.

Frametimes are overall improved versus previous Total War games we’ve tested. The consistency is up overall, with the scaling relatively uniform versus average FPS across all the CPUs. We see a bigger hit to frametimes on the 3400G part by nature of its overall reduced performance.

6 three kingdoms battle 1080p deviation

If you’re curious about how consistent the data is run-to-run, here’s a quick chart we threw together to better understand how accurate the test data is. This shows the square root of the variance amongst the data, or standard deviation, and helps illustrate the deviation from the mean for a group of test data against a single CPU. A few notes that are very important here: First of all, we run four test passes as a baseline and average that data. Secondly, because a lot of people forget this, remember that a test pass doesn’t produce just one set of numbers. It actually can produce thousands or even tens of thousands of data points to be averaged per run, depending on how high the FPS is for that particular game. What we’re actually doing is this: For each test pass, we average the frametime, or the average interval required to draw the next frame, against the entirety of that test pass. We then do that three more times, a total of four, and then average those sets of four averages together. You end up with no fewer than tens of thousands of data points averaged together in nearly all instances of AVG FPS, with an exception for very low-scoring parts. For this chart, the standard deviation is about 0.6 for 9900K AVG FPS, 0.9 for 1% lows, and 2.6 for 0.1% lows. We traditionally expect lows to have higher variance in general, because it’s more sensitive due to having less data to draw from. The most variable result we saw was in our 9900K stock pass for 0.1% lows. This is still better than Total War games from a few years ago, but it’s more variable than we’d typically like. When we see results like this, our next step is to retest the CPU for validation. The next most variable was the 3900X at 5.5 for 0.1%. 

Overall, we’re very happy with these numbers, especially AVG FPS, where our greatest standard deviation is just 1.0, and think that even the 0.1% numbers are within reason for a number which we have always been careful to present as being more variable than AVG FPS. This data looks good and consistent overall, and so we can make more confident statements about comparative product performance. Anyone presenting a 0.1% low number is doing so knowing that it is inherently more prone to variance, but as long as that is stated to and known by viewers, that’s OK. It’s still extremely useful data, we just have to remember this key.

One more key note that’s important: run-to-run variance isn’t inherently a test problem or technician error -- it could also be an important indicator to a CPU problem or to CPU boosting behavior. In some instances, higher run-to-run variance may be indicative of a CPU throttling (that wasn’t happening here, though) or of a CPU hitting a power duration limit.

7 three kingdoms battle frametimes 9900k 1

7 three kingdoms battle frametimes 9900k all

Here’s a look at frametimes of that 9900K result, just so we can better understand the numbers. This is what our data looks like before we add the rate abstraction layer and do all of the averaging to make legible numbers. These charts will still pop-up on occasion for important data, like those spikes you’re seeing in the first pass of the 9900K run. We find that an 8-12ms excursion from the previous frame time starts to become noticeable to players as a hitch or stutter. The 9900K is overall consistent in delivery in this game, but those occasional, large spikes are what created the higher standard deviation than in other tests. It wasn’t test error or technician error, but an outlier to 24ms that distorted the average, followed by several other spikes toward 20ms and 17ms. Remember that 60FPS is about 16ms, just as a baseline, so it's an objectively good experience overall with a few outliers. No major stutters, but enough to distort the low-end data that’s natively more limited in quantity.

We also have 1440p tests for this one that are GPU-bound, but we’ll talk about those more in the upcoming CPU series.

8 three kingdoms campaign 1080p

Surprisingly, average FPS is lower in the campaign benchmark for most CPUs, which hasn’t always been true in past iterations of this test in other Total War games. The overclocked 9900K is again at the top of the chart with a 152FPS average, and the 3950X is again in the middle and at the same performance level as the 3700X, at 127 and 125 FPS averages respectively. All three CPUs have 0.1% lows above 60FPS, as do almost all of the CPUs we’ve tested so far, which is a welcome change from the frequent stuttering issues we’ve encountered with past Total War titles. We can see that frequency still matters most to Total War, illustrated best by the 9600K at 5.1GHz outdoing the 9900K stock CPU -- or roughly matching, at best. This is good data to have so that we understand how CPUs perform in each game, and is also part of why AMD runs lower on the charts. The 10980XE and 3990X establish that all those extra cores don’t matter if there’s no frequency to back them up in Three Kingdoms, but the 9900K at 5.1GHz shows that extra threads can post a difference when frequency is equal, shown by highlighting the 9700K at 5.1GHz. We noticed that variance is a little higher in this one, although still tight in tolerances, and our range is about 1.4FPS AVG.

We’ve completely cut Total War: Warhammer 2. It’s an aging game, and we don’t trust the DX12 version enough to switch it. Total War: Three Kingdoms fulfills every function that TWW2 did, it’s up-to-date, and it’s still relevant to our audience. Last we heard, the Total Warhammer games were planned as a trilogy, so we may see the series return to our benchmark suite in the future.

Division 2 [DX12]

We’ve found Division 2 to be a relatively undemanding game for CPUs that almost unavoidably bottlenecks hard with GPUs when using high settings and/or resolutions higher than 1080p. This title will definitely make its way into our next round of GPU benchmarks, but we’ve kept it in the CPU suite as well since it has reasonable performance on low-end CPUs and can show differentiation between Pentiums and Athlons without dropping to a completely unplayable framerate. There are fewer results on the final chart for now because we decided the level of GPU bottlenecking was unacceptable midway through testing and picked new settings for benchmarking; that chart will be repopulated over time, but let’s start with the unusable chart with bottlenecked, bad data.

9 division 2 bad data 1080p

Let’s take a look at the results we’ve discarded first, since they’re a prime example of what we have to avoid when CPU benchmarking. We initially chose the “high” settings preset and modified a couple settings to increase CPU load, with initially promising results. As the chart filled-out though, it became clear that CPUs were falling into two buckets and had a clear division: All Intel and Threadripper parts would average about 160 FPS, and all other AMD parts would average about 150 FPS, almost entirely unaffected by the frequency or core count of the CPUs except for the extreme bottom of the stack with the Pentium G5500 and R5 3400G. This indicates a definite non-CPU bottleneck, and one that favors Intel in this particular instance. We are no longer testing the CPU, we’re testing the GPU, and the results range is mostly noise from frametimes bouncing off of an artificial ceiling for the CPU. These are still real results, but they’re more of a benchmark of our 2080 Ti than anything else and therefore aren’t much use to anyone looking for a CPU review. We switched to a settings profile based on the medium preset, which we’ll be using for our published results going forward.

10 division2 good data 1080p

The results we’ve gathered so far using a profile based on the medium preset are like something we’d expect to see from F1 2019, with the 9900K in particular breaking the 200FPS mark. By moving to lower settings we’ve gone beyond what a typical user actually wants, but like F1 2019, we’re making an intentional choice to force the CPU to be the limiting factor so we can make comparisons. The previous test showed the 3950X losing by a wide margin to the 9600K, but with the CPU-bound profile AMD’s CPU has a moderate lead. This still isn’t a title where high core counts can really flex, but the stack makes sense. Note that the 9900K is bottlenecking on the GPU at the top-end of the chart, hence why the 5.1GHz result is not meaningfully different from the stock result.

11 division2 good data 1440p

There’s even some scaling at 1440p, although very little. The top-ranking CPUs are all within variance and are clearly GPU-bound. The 9900K at 5.1GHz lost about 25% of its performance versus the 1080p result, thanks to the GPU getting hit too hard. Even the 3950X loses performance, despite not being fully maxed relative to the 9900K -- there’s some overhead here that lowers its ceiling. Same goes for the rest of the AMD CPUs -- even the low-end is getting limited, thanks to the higher-performing areas of the bench now being GPU-bound. Our experience with The Division 2 so far has made it a prime candidate for our GPU bench as well.

F1 2019 [DX12]

We need a game with known CPU hierarchical scaling that bypasses a GPU bottleneck while maintaining high settings, representing both a good benchmark scenario and real-world play. We’ve replaced F1 2018 with F1 2019. We’re using the DX12 version of the game, but this is basically the same benchmark as it was in 2018, right down to the track and weather used. As before, this is an undemanding game that can easily run at 200+ FPS, but the results are still CPU-bound and therefore useful for comparing relative performance.

15 f1 2019 1080p

The 5.1GHz 9900K leads the pack again at 280 FPS average, finally establishing more of a known-good CPU-only scaling, or CPU-mostly scaling. The first AMD chips appear around the 240 FPS mark, allowing the 9900K stock CPU a lead at 268FPS AVG, or about 12.1% versus the highest stock AMD CPU -- the 3700X. The 3900X and 3950X are practically tied at both their stock frequencies and when overclocked to 4.3GHz. This game will be a good indicator of full-on CPU performance in gaming, as will Civilization and some of our other older titles. Like always with F1, we’re seeing issues with 0.1% low consistency that we can’t explain well.

16 f1 2019 1440p

Even 1440p works well for a scaling demonstration, although it’s obviously becoming GPU-bound. The 9900K runs at 229FPS AVG at 5.1GHz, with the stock 9900K at 224FPS AVG. The margin has slimmed between these top-half CPUs, but there’s still a difference being shown.

Red Dead Redemption 2 [Vulkan, DX12]

Red Dead has proven to be an absolute pain in the ass to benchmark for various boring reasons that we won’t go into, but it’s new, it’s demanding, and it has both Vulkan and DX12 support, so we’ve put the work in to include it in our suite. Both Vulkan and DX12 tests are run with settings as identical as possible, and generally our results have shown a higher framerate using Vulkan regardless of whether the game is CPU or GPU bound. We’re looking forward to seeing arguments about that in the comments section for the rest of our lives, particularly from people who have no idea what they’re talking about but read half of a comment on reddit somewhere. We’re aware of the -cpuLoadRebalancing command line flag that was suggested as a bugfix at launch, but this flag should no longer be necessary and we don’t use it.

12 red dead dx12 1080p

Red Dead Redemption also becomes GPU bottlenecked at the upper end, with the 9900K showing no significant improvement from an overclock and barely placing above the stock 8700K. Still, there’s definitely scaling for CPUs lower on the stack. The R7 2700 and R5 1600AF both show positive scaling with overclocks here, the 3600 shows no improvement, and the 3900X actually averaged a couple FPS lower when locked down to 4.3GHz across all cores. The 3950X performed adequately here, but clearly can’t leverage its core count advantage and only tied the overclocked 3900X.

13 red dead vulkan 1080p

The Vulkan chart also shows a GPU bottleneck around the 130FPS mark, but more CPUs are able to hit it. The 9900K both stock and overclocked is tied with Intel CPUs further down the stack, even down to the 8700K and 8086K. The AMD CPUs seem to hit their own limit around 123FPS, with the 3950X averaging 122 FPS and most of the other 3000 series parts tying it. Overall, the Vulkan benchmark seems less useful to us as a CPU test with these settings, but we’re keeping it around as a comparison to the DX12 results and because it’s the API that the game defaults to.

13 red dead vulkan vs dx12 avg

Here’s the chart of Vulkan next to Dx12 for AVG FPS. Overall, Vulkan trends higher in averages, despite having the buggier implementation back at launch. The Vulkan headroom is also more constrained on the CPUs, leaving us with less to compare scaling against. Still, at some point, real-world play has to be considered.

14 red dead vulkan vs dx12 lows

The 0.1% differences establish that Vulkan is superior as well. Its biggest downside appears to be in implementation issues where some players have reported a need to use Dx12 to prevent crashing, but we can’t test for that, so we’ll just test both APIs.

Red Dead 2 Launch-Day Benchmarks (Medium)

rdr2 1080p benchmark cpu

After seeing these results, reminded of our Red Dead 2 launch-day CPU benchmarks, we’re going to add a set of 1080p/Medium tests to show more scaling. If we pull one of our charts from the launch-day CPU benchmarks, you’ll see that the top-to-bottom ranking was wider. We’ll also drop 1440p with Dx12 going forward, since Dx12 seems the lower performing of the two and since 1440p is already of questionable usefulness. We’ll keep 1440p Vulkan. The end result will be Red Dead tested at 1080p, 1440p, DX12, Vulkan, and at both Medium and High settings. That should cover our bases. 

Moving on, we can next talk about some of our validation and data verification tests to help ensure good data is being produced. 

HWiNFO Logging

17 cinebench validation all

We’ve made it part of our standard procedure to record two HWiNFO logs. The first is taken while running a full Cinebench R20 single-threaded pass, and the second is taken while rendering a special version of the GN Logo scene in Blender 2.81 for twenty minutes. This gives us a log of single-core and multi-core boost behavior and thermals, as well as a hard record that everything was running at the correct frequency during testing. We discard the results of the Cinebench run and the Blender render; none of the benchmarks we publish are run with logging software in the background. Logging software is only used in these discarded tests for verification.

CPU-Z Report

In the same vein, we export a CPU-Z log file before every round of tests to have a record of the state of all hardware that can be referred to if a result seems questionable. This aids in our janitorial work later and helps automated some of the simpler processes that are required for useful data analytics. This is for internal validation and won’t be published, since it’s mostly the same as looking at a spec sheet on the vendor website.

RETURNING/MODIFIED TESTS:

Cinebench R15 multithreaded and singlethreaded (OpenGL eliminated)

Cinebench R15 is a benchmark that we haven’t published results from for some time now, but we’ve continued to use it as a quick internal test for consistency with older results. Since we’ve been running this test for years, we have years worth of data stored up, and if there’s ever a significant difference in score we instantly know that something has gone wrong and that we need to rerun our tests. We don’t publish the numbers because Cinebench’s usefulness as an actual benchmark is limited: the R15 benchmark runs too quickly to be a good test for high core count CPUs, and although the newer R20 benchmark addresses that problem, the results of the multicore test correlate almost directly with the number of CPU cores. We’re trying to provide valuable information here, not prove that 16 is higher than 8 over and over. If we’re going to publish results like that, we’d rather them be from some software that people actually use, like Blender. We run both the single-threaded and multi-threaded CBR15 benchmarks; we also ran the included OpenGL benchmark until recently, but we’ve eliminated that from the suite as of this year.

Adobe Photoshop Puget Systems benchmark

18 photoshop benchmark

Photoshop is a returning test. We’re continuing to use the Puget Systems “extended” benchmark, now with Photoshop version 21.0.3. We may choose to move away from this benchmark in the future as Puget Systems has made some changes to their Adobe benchmarking software, but for now the older version continues to work just fine. Photoshop continues to be far more responsive towards increases in frequency than increases in core count, and we like it for that reason: it’s a good example of a real-world enterprise workload that benefits from overclocking and boosting rather than just scaling directly with core count, as render workloads typically do.

Adobe Premiere in-house benchmark (Charts render eliminated)

19 premiere benchmark 1080

19 premiere benchmark 4k

We’ve tested Adobe Premiere for some time now, even before we published results for it regularly, but we now perform this test as part of our standard suite. We use an in-house benchmark that renders one 1080p H.264 project and one 4K H.264 project, currently using Premiere and Encoder versions 14.0.1. Both projects are real GN videos that have been posted to the channel and both use our normal render presets, so this benchmark is particularly useful to us when we’re shopping for new hardware for rendering videos. CUDA acceleration is enabled for this test.

Civilization VI Gathering Storm AI Benchmark

20 civ vi

With Civilization VI’s Gathering Storm update, the option to run the original Civ AI benchmark that we’d been using since we first downloaded the game was replaced with the much longer Gathering Storm benchmark. They’ve since remedied this, but not until we’d completely changed over testing to the longer benchmark, which we’ll continue to use going forward. The benchmark loads a late-game Civ save, ends a turn, and completes a full cycle of AI player turns. This is done five times within the ingame benchmark, and we run the benchmark four times, making this an extremely time consuming but consistent benchmark. It’s also the only game benchmark we run that isn’t scored by FPS. Results for this title tend not to scale strongly with high core counts.

Hitman 2 [DX12]

For these next few games, we won't produce charts until our first review using these methods. You've already seen these before, so not much has changed.

Hitman 2 returns unchanged. Our first benchmarks with Hitman 2 were done using DX11 due to reports of instability in the DX12 version, but we’re tired of explaining that, and the DX12 benchmark usually manages to complete without crashing anyway. Our version of this benchmark has some tweaked settings to increase CPU load as much as possible, like maxing out Simulation Quality and Sound Simulation Quality.

Grand Theft Auto V [DX11]

By force of popularity Grand Theft Auto V remains in the suite. It’s one of our oldest and most reliable benchmarks, and one that remains CPU-bound. 

Assassin’s Creed: Origins [DX11]

This isn’t the newest Assassin’s Creed game, but we’ve kept it around as another representative of DirectX 11 and a title that produces reliable, CPU-bound results. 

Shadow of the Tomb Raider [DX12]

Shadow of the Tomb Raider at 1440p has been eliminated from the suite, but 1080p remains. It’s still difficult to create a CPU bottleneck at 1440p in many modern games without using the lowest possible graphical settings. The SotTR 1080p test remains mostly unchanged, but we’re sticking to build number 294 for this round of tests.

7-Zip

21 7zip decompression

We’ve tested 7-Zip in the past, but we’ve now updated to version 1900. The baked-in benchmark is run three times and the results are averaged.

Chaos Group V-Ray

22 vray bench ksamples

We’re now using V-Ray version 4.10.07, which has switched to scoring based on “ksamples” rather than time to render, similar to the way Cinebench scores in cb marks rather than seconds. More is now better rather than worse on this chart.

Blender

We’re using the same render scenes as always for Blender testing, namely the monkey head scene and a frame of the GN logo from our video intro, both created by Andrew. We’ve now updated to Blender version 2.81, which we use for both renders.

Power Testing

23 power blender

We’ve been doing power testing in one form or another on CPUs for a long time now, but we’ve now officially made it part of the standard benchmarking process. The charts won’t look much different in the end, but it means we won’t keep forgetting to run the tests and having to reassemble the test bench afterwards. We test using a current clamp directly over the CPU 12V power cable while running Cinebench R15 single-threaded and multi-threaded, Cinebench R20 single-threaded and multi-threaded, and F1 2019 at 1080p. We then allow Blender to render a specially-designed scene and spot-check the amperage after five minutes.

Conclusion

We’re not going to try and squash this down into a consumable quote in a conclusion. If you want to know how we test things, read the article. Most people jump to this section for a shortcut, but there’s no shortcut to testing methods.

Thanks for your interest! Stay tuned for the upcoming reviews using this testing.

Editorial, Testing: Patrick Lathan
Editorial, Test Lead: Steve Burke
Video: Keegan Gallick, Andrew Coleman

Last modified on May 02, 2020 at 11:45 pm

Leave a comment

Make sure you enter all the required information, indicated by an asterisk (*). HTML code is not allowed.

We moderate comments on a ~24~48 hour cycle. There will be some delay after submitting a comment.

Advertisement:

  VigLink badge