We took time aside at AMD’s Threadripper & Vega event to speak with leading architects and engineers at the company, including Corporate Fellow Mike Mantor. The conversation eventually became one that we figured we’d film, as we delved deeper into discussion on small primitive discarding and methods to cull unnecessary triangles from the pipeline. Some of the discussion is generic – rules and concepts applied to rendering overall – while some gets more specific to Vega’s architecture.
The interview was sparked from talk about Vega’s primitive shader (or “prim shader”), draw-stream binning rasterization (DSBR), and small primitive discarding. We’ve transcribed large portions of the first half below, leaving the rest in video format. GN’s Andrew Coleman used Unreal Engine and Blender to demonstrate key concepts as Mantor explained them, so we’d encourage watching the video to better conceptualize the more abstract elements of the conversation.
Every now and then, a content piece falls to the wayside and is archived indefinitely -- or just lost under a mountain of other content. That’s what happened with our AMD Ryzen pre-launch interview with Sam Naffziger, AMD Corporate Fellow, and Michael Clark, Chief Architect of Zen. We interviewed the two leading Zen architects at the Ryzen press event in February, had been placed under embargo for releasing the interview, and then we simply had too many other content pieces to make a push for this one.
The interview discusses topics of uOp cache on Ryzen CPUs, power optimizations, shadow tags, and victim cache. Parts of the interview have been transcribed below, though you’ll have to check the video for discussion on L1 writeback vs. writethrough cache designs and AMD’s shadow tags.
“Disillusioned and confused” could describe much of the response to initial AMD Vega: Frontier Edition testing and reviews. The card’s market positioning is somewhat confusing, possessing neither the professional-level driver certification nor the gaming-level price positioning. This makes Vega: FE ($1000) a very specifically placed card and, like the Titan Xp, doesn’t exactly look like the best price:performance argument for a large portion of the market. But that’s OK – it doesn’t have to be, and it’s not trying to be. The thing is, though, that AMD’s Vega architecture has been so long hyped, so long overdue, that users in our segment are looking for any sign of competition with nVidia’s high-end. It just so happens that, largely thanks to AMD’s decision to go with “Vega” as the name of its first Vega arch card, the same users saw Vega: FE as an inbound do-all flagship.
But it wasn’t really meant to compete under those expectations, it turns out.
Today, we’re focusing our review efforts most heavily on power, thermals, and noise, with the heaviest focus on power and thermals. Some of this includes power draw vs. time charts, like when Blender is engaged in long render cycles, and other tests include noise-normalized temperature testing. We’ve also got gaming benchmarks, synthetics (FireStrike, TimeSpy), and production benchmarks (Maya, 3DS Max, Blender, Creo, Catia), but those all receive less focus than our primary thermal/power analysis. This focus is because the thermal and power behavior can be extrapolated most linearly to Vega’s future supplements, and we figure it’s a way to offer a unique set of data for a review.
NVidia’s Volta GV100 GPU and Tesla V100 Accelerator were revealed yesterday, delivering on a 2015 promise of Volta arrival by 2018. The initial DGX servers will ship by 3Q17, containing multiple V100 Accelerator cards at a cost of $150,000, with individual units priced at $18,000. These devices are obviously for enterprise, machine learning, and compute applications, but will inevitably work their way into gaming through subsequent V102 (or equivalent) chips. This is similar to the GP100 launch, where we get the Accelerator server-class card prior to consumer availability, which ultimately helps consumers by recuperating some of the initial R&D cost through major B2B sales.
Our third and final interview featuring Scott Wasson, current AMD RTG team member and former EIC of Tech Report, has just gone live with information on GPU architecture. This video focuses more on a handful of reader and viewer questions, pooled largely from our Patreon backer discord, with the big item being “GPU IPC.” Patreon backer “Streetguru” submitted the question, asking why a ~1300~1400MHz RX 480 could perform comparably to an ~1800MHz GTX 1060 card. It’s a good question – it’s easy to say “architecture,” but to learn more about the why aspect, we turned to Wasson.
The main event starts at 1:04, with some follow-up questions scattered throughout Wasson’s explanation. We talk about pipeline stage length and its impact on performance, wider versus narrower machines with frequencies that match, and voltage “spent” on each stage.
We’ll leave this content piece primarily to video, as Wasson does a good job to convey the information quickly.
Between its visit to the White House and Intel’s annual Investor Day, we’ve collected a fair bit of news regarding Intel’s future.
Beginning with the former, Intel CEO Brian Krzanich elected to use the White House Oval Office as the backdrop for announcing Intel’s plans to bring Fab 42 online, with the intention of preparing the Fab for 7nm production. Based in Chandler, Arizona, Fab 42 was originally built between 2011 and 2013, but Intel shelved plans to finalize the fab in 2014. The rebirth of the Arizona-based factory will expectably facilitate up to 10,000 jobs and completion is projected in 3-4 years. Additionally, Intel is prepared to invest as much as $7 billion to up-fit the fab for their 7nm manufacturing process, although little is known about said process.
AMD’s Vega GPU architecture has received cursory details pertaining to high-bandwidth caching, an iterative step to CUs (NCUs), and a unified-but-not-unified memory configuration.
Going into this, note that we’re still not 100% briefed on Vega. We’ve worked with AMD to try and better understand the architecture, but the details aren’t fully organized for press just yet; we’re also not privy to product details at this time, which would be those more closely associated with shader counts, memory capacity, and individual SKUs. Instead, we have some high-level architecture discussion. It’s enough for a start.
Taiwan Semiconductor Manufacturing Co. (TSMC) has set sights on building a new $15.7 billion facility geared towards the 5 and 3 nanometer chip processes, eyes set for future process nodes. TSMC is the world’s biggest chip maker by revenue, accounting for 55% of the market share. TSMC’s deep-pocketed clients include Qualcomm, nVidia, and Apple, whose iPhone 7 launch was especially pivotal in the record quarter to quarter profits TSMC has been reporting, as TSMC produces the A10 processor for the iPhone 7.
Taiwan Semiconductor houses its base of operations in Northern Taiwan, where several of their fabs are located. This is in addition to leading-edge fabs in Southern Taiwan and Central Taiwan, not to mention manufacturing bases in China.
Abstraction layers that sit between the game code and hardware create transactional overhead that worsens software performance on CPUs and GPUs. This has been a major discussion point as DirectX 12 and Vulkan have rolled-out to the market, particularly with DOOM's successful implementation. Long-standing API incumbent Dx 11 sits unmoving between the game engine and the hardware, preventing developers from leveraging specific system resources to efficiently execute game functions or rendering.
Contrary to this, it is possible, for example, to optimize tessellation performance by making explicit changes in how its execution is handled on Pascal, Polaris, Maxwell, or Hawaii architectures. A developer could accelerate performance by directly commanding the GPU to execute code on a reserved set of compute units, or could leverage asynchronous shaders to process render tasks without getting “stuck” behind other instructions in the pipeline. This can't be done with higher level APIs like Dx 11, but DirectX 12 and Vulkan both allow this lower-level hardware access; you may have seen this referred to as “direct to metal,” or “programming to the metal.” These phrases reference that explicit hardware access, and have historically been used to describe what Xbox and Playstation consoles enable for developers. It wasn't until recently that this level of support came to PC.
In our recent return trip to California (see also: Corsair validation lab tour), we visited AMD's offices to discuss shader intrinsic functions and performance acceleration on GPUs by leveraging low-level APIs.
This episode of Ask GN (#28) addresses the concept of HBM in non-GPU applications, primarily concerning its imminent deployment on CPUs. We also explore GPU Boost 3.0 and its variance within testing when working on the new GTX 1080 cards. The question of Boost's functionality arose as a response to our EVGA GTX 1080 FTW Hybrid vs. MSI Sea Hawk 1080 coverage, and asked why one 1080 was clock-dropping differently from another. We talk about that in this episode.
Discussion begins with proof that the Cullinan finally exists and has been sent to us – because it was impossible to find, after Computex – and carries into Knights Landing (Intel) coverage for MCDRAM, or “CPU HBM.” Testing methods are slotted in between, for an explanation on why some hardware choices are made when building a test environment.
We moderate comments on a ~24~48 hour cycle. There will be some delay after submitting a comment.