July 20, 2017
By: Michael Feldman
While a number of commentators have written off AMD’s prospects of competing against Intel in HPC, testing of the latest server silicon from each chipmaker has revealed that the EPYC chip offers some surprising performance advantages against Intel’s newest "Skylake" Xeon destined for the datacenter.
Since Intel integrated the 512-bit Advanced Vector Extensions (AVX-512) feature into its new Xeon Skylake scalable processor (SP) platform, it can theoretically double floating-point performance (and integer performance) compared its previous “Broadwell” generation Xeon line. The latter chips supported vector widths of only 256 bits. With EPYC, AMD decided to forego any extra-wide vector support, implementing its floating-point unit with a modest 128-bit capability. That leaves it with a distinct disadvantage on vector-friendly codes.
However, the majority of HPC codes don’t take advantage of AVX-512 today, since prior to Skylake the only platform that supported it was Intel’s “Knight’s Landing” Xeon Phi, a processor specifically designed for vector-intensive software. Many HPC applications could certainly be enhanced to use the extra-wide vectors, although for others, like sparse matrix codes, it may not be worth the trouble. In any case, adding AVX-512 support to the code base will be done one application at a time.
Without the performance boost from extra-wide vector instructions, the theoretical floating-point advantage of the new Xeon over the AMD EPYC processor disappears. At least that is what can be concluded from testing done by the gang over at Anandtech. They recently ran a series of floating-point-intensive tests, among other, pitting the EPYC 7601 (32 cores, 2.2 GHz) against the comparable Xeon Platinum 8176 (28 cores, 2.1 GHz). Both are considered high bin server chips from their respective product lines.
The testing comprised benchmarks based on three real-world codes: C-ray, a ray-tracing code that runs out of L1 cache; POV-Ray, a ray-tracing code that runs out of L2 cache; and NAMD, a molecular dynamics code that requires consistent use of main memory. The tests were performed on dual-socket servers running Ubuntu Linux.
Somewhat surprising, the EPYC processor outran the Xeon in all three floating-point benchmarks. For C-ray, the 7601 delivered about 50 percent more renders than the 8176 in a given amount of time, while for POV-Ray, the 7601 scored a more modest 16 percent performance advantage. For NAMD, Anandtech used two implementations, a non-AVX version and an AVX-version that uses Intel’s compiler vectorization smarts (but not specifically for AVX-512). In both cases, the EPYC processor prevailed – by 41 percent, with the older implementation, and by 22 percent with AVX turned on. Anandtech’s conclusion was that even though “the Zen FP might not have the highest ‘peak FLOPS’ in theory, there is lots of FP code out there that runs best on EPYC.”
It’s worth noting that in Anandtech also performed a “Big Data” benchmark, in which the Xeon edged the EPYC by a little less than 5 percent. In this case, the test was a collection of five Spark-based codes, which measured mostly integer performance and memory accesses. In general, the EPYC processors should do better on data-demanding codes due to its superior memory bandwidth, but it was not clear how memory intensive these particular codes were. It would be interesting to see how these two architectures match up on in-memory database benchmarks.
Execution speed aside, AMD silicon looks even more attractive when you consider price-performance. The Xeon 8176 lists for $8,719, while the EPYC 7601 is priced at $4,000. With the Xeon line, you could move up to a faster clock (2.5 GHz) with the top-of-the-line 8180 for around $10,000, or move down to the Xeon 8160 (same clock, 24 cores) for $4,700. But either way, AMD looks to be undercutting Intel on price for comparably performaning server silicon.
Of course, if an application can take full advantage of AVX-512, the performance advantage would shift to Intel. (Perhaps not a price-performance advantage though.) One other thing to consider is for AVX-512-friendly codes, the Xeon Phi itself offers the best performance and price, not to mention energy efficiency. The only caveat here is threads on the Xeon Phi execute about 1 GHz slower than on their Xeon counterparts, so if single-threaded performance is critical to some portion of your code or codes, you’re going to take a pretty significant performance hit.
In a discussion posted on Facebook earlier this week, Forrest Norrod, SVP and GM of Enterprise, Embedded & Semi-Custom Products, said he was pleased with how their new server chip is positioned against its rival. He made particular mention of the favorable floating-point performance, noting “the results on EPYC have been tremendous, head-to-head, against the competitor.”
He went on to explain that while the EPYC design team considered implementing a wide vector capability, they felt it was too expensive in terms of die area and power requirements to load down the CPU with such a capability. Instead they opted for a more general-purpose floating-point unit, plumbed with dedicated FP pipes to improve performance.
Also part of the Facebook discussion was AMD Engineering Fellow Kevin Lepak, who explained that another facet of the decision to keep the EPYC floating-point unit more generalized was due to AMD’s GPU computing product line, which essentially fulfills the role of a dedicated vector processor. The company felt it didn’t make much sense to overlap this capability in their CPU platform as long as they were offering both. As noted earlier, Intel made the exact opposite decision, vis-à-vis their Xeon and Xeon Phi lines.
Norrod and Lepak also delved into the rationale for implementing EPYC as a multi-chip module (MCM) processor, rather than as a monolithic chip, as Intel has done with its Skylake Xeons. A 32-core EPYC processor, for example, is comprised of four eight-core dies glued together with the Infinity Fabric. Intel has been critical of AMD for its MCM approach, claiming it hinders performance at various choke points. AMD counters that it’s a more effective way to get its extra-large feature set – eight memory channels, 128 PCIe lanes, built-in encryption, and so on – into the processor, while also serving to lower costs via increased manufacturing yields.
None of these technical arguments amount to much for customers, who will be focused on performance, price-performance and performance-per-watt across their own applications. If AMD can deliver superior numbers on even two of these criteria, Intel will likely lose its 90 percent-plus market share in HPC for the first time in nearly ten years. And that would be a true EPYC event.