Nov. 15, 2016
Just as the choice of processors architectures in supercomputing is expanding with GPUs, FPGAs, ARM and Power, memory is beginning to diversify as well. Novel technologies like 3D XPoint, resistive RAM/memristors, and 3D memory stacks are already starting to work their way into the hands of HPC users. At SC16 this week in Salt Lake City, one of the Friday panels, “The Future of Memory Technology for Exascale and Beyond IV,” will delve into this subject more deeply.
In advance of the conference, we spoke with panel moderator Richard Murphy to offer some context to the upcoming discussion. Murphy, who directs the Advanced Computing Solutions Pathfinding team in for Micron’s Advanced Computing Solutions Group, has devoted a lot of attention to data-centric supercomputing over the course of his career. He spent much of that time as a researcher and academician, including a long stint at Sandia National Laboratories. In this interview, he offers his perspective about what is driving the current thinking in memory technologies as the HPC community reaches toward exascale computing.
TOP500 News: In a nutshell, can you describe the memory wall problem, especially with regard to supercomputing applications?
Richard Murphy: The performance of all computers is generally dominated by the memory and storage system. While different measures have been proposed over the years, the memory wall refers to the relative divergence between processing power, the processor’s memory management sophistication, and memory performance. Computers are becoming less and less balanced by supporting increasingly more compute with less ability to move data in support of that compute because the community’s focused on compute performance rather than application performance, even though application performance matters more. Supercomputers are particularly susceptible to this problem because the application base tends to have large, unstructured data sets compared to commercial workloads. In a nutshell, we bought better calculators when we should have bought better data moving memories.
We’ve hit the perfect storm of an application base that’s shifting away from simulation and towards large-scale unstructured data analytics and a point in the CMOS technology curve where we simply won’t get improved transistor performance “for free” anymore. The exponential scaling we saw at essentially constant energy up until 2003, when Dennard Scaling ended, meant that a lot of fundamental computer architecture questions could be ignored. Now that the free ride is over, the architecture knob is the only one left to turn, and the memory system is the biggest source of potentiTOP500 News: Why is the problem receiving so much attention now?al gains left. It’s quite simply where the action is.
TOP500 News: For future systems – pre-exascale and exascale supercomputers – the driving criteria looks to be energy-efficient data movement. How does one pursue that, while at the same time demanding greater bandwidth and lower latency?
Murphy: The fundamental technology is capable of doing a lot, and as we face scaling challenges with Moore’s Law, the question will be less “how many transistors are there?” or “how can you clock them” but rather “how did you use them?” What we’re fundamentally seeing is a commoditization of the processor, and to get energy-efficient application performance we’ll have to build balanced systems rather than focusing on one-dimensional performance criteria like peak FLOPS.
TOP500 News: How does the co-design approach help inform the design of the memory subsystem, especially when you consider that different types of supercomputing applications, like informatics and physics simulations, have different behavior, and thus requirements, with regard to memory usage? Do we need to develop distinct architectures for each?
Murphy: I’m personally a big proponent of the “unified architecture” and believe that physics and informatics applications share a lot in common from a data movement perspective. Economically, I think traditional HPC risks being left behind by informatics – the data analytics market surpassed HPC in size last year and has a CAGR of 28 to 32 percent, compared to HPC’s CAGR of around 8 percent. While 8 percent is a great growth rate, the analytics market isn’t going to slow down and the HPC community risks falling behind on a technology base that they developed and should be leading. NSCI objective number two, which looks for commonality between the two approaches, could be achieved somewhat backwards – by the HPC community adopting technologies developed for informatics because informatics problems push the memory system harder.
I do see a potential for some application bases to support more highly-customized compute environments. We’re seeing processors, GPUs, and their associated memory systems customized for different workloads – mobile, server, machine learning, etc. This process is likely to continue and, in my opinion, accelerate as we demand more from the architecture. Codesign enables system-level tradeoffs to be considered rather than myopically focusing on one aspect of performance, like how many floating point units are available.
TOP500 News: What new types of memory technologies and memory hierarchies will help us address these challenges?
Murphy: I don’t think there’s a single silver bullet technology. System balance is going to matter as we optimize for performance and energy, and that balance will often be workload dependent. This is especially true of supercomputers, large-scale data center applications, and (ironically) mobile applications where energy and performance constraints are really challenging. This is a big shift in the memory industry which has historically focused on turning out a small number of part types (at the lowest cost possible). We’ve started to see a proliferation of parts optimized for different use cases, as well as emerging memory technologies like 3D XPoint memory.
For example, GPGPUs are used heavily in machine learning and memories like GDDR5X are already evolving in support of both graphics and these application bases. We’re also seeing ASICs being deployed in the data center for tensor analysis and other important application bases. These types of analytics place different demands on the memory system and will drive their own memory hierarchies.
Given the economics behind data analytics, I believe there will be strong drivers for memory hierarchies capable of exploring large-scale, unstructured data. That pushes a fundamentally more capable memory system, which, when combined with the energy constraints of exascale systems will bring the system more into balance.
TOP500 News: Which technology do you think is the most important?
Murphy: Any technology that meaningfully increases performance while simultaneously reducing energy.
There are generally two approaches, first to address the problem with technology and architecture – specialized memories like GDDR5X do precisely this. The second approach is to move to a hierarchical system, which generally requires more work on the part of the programmer or the hardware.
Advanced nonvolatile memory technologies like 3D XPoint memory that enable very large address spaces and reap the benefit of sparsity of access will provide tremendous energy benefits while simultaneously easing the burden on the programmer, which is particularly important for adoption.
Finally, I think that emerging capabilities in near-memory processing and in enabling the memory system to better manage data objects and their placement in support of the processor will prove absolutely critical to future platforms.
TOP500 News: Here we are in 2016, presumably just five years or so away from our first exascale systems. Will the memory solutions in the pipeline, real or imagined, meet the criteria set out for these machines? Are there challenges that remain?
Murphy: The answer is a resounding yes to both questions. It’s an exciting time to be a computer architect and to be working in memory systems, and there are advanced technologies emerging to address these specific challenges. On the flip-side, five years is a short time, and the challenges are still big. The goal isn’t just to create an exascale platform, it’s to create a capable exascale platform. That generally means that within the constraints of cost and energy, which is really a proxy for operational cost, we’re looking for the highest performance on a set of applications and the best system programmability. I think the last point’s an important one, and is often lost. A lot of addressing all three challenges really boils down to building the most capable memory system, not worrying about peak performance processor, even though the latter tends to get the most attention.
Murphy’s SC16 panel will be held on Friday, November 18, 8:30-10:00am in the Salt Palace Convention Center in Salt Lake City.