April 9, 2018
By: Michael Feldman
The US Department of Energy (DOE) has announced a Request for Proposals (RFP) to develop at least two new exascale supercomputers for the DOE at a cost of up to $1.8 billion. The RFP was issued under the moniker of CORAL-2, representing the second phase of procurements for exascale machinery for Oak Ridge, Argonne, and Lawrence Livermore national labs.
“These new systems represent the next generation in supercomputing and will be critical tools both for our nation’s scientists and for US industry,” U.S. Secretary of Energy Rick Perry said, in prepared statement announcing the RFP. “They will help ensure America’s continued leadership in the vital area of high performance computing, which is an essential element of our national security, prosperity, and competitiveness as a nation.”
The RFP covers the first exascale computers for Oak Ridge National Laboratory (ORNL) and Lawrence Livermore National Laboratory (LLNL), respectively, with a possible third system to be deployed Argonne National Laboratory (ANL). The ANL option would be either an upgrade to the Intel/Cray-built Aurora system, which is slated to be the first exascale machine up and running in the US, or an entirely new supercomputer. The RFP appears to leave it up to ANL whether or not it wants to make an award under this contract. The deployment timeline for these new systems begins in the third quarter of 2021, with ORNL’s exascale supercomputer, followed by a third quarter 2022 system installation at LLNL. The ANL addition or upgrade, if it happens, will also take place in the third quarter of 2022.
Putting aside the strange tale of the Aurora system, the choice of Oak Ridge and Lawrence Livermore for these first exascale systems is pretty straightforward, inasmuch as they represent two of the Energy department’s premier national labs – ORNL for the Office of Science, and LLNL for the National Nuclear Security Administration (NNSA). ORNL currently has the country’s fastest supercomputer, in Titan, a 27-petaflop (peak) system that is about to be superseded by the 200-petaflop Summit machine. LLNL operates the nation’s number two system, the 20-petaflop Sequoia.
Not only will the two new systems under the RFP issued this week be 50 to 100 times more powerful Titan and Sequoia, they will also be expected to run applications in the emerging areas of machine learning and data analytics. This is not merely for checking off the AI and big data boxes for the sake of appearing relevant. The DOE believes it needs to address what it refers to as “the emerging convergence of data science machine learning, and simulation science,” and already has plans in motion to use these new application approaches for things like uncertainty quantification, simulation optimization, and nuclear threat identification, among others.
Besides the requirement for exaflop-level performance and new applications, the DOE is also putting an upper limit on power consumption. According to the RFP, the new systems can’t exceed 40 MW, with the preferred power draw in the 20 to 30 MW range. On first blush, that might seem rather forgiving, since the original idea was to cap these exascale machines at 20 MW. But in this case the DOE is also counting storage, cooling and any other auxiliary equipment in these power measurements. As a result, all of these subsystems – computation, storage, and cooling – will need to deliver state-of-the-art efficiency at their respective tasks.
The other critical requirement is that the ORNL and ANL systems are architecturally diverse from one other. For now, that means the Oak Ridge system can’t be like Aurora, an Intel-based machine comprised of as-yet unknown componentry. Lawrence Livermore has more leeway, since it can match whatever either ORNL or ANL chooses or select a different system altogether.
Right now, ORNL and LLNL are installing identical systems, architecturally speaking, for their pre-exascale machines of Summit and Sierra, respectively. Both are IBM-built supercomputers powered by Power9 CPUs and NVIDIA Tesla V100 GPUs. Given Oak Ridge’s long history with GPU acceleration, which began with Titan in 2012, it seems unlikely that this particular lab would switch architectures for its first exascale machine.
Lawrence Livermore, the other hand, is less invested in GPUs, with Sierra representing the lab’s first big commitment to the CPU-GPU model. Sequoia is an unadorned Power-based Blue Gene system, and the lab has a long history with Blue Gene and other IBM Power-based machines that goes back nearly two decades. ANL also has an extended history of Blue Gene and other IBM deployments, but has also procured Intel Xeon and Xeon Phi-based supercomputers at times.
Given all that, IBM, Cray, Intel, and Hewlett Packard Enterprise (HPE) will be the most likely bidders on this RFP. Dell and Penguin Computing both have the capability to build exascale systems in the same timeframe, but don’t have a history of delivering leadership-class supercomputers to the DOE – so they are pretty much long shots. Bidders are allowed to submit two proposals, one for the Oak Ridge machine and the other for Lawrence Livermore, which will be the basis for the potential selection by Argonne as well.
IBM will almost certainly be bidding a Power10-based supercomputer, likely accelerated by whatever top-of-the-line Tesla GPUs NVIDIA is hawking in the 2020 timeframe – so basically an updated version of the Summit and Sierra platform. One would expect at least HDR InfiniBand (200 Gbps) as the system interconnect, with NDR InfiniBand (400 Gbps?) a possibility.
Cray will surely propose a Shasta system, which is the company’s upcoming exascale architecture and the one that Aurora is based on. Unlike Aurora though, Cray needn’t necessarily rely on Intel parts for the componentry and could team up with NVIDIA, AMD, or even Cavium (ARM) on the processor side. The system network fabric is up for grabs since Cray intends to support a variety of interconnects on Shasta.
Intel’s bid is probably going to resemble its Aurora machine, but could include some updated processors, interconnect technology, and memory components. Since not much is known about the specific Aurora hardware at this point, that could mean almost anything, but a likely design would include a manycore Xeon (which the company may brand as a Xeon Phi), the second-generation Omni-Path fabric integrated with silicon photonics, and some combo of conventional memory, 3D high bandwidth memory and non-volatile memory based on 3D XPoint technology.
For its bid, HPE is probably going to propose some version of its memory-centric architecture, formally known as “The Machine.” As reflected by this nomenclature, the company is going to be pushing the envelop on the memory side, with some flavor of NVDIMMs as the central feature, even if they’re not based on memristors. Since Gen-Z appears to be central to HPE’s memory-centric ambitions, one would them to specify that technology for the system interconnect. As with Cray, the processor mix could be almost anything.
Proposals are due in May and according to the RFP, the bidders will be selected before the end of the second quarter. Acquisition contracts will follow in 2019. Keep in mind there is currently no funding in place to procure any of these systems, each of which is expected to cost between $400 to $600 million. That’s going to be up to the administration and the US Congress as the fiscal years for 2021 and 2022 come into play..