Dec. 10, 2016
By: Michael Feldman
The US Department of Energy’s Exascale Computing Project (ECP) has changed its timeline for getting the first post-petascale system into the field. The new goal is to get an initial exascale system deployed sometime in 2021, with acceptance nine months after that. That shrinks the schedule significantly and puts the country back on a more competitive trajectory with regard to China and Japan.
According to Paul Messina, who heads up the ECP effort, the new plan came about as a result of internal discussions over summer within the DOE. The compressed timeline was ratified on November 17th, Messina told TOP500 News. From his perspective, the more ambitious plan is “a signal to the vendor community that there is room for more options,” although he wouldn’t speculate on the agency’s rationale for compressing the timeline.
The project has come under fire, both within the agency and from industry watchers, that the original ECP roadmap would cede supercomputing leadership to other nations with more aggressive exascale programs. China has plans to deploy its first exascale system in 2020, while Japan’s original plan to bring its first system online in 2019/2020 was recently pushed back by at least a year. France meanwhile, is scheduled to get its initial exascale deployed at CEA, the French atomic energy agency, sometime in 2020.
Although the first US system will be deployed a full year earlier than the original date, the second exascale system is still scheduled for deployment in 2022, with a 2023 acceptance date. Originally both systems – or more precisely, at least two systems – were on track for 2022/2023 deployment/acceptance. Keep in mind, the ECP effort itself does not entail the acquisition of these exascale systems by the DOE, just the R&D and NRE support efforts necessary to propel those designs to realization.
In another departure from the original plan, the 2021 system is going to be based on “a novel architecture.” Asked what exactly that meant, Messina indicated that the design would be based on technologies and architectures not represented in the current TOP500 systems today. That opens the possibility for an ARM-based design or perhaps an AMD APU-powered platform. Either would have been considered long shots before the change in direction, but now have an opportunity to go to the head of the line. Of course, there could be other emerging technologies in the mix as well, such as resistive RAM (ReRAM), silicon photonics, or perhaps even an optical computer along the lines of what Optalysis is developing.
Which brings up the prospect that this first exascale system may end up as a one-off supercomputer, affectionally known as a “stunt machine.” The idea behind such a system would be to deploy a system capable of exascale computing, but not necessarily something that would have a commercial future. Historically, it’s been difficult to avoid such a fate with these milestone supercomputers. The first teraflop system, ASCI Red (Intel), and the first petaflop system, Roadrunner (IBM), were both systems that became dead ends from a commercial standpoint. By the time ASCI Red was deployed, Intel had already shut down its supercomputer division. And a year after Roadrunner came online, the development of the processor upon which it was based, the PowerXCell 8i, was abandoned by IBM.
Nobody at the DOE would likely admit to such an intention for the first exascale system, and no one building it would dismiss it as something with no commercial future. But the fact is it’s easier to build a machine without the constraints imposed by commercial acceptance. Nevertheless, both ASCI Red and Roadrunner had long productive lives and paved the way for technology that would establish itself in more mainstream supercomputers – x86 processors in the case of ASCI Red, and floating point accelerators in the case of Roadrunner.
The other, perhaps more charitable interpretation of the motivation at the DOE is that the agency simply wanted a more powerful system sooner than was being promised, and they didn’t care if that system was a “mainstream” supercomputer. The two most likely architectures under consideration were, up until now, the IBM Power/NVIDIA GPU hybrid platform and an Intel machine based on some future version of Xeon Ph and Omni-Path, undergirded by the chipmaker’s silicon photonics technology. Given the change of plans, support for at least one of these platforms under the ECP now seems to be in doubt.
It’s impossible to know what was going on behind the scenes at the DOE, but it’s possible the power that be have soured on one or both of these architectures and/or became attracted to a more exotic design or set of technologies. We might get some indication of the direction of the desired hardware when the first FastForward contract is announced by the ECP in January. Whether any money gets tossed to vendors with novel technologies remains to be seen. Messina said there will be a second FastForward RFP announced sometime in the next fiscal year, and one would surmise that proposal would explicitly call for these more exotic solutions.
In addition to the call for a 2021 machine, the ECP is also chopping three years off the original 10-year project. For the most part that means the tail end of the effort, in which the proxy exascale applications were to be tested, tuned, and accepted on the deployed machines, will be done after the project concludes in 2023. That’s not a substantial change, since that work will still be performed, just not under the purview of the ECP.
All of the other pieces of the project will go forward as planned, specifically, the co-design work, the system software development, and the application development. Obviously much of this work will now have to be completed a year sooner to coincide with the deployment of the first system, but the basic framework of the project remains in place.
Whatever one may think of the prospects for novel HPC architectures, the ECP project now looks a lot more interesting than it did when Messina reported on the status of the work last month at SC16. Not only should the US get an exascale system up and running a year sooner, it will also house more unique technologies than might have been the case otherwise. As Messina correctly noted, “we’re being more ambitious.”