Nov. 14, 2017
By: Michael Feldman
Summit, the most powerful supercomputer in the United States, is currently under construction at the US Department of Energy’s Oak Ridge National Lab (ORNL). ORNL director Thomas Zacharia updates us on its status and reveals the opportunities the new machine will provide scientists when it comes online next summer.
Zacharia, who was named lab director in June, take the reins at ORNL just as Summit is poised to become the number one system in the United States, and with any luck, the fastest supercomputer in the world. The current champ, by Linpack standards, is the Sunway TaihuLight, a 93-petaflop supercomputer installed at the National Supercomputing Center in Wuxi, China. Summit is slated to hit at least 200 peak petaflops when completed next summer, and if China or anyone else can’t double up on the Sunway machine’s flops in the interim, the US will recapture the number one spot on the next TOP500 list. If the happens, it will be the first time the US has occupied the top position since 2012.
Currently the most powerful supercomputer at Oak Ridge is Titan, a five-year-old machine, which at 27 peak petaflops, is the fastest in the US, and ranked fifth in the world. It’s powered by AMD Opteron CPUs and NVIDIA K20x GPUs. When it was purchased, GPU accelerators were pretty much a new thing for top-flight HPC machinery, and Oak Ridge established itself as something of a pioneer in developing supercomputing applications for this CPU-GPU architecture. With Summit they will continue the heterogenous tradition, but this time with IBM’s upcoming Power9 processors and NVIDIA’s latest GPU accelerator, the V100.
With more than 25,000 V100 GPUs, Summit will not only top 200 peak petaflops in double precision math (64-bit floating point) for traditional HPC simulations, but will also deliver more than 3 exaflops for the 32-bit/16-bit mixed arithmetic employed by machine learning codes. Oak Ridge is well aware of the opportunity this presents, and according to Zacharia, plans to leverage the V100 extensively for this new application category.
“Many of our users are already developing sophisticated machine learning algorithms that, when combined with traditional modeling and simulation, will give us entirely new capabilities,” Zacharia told TOP500 News. “One team from Oak Ridge is working on a machine learning algorithm for Summit to help select the best treatment for cancer in a given patient. We have a team using Titan today to develop the deep learning tools to design and monitor fusion reactors. Another team is using machine learning to help classify types of neutrino tracks seen in experiments.”
The melding of HPC simulations with machine learning is emerging as one of the more important developments in supercomputing over the next decade, and with Summit, Oak Ridge is positioned to be a leading center for some of the most ambitious work in this area. Of course, the system will also be tackling more traditional supercomputing applications and Zacharia points to codes like ACME (climate modeling), DIRAC (relativistic quantum chemistry, FLASH (astrophysics), NWCHEM (computational chemistry), GTC (plasma physics), and NAMD (biophysics), which are expected to get a huge performance boost from Summit’s power.
These applications, and others, will be ported to the Power9/V100 platform as part of an effort known as the Center for Accelerated Application Readiness (CAAR) program. It will involve “redesigning, porting, and optimizing application codes for Summit’s hybrid CPU–GPU architecture.” The development teams for these applications will get early access to Summit and receive support from the new IBM/NVIDIA Center of Excellence at ORNL.
The installation of Summit appears to be proceeding at a steady pace. All of the cabinets are installed, and most of the interconnects are now wired. NVIDIA has been shipping the V100 GPUs for awhile now, so there shouldn’t be any holdup on the accelerator side. And even though IBM hasn’t officially launched the Power9 processors, the company expects they will be in production before the end of the year.
But this is the first time this Power9/V100 platform will be assembled, so it’s going to be a somewhat protected installation. According to Buddy Bland, project director at the Oak Ridge Leadership Computing Facility, they expect the compute nodes to start arriving in late fall, but he doesn’t think all of the hardware will be onsite until February or March of next year, with formal acceptance scheduled for late summer, 2018. Though the timeline seems unusually long, Bland maintains that this is pretty much in line with other large system installations.
“It takes a long time to install and test 4600 nodes,” explained Bland. “Once the hardware can get through the diagnostic tests, then we have to start testing the system software and scaling that out to the size of the system. Again, there isn’t any other place to test this, so there will be several months of testing and debugging the software before the system will be ready to run the acceptance test.”
When all of the dust settles, ORNL and its users will have at their disposal a truly unique machine, which Zacharia believes will advance scientific discovery for years to come. “Oak Ridge National Laboratory today is in a tremendously enviable position,” he noted. “We have signature strengths in materials, in neutrons, in computing, nuclear science and engineering. And we can apply that and the tremendous talent that we have to solve challenging problems in energy, in national security, in manufacturing, and in grid cyber security. So our opportunity is to put it all together and challenge ourselves to take the next step and be that premiere research institution in the world. That, in and of itself, is exciting.”