Dec. 19, 2016
By: Michael Feldman
Cray’s recent announcement about how its XC50 supercomputer plus Microsoft’s Cognitive Toolkit was used to scale up training of a neural network serves as a proof point on how topflight HPC technologies can be used to push the boundaries of deep learning. But the company’s long game is to bring supercomputing into the realm of deep learning and the broader category of data analytics in a more generalized fashion.
The XC50 in question is “Piz Daint,” the 9.8-petaflop system housed at the Swiss National Supercomputing Centre (CSCS), which was recently upgraded with 4,500 of NVIDIA’s new P100 GPUs. Each of those GPUs is able to deliver over 5.3 teraflops of double precision floating point performance, 10.6 teraflops of single precision performance, or 21.2 teraflops of half precision performance. Those lower precision flops make these devices especially useful for deep learning codes, where 64-bit float pointing operations are, for the most part, overkill.
Despite all this concentrated performance, training neural networks is so computationally intense that many GPUs are often needed to in train these models in a reasonable amount of time – minutes or hours, rather than weeks or months. The problem is that with conventional compute clusters and first-generation deep learning frameworks, it’s hard to scale these applications without running into some rather daunting bottlenecks.
According to Mark Staveley, Cray’s director of deep learning and machine learning, the consensus from the research literature in this area is that once you get beyond about 200 "workers," you start to encounter diminishing returns. “In some of the deep learning codes, you have to perform an exchange of data at each step of the calculation,” explains Staveley. “And if you have eight GPUs worth of data trying to be exchanged over a small network pipe, then that becomes a challenge when you get to large scale.”
For supercomputing vendors like Cray, building systems that can deliver linear performance scaling (“strong scaling”) on applications is one of their specialties. It involves not just high-end hardware, but high-end software as well. In this case, Cray, Microsoft and CSCS were able to demonstrate scaling to more than 1,000 GPUs on Piz Daint using an image recognition training application applied to the ImageNet database.
To achieve this, a multi-pronged approach was used. The team employed ResNet, a training model that provides a few shortcuts compared to a typical convolutional neural network; Microsoft’s Cognitive Toolkit, a deep learning framework that supports MPI; and Cray’s native implementation of MPI targeted to its high performance Aries network. The use of MPI was a critical element to achieving scalability, since the engineers were able to optimize the communication pattern across the XC50’s Aries network by tuning various MPI parameters. That enabled the training to execute in reasonable amount of time even on a neural network with 152 layers.
Another element that helped this effort is the relatively large amount of network bandwidth per processor, or in this case, coprocessor. The XC50 has a single GPU per node, which means the internode communication is not being overwhelmed by data from multiple graphics coprocessors on the server blade. Thanks to this high ratio of network bandwidth to GPUs, the powerful P100, and the high performance Aries network, the XC50 can offer a path to scale in a way that others can’t, claims Staveley.
For Cray though, the bigger picture is that their customers are increasingly looking to add analytics and other data-intensive codes to their scientific workflows. Data analytics has been described as the “Fourth Paradigm” of science, the first three being empirical description, theoretical modeling, and computational simulation. Deep learning is just one instance of this latest paradigm, but one that has drawn signficant amount of attention to the fact that these applications are enabling a whole new set of capabilities for scientists and engineers.
Image: The Fourth Paradigm, Data-Intensive Scientific Discovery
Cray, or course, has built a successful business out of the third paradigm of computational simulation and is in the process of developing a portfolio for the fourth. The XC50 is just its latest offering in this area; CS-Storm, its accelerator-equipped cluster platform, and Urika-GX, a CPU-powered system for graph analytics, are the other platforms aimed at data-intensive work.
With the addition of the multi-faceted P100 hardware, the XC50 is able to provide a supercomputing platform for both traditional computational simulation and analytics based on neural networks. While that might seem like an esoteric type of flexibility, HPC users are starting to add these neural nets and other types of data-intensive applications to their standard technical and business workflows. “I really feel like we’re in that transition between the third and fourth paradigm,” says Staveley.
In fact, a few instances of workflows that fuse the third and fourth paradigms have started to appear. One of these is DeepChem, a drug discovery package that combines computational chemistry code with machine learning. The latter capability is used for geometry optimization for candidate drug molecules. Likewise, PGS, an oil & gas services company, employs machine learning for its reservoir imaging phase of seismic analysis. There has also been a good deal of interest in applying neural networks for weather forecasting, using image recognition to predict rainfall patterns, rather than relying entirely on computational models. Asset portfolio analysis is another area actively being explored for this combined workflow.
As it has done with computational modeling, Cray will stake out the high end of the analytics space. The good news here is that a significant chunk of this work, and especially the rapidly growing application set defined by neural networks, is as computationally demanding as traditional HPC workloads. When applications outrun conventional hardware, customers turn to vendors who can design around those limitations. That places differentiated HPC providers like Cray in an enviable position, especially if the company’s vision of a third/fourth paradigm fusion comes to pass. Time will tell.