By: Matthew Ziegler, Director Lenovo Neptune® & Sustainability, Lenovo
Years ago, when Artificial Intelligence (AI) began to emerge as a potential technology to be harnessed as a powerful tool to change the way the world works, organizations began to kick the AI tires by exploring it’s potential to enhance their research or business. However, to get started with AI, neural networks needed to be created, data sets trained, and microprocessors were needed that could perform matrix-multiplication calculations ideally suited to perform these computationally demanding tasks. Enter the accelerator.
As companies and research institutions began purchasing these systems loaded with accelerators, it was often the high-performance computing (HPC) teams that were handed the box to deploy, manage and address challenges. These systems consumed high amounts of energy versus a standard CPU-only system. They needed high speed networks. They generated a lot of heat and could be integrated into larger HPC cluster systems that leveraged workload management. That was then. Now, we are watching as large AI factories are being assembled that consume enormous amounts of energy and require innovative power and cooling solutions. Additionally, as AI adoption increases, data centers need to be re-engineered to consume even more resources without impacting the surrounding ecosystem that could negatively affect local communities.
Looking back, AI and HPC started with just hardware integration and management. The systems were designed to do entirely different things. HPC leveraged accuracy through double-precision floating point operations. AI didn’t need high accuracy and could leverage single, half, and integer level precision that accelerators were designed to deliver. Interconnects were different too. InfiniBand versus Ethernet. But overall, the systems could be integrated and managed similarly, thus putting HPC system administrators on the front lines of AI and HPC convergence.
Today, the convergence is now happening inside the system. There has been a lot of improvement in our understanding of how to best design systems with ideal PCIe routing, external networking, and memory sizing for data sets. AI and HPC systems now look very similar. That design convergence was showcased recently at Lenovo Tech World 2024 when Lenovo Chairman and CEO Yuanqing Yang and NVIDIA CEO Jensen Huang shared the stage to unveil the Lenovo ThinkSystem SC777 V4 Neptune® server. This system boasts an NVIDIA GB200-based featuring two Grace processors and four Blackwell GPUs in one vertical 21-inch (53 centimeter) compute tray. Coupled with the Lenovo ThinkSystem N1380 Neptune® chassis, the footprint of the SC777 V4 is a 13U enclosure featuring 8 tray slots, up to four Titanium 15kW power conversion stations (power can be directly connected from facility source) and a full 100% liquid-cooled Neptune® environment. Power is supplied directly to an internal 48V busbar making it ready for high powered HPC and AI workloads. In total, three of the ThinkSystem N1380 enclosures can comfortably fit in a standard 42U rack. That adds up to 24 trays, 48 processors and 96 accelerators in two standard floor tiles.
With four PCS units supporting N+1 operation and 120% oversubscription, the total power capacity for the enclosure reaches 54kW DC. Given a 96 percent peak AC/DC conversion efficiency and an almost perfect Power Factor at 99% above 50% load, the total apparent power is 58 kVA. When you place three enclosures in a rack, you can achieve a total power density of 162kW DC or approximately 175 kVA!
At first glance, this system looks very much suited for AI. In fact, it will run AI workloads extremely well. The HPC influence is hidden in the scale-out design features that are built into each compute node. When NVIDIA announced their GB200 NVL72 system , one prominent feature was the NVLink network natively designed into the backplane to allow for full NVLink scalability across all nodes. With the SC777 V4, Lenovo has leveraged traditional InfiniBand networking making it suited for HPC workloads. This system supports the next-generation NVIDIA Quantum-X800 InfiniBand and Spectrum-X800 Ethernet platforms for high-performance accelerated networking. In addition, it supports NVIDIA AI Enterprise, a cloud-native software platform that streamlines development and deployment of production-grade AI solutions, including generative AI, computer vision, and speech AI. The ThinkSystem SC777 V4 Neptune® offers the best of both the AI and HPC worlds!
Lenovo Neptune® has led the world in liquid cooling technology for more than a decade, pioneering a method to remove heat and reduce power consumption compared to traditional air-cooled systems. With more than 40 patents and the most widely used liquid cooling platform around the world, Lenovo fine-tuned its liquid cooling technology with revolutionary engineering. It utilizes state-of-the-art materials, including custom brazed copper water loops and patented CPU cold plates, for full system water-cooling. Unlike systems that use low-quality FEP plastic, Neptune® also features durable stainless steel and reliable EPDM hoses.
With advanced Neptune® water cooling, the SC777 V4 allows critical components to operate at lower temperatures and effectively removes all heat from all components, including GPUs, memory, I/O, local storage, and voltage regulators. Combined, these innovations enable HPC workloads and AI training along with real-time LLM inference for models scaling up to 10 trillion parameters while increasing performance, energy-efficiency, and reliability in a compact, ultra-quiet system.
Lenovo coupled their industry-leading approach to liquid cooling and their drive for innovation to bring one of the most powerful HPC and AI platforms to the market. The ThinkSystem SC777 V4 provides:
-
Maximized Standard Form Factor: Engineered from the ground up to provide the additional space needed without breaking data center standards, the N1380 chassis maintains scalability to make the newest and highest performing technologies available to customers of every size.
-
Reduced Power Consumption: The unique design eliminates the need for internal airflow and power-consuming fans, reducing power consumption by up to 40% compared with similar air-cooled systems in a typical data center. Rack insulation also significantly minimizes radiant heat.
-
Cutting-Edge Power Conversion Station (PCS): Beyond the Neptune® water-cooling infrastructure, the N1380 enclosure houses up to four ThinkSystem 15kW Titanium PCSs, supplying internal system power to a 48V busbar. This innovative design merges power conversion, rectification, and distribution into a single PCS, a departure from traditional setups that demand separate units, resulting in best-in-class efficiency.
-
Reliable Cooling: N1380 features an integrated manifold with a patented blind-mate mechanism and aerospace-grade dripless connectors to the compute trays for safer operation.
-
Complete Heat Removal: With enhanced waterflow safety, the efficient connections facilitate 100% heat removal.
-
Power Repurposed: Neptune® is designed to operate at water inlet temperatures as low as the dew point allows, up to 45°C, which eliminates the need for additional chilling and allows for efficient reuse of the generated heat for building heat or adsorption chilling cold water generation.
The industry is leaps and bounds ahead of where we started in the early days of running AI workloads. Since its infancy, AI has grown into a powerhouse – churning through millions of inference and training data sets, using more power, and producing more heat in the data center. With these increasing demands, Lenovo has remained two steps ahead delivering solutions to our customers that generate the outputs they need to do business in the most innovative and energy-efficient ways. To learn more, visit the Neptune®, HPC and AI sites.
Attending SC24? You can catch up with Lenovo at the industry’s leading HPC technical conference in Atlanta, Georgia from November 18 – 21, 2024. Visit booth #2201 at the Georgia World Congress Center for booth theater sessions and interactive demos. Also, be sure to register for our Lenovo Innovation Forums.