LATEST NEWS

NVIDIA Ethernet Networking Accelerates World’s Largest AI Supercomputer, Built by xAI

NVIDIA today announced that xAI’s Colossus supercomputer cluster comprising 100,000 NVIDIA Hopper Tensor Core GPUs in Memphis, Tennessee, achieved this massive scale by using the NVIDIA Spectrum-X™ Ethernet networking platform, which is designed to deliver superior performance to multi-tenant, hyperscale AI factories using standards-based Ethernet, for its Remote Direct Memory Access (RDMA) network.

Colossus, the world’s largest AI supercomputer, is being used to train xAI’s Grok family of large language models, with chatbots offered as a feature for X Premium subscribers. xAI is in the process of doubling the size of Colossus to a combined total of 200,000 NVIDIA Hopper GPUs.

The supporting facility and state-of-the-art supercomputer was built by xAI and NVIDIA in just 122 days, instead of the typical timeframe for systems of this size that can take many months to years. It took 19 days from the time the first rack rolled onto the floor until training began.

While training the extremely large Grok model, Colossus achieves unprecedented network performance. Across all three tiers of the network fabric, the system has experienced zero application latency degradation or packet loss due to flow collisions. It has maintained 95% data throughput enabled by Spectrum-X congestion control.

This level of performance cannot be achieved at scale with standard Ethernet, which creates thousands of flow collisions while delivering only 60% data throughput.

“AI is becoming mission-critical and requires increased performance, security, scalability and cost-efficiency,” said Gilad Shainer, senior vice president of networking at NVIDIA. “The NVIDIA Spectrum-X Ethernet networking platform is designed to provide innovators such as xAI with faster processing, analysis and execution of AI workloads, and in turn accelerates the development, deployment and time to market of AI solutions.”

“Colossus is the most powerful training system in the world,” said Elon Musk on X. “Nice work by xAI team, NVIDIA and our many partners/suppliers.”

“xAI has built the world’s largest, most-powerful supercomputer,” said a spokesperson for xAI. “NVIDIA’s Hopper GPUs and Spectrum-X allow us to push the boundaries of training AI models at a massive-scale, creating a super-accelerated and optimized AI factory based on the Ethernet standard.”

At the heart of the Spectrum-X platform is the Spectrum SN5600 Ethernet switch, which supports port speeds of up to 800Gb/s and is based on the Spectrum-4 switch ASIC. xAI chose to pair the Spectrum-X SN5600 switch with NVIDIA BlueField-3® SuperNICs for unprecedented performance.

Spectrum-X Ethernet networking for AI brings advanced features that deliver highly effective and scalable bandwidth with low latency and short tail latency, previously exclusive to InfiniBand. These features include adaptive routing with NVIDIA Direct Data Placement technology, congestion control, as well as enhanced AI fabric visibility and performance isolation — all key requirements for multi-tenant generative AI clouds and large enterprise environments.


Credit:NVIDIA

Danit

Recent Posts

Arrow Electronics Announces New Single Pair Ethernet Reference Design Featuring Bourns, Microchip and Amphenol Technology

Arrow Electronics has announced the launch of a new evaluation platform, jointly designed in collaboration…

58 minutes ago

Siemens introduces AI-powered on-premises analytics for industrial drivetrain systems

Siemens has unveiled Drivetrain Analyzer Onsite, a new on-premises analytics solution for industrial drive systems,…

3 hours ago

Menlo Micro Achieves Key Milestone in U.S. Navy Advanced Circuit Breaker Program, Validating Scalable MEMS Power Switching Leadership

Menlo Microsystems, Inc. (Menlo Micro), a leader in high-performance switches, today announced the successful completion…

2 days ago

Wireless modules for embedded systems now available from Tria Technologies

Multi-protocol modules support WiFi®, Bluetooth® and mesh networking standards Tria Technologies™, an Avnet company specializing…

2 days ago

Tower Semiconductor and Coherent Demonstrate 400Gbps/lane Data Transmission with a Silicon Modulator in a Production-Ready Sipho Process

 The demonstration uses a silicon MZM without use of exotic materials targeting next-generation 3.2T optical…

2 days ago

Power Integrations Extends Flyback Topology to Enable 440 W, Offering Simpler Alternatives to Resonant Power Designs

New TOPSwitchGaN ICs more than double power output, reducing system cost, complexity, and design time…

2 days ago