NVIDIA Networks NVLink
I attended several sessions at last week's Hot Chips, and NVIDIA's NVSwitch talk was a standout. Ashraf Eassa did a great job of covering the talk's contents in an NVIDIA blog, so I will focus on analysis here.
SuperPOD Bids Adieu to InfiniBand
From a system-architecture perspective, the biggest change is extending NVLink beyond a single chassis. NVLink Network is a new protocol built on the NVLink4 link layer. It reuses 400G Ethernet cabling to enable passive-copper (DAC), active-copper (AEC), and optical links. The build its DGX H100 SuperPOD, NVIDIA designed a 1U switch system around a pair of NVSwitch chips. The pod ("scalable unit") includes a central rack with 18 NVLink Switch systems, connecting 32 DGX H100 nodes in a two-level fat-tree topology. This pod interconnect yields 460.8Tbps of bisectional bandwidth.
NVLink Network replaces InfiniBand (IB) as the first level of interconnect in DGX SuperPODs. The A100 generation pod uses an IB HDR leaf/spine interconnect with only 51.2Tbps of bisectional bandwidth. Benchmarks for AI training showed excellent scaling over NVLink within a DGX chassis, but scaling fell off dramatically using multiple chassis connected with IB. NVLink Network should improve pod-level scaling with its greater bandwidth and, presumably, lower latency. NVIDIA was silent on the latter metric, however, and NVLink Network adds forward error correction (FEC) for some links.
Mellanox Sharpens NVLink
The NVLink4-generation switch chip introduces a major new feature: in-network computing. The Scalable Hierarchical Aggregation Protocol (SHARP) comes from Mellanox, which patented the technology. Prior to NVIDIA acquiring it, Mellanox incorporated SHARP into IB switch silicon. The protocol enables collective operations in the network, thereby reducing inter-node communications. For IB, Mellanox demonstrated sub-3us latency for Allreduce operations with up to 128 nodes. It never disclosed the performance, however, of the compute engines (ALUs) integrated into the Quantum IB switch. For Quantum-2, NVIDIA says only that it has 32x the ALU performance of the prior generation.
The NVLink4 switch chip includes 400GFLOPS of FP32 compute plus SRAM to support SHARP operations. Although this is modest performance compared with a GPU, it targets only collective operations and adds less than 20% to the switch chip's die area. Offloading collectives to the network requires software changes in the MPI or CUDA stack on each node plus a central agent to manage the switch. For DGX systems, SHARP will likely deliver large performance gains in Allreduce operations that fit within the switch chip's limited SRAM, but the benefit will fall off as message sizes overflow that memory.
NVLink4 Leaves CXL Behind
NVIDIA takes some heat for its use of proprietary protocols, but its latest NVLink iteration is well ahead of standardized alternatives. NVLink4 uses PAM4 modulation to deliver 100Gbps per lane, matching the speed of the fastest network interconnects. CXL represents the standardized alternative for coherent interconnects, but its first two generations (1.1/2.0) use PCIe Gen5 electrical signaling with NRZ modulation to produce only 32Gbps per lane. CXL 3.0 uses PCIe Gen6 with PAM4 to double per-lane speed, but products won't reach the market for several years. Like it or not, NVLink is years ahead of open alternatives.
On-Target ASIC
The NVlink4 NVSwitch chip is a true ASIC, tuned specifically for its application. Its switching logic is lean, keeping latency, power, and die area in check. Its serdes are state-of-the-art, driving 100Gbps lanes over PCB traces, cables, or optics. SHARP adds power and die area in return for what should be measurable reductions in AI-training time. By replacing IB at the pod level, NVIDIA is eating its own lunch before a competitor does.
Comments
Post a Comment