All industries

AI / ML training & inference

GPU clusters at 400G and 800G, InfiniBand and Ethernet fabrics with low BER, predictable latency.

AI training fabrics are the highest-bandwidth and highest-density deployments of optical transceivers in production today. NVIDIA SuperPOD-class clusters and hyperscale equivalents (Meta Grand Teton, Google TPU pods) push 200G / 400G / 800G per port at thousands of ports per pod, with cluster fabric and inter-pod links accounting for the majority of optical SKU spend. The collective-communication latency tail (NCCL all-reduce, all-gather) is sensitive to BER, jitter, and the occasional retransmit, so optical-layer health and DDM telemetry matter more here than in any other use case.

Operating constraints

  • Per-port density: 64–128 ports of 400G or 800G in a single 2U switch chassis
  • Power and thermal budget: an 800G OSFP draws 15–20 W; a full chassis may exceed 2 kW of optics alone
  • Per-lane BER under PAM4 sensitivity to dirty connectors
  • Predictable latency for collective communications (NCCL, RCCL)
  • Choosing Ethernet (RoCEv2) vs. InfiniBand, different optical SKUs, often interchangeable host hardware

Use cases

Intra-rack GPU-to-leaf

1–3 m links from GPU host NICs to the rack ToR. AOC or short-reach DAC at 400G / 800G. AOC dominates because the cable management is cleaner than DAC at 8-lane PAM4 cross-sections.

Recommended form factors:QSFP-DD · 400GOSFP · 800G

Leaf-to-spine east-west

ToR-to-spine in a leaf-spine fabric. Typically 5–30 m AOC, or SR8/DR8 over MM/SM fibre for runs that cross rows. NCCL-class collectives drive the bulk of east-west bandwidth.

Recommended form factors:OSFP · 800G

InfiniBand HDR / NDR fabric

200G HDR (QSFP56) on legacy NVIDIA Quantum platforms; 400G / 800G NDR (OSFP) on Quantum-2. NetAPI supports both with InfiniBand-coded SKUs that drop into Mellanox/NVIDIA Spectrum and Quantum line cards.

Recommended form factors:QSFP56 · 200GOSFP · 800G

Pod-to-pod inter-rack

100–500 m fibre between aggregation rows in hyperscale AI buildings. DR8 / FR8 single-mode parallel at 400G / 800G is the dominant choice, clean BER over the full reach.

Recommended form factors:QSFP-DD · 400GOSFP · 800G

Storage fabric for training data

NVMe-oF or distributed file-system (Lustre, BeeGFS, Weka) backplane for the training set. 100G / 400G Ethernet or HDR / NDR InfiniBand depending on storage architecture.

Recommended SKUs for this segment

  • NAP-OSFP-AOC-3M…30M (800G AOC, the dominant intra-rack interconnect)
  • NAP-OSFP-DR8 (800G, 500 m on parallel single-mode for inter-row)
  • NAP-OSFP-SR8 (800G, 100 m / 150 m multi-mode for shorter inter-rack)
  • NAP-QSFPDD-DR4 (400G ToR fabric)
  • NAP-QSFP56-AOC (200G HDR InfiniBand intra-rack)

Design notes

  • Verify the OSFP cage power budget on the switch before populating every port, full 800G OSFP populations push past 2 kW of optical power alone in a 64-port leaf.
  • PAM4 at 100G/lane is sensitive to connector dirt and OM3 fibre quality. New AI builds should standardise on OM4 / OM5 for multi-mode and OS2 single-mode for everything single-mode.
  • InfiniBand and Ethernet OSFP modules look identical mechanically but are coded differently. Order the right variant for your fabric.
  • DDM telemetry per lane is exposed via CMIS, pipe it into your fleet observability (Datadog, Prometheus) and alert on pre-FEC BER trending, not just link state.

Designing a ai / ml training & inference deployment?

Our pre-sales engineers will review your topology, fibre plant, and switch firmware and spec the SKU list before you order. Free.