Skip to main content

University of Edinburgh - Shop


AI Data Center Network Design and Technologies (ePub eBook)

eBook by Subramaniam, Mahesh/Styszynski, Michal/Tambakuwala, Himanshu

AI Data Center Network Design and Technologies (ePub eBook)

£25.99

ISBN:
9780135436356
Publication Date:
20 Jan 2026
Publisher:
Pearson ITP
Imprint:
Addison-Wesley Professional
Pages:
384 pages
Format:
eBook
For delivery:
Download available
AI Data Center Network Design and Technologies (ePub eBook)

Description

Artificial intelligence is redefining the scale, architecture, and performance expectations of modern data centers. Training large ML models demand infrastructure capable of moving massive data sets through highly parallel, compute-intensive environmentswhere traditional data center designs simply canOt keep up. AI Data Center Network Design and Technologies is the first comprehensive, vendor-agnostic guide to the design principles, architectures, and technologies that power AI training and inference clusters. Written by leading experts in AI Data center design, this book helps engineers, architects, and technology leaders understand how to design and scale networks purpose-built for the AI era. INSIDE, YOUOLL LEARN HOW TO O Architect scalable, high-radix network fabrics to support xPU (GPE, TPU)-based AI clusters O Integrate lossless Ethernet/IP fabrics for high-throughput, low-latency data movement O Align network design with AI/ML workload characteristics and server architectures O Address challenges in cooling, power, and interconnect design for AI-scale computing O Evaluate emerging technologies from the Ultra Ethernet Consortium (UEC) and their affect on future AI data centers O Apply best practices for deployment, validation, and performance measurement in AI/ML environments With broad coverage of both foundational concepts and emerging innovations, this book bridges the gap between network engineering and AI infrastructure design. It empowers readers to understand not only how AI data centers workbut why they must evolve.

Contents

Foreword.. . . . . . . . . . . . . . . . xv Preface.. . . . . . . . . . . . . . . . . xvii Acknowledgments.. . . . . . . . . . . . . . xix About the Authors.. . . . . . . . . . . . . . xxi 1 Wonders in the Workload. . . . . . . . . . . . 1 What's New in AI Data Center Workloads.. . . . . . . . 1 The Life Cycle of an AI Model.. . . . . . . . . . . 2 Training an AI Model. . . . . . . . . . . . 3 Parallelism. . . . . . . . . . . . . . 4 Job Completion Time (JCT). . . . . . . . . . . 6 Tail Latency.. . . . . . . . . . . . . . 7 Summary. . . . . . . . . . . . . . 16 Test Your Knowledge. . . . . . . . . . . . 17 2 "The Common-Man View" of AI Data Center Fabrics.. . . . . 19 Training vs. Inference AI Data Centers. . . . . . . . . 19 InfiniBand vs. Ethernet for AI Training Data Centers.. . . . . . 21 Ethernet Hardware Switches and Advanced Software Features.. . . . 22 Handling Elephant Flows.. . . . . . . . . . . 24 Load-Balancing Techniques. . . . . . . . . . . 25 Congestion Management and Mitigation Techniques.. . . . . . 26 Summary. . . . . . . . . . . . . . 28 Test Your Knowledge. . . . . . . . . . . . 29 3 Network Design Considerations. . . . . . . . . . 31 Background Introduction.. . . . . . . . . . . 31 Training Data Center Architecture. . . . . . . . . . 33 Rail-Optimized Design (ROD).. . . . . . . . . . 34 Rail-Unified Design (RUD).. . . . . . . . . . . 42 Rack Design. . . . . . . . . . . . . . 45 Scheduled Fabric. . . . . . . . . . . . . 49 Topologies. . . . . . . . . . . . . . 50 Inference Data Center Architecture. . . . . . . . . 56 Multi-Planar Scale-Out Architectures.. . . . . . . . . 56 Summary. . . . . . . . . . . . . . 63 Test Your Knowledge. . . . . . . . . . . . 64 References. . . . . . . . . . . . . . 66 4 Optics and Cable Management.. . . . . . . . . . 67 Scaling Optics for AI Clusters.. . . . . . . . . . 67 Challenges in Optical Innovation.. . . . . . . . . . 70 Packet Flow. . . . . . . . . . . . . . 70 Transmission Modes.. . . . . . . . . . . . 73 Transceiver Types.. . . . . . . . . . . . . 76 Cable and Connector Types. . . . . . . . . . . 78 Standards.. . . . . . . . . . . . . . 79 Further Innovations in Optics.. . . . . . . . . . 82 Summary. . . . . . . . . . . . . . 83 Test Your Knowledge. . . . . . . . . . . . 85 References. . . . . . . . . . . . . . 86 5 Thermal and Power Efficiency Considerations. . . . . . . 87 Thermal Footprints in AI Data Centers.. . . . . . . . . 87 Airflow Options. . . . . . . . . . . . . 88 Liquid Cooling. . . . . . . . . . . . . 89 Summary. . . . . . . . . . . . . . 93 Test Your Knowledge. . . . . . . . . . . . 94 References. . . . . . . . . . . . . . 95 6 Efficient Load Balancing. . . . . . . . . . . . 97 Per-Flow Load Balancing. . . . . . . . . . . 99 Per-Packet Load Balancing.. . . . . . . . . . . 115 Load-Balancing Mechanism Comparison.. . . . . . . . 117 Summary. . . . . . . . . . . . . . 118 Test Your Knowledge. . . . . . . . . . . . 119 7 RoCEv2 Transport and Congestion Management.. . . . . . 123 Congestion Points. . . . . . . . . . . . 123 Explicit Congestion Notification (ECN).. . . . . . . . 127 Data Center Quantized Congestion Notification (DCQCN).. . . . . 134 Source Flow Control (SFC). . . . . . . . . . . 136 Congestion Signaling.. . . . . . . . . . . . 137 Summary. . . . . . . . . . . . . . 139 Test Your Knowledge. . . . . . . . . . . . 140 8 IP Routing for AI/ML Fabrics.. . . . . . . . . . 143 Dynamic IP Routing Options. . . . . . . . . . 144 eBGP Underlay for Three-Stage/Five-Stage Fabric for an AI Data Center.. . 145 Multi-tenancy for an AI/ML Cluster Data Center Network. . . . . 171 Microsegmentation and Multi-tenancy for an AI/ML Data Center.. . . 177 Extending IP Routing to the Server. . . . . . . . . 177 Traffic Engineering in the AI Data Center Fabric.. . . . . . . 178 Segment Routing and SRv6 for AI/ML Fabrics. . . . . . . 179 Summary. . . . . . . . . . . . . . 184 Test Your Knowledge. . . . . . . . . . . . 185 References. . . . . . . . . . . . . . 187 9 Storage Network Design and Technologies.. . . . . . . 189 The AI Data Center Life Cycle and Storage Networks.. . . . . . 191 Storage Network Design Types. . . . . . . . . . 193 Block, Object, and File Storage Systems.. . . . . . . . 198 NVMe-oF for Block-Level Access.. . . . . . . . . . 199 NVMe-o-RDMA/RoCEv2 State Machine. . . . . . . . 206 High-Performance File Systems. . . . . . . . . . 208 GPUDirect Storage.. . . . . . . . . . . . 211 Summary. . . . . . . . . . . . . . 217 Test Your Knowledge. . . . . . . . . . . . 218 References. . . . . . . . . . . . . . 219 10 AI Network Performance KPIs. . . . . . . . . . 221 Significance of Performance Benchmarking. . . . . . . 221 MLCommons for AI Data Centers.. . . . . . . . . 223 MLCommons Initiatives. . . . . . . . . . . 224 MLCommons Benchmarking Suites.. . . . . . . . . 224 Benchmarking a Data Center for Machine Learning. . . . . . 225 Summary. . . . . . . . . . . . . . 226 Test Your Knowledge. . . . . . . . . . . . 227 References. . . . . . . . . . . . . . 228 11 Monitoring and Telemetry.. . . . . . . . . . . 229 Exploring Monitoring Options.. . . . . . . . . . 229 Network Monitoring in an AI/ML Data Center Network.. . . . . 231 In-Band Flow Analyzer (IFA). . . . . . . . . . . 234 Corrective Actions. . . . . . . . . . . . 237 Summary. . . . . . . . . . . . . . 238 Reference.. . . . . . . . . . . . . . 238 12 Ultra Ethernet Consortium (UEC). . . . . . . . . 239 UEC Developments and Working Groups.. . . . . . . . 241 UEC Key Terminology.. . . . . . . . . . . . 244 The UEC and Network Architectures. . . . . . . . . 246 A New Protocol Stack.. . . . . . . . . . . . 247 Data Plan: Packet Forwarding Options.. . . . . . . . 252 Packet Delivery Modes.. . . . . . . . . . . 257 Congestion Management (CM) in the UEC Specification.. . . . . 261 Packet Trimming and Fast Retransmissions. . . . . . . . 264 Link Layer Reliability (LLR) Mechanism.. . . . . . . . 265 In-Network Collectives (INC) and xCCL.. . . . . . . . 266 Management and Orchestration. . . . . . . . . . 268 Interoperability and Backward Compatibility.. . . . . . . 269 Compliance and Certification.. . . . . . . . . . 269 UEC Challenges and Future Directions.. . . . . . . . 269 Comparing UEC to InfiniBand and RoCEv2. . . . . . . . 270 Summary. . . . . . . . . . . . . . 271 Test Your Knowledge. . . . . . . . . . . . 272 References. . . . . . . . . . . . . . 273 13 Scale-Up Systems.. . . . . . . . . . . . . 275 Key Building Blocks of Scale-Up Systems.. . . . . . . . 278 Scale-Up Ethernet Transport (SUE-T). . . . . . . . . 281 Ultra Accelerator Link (UALink).. . . . . . . . . . 286 Memory Coherence in Scale-Up Systems.. . . . . . . . 291 Scale-Up Systems: Key Differences and Similarities.. . . . . . 292 Summary. . . . . . . . . . . . . . 294 Test Your Knowledge. . . . . . . . . . . . 295 References. . . . . . . . . . . . . . 297 14 Conclusion.. . . . . . . . . . . . . . 299 DC Network Role for AI.. . . . . . . . . . . 299 Caveats and Challenges.. . . . . . . . . . . 300 Future Developments.. . . . . . . . . . . . 302 Final Remarks.. . . . . . . . . . . . . 304 References. . . . . . . . . . . . . . 305 Appendix A Questions and Answers.. . . . . . . . . . 307 Appendix B Acronyms.. . . . . . . . . . . . . 329 9780135436288, TOC, 1/8/2026

Accessing your eBook through Kortext

Once purchased, you can view your eBook through the Kortext app, available to download for Windows, Android and iOS devices. Once you have downloaded the app, your eBook will be available on your Kortext digital bookshelf and can even be downloaded to view offline anytime, anywhere, helping you learn without limits.

In addition, you'll have access to Kortext's smart study tools including highlighting, notetaking, copy and paste, and easy reference export.

To download the Kortext app, head to your device's app store or visit https://app.kortext.com to sign up and read through your browser.

This is a Kortext title - click here to find out more This is a Kortext title - click here to find out more

NB: eBook is only available for a single-user licence (i.e. not for multiple / networked users).

Back

University of Edinburgh