Just-in-time in-network gradient compression for distributed training via packet trimming

Speaker

Xiaoqi Chen, Purdue University

Time

2024-12-15 10:30:00 ~ 2024-12-15 12:00:00

Location

上海交通大学软件大楼专家楼1319会议室

Host

赵世振

Abstract

Distributed training of large AI models has become the most demanding workload for data center networks, due to its significant traffic volume and sensitivity to congestion and delay. Traditional network fabric like InfiniBand or RoCE was designed with an application-agnostic philosophy for HPC workload, unnecessarily providing reliable delivery semantics while avoiding packet loss at all costs, leading to extra latency and scalability challenges. Meanwhile, large AI models embed significant redundancy and are quite robust against imperfections in training. We also notice in tightly-synchronized, large-scale training, delivering incomplete data on time is better than waiting for retransmission: it's too costly for thousands of GPUs to wait for one slow-finishing "straggler".

In this talk, I will discuss my recent work in designing low-latency, lossy transport tailored for distributed AI training (HotNets'24). By exploiting the packet trimming feature already supported by today's switches, we propose a novel application-transport co-design where gradients are sent in specially-crafted trimmable compression encoding, allowing switches to perform gradient compression on-demand while avoiding retransmission delays. Evaluation results shows trimmable encoding based on Randomized Hadamard Transform has low computational overhead and achieves good training quality and shorter time to accuracy at high trim rates of up to 50%.

Bio

Xiaoqi Chen recently started as an Assistant Professor in School of Electrical and Computer Engineering, Purdue University. He obtained his PhD degree at Princeton University in 2023, and was a postdoctoral researcher in VMware Research Group prior to joining Purdue. His research interest focuses on applying algorithm design on various programmable targets in the network data plane -- high-speed programmable switches, SmartNICs, host network stack, and their combinations -- to improve network performance, security, and privacy.

Home

Research Areas

Admission

Students

Open Positions / Job Opportunity