Just-in-time in-network gradient compression for distributed training via packet trimming
Speaker
Xiaoqi Chen, Purdue University
Time
2024-12-15 10:30:00 ~ 2024-12-15 12:00:00
Location
上海交通大学软件大楼专家楼1319会议室
Host
赵世振
Abstract
Distributed training of large AI models has become the most demanding workload for data center networks, due to its significant traffic volume and sensitivity to congestion and delay. Traditional network fabric like InfiniBand or RoCE was designed with an application-agnostic philosophy for HPC workload, unnecessarily providing reliable delivery semantics while avoiding packet loss at all costs, leading to extra latency and scalability challenges. Meanwhile, large AI models embed significant redundancy and are quite robust against imperfections in training. We also notice in tightly-synchronized, large-scale training, delivering incomplete data on time is better than waiting for retransmission: it's too costly for thousands of GPUs to wait for one slow-finishing "straggler".
In this talk, I will discuss my recent work in designing low-latency, lossy transport tailored for distributed AI training (HotNets'24). By exploiting the packet trimming feature already supported by today's switches, we propose a novel application-transport co-design where gradients are sent in specially-crafted trimmable compression encoding, allowing switches to perform gradient compression on-demand while avoiding retransmission delays. Evaluation results shows trimmable encoding based on Randomized Hadamard Transform has low computational overhead and achieves good training quality and shorter time to accuracy at high trim rates of up to 50%.
Bio
Xiaoqi Chen recently started as an Assistant Professor in School of Electrical and Computer Engineering, Purdue University. He obtained his PhD degree at Princeton University in 2023, and was a postdoctoral researcher in VMware Research Group prior to joining Purdue. His research interest focuses on applying algorithm design on various programmable targets in the network data plane -- high-speed programmable switches, SmartNICs, host network stack, and their combinations -- to improve network performance, security, and privacy.