AutoHete: 面向大语言模型的自动化高效异构训练系统 | An Automatic and Efficient Heterogeneous Training System for LLMs

论文链接 / Paper: arXiv:2503.01890

作者 / Authors: Zihao Zeng, Chubo Liu, Xin He, Juan Hu, Yong Jiang, Fei Huang, Kenli Li, Wei Yang Bryan Lim


中文版

研究动机

大语言模型(LLMs)的训练通常需要大规模 GPU 集群,但现实中的计算资源往往是异构的——不同型号的 GPU、不同的网络带宽、不同的内存容量混合在一起。现有的并行训练框架大多假设同构环境,在异构集群上会出现严重的负载不均衡和资源浪费问题。

如何自动且高效地在异构计算环境中训练 LLMs,是一个亟待解决的系统性难题。

核心方法

AutoHete 提出了一个自动化的异构训练系统,核心设计目标是在无需人工干预的情况下,自动适配异构计算资源并实现高效训练:

关键特性

  1. 自动并行策略搜索:针对异构集群中不同设备的算力差异,自动搜索最优的混合并行策略(数据并行、张量并行、流水线并行等的组合),避免人工调参
  2. 负载均衡优化:动态调整每个设备的计算负载,确保快设备不空等、慢设备不拖后腿
  3. 通信优化:针对异构网络环境(如不同节点间带宽差异),优化梯度通信策略,减少通信瓶颈

应用场景

  • 使用混合 GPU 型号(如 A100 + V100 + 消费级 GPU)进行 LLM 训练
  • 跨数据中心的分布式训练
  • 资源受限场景下的高效训练

English Version

Motivation

Training Large Language Models (LLMs) typically requires large-scale GPU clusters, but real-world computing resources are often heterogeneous—mixing different GPU models, network bandwidths, and memory capacities. Most existing parallel training frameworks assume homogeneous environments, leading to severe load imbalance and resource waste on heterogeneous clusters.

How to automatically and efficiently train LLMs in heterogeneous computing environments is a critical systems challenge.

Key Methods

AutoHete proposes an automated heterogeneous training system designed to automatically adapt to heterogeneous computing resources and achieve efficient training without manual intervention:

Key Features

  1. Automatic Parallel Strategy Search: Automatically searches for optimal hybrid parallelism strategies (combinations of data, tensor, and pipeline parallelism) tailored to the computational capabilities of different devices in a heterogeneous cluster
  2. Load Balancing Optimization: Dynamically adjusts computational workload for each device, ensuring fast devices don’t idle and slow devices don’t become bottlenecks
  3. Communication Optimization: Optimizes gradient communication strategies for heterogeneous network environments (e.g., varying inter-node bandwidths), reducing communication bottlenecks

Use Cases

  • LLM training with mixed GPU models (e.g., A100 + V100 + consumer GPUs)
  • Cross-datacenter distributed training
  • Efficient training under resource-constrained environments