🎉 本文已被 DAC 2026 录用！

🎉 This paper has been accepted by DAC 2026!

论文链接 / Paper: arXiv:2410.17954

作者 / Authors: Xin He, Shunkang Zhang, Kaijie Tang, Shaohuai Shi, Yuxin Wang, Zihao Zeng, Zhenheng Tang, Xiaowen Chu, Haiyan Yin, Ivor Tsang, Ong Yew Soon

中文版

研究动机

稀疏混合专家模型（Mixture-of-Experts, MoE）通过仅激活部分专家来实现高效推理，但在实际部署中面临两大核心挑战：

GPU 显存占用巨大：MoE 模型包含大量专家参数，远超单张 GPU 的显存容量
专家切换开销高：推理过程中不同 token 需要激活不同的专家，专家在 GPU 和 CPU 之间的频繁迁移导致严重的 I/O 瓶颈

核心方法

ExpertFlow 提出了两项关键技术来解决上述问题：

1. 预测性专家缓存（Predictive Expert Caching）

设计了一个轻量级预测器，在实际计算之前预测下一步需要激活的专家路由路径
利用预测结果提前将所需专家加载到 GPU，实现主动式预取（proactive prefetching）
引入实时纠错机制，在预测出错时快速修正，保证缓存命中率

2. 动态令牌调度（Dynamic Token Scheduling）

对输入 token 进行跨批次重新编排，使得每个批次内激活的专家数量最少
减少每步推理的专家切换次数，提升计算效率

实验结果

GPU 显存节省高达 93.72%
推理速度提升 2-10 倍
适用于资源受限的部署场景，为大规模 MoE 模型的实际应用提供了可行方案

English Version

Motivation

Sparse Mixture-of-Experts (MoE) models achieve efficient inference by activating only a subset of experts per token. However, real-world deployment faces two critical challenges:

Massive GPU memory footprint: MoE models contain a huge number of expert parameters that far exceed the memory capacity of a single GPU.
High expert swapping overhead: Different tokens activate different experts during inference, causing frequent expert transfers between GPU and CPU memory—a severe I/O bottleneck.

Key Methods

ExpertFlow introduces two core techniques:

1. Predictive Expert Caching

A lightweight predictor forecasts expert routing paths before actual computation begins.
Predicted results enable proactive expert prefetching onto the GPU, eliminating reactive loading delays.
A real-time error correction mechanism quickly adjusts when predictions are incorrect, maintaining high cache hit ratios.

2. Dynamic Token Scheduling

Input tokens are rearranged across batches to minimize the number of activated experts per batch.
This reduces the number of expert swaps per inference step, improving computational throughput.

Results

Up to 93.72% GPU memory savings
2-10× inference speedup compared to baselines
Practical and deployable solution for resource-constrained environments running large-scale MoE models