Publications
publications by categories in reversed chronological order. generated by jekyll-scholar.
2024
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive CompressionarXiv preprint arXiv:2410.12707, 2024
-
- Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing2024
2023
2022
2021
2020
- CCGRIDBenchmarking the performance and energy efficiency of AI accelerators for AI trainingIn 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID) Workshop , 2020