[[测试]]

给自己答疑

我们不做通信，只做计算，并且要做到极致。

本研究专注于LLM训练中的**单节点计算性能预测 (Single-Node Computational Performance Prediction)**。我们通过在单GPU上模拟不同并行策略下的计算切片（computational slice），来精确测量其运行时间与内存开销。我们的模型旨在为上层分析模型提供精确的计算成本输入，从而解耦计算与通信的复杂性。

一个完整的LLM训练的计算过程，被拆分为layer、submodule、kernel三个级别。
后两个是计算图节点的不同粒度。

	ground truth	proxy
layer(1,2,4,8……)	cuda event（纯）”我有GPU，并且完整训练过，所以我直接查表“。profiler（overhead）”我想知道一些详细的数据“
submodule（由aiob改为transformer调包）	Hook（纯）、profiler（overhead）	Cuda event”我有GPU，但是我运行完整模型太麻烦了，我用碎片时间来拼凑“。风险比较大，自写的模块没有经过compile，反向传播也很难测
kernel	Profiler”我买不起/正在设计GPU，拿算子数据库时间来拼凑预测“	Profiler

submodule部分最好使用 `torch.allclose(output_profiler, output_cuda event, atol=1e-5)` 来断言两个输出在数值上是否足够接近。
数值（allclose）和性能（compile和eager）都要验证

model_aiob = AIOB_DecoderBlock(...)
model_hf = LlamaDecoderLayer(...) (来自Hugging Face)

![[Pasted image 20251106145408.png]]

![[Pasted image 20251106150141.png]]

将 torch.compile 作为实验变量。”在不同模型、不同配置、不同硬件上的收益究竟如何？“

A Survey on Performance Modeling and Prediction for Distributed DNN Training

2背景

老生常谈了，一个巨大训练超参数的配置空间，怎么在正式运行训练过程之前找一个最合适的。省钱省时间！

配置空间详细介绍

1、并行策略

数据并行

数据集多的时候会用这个

模型并行（流水线、张量、MoE的EP）

模型放不下用这个

混合并行

有一些论文讨论过怎么自动找策略

2、计算优化

一方面是优化计算任务的分配（节点之间）
一方面是设置、算子优化（节点之内）

3、通信优化

略

4、内存和数据加载

内存方面，可以训练更大的模型。比如ZeRO家族
加载方面，I/O一定不能限制训练速度！🫲😭🫱

5、集群设计

拓扑怎么设计、网络传输协议
（怎么选卡？这个文章里面没提到）

6、协同设计

上面这些内容会互相影响，有耦合。比如计算和通信就会重。
很难

3正文

把模型分为三种：

分析模型：就是用公式。分解为计算和通信两部分，分别用分析模型来估计时间。
基于图：
执行驱动：仿真模拟。

A Survey of End-to-End Modeling for Distributed DNN Training: Workloads, Simulators, and TCO

2工作负载表示

B工作负载表示

两类：配置、中间表示
看图3非常易懂，还分为框架通用和专用，带标记的并不是从零开始发明的全新IR

基于配置，不够细节，便于用户定义但不便于作为trace

C启发与展望

计算/通信之外也需要建模，比如I/O和checkpoint
虽然有很多中间表示方法，但实际上很多模拟器有自己特有的IR，便于功能集成
torch.fx是最常用的选择
MLIR在编译器优化中广泛使用，但是模拟器中用得很少
机器级表示不常见，可能是因为粒度太小模拟费劲

3模拟器

A背景和动机

保真度和速度如何权衡
通用的框架基本上就是图4，

B模拟器基本架构的研究

C模拟器分类与比较

模型分为三种，如图5

Analytical：纯数学分析，快速但精度有限。工作负载的粒度区别带来精确度的差异。
Profiling：结合了实证数据的Analytical模型，速度和精度平衡。要注意profiling的成本问题，要不然模拟和真实执行时间没区别，就失去了模拟的意义。
Execution：精度很高但很慢

在这个基础上还对模拟器的粒度进行了区分，见图5。越细粒度越精确越慢
表6是这个文章的核心，LLMCompass和LLMServingSim是做推理的但是角度很新颖。
基本上都是计算和通信分开建模，然后协调overlap。

D启发与展望

工作负载LLM驱动，缺乏通用性
基于配置的工作负载正在被淘汰
99%是英伟达的硬件，能不能推广呢
验证是一个大弱点，没办法真的拿几千张卡来跑ground truth
网络模型都太简单了，忽略了拥塞和拓扑
ASTRA-sim是一个很有影响力的”模块化“框架
Python负责上层，C++负责底层细节
模拟器普遍缺乏“能耗”模型
模拟器忽视了“硬件购买成本”（资本成本）
内存（本地和远端）被严重忽视，建模非常粗糙

表格

是否选中	状态	The work	引用次数	来源	计算收集	预测方法	我们还能做什么
	读完，主要做的是CV，复现待定	Modeling the Training Iteration Time for Heterogeneous Distributed Deep Learning Systems[69] C刊IJIS 2023		TPD A.分析模型表1为[138]	一次迭代的墙上时间	核心是推导FLOPs/FLOPS，之后用泰勒经验校准	作为一个非常基础的baseline
	读完，主打是transformer，跑通示例	[[AMPeD]]:An Analytical Model for Performance in Distributed Training of Transformers[72] C会ISPASS 2023		TPD A.分析模型表1为[75]; arxiv-sim		核心是用GPU的MAC（Multiply-Accumulate），之后也要校准	作为一个非常基础的baseline
	有点老不确定看不看	Optimus: an efficient dynamic resource scheduler for deep learning clusters[73] A会2018		TPD A.分析模型表1为[85]; arxiv-sim
	烂刊不看	Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs[74]		TPD A.分析模型表1为[]
	烂刊不看	Performance analysis of distributed deep learning frameworks in a multi-GPU environment[75]		TPD A.分析模型表1为[]
	读完，不开源，复现待定	Predicting throughput of distributed stochastic gradient descent[76]A刊2022		TPD A.分析模型表1为[62]		计算时间是已知输入	数据集作为参考补充
	烂刊不看	SMSG: Profiling-Free Parallelism Modeling for Distributed Training of DNN[77]		TPD A.分析模型表1为[123]
		[[FasterMoE]]: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models[78] A会PPoPP 2022		TPD A.分析模型表1为[31]		纯公式计算
	读完，针对LLM做的工作，跑通示例	[[Calculon]]: A methodology and tool for high-level codesign of systems and large language models[80]A会SC 2023		TPD A.分析模型表1为[38]; arxiv-sim		FLOPS理论计算	baseline
		[[Paleo]] 好会ICLR但不是ccf推荐2017，太经典了必须看		TPB B.基于图表2为[89]; arxiv-sim
	bert	Beyond data and model parallelism for deep neural networks[93]好会mlsys但不是ccf推荐2019		TPD B.基于图表2为[43]; arxiv-sim;
		Daydream: Accurately estimating the efficacy of optimizations for DNN training[95]A会ATC 2020		TPD B.基于图表2为[146]; arxiv-sim
		dPRO: A generic performance diagnosis and optimization toolkit for expediting distributed DNN training[98]好会mlsys但不是ccf推荐2022		TPD B.基于图表2为[36]
	LLM	DistSim: A performance model of large-scale hybrid distributed DNN training[99]C会CF 2023		TPD B.基于图表2为[69]; arxiv-sim		计算作为已知嵌入
	LLM	Proteus: Simulating the performance of distributed DNN training[102]A刊TPDS2024		TPD B.基于图表2为[20]	用一个分析器直接测量	计算作为已知嵌入
		(TAG)Expediting distributed DNN training with device topology-aware graph deployment[104]A刊TPDS 2023		TPD B.基于图表2为[140]		计算作为已知嵌入
				TPD C.执行驱动表3为[]
				TPD C.执行驱动表3为[]
				TPD C.执行驱动表3为[]
		PerfEstimator		TPD C.执行驱动表3为[]
		DNNEmu		TPD C.执行驱动表3为[]
		DistIR		TPD C.执行驱动表3为[]
		ASTRA-Sim 两版本都是C会ISPASS 2020/2023		TPD C.执行驱动; SimAI; arxiv-sim
		Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces[115] 未发表 2023		arxiv-IR
		Distir: An intermediate representation for optimizing distributed neural networks.[106] 未发表 2023		arxiv-sim
		SimAI[124]		arxiv-sim
		(multiverse)Accelerating Design Space Exploration for LLM Training Systems with Multi-experiment Parallel Simulation[43]		arxiv-sim		重点为通信，直接假设计算时间已知。使用chakra工具生成计算时间，实际上这个东西跟pytorch官方的kineto有很大重叠。我们直接用kineto（profiler）就行。
		(Neusight)Forecasting GPU Performance for Deep Learning Training and Inference[77]A会ASPLOS	15	arxiv-sim
		ATLAHS: An Application-centric Network Simulator Toolchain for AI, HPC, and Distributed Storage[109]未发表	1	arxiv-sim
		Echo: Simulating distributed training at scale.[35]未发表		arxiv-sim
		Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms[79]A刊TPDS 2024	4	arxiv-sim
		vTrain: A Simulation Framework for Evaluating Cost-effective and Compute-optimal Large Language Model Training A刊MICRO2024	24	arxiv-sim
		LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale烂刊	19	arxiv-sim
		Proteus: Simulating the performance of distributed dnn training.A刊TPDS2024	13	arxiv-sim
		Llmem: Estimating gpu memory usage for fine-tuning pre-trained llms.未发表	16	arxiv-sim
√		LLMCompass: Enabling Efficient Hardware Design for Large Language Model Inference A会ISCA2024	45	arxiv-sim
		Deepflow: A cross-stack pathfinding framework for distributed ai systems.[6]B刊TODAES 2024	20	arxiv-sim
√		(DNNMem)Estimating GPU Memory Consumption of Deep Learning Models A会FSE2020	186	arxiv-sim; 师姐

https://suanlilog.com/2025/12/25/Dataset is all you need/

作者

zihan12ai

发布于

2025年12月25日

许可协议

个人博客的搭建与维护上一篇

Hello World 下一篇