This is the artifact repository for evaluating FAST'26 GPU Checkpoint/Restore Made Fast and Lightweight.
bash run.shResults summary will be written to results.txt.
Detailed results are stored in eval/log/ with subdirectories for each experiment type.
This artifact is a comprehensive evaluation framework for GPU checkpoint/restore systems. It includes:
-
Source code:
GCR/- GCR implementation
-
Analysis scripts: in
eval/analysis/- Scripts for generating summary and analyzing results
-
Configuration and utilities:
run.sh- Main evaluation scriptrun_analysis.sh- Result analysis script
-
Hardware:
- 2× NVIDIA A100-40GB GPUs with NVLink and PCIe 4.0 (tested configuration)
-
Software:
- CUDA toolkit 12.6
- PyTorch 2.7.1
- vLLM 0.9.1
- DeepSpeed 0.17.5
Note on framework versions: Since PhOS only supports Transformers 4.30.0, we use the same Transformers version when evaluating workloads that are supported by PhOS to ensure fair comparison. For workloads not supported by PhOS, we use newer versions of Transformers (4.53.3) to enable broader workload coverage.
- ServerlessLLM is installed following ServerlessLLM (commit id: 76d472f)
- Kernel PTX is generated using Neutrino
- Python library dependencies are listed in
environment.yml
We will release our paper after it is camera-ready.