Skip to content

thustorage/GCR

Repository files navigation

GPU Checkpoint/Restore Made Fast and Lightweight

This is the artifact repository for evaluating FAST'26 GPU Checkpoint/Restore Made Fast and Lightweight.

Table of Contents

Evaluating the Artifact

bash run.sh

Results summary will be written to results.txt.

Detailed results are stored in eval/log/ with subdirectories for each experiment type.

Artifact Overview

This artifact is a comprehensive evaluation framework for GPU checkpoint/restore systems. It includes:

  • Source code:

    • GCR/ - GCR implementation
  • Analysis scripts: in eval/analysis/

    • Scripts for generating summary and analyzing results
  • Configuration and utilities:

    • run.sh - Main evaluation script
    • run_analysis.sh - Result analysis script

Environment Setup

  • Hardware:

    • 2× NVIDIA A100-40GB GPUs with NVLink and PCIe 4.0 (tested configuration)
  • Software:

    • CUDA toolkit 12.6
    • PyTorch 2.7.1
    • vLLM 0.9.1
    • DeepSpeed 0.17.5

Note on framework versions: Since PhOS only supports Transformers 4.30.0, we use the same Transformers version when evaluating workloads that are supported by PhOS to ensure fair comparison. For workloads not supported by PhOS, we use newer versions of Transformers (4.53.3) to enable broader workload coverage.

Dependencies

  • ServerlessLLM is installed following ServerlessLLM (commit id: 76d472f)
  • Kernel PTX is generated using Neutrino
  • Python library dependencies are listed in environment.yml

Citation

We will release our paper after it is camera-ready.

About

code repo for GCR [FAST'26]

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published