Skip to content

vita-epfl/Stable-Video-Infinity

Repository files navigation

SVI

Stable Video Infinity: Infinite-Length Video Generation with Error Recycling

Wuyang Li · Wentao Pan · Po-Chien Luan · Yang Gao · Alexandre Alahi

VITA@EPFL

Technical introduction (unofficial): AI Papers Slop (English); WechatApp (Chinese)

Watch the video
Quick Glance at the SVI Family
Watch the video
8‑minute crazy Tom & Jerry video made with SVI‑Tom
Watch the video
14‑minute videos made with SVI‑2.0 (based on Wan 2.1) and SVI‑Talk.

🚀 [26 Dec 2025 News] Update SVI 2.0 Pro for Wan 2.2

✨ SVI 2.0 Pro ComfyUI Workflows and Videos from the Community (Not us!)

Thanks to many enthusiastic community users who keep creating and updating various SVI workflows, we now have a growing collection of different features and use cases. Please refer to the pinned issue for a summarized overview of these workflows. We will continuously update that issue to showcase more interesting and useful SVI workflows. When using them, please check out the pinned issue for updated important tips, e.g.,

  • Use different seeds for different clips, which is very important!
  • Enhance prompts & reduce LightX2V usgae & use more optimal resolution (480p) to relieve slow motion.
  • Avoid using the wrong SVI 1.0 workflow in this repo.

Some Community Workflow Tutorials

Really appreciate the attention from community Youtubers and Bilibili creators.

  • ❤️ Big thanks to the amazing Youtuber @AI Search for his fantastic SVI tutoral [Link]!

  • ❤️ Big thanks to the amazing Youtuber @ComfyUI Workflow Blog making tutoral about generating 40-second highly dynamic videos witout any color degragation [Link].

  • ❤️ Big thanks to the amazing Bilibili creator @AI Aiwood for his three amazing SVI tutorals about long-shot videos ([Link]), multi-shot videos ([Link]), and video extension ([Link])!

  • ❤️ Big thanks to the amazing Bilibili creators @AI 与AI同行1996 for his 1-min stress test of SVI without color drift! @AI绘视玩家 for his stress test of storytelling long videos. @三当家AI for the test of different Wan base model varients, and the videos from amazing Youtuber @Jaevlon.

Use Cases from the Community

Here are some beautiful videos generated by creative community users (not us) using SVI 2.0 Pro workflows! Please don’t hesitate to share your SVI creations with us!

If your video quality differs significantly from the community example below (e.g., flickering or noticeable degradation), please double-check that you are using the workflow correctly. Besides, please turn on the sound of the following video for the best experience.

teaser.mp4

Caption: Please turn on the sound at first! Video credit to community creator @PT. This is an unsolicited, non-paid promotional video with sound for SVI Pro 2.0 created independently by a community user (not affiliated with us). The video is first generated with SVI, then lip alignment is refined using InfiniteTalk@Meituan (PS: Big thanks to Longcat team!). The English voiceover says: “Many people ask what SVI Pro can do, it's about generating long videos without quality degradation. I love continuous camera moves and narration. Combined with amazing Wan 2.2, it’s simply an epic ride westward.”

from_community_with_music.mp4

Caption: Please turn on the sound at first! Big thanks to @ ̮ (̲͡-̲̅ .̲̅ ̲̅͡- ̲). Happy New Year!

community_demo1.mp4

Big thanks to @PT.

community_demo_11.mp4

Big thanks to @邂逅2004.

community_demo3.mp4

Big thanks to @RuneGjerde.

community_demo_10.mp4

Big thanks to @XXX.

community_demo5.mp4

Big thanks to @Jaevlon.

community_demo9.mp4

Big thanks to @高姿态的浅唱.

community_demo6.mp4

Big thanks to @Aiwood.

@aiwood.mp4

Big thanks to @Aiwood.

@PT-1.mp4

Big thanks to @PT.

community_demo8.mp4

Big thanks to @wallen.

community_demo4.mp4

Big thanks to @RuneGjerde.

community_demo7.mp4

Big thanks to @CUDA out of memory.

What is our next release? Wan 2.2 Animate SVI. We found that tuning with only 1k samples is sufficient to unlock infinite-length generation for Wan 2.2 Animate, and we are trying to scale up now. The performance is far better than our original SVI-Dance based on UniAnimate-DiT.

✨ Highlight

Stable Video Infinity (SVI) is able to generate ANY-length videos with high temporal consistency, plausible scene transitions, and controllable streaming storylines in ANY domains.

  • OpenSVI: Everything is open-sourced: training & evaluation scripts, datasets, and more.
  • Infinite Length: No inherent limit on video duration; generate arbitrarily long stories (see the 10‑minute “Tom and Jerry” demo).
  • Versatile: Supports diverse in-the-wild generation tasks: multi-scene short films, single‑scene animations, skeleton-/audio-conditioned generation, cartoons, and more.
  • Efficient: Only LoRA adapters are tuned, requiring very little training data: anyone can make their own SVI easily.

📧 Contact: wuyang.li@epfl.ch

😀 SVI 1.0 ComfyUI Workflow

Official ComfyUI

We've recently discovered that some users have been incorrectly using SVI workflows. We apologize for any confusion. Please note that SVI LoRA cannot directly use the original Wan 2.1 workflow - it requires modified padding settings.

Please use our official workflow: Stable-Video-Infinity/comfyui_workflow, which supports independent prompts for each video clip. Big thanks to @RuneGjerde, @Kijai, and @Taiwan1912!

Due to the significant impact of quantization and step distillation on the SVI-Film workflow, we currently only open-source the SVI-Shot workflow. Using our official workflow will generate infinite-length videos without drifting and forgetting. Below is a 3-minute interactive video demo (distinct prompts for each 5-second video continuation):

SVI-Shot-.Interactive-3min.mp4

Some Important To-Checks

If you can’t wait for the official ComfyUI release, try the testing versions of the Shot and Film workflows first with commercial GPUs based on quantization and distill Loras: Here. The official one (more stable) might be updated soon. Due to model quantization, the video quality may be affected (Better to try more sampling steps than 4/8).

  • Please ensure that every video clip uses a different seed.
  • SVI-Film uses 5 motion frames (last 5 frames) for i2v, not 1.
  • SVI-Tom shares the workflow with SVI-Film, but uses 1 motion frame.
  • SVI-Shot uses 1 motion frame (last frame) and uses extra VACE-based padding (the given reference image).
  • Use the boat and cat demos for 50s generation and compare them with the reproduced ones to verify correctness.
  • SVI-Shot also supports using different text for clips. See here. Thanks @Taiwan1912!

Thank you for playing with SVI!

🔥 News

  • [12-26-2025] SVI-2.0 Pro released!
  • [12-07-2025] SVI-2.0 WanVideoWrapper ComfyUI workflow (native ComfyUI workflow is under deployment)
  • [12-04-2025] SVI-2.0 released, supporting both Wan 2.1 and Wan 2.2
  • [10-31-2025] Official SVI-Shot ComfUI workflow!
  • [10-23-2025] Preview of Wan 2.2-5B-SVI and some tips for custom SVI implementation: See DevLog!
  • [10-21-2025] The error-banking strategy is optimized, further imporving the stability. See details in DevLog!
  • [10-13-2025] SVI is now fully open-sourced and online!

❓ Frequently Asked Questions

Bidirectional or Causal (Self-Forcing)?

Self-Forcing achieves frame-by-frame causality, whereas SVI, a hybrid version, operates with clip-by-clip causality and bidirectional attention within each clip.

Targeting film and creative content production, our SVI design mirrors a director's workflow: (1) Directors repeatedly review clips in both forward and reverse directions to ensure quality, often calling "CUT" and "AGAIN" multiple times during the creative process. SVI maintains bidirectionality within each clip to emulate this process. (2) After that, directors seamlessly connect different clips along the temporal axis with causality (and some scene-transition animation), which aligns with SVI's clip-by-clip causality. The Self-Forcing series is better suited for scenarios prioritizing real-time interaction (e.g., gaming). In contrast, SVI focuses on story content creation, requiring higher standards for both content and visual quality. Intuitively, SVI's paradigm has unique advantages in end-to-end high-quality video content creation.

Pardigm comparisoon

Please Refer to FAQ for More Questions.

🔧 Environment Setup

We have tested the environment with A100 80G, cuda 12.0, and torch 2.8.0. This is our reproduced environment. The following script will automatically install the older version torch==2.5.0. We have also tested with the lower version: torch==2.4.1 and torch==2.5.0. Feel free to let me know if you meet issues.

conda create -n svi python=3.10 
conda activate svi

# For svi family
pip install -e .
pip install flash_attn==2.8.0.post2
# If you encounter issues with flash-attn installation, please refer to the details at https://github.com/vita-epfl/Stable-Video-Infinity/issues/3.

conda install -c conda-forge ffmpeg
conda install -c conda-forge librosa
conda install -c conda-forge libiconv

📦 Model Preparation

Download Wan 2.1 I2V 14B

huggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P --local-dir ./weights/Wan2.1-I2V-14B-480P

Download SVI Family

Model Task Input Output Hugging Face Link Comments
SVI-2.0 Single-scene (suppors some transitions) Image + Text prompt stream Long video 🤗 Model Generate consistent long video with 1 text prompt stream.
ALL SVI-1.0 Infinite possibility Image + X X video 🤗 Folder Family bucket! I want to play with all!
SVI-Shot Single-scene generation Image + Text prompt Long video 🤗 Model Generate consistent long video with 1 text prompt. (This will never drift or forget in our 20 min test)
SVI-Film-Opt-10212025 (Latest) Multi-scene generation Image + Text prompt stream Film-style video 🤗 Model Generate creative long video with 1 text prompt stream (5 second per text).
SVI-Film Multi-scene generation Image + Text prompt stream Film-style video 🤗 Model Generate creative long video with 1 text prompt stream (5 second per text).
SVI-Film (Transition) Multi-scene generation Image + Text prompt stream Film-style video 🤗 Model Generate creative long video with 1 text prompt stream. (More scene transitions due to the training data)
SVI-Tom&Jerry Cartoon animation Image Cartoon video 🤗 Model Generate creative long cartoon videos with 1 text prompt stream (This will never drift or forget in our 20 min test)
SVI-Talk Talking head Image + Audio Talking video 🤗 Model Generate long videos with audio-conditioned human speaking (This will never drift or forget in our 10 min test)
SVI-Dance Dancing animation Image + Skeleton Dance video 🤗 Model Generate long videos with skeleton-conditioned human dancing

Note: If you want to play with T2V, you can directly use SVI with an image generated by any T2I model!

SVI-2.0

For this model, you can try the sample in 100-prompt-sample with SVI-Shot inference scirpt. It should generate results similar to the ones shown in our 14-min YouTube video.

# This uses the SVI-Shot inference script and workflow, supporting both 5 and 1 motion frames
huggingface-cli download vita-video-gen/svi-model version-2.0/SVI_Wan2.1-I2V-14B_lora_v2.0.safetensors --local-dir ./weights/Stable-Video-Infinity

SVI-1.0

# login with your fine-grained token
huggingface-cli login

# Option 1: Download SVI Family bucket!
huggingface-cli download vita-video-gen/svi-model --local-dir ./weights/Stable-Video-Infinity --include "version-1.0/*"

# Option 2: Download individual models
# huggingface-cli download vita-video-gen/svi-model version-1.0/svi-shot.safetensors --local-dir ./weights/Stable-Video-Infinity
# huggingface-cli download vita-video-gen/svi-model version-1.0/svi-film-opt-10212025.safetensors  --local-dir ./weights/Stable-Video-Infinity
# huggingface-cli download vita-video-gen/svi-model version-1.0/svi-film.safetensors --local-dir ./weights/Stable-Video-Infinity
# huggingface-cli download vita-video-gen/svi-model version-1.0/svi-film-transitions.safetensors --local-dir ./weights/Stable-Video-Infinity
# huggingface-cli download vita-video-gen/svi-model version-1.0/svi-tom.safetensors --local-dir ./weights/Stable-Video-Infinity
# huggingface-cli download vita-video-gen/svi-model version-1.0/svi-talk.safetensors --local-dir ./weights/Stable-Video-Infinity
# huggingface-cli download vita-video-gen/svi-model version-1.0/svi-dance.safetensors --local-dir ./weights/Stable-Video-Infinity

Download Multitalk Cross-Attention for SVI-Talk Training/Test

# Download audio encoder
huggingface-cli download TencentGameMate/chinese-wav2vec2-base --local-dir ./weights/chinese-wav2vec2-base 
huggingface-cli download TencentGameMate/chinese-wav2vec2-base model.safetensors --revision refs/pr/1 --local-dir ./weights/chinese-wav2vec2-base

# Download multitalk weight
huggingface-cli download MeiGen-AI/MeiGen-MultiTalk --local-dir ./weights/MeiGen-MultiTalk

# Link Multitalk
ln -s $PWD/weights/MeiGen-MultiTalk/multitalk.safetensors weights/Wan2.1-I2V-14B-480P/

Download UniAnimate-DiT LoRA for SVI-Dance Training

huggingface-cli download ZheWang123/UniAnimate-DiT --local-dir ./weights/UniAnimate-DiT

Check Model

After downloading all the models, your weights/ directory structure should look like this:

weights/
├── Wan2.1-I2V-14B-480P/
│   ├── diffusion_pytorch_model-00001-of-00007.safetensors
│   ├── diffusion_pytorch_model-00002-of-00007.safetensors
│   ├── diffusion_pytorch_model-00003-of-00007.safetensors
│   ├── diffusion_pytorch_model-00004-of-00007.safetensors
│   ├── diffusion_pytorch_model-00005-of-00007.safetensors
│   ├── diffusion_pytorch_model-00006-of-00007.safetensors
│   ├── diffusion_pytorch_model-00007-of-00007.safetensors
│   ├── diffusion_pytorch_model.safetensors.index.json
│   ├── models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth
│   ├── models_t5_umt5-xxl-enc-bf16.pth
│   ├── Wan2.1_VAE.pth
│   ├── multitalk.safetensors (symlink)
│   └── README.md
├── Stable-Video-Infinity/
│   ├── version-2.0/
│   │   └── SVI_Wan2.1-I2V-14B_lora_v2.0.safetensors (Improved Wan 2.1 14B SVI )
│   └── version-1.0/
│       ├── svi-shot.safetensors
│       ├── svi-film.safetensors
│       ├── svi-film-transitions.safetensors
│       ├── svi-tom.safetensors
│       ├── svi-talk.safetensors
│       └── svi-dance.safetensors
├── chinese-wav2vec2-base/ (for SVI-Talk)
│   ├── config.json
│   ├── model.safetensors
│   ├── preprocessor_config.json
│   └── README.md
├── MeiGen-MultiTalk/ (for SVI-Talk)
│   ├── diffusion_pytorch_model.safetensors.index.json
│   ├── multitalk.safetensors
│   └── README.md
└── UniAnimate-DiT/ (for SVI-Dance)
    ├── dw-ll_ucoco_384.onnx
    ├── UniAnimate-Wan2.1-14B-Lora-12000.ckpt
    ├── yolox_l.onnx
    └── README.md

🎮 Play with Official SVI

Inference Scripts

The following scripts will use data in data/demo for inference. You can also use custom data to inference by simply changing the data path.

# SVI-2.0
bash scripts/test/svi_2.0.sh 

# SVI-Shot
bash scripts/test/svi_shot.sh 

# SVI-Film
bash scripts/test/svi_film.sh 

# SVI-Talk
bash scripts/test/svi_talk.sh 

# SVI-Dance
bash scripts/test/svi_dance.sh 

# SVI-Tom&Jerry
bash scripts/test/svi_tom.sh 

Gradio Demo

Currently, gradio demo only supports SVI-Shot and SVI-Film.

bash gradio_demo.sh

🔥 Train Your Own SVI

We have prepared the toy training data data/toy_train/. You can simply follow the data format to train SVI with your custom data. Please modify --num_nodes if you use more nodes for training. We have tested both 8 and 64 GPUs for training, where larger batch-size gave a better performance.

SVI-Shot

# (Optionally) Use scripts/data_preprocess/process_mixkit.py from CausVid to pre-process data
# start training
bash scripts/train/svi_shot.sh 

SVI-Film

# (Optionally) Use scripts/data_preprocess/process_mixkit.py from CausVid to pre-process data
# start training
bash scripts/train/svi_film.sh 

SVI-Talk

# Preprocess the toy training data
python scripts/data_preprocess/prepare_video_audio.py 

# Start training
bash scripts/train/svi_talk.sh 

SVI-Dance

# Preprocess the toy training data
python scripts/data_preprocess/prepare_video_audio.py 

# Start training
bash scripts/train/svi_dance.sh 

📝 Test Your Trained SVI

Model Post-processing

# Change .pt files to .safetensors files
# zero_to_fp32.py will be automatically generated in your model dir, change $DIR_WITH_SAFETENSORS into your desired DIR
python zero_to_fp32.py . $DIR_WITH_SAFETENSORS --safe_serialization

# (Optionally) Extract and only save LoRA parameters to reduce disk space
python utils/extract_lora.py --checkpoint_dir $DIR_WITH_SAFETENSORS --output_dir $XXX

Inference

Please modify the inference scripts in ./scripts/test/ accordingly by changing the inference samples and your new weight

🗃️ Datasets

You can also use our benchmark datasets made by our Automatic Prompt Stream Engine (see Appendix. A for more details), where you can find images and associated prompt streams according to specific storylines.

Data Use HuggingFace Link Comment
Consistent Video Generation Test 🤗 Dataset Generate 1 long video using 1 text prompt
Creative Video Generation Test 🤗 Dataset Generate 1 long video using 1 text prompt stream according to storyline (1 prompt for 5 sec clip)
Creative Video Generation (More prompts) Test 🤗 Dataset Generate 1 long video using 1 text prompt stream according to storyline (1 prompt for 5 sec clip)

The following is the training data we used for SVI family.

Data Use HuggingFace Link Comment
Customized Datasets Train 🤗 Dataset You can make your customized datasets using this format
Consistent/Creative Video Generation Train 🤗 Dataset MixKit Dataset
Consistent/Creative Video Generation Train 🤗 Dataset UltraVideo Dataset
Human Talking Train 🤗 Dataset 5k subset from Hallo 3
Human Dancing Train 🤗 Dataset TikTok
huggingface-cli download --repo-type dataset vita-video-gen/svi-benchmark --local-dir ./data/svi-benchmark

📋 TODO List

  • Release everything about SVI 1.0
  • SVI 2.0 for Wan 2.1 and Wan 2.1
  • Wan 2.2 Animate SVI
  • Customizable video generation

🙏 Acknowledgement

We greatly appreciate the tremendous effort for the following fantastic projects!

[1] Wan: Open and Advanced Large-Scale Video Generative Models
[2] UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer
[3] Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation

❤️ Citation

If you find our work helpful for your research, please consider citing our paper. Thank you so much!

@article{li2025stable,
  title={Stable Video Infinity: Infinite-Length Video Generation with Error Recycling},
  author={Li, Wuyang and Pan, Wentao and Luan, Po-Chien and Gao, Yang and Alahi, Alexandre},
  journal={arXiv preprint arXiv:2510.09212},
  year={2025}
}

📌 Abstract

We propose Stable Video Infinity (SVI) that is able to generate infinite-length videos with high temporal consistency, plausible scene transitions, and controllable streaming storylines. While existing long-video methods attempt to mitigate accumulated errors via handcrafted anti-drifting (e.g., modified noise scheduler, frame anchoring), they remain limited to single-prompt extrapolation, producing homogeneous scenes with repetitive motions. We identify that the fundamental challenge extends beyond error accumulation to a critical discrepancy between the training assumption (seeing clean data) and the test-time autoregressive reality (conditioning on self-generated, error-prone outputs). To bridge this hypothesis gap, SVI incorporates Error-Recycling Fine-Tuning, a new type of efficient training that recycles the Diffusion Transformer (DiT)'s self-generated errors into supervisory prompts, thereby encouraging DiT to actively identify and correct its own errors. This is achieved by injecting, collecting, and banking errors through closed-loop recycling, autoregressively learning from error-injected feedback. Specifically, we (i) inject historical errors made by DiT to intervene on clean inputs, simulating error-accumulated trajectories in flow matching; (ii) efficiently approximate predictions with one-step bidirectional integration and calculate errors with residuals; (iii) dynamically bank errors into replay memory across discretized timesteps, which are resampled for new input. SVI is able to scale videos from seconds to infinite durations with no additional inference cost, while remaining compatible with diverse conditions (e.g., audio, skeleton, and text streams). We evaluate SVI on three benchmarks, including consistent, creative, and conditional settings, thoroughly verifying its versatility and state-of-the-art role.

SVI intro