Wuyang Li · Wentao Pan · Po-Chien Luan · Yang Gao · Alexandre Alahi
Technical introduction (unofficial): AI Papers Slop (English); WechatApp (Chinese)
Quick Glance at the SVI Family |
8‑minute crazy Tom & Jerry video made with SVI‑Tom |
14‑minute videos made with SVI‑2.0 (based on Wan 2.1) and SVI‑Talk. |
-
main (this branch): SVI using Wan 2.1 base model (both SVI 1.0/2.0)
-
svi_wan22 branch: SVI using Wan 2.2 base model (both SVI 2.0/2.0 Pro)
Thanks to many enthusiastic community users who keep creating and updating various SVI workflows, we now have a growing collection of different features and use cases. Please refer to the pinned issue for a summarized overview of these workflows. We will continuously update that issue to showcase more interesting and useful SVI workflows. When using them, please check out the pinned issue for updated important tips, e.g.,
- Use different seeds for different clips, which is very important!
- Enhance prompts & reduce LightX2V usgae & use more optimal resolution (480p) to relieve slow motion.
- Avoid using the wrong SVI 1.0 workflow in this repo.
Really appreciate the attention from community Youtubers and Bilibili creators.
-
❤️ Big thanks to the amazing Youtuber @AI Search for his fantastic SVI tutoral [Link]!
-
❤️ Big thanks to the amazing Youtuber @ComfyUI Workflow Blog making tutoral about generating 40-second highly dynamic videos witout any color degragation [Link].
-
❤️ Big thanks to the amazing Bilibili creator @AI Aiwood for his three amazing SVI tutorals about long-shot videos ([Link]), multi-shot videos ([Link]), and video extension ([Link])!
-
❤️ Big thanks to the amazing Bilibili creators @AI 与AI同行1996 for his 1-min stress test of SVI without color drift! @AI绘视玩家 for his stress test of storytelling long videos. @三当家AI for the test of different Wan base model varients, and the videos from amazing Youtuber @Jaevlon.
Here are some beautiful videos generated by creative community users (not us) using SVI 2.0 Pro workflows! Please don’t hesitate to share your SVI creations with us!
If your video quality differs significantly from the community example below (e.g., flickering or noticeable degradation), please double-check that you are using the workflow correctly. Besides, please turn on the sound of the following video for the best experience.
teaser.mp4
Caption: Please turn on the sound at first! Video credit to community creator @PT. This is an unsolicited, non-paid promotional video with sound for SVI Pro 2.0 created independently by a community user (not affiliated with us). The video is first generated with SVI, then lip alignment is refined using InfiniteTalk@Meituan (PS: Big thanks to Longcat team!). The English voiceover says: “Many people ask what SVI Pro can do, it's about generating long videos without quality degradation. I love continuous camera moves and narration. Combined with amazing Wan 2.2, it’s simply an epic ride westward.”
from_community_with_music.mp4
Caption: Please turn on the sound at first! Big thanks to @ ̮ (̲͡-̲̅ .̲̅ ̲̅͡- ̲). Happy New Year!
community_demo1.mp4Big thanks to @PT. |
community_demo_11.mp4Big thanks to @邂逅2004. |
community_demo3.mp4Big thanks to @RuneGjerde. |
community_demo_10.mp4Big thanks to @XXX. |
community_demo5.mp4Big thanks to @Jaevlon. |
community_demo9.mp4Big thanks to @高姿态的浅唱. |
community_demo6.mp4Big thanks to @Aiwood. |
@aiwood.mp4Big thanks to @Aiwood. |
@PT-1.mp4Big thanks to @PT. |
community_demo8.mp4Big thanks to @wallen. |
community_demo4.mp4Big thanks to @RuneGjerde. |
community_demo7.mp4Big thanks to @CUDA out of memory. |
What is our next release? Wan 2.2 Animate SVI. We found that tuning with only 1k samples is sufficient to unlock infinite-length generation for Wan 2.2 Animate, and we are trying to scale up now. The performance is far better than our original SVI-Dance based on UniAnimate-DiT.
Stable Video Infinity (SVI) is able to generate ANY-length videos with high temporal consistency, plausible scene transitions, and controllable streaming storylines in ANY domains.
- OpenSVI: Everything is open-sourced: training & evaluation scripts, datasets, and more.
- Infinite Length: No inherent limit on video duration; generate arbitrarily long stories (see the 10‑minute “Tom and Jerry” demo).
- Versatile: Supports diverse in-the-wild generation tasks: multi-scene short films, single‑scene animations, skeleton-/audio-conditioned generation, cartoons, and more.
- Efficient: Only LoRA adapters are tuned, requiring very little training data: anyone can make their own SVI easily.
📧 Contact: wuyang.li@epfl.ch
We've recently discovered that some users have been incorrectly using SVI workflows. We apologize for any confusion. Please note that SVI LoRA cannot directly use the original Wan 2.1 workflow - it requires modified padding settings.
Please use our official workflow: Stable-Video-Infinity/comfyui_workflow, which supports independent prompts for each video clip. Big thanks to @RuneGjerde, @Kijai, and @Taiwan1912!
Due to the significant impact of quantization and step distillation on the SVI-Film workflow, we currently only open-source the SVI-Shot workflow. Using our official workflow will generate infinite-length videos without drifting and forgetting. Below is a 3-minute interactive video demo (distinct prompts for each 5-second video continuation):
SVI-Shot-.Interactive-3min.mp4
If you can’t wait for the official ComfyUI release, try the testing versions of the Shot and Film workflows first with commercial GPUs based on quantization and distill Loras: Here. The official one (more stable) might be updated soon. Due to model quantization, the video quality may be affected (Better to try more sampling steps than 4/8).
- Please ensure that every video clip uses a different seed.
- SVI-Film uses 5 motion frames (last 5 frames) for i2v, not 1.
- SVI-Tom shares the workflow with SVI-Film, but uses 1 motion frame.
- SVI-Shot uses 1 motion frame (last frame) and uses extra VACE-based padding (the given reference image).
- Use the boat and cat demos for 50s generation and compare them with the reproduced ones to verify correctness.
- SVI-Shot also supports using different text for clips. See here. Thanks @Taiwan1912!
Thank you for playing with SVI!
- [12-26-2025] SVI-2.0 Pro released!
- [12-07-2025] SVI-2.0 WanVideoWrapper ComfyUI workflow (native ComfyUI workflow is under deployment)
- [12-04-2025] SVI-2.0 released, supporting both Wan 2.1 and Wan 2.2
- [10-31-2025] Official SVI-Shot ComfUI workflow!
- [10-23-2025] Preview of Wan 2.2-5B-SVI and some tips for custom SVI implementation: See DevLog!
- [10-21-2025] The error-banking strategy is optimized, further imporving the stability. See details in DevLog!
- [10-13-2025] SVI is now fully open-sourced and online!
Self-Forcing achieves frame-by-frame causality, whereas SVI, a hybrid version, operates with clip-by-clip causality and bidirectional attention within each clip.
Targeting film and creative content production, our SVI design mirrors a director's workflow: (1) Directors repeatedly review clips in both forward and reverse directions to ensure quality, often calling "CUT" and "AGAIN" multiple times during the creative process. SVI maintains bidirectionality within each clip to emulate this process. (2) After that, directors seamlessly connect different clips along the temporal axis with causality (and some scene-transition animation), which aligns with SVI's clip-by-clip causality. The Self-Forcing series is better suited for scenarios prioritizing real-time interaction (e.g., gaming). In contrast, SVI focuses on story content creation, requiring higher standards for both content and visual quality. Intuitively, SVI's paradigm has unique advantages in end-to-end high-quality video content creation.
Please Refer to FAQ for More Questions.
We have tested the environment with A100 80G, cuda 12.0, and torch 2.8.0. This is our reproduced environment. The following script will automatically install the older version torch==2.5.0. We have also tested with the lower version: torch==2.4.1 and torch==2.5.0. Feel free to let me know if you meet issues.
conda create -n svi python=3.10
conda activate svi
# For svi family
pip install -e .
pip install flash_attn==2.8.0.post2
# If you encounter issues with flash-attn installation, please refer to the details at https://github.com/vita-epfl/Stable-Video-Infinity/issues/3.
conda install -c conda-forge ffmpeg
conda install -c conda-forge librosa
conda install -c conda-forge libiconvhuggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P --local-dir ./weights/Wan2.1-I2V-14B-480P| Model | Task | Input | Output | Hugging Face Link | Comments |
|---|---|---|---|---|---|
| SVI-2.0 | Single-scene (suppors some transitions) | Image + Text prompt stream | Long video | 🤗 Model | Generate consistent long video with 1 text prompt stream. |
| ALL SVI-1.0 | Infinite possibility | Image + X | X video | 🤗 Folder | Family bucket! I want to play with all! |
| SVI-Shot | Single-scene generation | Image + Text prompt | Long video | 🤗 Model | Generate consistent long video with 1 text prompt. (This will never drift or forget in our 20 min test) |
| SVI-Film-Opt-10212025 (Latest) | Multi-scene generation | Image + Text prompt stream | Film-style video | 🤗 Model | Generate creative long video with 1 text prompt stream (5 second per text). |
| SVI-Film | Multi-scene generation | Image + Text prompt stream | Film-style video | 🤗 Model | Generate creative long video with 1 text prompt stream (5 second per text). |
| SVI-Film (Transition) | Multi-scene generation | Image + Text prompt stream | Film-style video | 🤗 Model | Generate creative long video with 1 text prompt stream. (More scene transitions due to the training data) |
| SVI-Tom&Jerry | Cartoon animation | Image | Cartoon video | 🤗 Model | Generate creative long cartoon videos with 1 text prompt stream (This will never drift or forget in our 20 min test) |
| SVI-Talk | Talking head | Image + Audio | Talking video | 🤗 Model | Generate long videos with audio-conditioned human speaking (This will never drift or forget in our 10 min test) |
| SVI-Dance | Dancing animation | Image + Skeleton | Dance video | 🤗 Model | Generate long videos with skeleton-conditioned human dancing |
Note: If you want to play with T2V, you can directly use SVI with an image generated by any T2I model!
For this model, you can try the sample in 100-prompt-sample with SVI-Shot inference scirpt. It should generate results similar to the ones shown in our 14-min YouTube video.
# This uses the SVI-Shot inference script and workflow, supporting both 5 and 1 motion frames
huggingface-cli download vita-video-gen/svi-model version-2.0/SVI_Wan2.1-I2V-14B_lora_v2.0.safetensors --local-dir ./weights/Stable-Video-Infinity
# login with your fine-grained token
huggingface-cli login
# Option 1: Download SVI Family bucket!
huggingface-cli download vita-video-gen/svi-model --local-dir ./weights/Stable-Video-Infinity --include "version-1.0/*"
# Option 2: Download individual models
# huggingface-cli download vita-video-gen/svi-model version-1.0/svi-shot.safetensors --local-dir ./weights/Stable-Video-Infinity
# huggingface-cli download vita-video-gen/svi-model version-1.0/svi-film-opt-10212025.safetensors --local-dir ./weights/Stable-Video-Infinity
# huggingface-cli download vita-video-gen/svi-model version-1.0/svi-film.safetensors --local-dir ./weights/Stable-Video-Infinity
# huggingface-cli download vita-video-gen/svi-model version-1.0/svi-film-transitions.safetensors --local-dir ./weights/Stable-Video-Infinity
# huggingface-cli download vita-video-gen/svi-model version-1.0/svi-tom.safetensors --local-dir ./weights/Stable-Video-Infinity
# huggingface-cli download vita-video-gen/svi-model version-1.0/svi-talk.safetensors --local-dir ./weights/Stable-Video-Infinity
# huggingface-cli download vita-video-gen/svi-model version-1.0/svi-dance.safetensors --local-dir ./weights/Stable-Video-Infinity# Download audio encoder
huggingface-cli download TencentGameMate/chinese-wav2vec2-base --local-dir ./weights/chinese-wav2vec2-base
huggingface-cli download TencentGameMate/chinese-wav2vec2-base model.safetensors --revision refs/pr/1 --local-dir ./weights/chinese-wav2vec2-base
# Download multitalk weight
huggingface-cli download MeiGen-AI/MeiGen-MultiTalk --local-dir ./weights/MeiGen-MultiTalk
# Link Multitalk
ln -s $PWD/weights/MeiGen-MultiTalk/multitalk.safetensors weights/Wan2.1-I2V-14B-480P/huggingface-cli download ZheWang123/UniAnimate-DiT --local-dir ./weights/UniAnimate-DiTAfter downloading all the models, your weights/ directory structure should look like this:
weights/
├── Wan2.1-I2V-14B-480P/
│ ├── diffusion_pytorch_model-00001-of-00007.safetensors
│ ├── diffusion_pytorch_model-00002-of-00007.safetensors
│ ├── diffusion_pytorch_model-00003-of-00007.safetensors
│ ├── diffusion_pytorch_model-00004-of-00007.safetensors
│ ├── diffusion_pytorch_model-00005-of-00007.safetensors
│ ├── diffusion_pytorch_model-00006-of-00007.safetensors
│ ├── diffusion_pytorch_model-00007-of-00007.safetensors
│ ├── diffusion_pytorch_model.safetensors.index.json
│ ├── models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth
│ ├── models_t5_umt5-xxl-enc-bf16.pth
│ ├── Wan2.1_VAE.pth
│ ├── multitalk.safetensors (symlink)
│ └── README.md
├── Stable-Video-Infinity/
│ ├── version-2.0/
│ │ └── SVI_Wan2.1-I2V-14B_lora_v2.0.safetensors (Improved Wan 2.1 14B SVI )
│ └── version-1.0/
│ ├── svi-shot.safetensors
│ ├── svi-film.safetensors
│ ├── svi-film-transitions.safetensors
│ ├── svi-tom.safetensors
│ ├── svi-talk.safetensors
│ └── svi-dance.safetensors
├── chinese-wav2vec2-base/ (for SVI-Talk)
│ ├── config.json
│ ├── model.safetensors
│ ├── preprocessor_config.json
│ └── README.md
├── MeiGen-MultiTalk/ (for SVI-Talk)
│ ├── diffusion_pytorch_model.safetensors.index.json
│ ├── multitalk.safetensors
│ └── README.md
└── UniAnimate-DiT/ (for SVI-Dance)
├── dw-ll_ucoco_384.onnx
├── UniAnimate-Wan2.1-14B-Lora-12000.ckpt
├── yolox_l.onnx
└── README.md
The following scripts will use data in data/demo for inference. You can also use custom data to inference by simply changing the data path.
# SVI-2.0
bash scripts/test/svi_2.0.sh
# SVI-Shot
bash scripts/test/svi_shot.sh
# SVI-Film
bash scripts/test/svi_film.sh
# SVI-Talk
bash scripts/test/svi_talk.sh
# SVI-Dance
bash scripts/test/svi_dance.sh
# SVI-Tom&Jerry
bash scripts/test/svi_tom.sh Currently, gradio demo only supports SVI-Shot and SVI-Film.
bash gradio_demo.shWe have prepared the toy training data data/toy_train/. You can simply follow the data format to train SVI with your custom data.
Please modify --num_nodes if you use more nodes for training. We have tested both 8 and 64 GPUs for training, where larger batch-size gave a better performance.
# (Optionally) Use scripts/data_preprocess/process_mixkit.py from CausVid to pre-process data
# start training
bash scripts/train/svi_shot.sh # (Optionally) Use scripts/data_preprocess/process_mixkit.py from CausVid to pre-process data
# start training
bash scripts/train/svi_film.sh # Preprocess the toy training data
python scripts/data_preprocess/prepare_video_audio.py
# Start training
bash scripts/train/svi_talk.sh # Preprocess the toy training data
python scripts/data_preprocess/prepare_video_audio.py
# Start training
bash scripts/train/svi_dance.sh # Change .pt files to .safetensors files
# zero_to_fp32.py will be automatically generated in your model dir, change $DIR_WITH_SAFETENSORS into your desired DIR
python zero_to_fp32.py . $DIR_WITH_SAFETENSORS --safe_serialization
# (Optionally) Extract and only save LoRA parameters to reduce disk space
python utils/extract_lora.py --checkpoint_dir $DIR_WITH_SAFETENSORS --output_dir $XXXPlease modify the inference scripts in ./scripts/test/ accordingly by changing the inference samples and your new weight
You can also use our benchmark datasets made by our Automatic Prompt Stream Engine (see Appendix. A for more details), where you can find images and associated prompt streams according to specific storylines.
| Data | Use | HuggingFace Link | Comment |
|---|---|---|---|
| Consistent Video Generation | Test | 🤗 Dataset | Generate 1 long video using 1 text prompt |
| Creative Video Generation | Test | 🤗 Dataset | Generate 1 long video using 1 text prompt stream according to storyline (1 prompt for 5 sec clip) |
| Creative Video Generation (More prompts) | Test | 🤗 Dataset | Generate 1 long video using 1 text prompt stream according to storyline (1 prompt for 5 sec clip) |
The following is the training data we used for SVI family.
| Data | Use | HuggingFace Link | Comment |
|---|---|---|---|
| Customized Datasets | Train | 🤗 Dataset | You can make your customized datasets using this format |
| Consistent/Creative Video Generation | Train | 🤗 Dataset | MixKit Dataset |
| Consistent/Creative Video Generation | Train | 🤗 Dataset | UltraVideo Dataset |
| Human Talking | Train | 🤗 Dataset | 5k subset from Hallo 3 |
| Human Dancing | Train | 🤗 Dataset | TikTok |
huggingface-cli download --repo-type dataset vita-video-gen/svi-benchmark --local-dir ./data/svi-benchmark- Release everything about SVI 1.0
- SVI 2.0 for Wan 2.1 and Wan 2.1
- Wan 2.2 Animate SVI
- Customizable video generation
We greatly appreciate the tremendous effort for the following fantastic projects!
[1] Wan: Open and Advanced Large-Scale Video Generative Models
[2] UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer
[3] Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation
If you find our work helpful for your research, please consider citing our paper. Thank you so much!
@article{li2025stable,
title={Stable Video Infinity: Infinite-Length Video Generation with Error Recycling},
author={Li, Wuyang and Pan, Wentao and Luan, Po-Chien and Gao, Yang and Alahi, Alexandre},
journal={arXiv preprint arXiv:2510.09212},
year={2025}
}We propose Stable Video Infinity (SVI) that is able to generate infinite-length videos with high temporal consistency, plausible scene transitions, and controllable streaming storylines. While existing long-video methods attempt to mitigate accumulated errors via handcrafted anti-drifting (e.g., modified noise scheduler, frame anchoring), they remain limited to single-prompt extrapolation, producing homogeneous scenes with repetitive motions. We identify that the fundamental challenge extends beyond error accumulation to a critical discrepancy between the training assumption (seeing clean data) and the test-time autoregressive reality (conditioning on self-generated, error-prone outputs). To bridge this hypothesis gap, SVI incorporates Error-Recycling Fine-Tuning, a new type of efficient training that recycles the Diffusion Transformer (DiT)'s self-generated errors into supervisory prompts, thereby encouraging DiT to actively identify and correct its own errors. This is achieved by injecting, collecting, and banking errors through closed-loop recycling, autoregressively learning from error-injected feedback. Specifically, we (i) inject historical errors made by DiT to intervene on clean inputs, simulating error-accumulated trajectories in flow matching; (ii) efficiently approximate predictions with one-step bidirectional integration and calculate errors with residuals; (iii) dynamically bank errors into replay memory across discretized timesteps, which are resampled for new input. SVI is able to scale videos from seconds to infinite durations with no additional inference cost, while remaining compatible with diverse conditions (e.g., audio, skeleton, and text streams). We evaluate SVI on three benchmarks, including consistent, creative, and conditional settings, thoroughly verifying its versatility and state-of-the-art role.


