LTX Video
A text-to-video generation model from Lightricks, based on a DiT (Diffusion Transformer) architecture with T5-XXL text encoding and a 3D causal video VAE. Generates short video clips from text prompts.
"Northern lights dancing over a frozen lake in Iceland, green and purple aurora ribbons reflected in the ice" — ltx-video-0.9.6-distilled:bf16, 8 steps, 33 frames, seed 1234
"Underwater footage of a jellyfish pulsing through deep blue water, bioluminescent glow, particles floating" — ltx-video-0.9.6-distilled:bf16, 8 steps, 33 frames, seed 707
- Developer: Lightricks
- License: LTXV Open Weights License (custom, revenue-gated at $10M)
- HuggingFace: Lightricks/LTX-Video
Note: Video output defaults to APNG format (lossless, with embedded metadata). Also supports GIF, WebP, and MP4 via
--format. Frame count must be 8n+1 (9, 17, 25, 33, 49, ...) due to the VAE's 8x temporal compression.
Variants
| Model | Steps | Approx total pull | Notes |
|---|---|---|---|
ltx-video-0.9.6:bf16 | 40 | ~17.4 GB | Higher-quality 2B path, 30 FPS defaults |
ltx-video-0.9.6-distilled:bf16 | 8 | ~17.4 GB | Fast default single-pass path |
ltx-video-0.9.8-2b-distilled:bf16 | 7+3 | ~17.8 GB | 0.9.8 checkpoint plus spatial upscaler asset |
ltx-video-0.9.8-13b-dev:bf16 | 30 | ~38.5 GB | Highest-quality 13B multiscale dev path |
ltx-video-0.9.8-13b-distilled:bf16 | 7+3 | ~38.5 GB | Faster 13B checkpoint |
The 0.9.8 variants require the published spatial upscaler asset. mold pulls and tracks that file explicitly.
These sizes are approximate full-download totals, including the shared T5 encoder, tokenizer, VAE, and the 0.9.8 spatial upscaler where applicable.
The 0.9.8 family now runs the full two-pass multiscale refinement path. mold keeps the shared T5 assets in shared/flux/..., stores the 0.9.8 spatial upscaler under shared/LTX-Video/..., and intentionally continues using the compatible LTX-Video-0.9.5 VAE source until the newer VAE layout is ported.
Defaults
- Resolution: 1216x704
- Frames: 25
- FPS: 30
- Default model:
ltx-video-0.9.6-distilled:bf16 - Steps: 8 on
0.9.6-distilled, 40 on0.9.6, 7+3 on0.9.8distilled multiscale presets - Output format: APNG (animated PNG with metadata)
Output Formats
| Format | Flag | Quality | Metadata | Notes |
|---|---|---|---|---|
| APNG | --format apng (default) | Lossless | Yes (tEXt chunks) | Opens as .png everywhere |
| GIF | --format gif | 256 colors | No | Pipe-friendly |
| WebP | --format webp | Lossy | No | Requires webp feature |
| MP4 | --format mp4 | H.264 | No | Requires mp4 feature |
Recommended Dimensions
| Width | Height | Aspect Ratio |
|---|---|---|
| 1216 | 704 | current mold default |
| 1024 | 576 | 16:9 |
| 768 | 512 | 3:2 |
| 512 | 768 | 2:3 (portrait) |
| 512 | 512 | 1:1 (square) |
Dimensions must be multiples of 32. Frame count must be 8n+1.
Architecture
LTX Video uses a 3-stage sequential pipeline:
- T5-XXL text encoder (shared with FLUX) — encodes the prompt into 4096-dim embeddings
- LTXVideoTransformer3DModel — 28-layer DiT with 3D rotary position embeddings, self-attention + cross-attention, flow matching denoising
- 3D Causal Video VAE — decodes latents to video frames with 32x spatial and 8x temporal compression (128 latent channels)
Each component is loaded, used, then dropped to free VRAM for the next stage. The T5-XXL encoder is shared with FLUX via mold's shared component cache.
VRAM Usage
The sequential pipeline keeps peak VRAM manageable on 24GB cards for the 2B checkpoints:
- T5-XXL FP16: ~10 GB (dropped after encoding)
- Transformer BF16: model-dependent; 2B fits comfortably, 13B requires much more VRAM
- VAE: ~2.5 GB (dropped after decoding)
Example
# Fast default path
mold run ltx-video-0.9.6-distilled:bf16 "A cat walking across a sunlit windowsill" --frames 25
# Higher-quality 2B path
mold run ltx-video-0.9.6:bf16 "waves crashing on a rocky coastline at sunset" --frames 17 --steps 40
# GIF output for piping
mold run ltx-video-0.9.6-distilled:bf16 "a campfire at night" --format gif | mpv -
# 0.9.8 checkpoint family
mold run ltx-video-0.9.8-2b-distilled:bf16 "a humanoid robot walking" --frames 49If you want the safest current quality path in mold, start with ltx-video-0.9.6-distilled:bf16. If you want the newer upstream 0.9.8 checkpoint family with full multiscale refinement, try ltx-video-0.9.8-2b-distilled:bf16.
