Skip to content

LTX Video

A text-to-video generation model from Lightricks, based on a DiT (Diffusion Transformer) architecture with T5-XXL text encoding and a 3D causal video VAE. Generates short video clips from text prompts.

Northern lights — LTX Video 0.9.6 distilled"Northern lights dancing over a frozen lake in Iceland, green and purple aurora ribbons reflected in the ice"ltx-video-0.9.6-distilled:bf16, 8 steps, 33 frames, seed 1234

Jellyfish — LTX Video 0.9.6 distilled"Underwater footage of a jellyfish pulsing through deep blue water, bioluminescent glow, particles floating"ltx-video-0.9.6-distilled:bf16, 8 steps, 33 frames, seed 707

Note: Video output defaults to APNG format (lossless, with embedded metadata). Also supports GIF, WebP, and MP4 via --format. Frame count must be 8n+1 (9, 17, 25, 33, 49, ...) due to the VAE's 8x temporal compression.

Variants

ModelStepsApprox total pullNotes
ltx-video-0.9.6:bf1640~17.4 GBHigher-quality 2B path, 30 FPS defaults
ltx-video-0.9.6-distilled:bf168~17.4 GBFast default single-pass path
ltx-video-0.9.8-2b-distilled:bf167+3~17.8 GB0.9.8 checkpoint plus spatial upscaler asset
ltx-video-0.9.8-13b-dev:bf1630~38.5 GBHighest-quality 13B multiscale dev path
ltx-video-0.9.8-13b-distilled:bf167+3~38.5 GBFaster 13B checkpoint

The 0.9.8 variants require the published spatial upscaler asset. mold pulls and tracks that file explicitly.

These sizes are approximate full-download totals, including the shared T5 encoder, tokenizer, VAE, and the 0.9.8 spatial upscaler where applicable.

The 0.9.8 family now runs the full two-pass multiscale refinement path. mold keeps the shared T5 assets in shared/flux/..., stores the 0.9.8 spatial upscaler under shared/LTX-Video/..., and intentionally continues using the compatible LTX-Video-0.9.5 VAE source until the newer VAE layout is ported.

Defaults

  • Resolution: 1216x704
  • Frames: 25
  • FPS: 30
  • Default model: ltx-video-0.9.6-distilled:bf16
  • Steps: 8 on 0.9.6-distilled, 40 on 0.9.6, 7+3 on 0.9.8 distilled multiscale presets
  • Output format: APNG (animated PNG with metadata)

Output Formats

FormatFlagQualityMetadataNotes
APNG--format apng (default)LosslessYes (tEXt chunks)Opens as .png everywhere
GIF--format gif256 colorsNoPipe-friendly
WebP--format webpLossyNoRequires webp feature
MP4--format mp4H.264NoRequires mp4 feature
WidthHeightAspect Ratio
1216704current mold default
102457616:9
7685123:2
5127682:3 (portrait)
5125121:1 (square)

Dimensions must be multiples of 32. Frame count must be 8n+1.

Architecture

LTX Video uses a 3-stage sequential pipeline:

  1. T5-XXL text encoder (shared with FLUX) — encodes the prompt into 4096-dim embeddings
  2. LTXVideoTransformer3DModel — 28-layer DiT with 3D rotary position embeddings, self-attention + cross-attention, flow matching denoising
  3. 3D Causal Video VAE — decodes latents to video frames with 32x spatial and 8x temporal compression (128 latent channels)

Each component is loaded, used, then dropped to free VRAM for the next stage. The T5-XXL encoder is shared with FLUX via mold's shared component cache.

VRAM Usage

The sequential pipeline keeps peak VRAM manageable on 24GB cards for the 2B checkpoints:

  • T5-XXL FP16: ~10 GB (dropped after encoding)
  • Transformer BF16: model-dependent; 2B fits comfortably, 13B requires much more VRAM
  • VAE: ~2.5 GB (dropped after decoding)

Example

bash
# Fast default path
mold run ltx-video-0.9.6-distilled:bf16 "A cat walking across a sunlit windowsill" --frames 25

# Higher-quality 2B path
mold run ltx-video-0.9.6:bf16 "waves crashing on a rocky coastline at sunset" --frames 17 --steps 40

# GIF output for piping
mold run ltx-video-0.9.6-distilled:bf16 "a campfire at night" --format gif | mpv -

# 0.9.8 checkpoint family
mold run ltx-video-0.9.8-2b-distilled:bf16 "a humanoid robot walking" --frames 49

If you want the safest current quality path in mold, start with ltx-video-0.9.6-distilled:bf16. If you want the newer upstream 0.9.8 checkpoint family with full multiscale refinement, try ltx-video-0.9.8-2b-distilled:bf16.