Skip to content

Performance

mold performance depends mostly on three things:

  1. model family and quantization
  2. your GPU memory headroom
  3. whether offloading or CPU text encoders are in play

This page gives practical expectations, not a formal benchmark suite. Exact timings vary by GPU, driver, storage speed, and whether a model is already loaded.

Representative Starting Points

Reference hardware: RTX 4090 class GPU, warm model cache, default resolution.

ModelTypical StepsBallpark TimeNotes
flux-schnell:q84~8-12sFastest high-quality default
flux-dev:q425~20-40sBetter quality, slower denoising
z-image-turbo:q89~10-20sStrong quality/speed trade-off
sdxl-turbo:fp164~3-8sVery fast when you want 1024 output
sd15:fp1625~5-15sLightest full-featured family

What Slows Things Down

Offloading

--offload can drop FLUX-class VRAM usage from roughly 24 GB to roughly 2-4 GB, but it is usually 3-5x slower.

Use it when a model otherwise would not fit. Do not use it when the model already fits comfortably in VRAM.

CPU text encoders

mold may place text encoders on CPU when VRAM is tight. That reduces memory pressure, but prompt encoding takes longer.

If your GPU has headroom, --eager can improve repeat generation speed by keeping more components resident.

Cold starts

The first request for a model pays for:

  • model weight loading
  • tokenizer setup
  • possible prompt expansion model loading

The second request is usually faster unless the model or encoder was dropped to save memory.

Practical Tuning

GoalUse this first
Faster iterationflux-schnell:q8, sdxl-turbo:fp16, or sd15:fp16
Lower VRAMsmaller quantization or --offload
Better repeat latencykeep the same model loaded; try --eager if VRAM allows
Faster remote workflowkeep mold serve running on the GPU host
Smaller startup penaltypre-pull models with mold pull

Example Tuning Workflow

bash
# Start with a fast baseline
mold run flux-schnell:q8 "studio product photo"

# Move up in quality if the baseline is good enough operationally
mold run flux-dev:q6 "studio product photo"

# Only use offload when necessary
mold run flux-dev:bf16 "studio product photo" --offload

Benchmarking Your Own Setup

The most honest benchmark is your own prompt mix. Use fixed seeds and a warm model:

bash
time mold run flux-schnell:q8 "a product photo" --seed 42
time mold run flux-dev:q4 "a product photo" --seed 42

For remote setups, also compare local CLI latency against the server’s generation_time_ms from the SSE complete event to separate network time from pure inference time.