Skip to content

perf: stateful streaming VAE decode — eliminate redundant overlap#212

Merged
Labmem-Zhouyx merged 1 commit intoOpenBMB:mainfrom
KevinAHM:fix/stateful-streaming-vae
Apr 15, 2026
Merged

perf: stateful streaming VAE decode — eliminate redundant overlap#212
Labmem-Zhouyx merged 1 commit intoOpenBMB:mainfrom
KevinAHM:fix/stateful-streaming-vae

Conversation

@KevinAHM
Copy link
Copy Markdown
Contributor

@KevinAHM KevinAHM commented Apr 8, 2026

Summary

  • Streaming decode re-decoded 4 overlapping patches through the VAE each step, discarding 75% of the output
  • Replace with stateful decode that carries causal conv padding buffers between calls — one patch in, one patch out, no overlap
  • StreamingVAEDecoder caches CausalConv1d and CausalTransposeConv1d left-pad state, matching the approach used in the WebGPU ONNX VAE port

Changes

  • Add StreamingVAEDecoder to audiovae/audio_vae_v2.py
  • AudioVAE.streaming_decode() context manager for clean lifecycle
  • _inference yields single-patch latents in streaming mode (was 4-patch chunks)
  • _generate and _generate_with_prompt_cache use StreamingVAEDecoder

Benchmarks

Streaming VAE decode (isolated, 35 patches):

Method Time Max diff vs full decode
4-patch overlap (before) 289ms 0.0011
Stateful (after) 148ms 0.0005

2x faster VAE decode, and more accurate (half the error of the overlap approach).

Correctness

  • Cosine similarity generate() vs generate_streaming(): 1.0000
  • Tested with voice cloning (reference audio + timbre transfer)

Test plan

  • Cosine similarity: stateful decode vs full decode
  • A/B audio comparison with voice clone (Nellie reference)
  • End-to-end generate() vs generate_streaming() match
  • Streaming output length matches non-streaming

🤖 Generated with Claude Code

Streaming decode previously re-decoded 4 overlapping patches through
the VAE each step, discarding 75% of the output. Replace with stateful
decode that carries causal conv padding buffers between calls — one
patch in, one patch out, no overlap.

Changes:
- Add StreamingVAEDecoder to audiovae/audio_vae_v2.py — caches
  CausalConv1d and CausalTransposeConv1d left-pad state between calls
- AudioVAE.streaming_decode() context manager for clean lifecycle
- _inference yields single-patch latents in streaming mode
- _generate and _generate_with_prompt_cache use StreamingVAEDecoder

Streaming VAE decode time (isolated): 289ms → 148ms (2x faster)
Stateful vs full decode: cosine 1.0000, max diff 0.0005
(more accurate than previous overlap approach at max diff 0.001)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@liuxin99 liuxin99 requested a review from Labmem-Zhouyx April 11, 2026 05:16
Labmem-Zhouyx added a commit that referenced this pull request Apr 15, 2026
…ant overlap

- StreamingVAEDecoder caches CausalConv1d/CausalTransposeConv1d left-pad
  state between calls — one patch in, one patch out, no overlap
- _inference yields single-patch latents in streaming mode
- 2x faster streaming VAE decode, more accurate (max diff 0.0005 vs 0.0011)
@Labmem-Zhouyx Labmem-Zhouyx merged commit 6620513 into OpenBMB:main Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants