perf: stateful streaming VAE decode — eliminate redundant overlap by KevinAHM · Pull Request #212 · OpenBMB/VoxCPM

KevinAHM · 2026-04-08T16:10:27Z

Summary

Streaming decode re-decoded 4 overlapping patches through the VAE each step, discarding 75% of the output
Replace with stateful decode that carries causal conv padding buffers between calls — one patch in, one patch out, no overlap
StreamingVAEDecoder caches CausalConv1d and CausalTransposeConv1d left-pad state, matching the approach used in the WebGPU ONNX VAE port

Changes

Add StreamingVAEDecoder to audiovae/audio_vae_v2.py
AudioVAE.streaming_decode() context manager for clean lifecycle
_inference yields single-patch latents in streaming mode (was 4-patch chunks)
_generate and _generate_with_prompt_cache use StreamingVAEDecoder

Benchmarks

Streaming VAE decode (isolated, 35 patches):

Method	Time	Max diff vs full decode
4-patch overlap (before)	289ms	0.0011
Stateful (after)	148ms	0.0005

2x faster VAE decode, and more accurate (half the error of the overlap approach).

Correctness

Cosine similarity generate() vs generate_streaming(): 1.0000
Tested with voice cloning (reference audio + timbre transfer)

Test plan

Cosine similarity: stateful decode vs full decode
A/B audio comparison with voice clone (Nellie reference)
End-to-end generate() vs generate_streaming() match
Streaming output length matches non-streaming

🤖 Generated with Claude Code

Streaming decode previously re-decoded 4 overlapping patches through the VAE each step, discarding 75% of the output. Replace with stateful decode that carries causal conv padding buffers between calls — one patch in, one patch out, no overlap. Changes: - Add StreamingVAEDecoder to audiovae/audio_vae_v2.py — caches CausalConv1d and CausalTransposeConv1d left-pad state between calls - AudioVAE.streaming_decode() context manager for clean lifecycle - _inference yields single-patch latents in streaming mode - _generate and _generate_with_prompt_cache use StreamingVAEDecoder Streaming VAE decode time (isolated): 289ms → 148ms (2x faster) Stateful vs full decode: cosine 1.0000, max diff 0.0005 (more accurate than previous overlap approach at max diff 0.001) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ant overlap - StreamingVAEDecoder caches CausalConv1d/CausalTransposeConv1d left-pad state between calls — one patch in, one patch out, no overlap - _inference yields single-patch latents in streaming mode - 2x faster streaming VAE decode, more accurate (max diff 0.0005 vs 0.0011)

liuxin99 requested a review from Labmem-Zhouyx April 11, 2026 05:16

Labmem-Zhouyx merged commit 6620513 into OpenBMB:main Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: stateful streaming VAE decode — eliminate redundant overlap#212

perf: stateful streaming VAE decode — eliminate redundant overlap#212
Labmem-Zhouyx merged 1 commit intoOpenBMB:mainfrom
KevinAHM:fix/stateful-streaming-vae

KevinAHM commented Apr 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

KevinAHM commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Benchmarks

Correctness

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KevinAHM commented Apr 8, 2026 •

edited

Loading