Add practical long-audio and memory notes for VibeVoice-ASR by voidful · Pull Request #393 · microsoft/VibeVoice

voidful · 2026-05-12T01:28:35Z

This PR adds practical notes for running VibeVoice-ASR on long recordings and memory-constrained GPUs.

It documents:

chunked long-audio inference with timestamp-offset stitching;
coverage, timestamp monotonicity, and repetition checks for long-form ASR;
cross-chunk speaker-label caveats;
YaRN RoPE scaling observations for single-pass long-audio robustness;
memory-oriented Hugging Face generation options such as logits_to_keep=1 and chunked prefill.

Long-Audio Findings

In an 11-item long-form stress test, YaRN with the model's existing context length as the base improved 90-minute robustness:

Setting	e22 90m WER	e22 coverage	TED 90m WER	TED coverage	11-item mean WER	Collapses
No RoPE override	0.5824	77.6%	0.8250	21.8%	0.2328	2
YaRN, factor=1.5, original_max=131072	0.4859	82.0%	0.3422	91.0%	0.2542	0

YaRN is documented as a robustness trade-off, not as a memory optimization.

Docs-only change; no code tests were run.

voidful added 2 commits May 12, 2026 09:22

Add practical long-audio ASR inference notes

7c5cf24

Document YaRN long-audio ASR findings

2d056c8

matteozanettii approved these changes May 12, 2026

View reviewed changes