Skip to content

Add practical long-audio and memory notes for VibeVoice-ASR#393

Open
voidful wants to merge 2 commits into
microsoft:mainfrom
voidful:docs/asr-long-audio-memory-notes
Open

Add practical long-audio and memory notes for VibeVoice-ASR#393
voidful wants to merge 2 commits into
microsoft:mainfrom
voidful:docs/asr-long-audio-memory-notes

Conversation

@voidful
Copy link
Copy Markdown

@voidful voidful commented May 12, 2026

This PR adds practical notes for running VibeVoice-ASR on long recordings and memory-constrained GPUs.

It documents:

  • chunked long-audio inference with timestamp-offset stitching;
  • coverage, timestamp monotonicity, and repetition checks for long-form ASR;
  • cross-chunk speaker-label caveats;
  • YaRN RoPE scaling observations for single-pass long-audio robustness;
  • memory-oriented Hugging Face generation options such as logits_to_keep=1 and chunked prefill.

Long-Audio Findings

In an 11-item long-form stress test, YaRN with the model's existing context length as the base improved 90-minute robustness:

Setting e22 90m WER e22 coverage TED 90m WER TED coverage 11-item mean WER Collapses
No RoPE override 0.5824 77.6% 0.8250 21.8% 0.2328 2
YaRN, factor=1.5, original_max=131072 0.4859 82.0% 0.3422 91.0% 0.2542 0

YaRN is documented as a robustness trade-off, not as a memory optimization.

Testing

Docs-only change; no code tests were run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants