Skip to content

whisper: set no_context to prevent quality drift over a session#79

Merged
cjpais merged 3 commits intocjpais:mainfrom
anton-averich:fix/whisper-no-context
Apr 8, 2026
Merged

whisper: set no_context to prevent quality drift over a session#79
cjpais merged 3 commits intocjpais:mainfrom
anton-averich:fix/whisper-no-context

Conversation

@anton-averich
Copy link
Copy Markdown
Contributor

Problem

Whisper transcription quality degrades progressively over a long push-to-talk session in Handy:

  • Short clips (one or two words dictated into a different chat) frequently get mis-recognized or returned as empty.
  • Language detection sticks to the previous language. After dictating in Russian, the next English utterance often comes back transcribed as Russian.
  • Reloading the model fully restores quality.
  • Punctuation also tends to drift in the direction of whatever was dictated earlier.

I've been hitting this for a while and was reloading Handy a couple of times a day to clear it.

Cause

whisper.cpp's whisper_full defaults to using prompt_past — the last decoded tokens are fed back as a prompt for the next decode. That is the right thing for continuous speech (lectures, meetings) where consecutive segments are connected. It is the wrong thing for push-to-talk and similar workloads where each call to transcribe is an independent utterance: residue from prior, unrelated decodes biases the next one.

  • Short clips suffer most because they have less acoustic evidence to overcome the stale prompt.
  • Language switches suffer because the prompt is in the previous language and steers detection.
  • Punctuation drift is a side-effect of the same mechanism — the model imitates the style of the (stale) prompt.

Fix

Set FullParams::set_no_context(true) in WhisperEngine::infer so each decode starts from a clean prompt. One line.

The user-supplied initial_prompt is unaffected — it goes through a different FullParams field. I verified this with the existing test_prompt_product_names test, which still passes (it asserts that initial_prompt influences the output, and it does).

Performance

If anything, slightly cheaper — fewer prompt tokens for the decoder to process at the start of each decode. No allocations, no API change.

Compatibility

Public API unchanged. If anyone needs the old behaviour for streaming or continuous-speech use cases, happy to expose no_context as an opt-in field on WhisperInferenceParams in a follow-up.

Testing

  • cargo test --features whisper-cpp — all 3 whisper tests pass (test_jfk_transcription, test_prompt_product_names, test_timestamps).
  • Built Handy locally against this patch and used it for a full day of normal push-to-talk dictation. The drift symptoms above no longer reproduce.

Whisper transcription quality degrades progressively over a long
push-to-talk session: short clips get mis-recognized or returned
empty, and language detection sticks to the previous language
(e.g. RU→EN switches keep producing Russian). Reloading the model
restores quality.

The cause is whisper.cpp's default prompt_past behaviour — the last
decoded tokens are fed back as a prompt for the next decode. That's
the right thing for continuous speech (lectures, meetings) where
consecutive segments are connected, but the wrong thing for
push-to-talk and similar workloads where each call to transcribe
is an independent utterance: stale prompt tokens bias the next
decode. Short clips suffer most because they have less acoustic
evidence to overcome the stale prompt; language switches suffer
because the prompt is in the previous language and steers detection.

Set no_context = true so each decode starts from a clean prompt.
The user-supplied initial_prompt continues to work — it goes through
a different field and is unaffected.
@cjpais
Copy link
Copy Markdown
Owner

cjpais commented Apr 8, 2026

I think we can definitely add this, but I think we need to add it as an option for someone to change since this is a library. We can have the default be what you suggest

@anton-averich
Copy link
Copy Markdown
Contributor Author

Makes sense, thanks for the quick reply. I'll add it as a field on WhisperInferenceParams with the default set to true and push to this branch.

Per review, make no_context an opt-in field on WhisperInferenceParams
so callers can override it for continuous-speech use cases (lectures,
meetings, streaming) where carrying prompt_past across segments
improves consistency. Default stays true — the right choice for
independent utterances such as push-to-talk dictation, which is the
case the previous commit fixed.
@anton-averich
Copy link
Copy Markdown
Contributor Author

Done! Added no_context: bool to WhisperInferenceParams with true as the default, so the previous commit's behaviour carries over for anyone using ..Default::default(). Continuous-speech callers can now opt out:

WhisperInferenceParams { no_context: false, ..Default::default() }

Thanks again for the quick review! Happy to tweak naming or docs if you'd prefer something different.

@cjpais cjpais merged commit d97ae65 into cjpais:main Apr 8, 2026
4 checks passed
@cjpais
Copy link
Copy Markdown
Owner

cjpais commented Apr 8, 2026

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants