Whispering Context: Distilling Syntax and Semantics for Long Speech Transcripts

This is the repo for IEEE ASRU 2025 paper Whispering Context: Distilling Syntax and Semantics for Long Speech Transcripts. You can reach the preprint at Arxiv.

Our work offers a novel approach for blending transcript syntax&semantics into training chunks. We propose a novel approach that enhances ASR by distilling contextual knowledge from LLaMA models into Whisper. We measure syntactic&semantic success by measuring NER success, punctuation and capitalization success.

A key innovation of our approach is the use of an extended context window during representation-level distillation to provide Whisper with richer textual clues. Instead of distilling knowledge solely from LLaMA’s representation of the chunk text (e.g., a 30-second segment), we incorporate a broader context from the entire transcript. Specifically, for a fixed context size of tokens, we include half the tokens preceding the chunk and half following it within the transcript, effectively expanding the semantic scope available during distillation. We experimented with various context sizes—64, 128, 256, 512, and 1024 tokens—and observed that increasing the context size significantly improved NER performance by enabling Whisper to leverage more semantic information from the surrounding text. This extended context trick proved particularly effective for handling long-tail entities, where global transcript-level understanding is critical for accurate recognition and formatting. Below figure exhibits an overview of our approach.

Data preparation

We worked on SpokenWikipedia dataset,chosen for its long-audio and rich set of entity types and formats. The data came in aligned version and multiword entities are given by the XML format, so we processed whole transcripts such that they include entity tags like this:

Though multiword entities are provided by the dataset format, still we needed to extract single-word entities , also all the entity types. The below diagram shows the outline of burying entity tags to the transcript:

This pipeline is called the ReverseNormalizer, consisting of two modules, first one is NER to tag the entities with their types; the second module is called EntityFormatter which formats entities according to their types, calculates entity boundaries and inserts the tags into the transcript text.

Results

Results are ultra interesting.

Whisper-Medium achieved a 0.38, which is reasonable given the complexity of the transcripts. Planed fine-tuned version reduced the WER to 0.26. Our distilled Whisper managed to take WER down to 0.20.

Now comes the punctuation results:

and semantic success per entity type:

Entities are evaluated according to their types. numerical types with CER and textual entities with Jaro-Winkler distance. Please find more explanations of the results in the paper.

Running the experiment

To run the experiments

Generate and persist speech features by generate_features/featurize.py
Generate LlaMa tensors by generate_llama_tensors/get_vectors_quantized.py
Run the experiment by distill/run.sh

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
distill		distill
fine_tune_llama		fine_tune_llama
generate_features		generate_features
generate_llama_tensors		generate_llama_tensors
images		images
testing_scripts		testing_scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Whispering Context: Distilling Syntax and Semantics for Long Speech Transcripts

Data preparation

Results

Running the experiment

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Whispering Context: Distilling Syntax and Semantics for Long Speech Transcripts

Data preparation

Results

Running the experiment

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages