Skip to content

feat: add Qwen3-ASR batch transcription engine#48

Open
andrewleech wants to merge 6 commits intocjpais:mainfrom
andrewleech:feat/qwen3-batch
Open

feat: add Qwen3-ASR batch transcription engine#48
andrewleech wants to merge 6 commits intocjpais:mainfrom
andrewleech:feat/qwen3-batch

Conversation

@andrewleech
Copy link
Copy Markdown
Contributor

@andrewleech andrewleech commented Mar 4, 2026

Summary

Adds engines/qwen3 module implementing TranscriptionEngine for Qwen3-ASR, Alibaba's multilingual speech recognition model. Supports 0.6B and 1.7B model variants.

New qwen3 Cargo feature — follows the same pattern as existing ONNX-based engines (parakeet, moonshine, sense_voice): feature-gated, uses ort + ndarray, CPU execution.

Engine details

  • Encoder-decoder architecture with autoregressive token generation
  • Log-mel spectrogram feature extraction (80-bin, via rustfft)
  • HuggingFace BPE tokenizer
  • Language prefix stripping — the model outputs language <Name><text>, matched against known language names to find the boundary
  • Supports both FP32 and INT8 quantized models (selected via Qwen3ModelParams)

Pre-exported ONNX models

Export scripts and methodology: andrewleech/qwen3-asr-onnx

Resolves #30

@andrewleech
Copy link
Copy Markdown
Contributor Author

Oh, only just saw #46 beat me by a day - I see the note about the refactor and happy to rebuild this afterwards if it looks useful!

This is also very much AI driven code, I'm more of an embedded C / micropython / python developer professionally and not yet fluent in rust.

If you're interested I've got some other changes "ready" to push up to enable various ort GPU integrations though here into Handy, though on this model they ended up being slower than CPU on my AMD integrated gpu machine.

andrewleech pushed a commit to andrewleech/Handy that referenced this pull request Mar 4, 2026
Points at andrewleech/transcribe-rs feat/qwen3-batch (PR cjpais/transcribe-rs#48).
Drop this commit once qwen3 support is published to crates.io.
@cjpais
Copy link
Copy Markdown
Owner

cjpais commented Mar 4, 2026

Thank you for the contribution and largely would prefer onnx implementations, I will try to test in the coming days and pull it in.

I also do want to bring in acceleration support, and it will come here first before Handy. Probably that will come after the refactor so we have a cleaner base to work from. This PR probably will have to wait for the refactor as well, but I would guess it will be an easy port

@cjpais
Copy link
Copy Markdown
Owner

cjpais commented Mar 7, 2026

@andrewleech if you dont mind refactoring this into the new structure that would be great. Maybe parts of it can be simplified. Let me know

@andrewleech
Copy link
Copy Markdown
Contributor Author

Thanks for the review feedback on the initial draft. Here's a summary of what changed between the original submission and this revision:

Rebased onto current main (post-PR #51 "Reorganize Library into Engines more clearly"). The engine now implements the SpeechModel trait (capabilities() + transcribe(&mut self, &[f32], &TranscribeOptions)) and uses the shared src/onnx/session.rs helpers (create_session, resolve_model_path, Quantization) and TranscribeError throughout. The old TranscriptionEngine associated-type API and Qwen3Error type are gone.

Why mel.rs is not shared with src/features/mel.rs: Qwen3-ASR requires a Slaney-normalized mel filterbank (matching Whisper's feature extractor) computed in f64. The shared mel pipeline uses HTK normalization in f32. These are incompatible at the numeric level. A MelScale enum in the shared module would be the right long-term fix; that's noted in a TODO comment in qwen3/mel.rs.

Integration tests added (tests/qwen3.rs):

  • test_qwen3_transcribe — 0.6B model on jfk.wav, asserts exact transcript
  • test_qwen3_1_7b_transcribe — same for 1.7B model
  • test_qwen3_max_tokens_truncation — verifies transcribe_with(&Qwen3Params { max_tokens: 5 }) produces a non-empty result shorter than the full transcript

All three skip gracefully when model files are not present, so CI passes without the ~1 GB model artifacts.

Other changes since the initial draft:

  • MelConfig in config.rs renamed to Qwen3MelParams to avoid a name collision with crate::features::MelConfig
  • greedy_decode made private (only called from transcribe in the same impl)
  • encode: avoids a redundant clone (mel.view().into_dyn()) and uses into_dimensionality::<Ix3>() instead of manual from_shape_vec
  • SpecialTokens non-negative validation added at load time
  • log::warn! on the right-side reflect-padding fallback path in mel.rs
  • strip_language_prefix warns before returning empty string on unrecognised-language-no-newline case

@cjpais
Copy link
Copy Markdown
Owner

cjpais commented Mar 10, 2026

@andrewleech can you check your onnx export? I suspect something is not right. The raw FP32 safetensors for Qwen 0.6B is 1.88GB. I had to download a file much larger than that, which uncompresses into 6GB. I think there are probably a lot of duplicated tensors in your export. I think it would make sense to not split .onnx and .data where possible too. I would look at some other ONNX exports of other models (either by the sherpa team or istupakov) for more canonical formatting

I am not impressed by the speed of the inference as well, though the transcription quality is good. I bet this performance can be improved. It is 10x slower than parakeet of the same size which is quite surprising.

@andrewleech andrewleech force-pushed the feat/qwen3-batch branch 2 times, most recently from 20043f8 to 6c61294 Compare March 10, 2026 09:40
@andrewleech
Copy link
Copy Markdown
Contributor Author

Follow-up to the previous comment addressing the export size concern.

Root cause: The original export produced two separate ONNX files (decoder_init.onnx + decoder_step.onnx) backed by separate .data weight files. Both wrappers held references to the same PyTorch decoder parameters, but ONNX wrote each wrapper's weights independently — resulting in full duplication (~2.38 GB × 2 = 4.76 GB for decoders alone, 5.8 GB total for the 0.6B model).

Fix: Added a unified DecoderWrapper to the export script that handles both prefill (past_seq=0) and decode steps in a single ONNX graph. The attention mask is constructed as cat([zeros(q_len, past_seq), causal_triu(q_len, q_len)], dim=1) — when past_seq=0 this reduces to just the causal block, and the whole expression traces cleanly through torch.export.

Result:

  • FP32: 5.8 GB → ~3.1 GB (encoder 717 MB + decoder 2.38 GB + embeddings + tokenizer)
  • INT8: 4.2 GB → ~1.6 GB (encoder 734 MB + decoder 569 MB + fp16 embeddings)

The Rust library now auto-detects format at load time: tries decoder.onnx first, falls back to the legacy decoder_init.onnx + decoder_step.onnx split for backward compatibility. All 3 integration tests pass (0.6B unified, 1.7B split, max-tokens truncation), and compare.py confirms exact token agreement between unified FP32, quantized INT8, and native PyTorch inference.

The export tooling is at https://github.com/andrewleech/qwen3-asr-onnx — the --split-decoder flag preserves the old format if needed.

@xkcoding
Copy link
Copy Markdown

Nice work @andrewleech! 👍 I closed my PR #46 (qwen-asr crate based) in favor of this — the ONNX approach with ort fits the project much better, and the unified decoder fix cutting the model size in half is impressive.

Great to see you tackled the ONNX export quality issues too. Looking forward to seeing this merged!

@cjpais
Copy link
Copy Markdown
Owner

cjpais commented Mar 11, 2026

The Rust library now auto-detects format at load time: tries decoder.onnx first, falls back to the legacy decoder_init.onnx + decoder_step.onnx split for backward compatibility. All 3 integration tests pass (0.6B unified, 1.7B split, max-tokens truncation), and compare.py confirms exact token agreement between unified FP32, quantized INT8, and native PyTorch inference.

I don't think we need to support legacy stuff. It adds bloat and this is a fresh PR.

Can you please upload the files to HF?

I'm a bit skeptical right now of pulling this in to be honest, this has been a bit sloppy so far. Was there any sanity checking done?

@cjpais
Copy link
Copy Markdown
Owner

cjpais commented Mar 17, 2026

Mind giving the models you have for this?

@andrewleech andrewleech force-pushed the feat/qwen3-batch branch 5 times, most recently from 1789c06 to 52f49e6 Compare March 18, 2026 12:05
@andrewleech
Copy link
Copy Markdown
Contributor Author

Update: branch rewritten onto v0.3.2

The branch has been rebased onto upstream/main (v0.3.2) and the commit history rebuilt as two clean commits.

What changed since the initial push

Adapted to upstream v0.3 API:

Model format changes:

  • Hybrid decoder: decoder_init accepts input_ids + audio_features (embedding table in graph for prefill scatter); decoder_step accepts input_embeds (Rust-side lookup from embed_tokens.bin)
  • Split decoder preferred over unified (decoder_init.onnx + decoder_step.onnx)
  • INT8/INT4 decoder variants auto-detected via suffixed filenames (e.g. decoder_init.int4.onnx)

Performance work (model.rs 550 → 417 lines):

  • Sequential ORT execution mode + CPU arena allocator for decoder sessions (create_decoder_session)
  • Zero-copy KV cache via DynValue pass-through (eliminates per-step clone)
  • Vectorized argmax over contiguous logit slice

New:

  • Quantization::Int4 variant for MatMulNBits models (wired through all engines)
  • bench_compare example with --help, accelerator selection, quantization flags
  • Integration tests for 0.6B, 1.7B, and 1.7B-int4

Model file structure

The model directory layout follows the same encoder/decoder split pattern used by Moonshine and Canary in this repo:

encoder.onnx                    # FP32 (all variants use FP32 encoder)
decoder_init.onnx               # prefill: audio features + token IDs → first KV cache
decoder_step.onnx               # per-token: input embeds + KV cache → next token
embed_tokens.bin                 # FP32 embedding table [vocab_size, hidden_size]
config.json                     # model dimensions, special token IDs, quantization metadata
vocab.json                      # SentencePiece vocabulary

Quantized decoder variants use suffixed filenames (e.g. decoder_init.int4.onnx) alongside the FP32 originals.

The main departure from the other engines is embed_tokens.bin. Qwen3-ASR ties its embedding and lm_head weights (standard for decoder-only transformers). decoder_init needs the embedding table in-graph for the prefill scatter, but duplicating it into decoder_step would add 594 MB of redundant weights. Extracting it as a flat binary lets Rust do a single load and fast row lookups during autoregressive decoding. We tested the alternative (keeping the table in decoder_step as a shared ONNX initializer with lm_head) but the required transpose on every token step made inference 2.5× slower.

Model refinement

The ONNX export pipeline and quantization approach went through ~90 experiments covering AWQ smoothing, GPTQ calibration, int4/int8 MatMul-only quantization, accuracy_level tuning, and encoder quantization impact. Full experiment log: https://github.com/andrewleech/qwen3-asr-onnx/blob/main/INVESTIGATION.md

Key findings:

  • accuracy_level=4 on int4 MatMulNBits improves both speed and WER
  • INT8 encoder degrades WER by ~1pp — all variants use the FP32 encoder
  • 1.7B benefits from GPTQ on decoder_init + RTN on decoder_step (GPTQ on step is too slow to calibrate for minimal gain)
  • AWQ INT8 is not recommended for 1.7B — causes degraded special token prediction (9% WER)

Recommended model variants

Two variants are published per model size — FP32 (baseline/GPU target) and int4 (recommended for CPU):

Model Quantization WER RTF Size
Qwen3 0.6B int4 (RTN al4) 5.08% 0.16x ~2.6 GB
Qwen3 0.6B FP32 4.42% 0.40x 3.8 GB
Qwen3 1.7B int4 (GPTQ-init + RTN al4) 4.25% 0.37x ~5.6 GB
Qwen3 1.7B FP32 3.79% 8.8 GB
Parakeet 0.6B INT8 (reference) 5.45% 0.16x

200-sample LibriSpeech test-other, CPU inference, WSL2/Linux, ORT 2.0.0-rc.12. RTF measured on 11s JFK clip. RTF < 1 = faster than real-time. Qwen3 produces full punctuation; Parakeet produces minimal.

The 0.6B int4 variant matches Parakeet speed with lower WER and full punctuation output.

Model downloads

Models listed above are currently being uploaded to Hugging Face:

Export pipeline and quantization tools: andrewleech/qwen3-asr-onnx

@cjpais
Copy link
Copy Markdown
Owner

cjpais commented Mar 18, 2026

Thank you for doing a bunch of deep work on this. I will take a look at it soon.

I am a little confused at the int4 download size though. It's larger than the original .safetensors?

@andrewleech
Copy link
Copy Markdown
Contributor Author

andrewleech commented Mar 18, 2026

Cheers, I've been using it for a couple of days now as I clean up the repos. Handy integration incoming.

In Handly (on windows) I prefer to use PTT, Paste: Direct, Don't modify clipboard, auto-submit Super+Enter - On a side note I personally think these should be the default, though accidentally holding ctrl down while it typed and mucked up the text threw me a few times.

It doesn't feel quite as fast as Parakeet still, but I do feel like the accuracy, particularly on quiet / whispered audio is working much better.

I am a little confused at the int4 download size though. It's larger than the original .safetensors?

The original safetensors are BF16 (~1.2 GB) which I was unable to convert into an efficient 16bit decoder format, fp32 being the more native onnx format doubled the size initially.

The decoder had the biggest size impact dropping to int4 with less WER impact. I had some success with FP16 encoder but it increased WER from ~5.08 to ~5.18 for a size saving from 2.5GB down to 2.1GB and no change in transcription speed (after the slightly slower model load speed due to size) - I opted to keep the slightly lower WER with a 400MB cost, but open to changing this.

So yeah the ONNX int4 package is larger because the encoder and embedding table are kept at FP32, I made a note of this, but it's probably a bit lost in the text.

@cjpais
Copy link
Copy Markdown
Owner

cjpais commented Mar 18, 2026

sounds good, was mostly just curious, overall it's fine, but 2gb is pretty significant memory impact

@andrewleech
Copy link
Copy Markdown
Contributor Author

Yeah it's ended up being quite a lot bigger model than Parakeet, and by the numbers I'm not sure it's actually that much better to justify its size. However for me any improvement in low volume voice and improved punctuation is making me happy and it's been an interesting learning exercise!

@cjpais
Copy link
Copy Markdown
Owner

cjpais commented Mar 18, 2026

sweet! im quite curious to try it, I often speak quite softly, going to pull it down and see how it runs in transcribe-rs

@andrewleech
Copy link
Copy Markdown
Contributor Author

andrewleech commented Mar 18, 2026

The latest models haven't finished uploading yet, and I'm still cleaning up the current Handy integration branch.
est 2-3 hours more to upload them all

@andrewleech
Copy link
Copy Markdown
Contributor Author

Handy branch is updated in cjpais/Handy#957 though sorry I haven't built and re-run the latest push, though it was only minor cleanups since the copy I'm running - just need to turn in for the night I'll test more tomorrow.

@cjpais
Copy link
Copy Markdown
Owner

cjpais commented Mar 18, 2026

No worries, I won't get to test until tomorrow either. Just wanna confirm that whatever I download will just work out of the box. Or you'll let me know what files to download for the models.

@andrewleech andrewleech force-pushed the feat/qwen3-batch branch 4 times, most recently from 4853d12 to e869323 Compare March 22, 2026 23:53
@andrewleech
Copy link
Copy Markdown
Contributor Author

andrewleech commented Mar 23, 2026

@cjpais the two PR's for this should be in a good state for testing - I had some performance issues a few days ago that I surprisingly found came from how many cpu cores ORT was allowed to use on my machine; I started adding a feature to set/adjust this here before finally splitting it off into its own clean pair of braches/PR's.

The models are all up to date on HF as per the url's configured in the Handy PR.

@cjpais
Copy link
Copy Markdown
Owner

cjpais commented Mar 23, 2026

thanks @andrewleech I will take a closer look soon, it may take me a week or so at this point. I do have some other things I need to focus on for a bit

in regards to the cpu thread count, overall it makes sense. I think it will definitely be an option for transcribe-rs, but I am a bit hesitant to add to handy. I will think more on it though

@cjpais
Copy link
Copy Markdown
Owner

cjpais commented Mar 28, 2026

@andrewleech I took another look and downloaded all the files again, I am seeing the duplicate weight thing again. Can we improve this export? It would drop gigabytes from the load time which would be quite significant and right now is blocking me from shipping this. Otherwise it looks good to go

@andrewleech
Copy link
Copy Markdown
Contributor Author

I am seeing the duplicate weight thing again.

Thanks for the pointers — I'd attempted to fix that previously but then in efforts to more closely match the packaging from other onnx export teams I'd squashed it out again.

Regarding the two decoder-init and decoder-step weights, the int4 copies were being quantized independently into slightly different values because 1.7B was using GPTQ for init and RTN for step. Reassessed the methods — ran 200-sample WER evals comparing GPTQ+RTN vs RTN-only across three independent tests. The difference is small (+0.04pp in the latest, within run-to-run noise) and RTN-only enables weight sharing, so the tradeoff is worth it. Switched 1.7B to RTN-only which makes the transformer layer weights byte-identical between the two decoders.

With that, the common weights are now pulled out into a shared decoder_weights.int4.data file that both decoder protos reference. The only unshared part is the lm_head (output projection) which exists in different forms between init and step — that gets inlined into the step proto (~87-171 MB depending on model size).

Also converted embed_tokens.bin to FP16 storage (cast to FP32 at lookup time) — zero WER impact, halves the file.

I investigated storing the encoder as FP16 too using a native autocast export approach (no Cast node overhead in theory). WER was fine but benchmarking on native Windows showed 9-13% per-inference slowdown from the FP16/FP32 boundary Cast nodes that ORT can't fuse across. So encoder stays FP32 for now. A possible future optimisation would be storing the encoder weights as FP16 on disk and expanding to FP32 in the Rust loader before creating the ORT session — that would save 359 MB (0.6B) / 608 MB (1.7B) on disk with no runtime overhead, but needs a custom loader rather than ORT's built-in file loading.

Also added ruff + mypy to the export repo — no bugs found but cleaned up ~100 lint issues.

Net result:

Previous New Change
0.6B int4 tar.gz 1.57 GB 1.26 GB -20%
1.7B int4 tar.gz 3.55 GB 2.67 GB -25%
0.6B RTF (Windows) 0.17x 0.17x unchanged
1.7B RTF (Windows) 0.37x 0.29x 22% faster
WER 5.16% / 4.25% 5.16% / 4.20% -0.05pp 1.7B

1.7B speed improvement is from dropping GPTQ — RTN-only loads and runs faster.

@andrewleech andrewleech force-pushed the feat/qwen3-batch branch 2 times, most recently from 5c30fb1 to 4f26ecb Compare March 30, 2026 22:53
@wangwillian0
Copy link
Copy Markdown

Hi, nice PR!

@pi-anl About the recently added commit about language hint, I think there is a specific template which the official qwen3-asr code follows: https://github.com/QwenLM/Qwen3-ASR/blob/main/qwen_asr/core/vllm_backend/qwen3_asr.py#L981-L990

<|im_start|>user\n{audio_placeholder}<|im_end|>
<|im_start|>assistant\nlanguage {full_lang_name_to}<asr_text>

@andrewleech
Copy link
Copy Markdown
Contributor Author

I think there is a specific template which the official qwen3-asr code

Thanks for that, I've reworked mine to match that!

For background I'd recently discovered an issue in this branch where for certain recordings of various length it would return "ology."

It was related to noisy / very low volume audio, particularly at the start, and the output was also missing the detected language tag - it's clearly just the output when the transcription failed in certain ways.
I found that feeding in the expected language removed the issue, and also protect against it by filtering out any responses that don't have the language tag at the start (all valid recordings do).

pi-anl and others added 6 commits April 7, 2026 09:23
ONNX-based Qwen3-ASR speech recognition with split encoder/decoder
architecture, Whisper-compatible mel spectrogram, SentencePiece
tokenizer, and configurable quantization (FP32/FP16/INT8).

Includes decoder session infrastructure (create_decoder_session) with
sequential execution mode, CPU arena allocator, and configurable
intra-op threads for autoregressive token generation.

Performance: INT8 auto-detection, zero-copy KV cache via DynValue
pass-through, vectorized argmax, hybrid decoder with Rust-side
embed lookup. FP16 embed_tokens.bin support (dtype from config.json).
Add Quantization::Int4 to the enum and wire it through model path
resolution, bench_compare parsing, and moonshine streaming. Add
integration tests for 1.7B int4 and 0.6B int4 with FP16 embed.
- Use div_ceil() instead of manual ceiling division (mel.rs)
- Remove needless borrow (canary/decoder.rs)
- Use range contains() instead of manual comparison (moonshine/model.rs)
- Derive Default instead of manual impl (moonshine/mod.rs)
The int4 quantized decoder can produce degenerate output (e.g.
"ology.") on non-speech audio where quantization noise flips the
argmax at the first token. These outputs lack the <asr_text> separator
token that normally separates the language prefix from the
transcription.

Check for the presence of asr_text_token_id (151704) in the generated
tokens. If absent, return empty string instead of passing garbage
through to the consumer. Logs a warning with the first 20 token IDs
for diagnostic purposes.

Adds asr_text_token_id to SpecialTokens config struct with serde
default for backward compatibility with existing config.json files.

style: fix cargo fmt formatting in session.rs
When TranscribeOptions.language is set (e.g. "English"), the decoder
prompt includes "Please transcribe the above {language} audio." which
conditions the decoder toward the specified language. This eliminates
the "ology" degenerate output on non-speech audio (see OLOGY_BUG.md)
and aligns int4 output with FP32 behavior.

Language token IDs are encoded on first use via greedy longest-match
on the BPE vocabulary and cached in RAM for reuse.

Changes:
- tokenizer.rs: Add encode() with reverse vocabulary lookup
- prompt.rs: Add build_prompt_ids_with_language() with template tokens
- model.rs: Thread language_token_ids through greedy_decode
- engine.rs: Cache language tokens, pass options.language through
  instead of warning. Qwen3Params gains a language field.

fix: address review findings for language hint implementation

- Eliminate clone on cache hit: ensure_language_cached() + borrow
  from cache instead of returning owned Vec
- Unify prompt builders: build_prompt_ids delegates to
  build_prompt_ids_with_language(_, _, None), single code path
- Add BCP-47 normalization: "en" → "English" so TranscribeOptions
  language codes work correctly (14 common codes)
- Trim normalize_language_name to common codes only

simplify: remove normalize_language_name, pass language string directly

The model tokenizes whatever language string is given and includes it
in the prompt. No need to map BCP-47 codes to full names — the model
handles both "en" and "English" in the prompt context.

fix: Qwen3Params::default() max_tokens 0 bug, empty language guard

- Implement Default manually with max_tokens=512 instead of derive
  (derive produced max_tokens=0 which silently truncated output)
- Filter empty language strings to None to avoid malformed prompt
- Document that language accepts both full names and short codes

fix: address branch review findings (5 warnings, 5 infos)

- asr_text guard: only apply when EOS was seen, not on max_tokens
  truncation (fixes conflict with truncation test)
- Add asr_text_token_id >= 0 to load-time validation
- Mark tokenizer encode() as pub(crate) to prevent misuse on long text
- Use ..Default::default() in transcribe_raw instead of hardcoded 512
- Fix dangling OLOGY_BUG.md doc reference
- Fix cfg(test) function doc reference
- Add unit tests for language-conditioned prompt structure and
  None-path equivalence with standard prompt

refactor: use official Qwen3-ASR language hint template

Replace the instruction-based language hint (modified system/user turns)
with the official Qwen3-ASR template that forces the assistant prefix:
  <|im_start|>assistant\nlanguage {Name}<asr_text>

This is a "forced generation" pattern — the model skips language
detection entirely and goes straight to transcription after <asr_text>.
Matches the reference implementation in qwen_asr/core/vllm_backend.

Changes:
- prompt.rs: Remove SYSTEM_CONTENT, USER_PREFIX, USER_SUFFIX_* constants.
  Language hint now appends to assistant prefix instead of modifying
  system/user turns. System and user turns are identical with or without
  language hint.
- engine.rs: Encode " {name}" (with leading space) to match BPE
  tokenization of "language English" → [11528, 6364].
- model.rs: Skip <asr_text> guard when language is forced (the token
  is in the prompt, not in generated output).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[model request] Qwen3 ASR

5 participants