feat: add Qwen3-ASR batch transcription engine#48
feat: add Qwen3-ASR batch transcription engine#48andrewleech wants to merge 6 commits intocjpais:mainfrom
Conversation
7eb5920 to
45edfdb
Compare
|
Oh, only just saw #46 beat me by a day - I see the note about the refactor and happy to rebuild this afterwards if it looks useful! This is also very much AI driven code, I'm more of an embedded C / micropython / python developer professionally and not yet fluent in rust. If you're interested I've got some other changes "ready" to push up to enable various ort GPU integrations though here into Handy, though on this model they ended up being slower than CPU on my AMD integrated gpu machine. |
Points at andrewleech/transcribe-rs feat/qwen3-batch (PR cjpais/transcribe-rs#48). Drop this commit once qwen3 support is published to crates.io.
|
Thank you for the contribution and largely would prefer onnx implementations, I will try to test in the coming days and pull it in. I also do want to bring in acceleration support, and it will come here first before Handy. Probably that will come after the refactor so we have a cleaner base to work from. This PR probably will have to wait for the refactor as well, but I would guess it will be an easy port |
|
@andrewleech if you dont mind refactoring this into the new structure that would be great. Maybe parts of it can be simplified. Let me know |
45edfdb to
55d33b8
Compare
|
Thanks for the review feedback on the initial draft. Here's a summary of what changed between the original submission and this revision: Rebased onto current Why Integration tests added (
All three skip gracefully when model files are not present, so CI passes without the ~1 GB model artifacts. Other changes since the initial draft:
|
|
@andrewleech can you check your onnx export? I suspect something is not right. The raw FP32 safetensors for Qwen 0.6B is 1.88GB. I had to download a file much larger than that, which uncompresses into 6GB. I think there are probably a lot of duplicated tensors in your export. I think it would make sense to not split .onnx and .data where possible too. I would look at some other ONNX exports of other models (either by the sherpa team or istupakov) for more canonical formatting I am not impressed by the speed of the inference as well, though the transcription quality is good. I bet this performance can be improved. It is 10x slower than parakeet of the same size which is quite surprising. |
20043f8 to
6c61294
Compare
|
Follow-up to the previous comment addressing the export size concern. Root cause: The original export produced two separate ONNX files ( Fix: Added a unified Result:
The Rust library now auto-detects format at load time: tries The export tooling is at https://github.com/andrewleech/qwen3-asr-onnx — the |
6c61294 to
20c406d
Compare
|
Nice work @andrewleech! 👍 I closed my PR #46 (qwen-asr crate based) in favor of this — the ONNX approach with ort fits the project much better, and the unified decoder fix cutting the model size in half is impressive. Great to see you tackled the ONNX export quality issues too. Looking forward to seeing this merged! |
I don't think we need to support legacy stuff. It adds bloat and this is a fresh PR. Can you please upload the files to HF? I'm a bit skeptical right now of pulling this in to be honest, this has been a bit sloppy so far. Was there any sanity checking done? |
|
Mind giving the models you have for this? |
1789c06 to
52f49e6
Compare
Update: branch rewritten onto v0.3.2The branch has been rebased onto What changed since the initial pushAdapted to upstream v0.3 API:
Model format changes:
Performance work (model.rs 550 → 417 lines):
New:
Model file structureThe model directory layout follows the same encoder/decoder split pattern used by Moonshine and Canary in this repo: Quantized decoder variants use suffixed filenames (e.g. The main departure from the other engines is Model refinementThe ONNX export pipeline and quantization approach went through ~90 experiments covering AWQ smoothing, GPTQ calibration, int4/int8 MatMul-only quantization, accuracy_level tuning, and encoder quantization impact. Full experiment log: https://github.com/andrewleech/qwen3-asr-onnx/blob/main/INVESTIGATION.md Key findings:
Recommended model variantsTwo variants are published per model size — FP32 (baseline/GPU target) and int4 (recommended for CPU):
200-sample LibriSpeech test-other, CPU inference, WSL2/Linux, ORT 2.0.0-rc.12. RTF measured on 11s JFK clip. RTF < 1 = faster than real-time. Qwen3 produces full punctuation; Parakeet produces minimal. The 0.6B int4 variant matches Parakeet speed with lower WER and full punctuation output. Model downloadsModels listed above are currently being uploaded to Hugging Face:
Export pipeline and quantization tools: andrewleech/qwen3-asr-onnx |
|
Thank you for doing a bunch of deep work on this. I will take a look at it soon. I am a little confused at the int4 download size though. It's larger than the original .safetensors? |
52f49e6 to
d4530fe
Compare
|
Cheers, I've been using it for a couple of days now as I clean up the repos. Handy integration incoming. In Handly (on windows) I prefer to use PTT, Paste: Direct, Don't modify clipboard, auto-submit Super+Enter - On a side note I personally think these should be the default, though accidentally holding ctrl down while it typed and mucked up the text threw me a few times. It doesn't feel quite as fast as Parakeet still, but I do feel like the accuracy, particularly on quiet / whispered audio is working much better.
The original safetensors are BF16 (~1.2 GB) which I was unable to convert into an efficient 16bit decoder format, fp32 being the more native onnx format doubled the size initially. The decoder had the biggest size impact dropping to int4 with less WER impact. I had some success with FP16 encoder but it increased WER from ~5.08 to ~5.18 for a size saving from 2.5GB down to 2.1GB and no change in transcription speed (after the slightly slower model load speed due to size) - I opted to keep the slightly lower WER with a 400MB cost, but open to changing this. So yeah the ONNX int4 package is larger because the encoder and embedding table are kept at FP32, I made a note of this, but it's probably a bit lost in the text. |
|
sounds good, was mostly just curious, overall it's fine, but 2gb is pretty significant memory impact |
|
Yeah it's ended up being quite a lot bigger model than Parakeet, and by the numbers I'm not sure it's actually that much better to justify its size. However for me any improvement in low volume voice and improved punctuation is making me happy and it's been an interesting learning exercise! |
|
sweet! im quite curious to try it, I often speak quite softly, going to pull it down and see how it runs in transcribe-rs |
|
The latest models haven't finished uploading yet, and I'm still cleaning up the current Handy integration branch. |
|
Handy branch is updated in cjpais/Handy#957 though sorry I haven't built and re-run the latest push, though it was only minor cleanups since the copy I'm running - just need to turn in for the night I'll test more tomorrow. |
|
No worries, I won't get to test until tomorrow either. Just wanna confirm that whatever I download will just work out of the box. Or you'll let me know what files to download for the models. |
4853d12 to
e869323
Compare
|
@cjpais the two PR's for this should be in a good state for testing - I had some performance issues a few days ago that I surprisingly found came from how many cpu cores ORT was allowed to use on my machine; I started adding a feature to set/adjust this here before finally splitting it off into its own clean pair of braches/PR's. The models are all up to date on HF as per the url's configured in the Handy PR. |
|
thanks @andrewleech I will take a closer look soon, it may take me a week or so at this point. I do have some other things I need to focus on for a bit in regards to the cpu thread count, overall it makes sense. I think it will definitely be an option for transcribe-rs, but I am a bit hesitant to add to handy. I will think more on it though |
|
@andrewleech I took another look and downloaded all the files again, I am seeing the duplicate weight thing again. Can we improve this export? It would drop gigabytes from the load time which would be quite significant and right now is blocking me from shipping this. Otherwise it looks good to go |
Thanks for the pointers — I'd attempted to fix that previously but then in efforts to more closely match the packaging from other onnx export teams I'd squashed it out again. Regarding the two decoder-init and decoder-step weights, the int4 copies were being quantized independently into slightly different values because 1.7B was using GPTQ for init and RTN for step. Reassessed the methods — ran 200-sample WER evals comparing GPTQ+RTN vs RTN-only across three independent tests. The difference is small (+0.04pp in the latest, within run-to-run noise) and RTN-only enables weight sharing, so the tradeoff is worth it. Switched 1.7B to RTN-only which makes the transformer layer weights byte-identical between the two decoders. With that, the common weights are now pulled out into a shared Also converted I investigated storing the encoder as FP16 too using a native autocast export approach (no Cast node overhead in theory). WER was fine but benchmarking on native Windows showed 9-13% per-inference slowdown from the FP16/FP32 boundary Cast nodes that ORT can't fuse across. So encoder stays FP32 for now. A possible future optimisation would be storing the encoder weights as FP16 on disk and expanding to FP32 in the Rust loader before creating the ORT session — that would save 359 MB (0.6B) / 608 MB (1.7B) on disk with no runtime overhead, but needs a custom loader rather than ORT's built-in file loading. Also added ruff + mypy to the export repo — no bugs found but cleaned up ~100 lint issues. Net result:
1.7B speed improvement is from dropping GPTQ — RTN-only loads and runs faster. |
5c30fb1 to
4f26ecb
Compare
|
Hi, nice PR! @pi-anl About the recently added commit about language hint, I think there is a specific template which the official qwen3-asr code follows: https://github.com/QwenLM/Qwen3-ASR/blob/main/qwen_asr/core/vllm_backend/qwen3_asr.py#L981-L990 |
b850cc7 to
de179e2
Compare
Thanks for that, I've reworked mine to match that! For background I'd recently discovered an issue in this branch where for certain recordings of various length it would return "ology." It was related to noisy / very low volume audio, particularly at the start, and the output was also missing the detected language tag - it's clearly just the output when the transcription failed in certain ways. |
ONNX-based Qwen3-ASR speech recognition with split encoder/decoder architecture, Whisper-compatible mel spectrogram, SentencePiece tokenizer, and configurable quantization (FP32/FP16/INT8). Includes decoder session infrastructure (create_decoder_session) with sequential execution mode, CPU arena allocator, and configurable intra-op threads for autoregressive token generation. Performance: INT8 auto-detection, zero-copy KV cache via DynValue pass-through, vectorized argmax, hybrid decoder with Rust-side embed lookup. FP16 embed_tokens.bin support (dtype from config.json).
Add Quantization::Int4 to the enum and wire it through model path resolution, bench_compare parsing, and moonshine streaming. Add integration tests for 1.7B int4 and 0.6B int4 with FP16 embed.
- Use div_ceil() instead of manual ceiling division (mel.rs) - Remove needless borrow (canary/decoder.rs) - Use range contains() instead of manual comparison (moonshine/model.rs) - Derive Default instead of manual impl (moonshine/mod.rs)
The int4 quantized decoder can produce degenerate output (e.g. "ology.") on non-speech audio where quantization noise flips the argmax at the first token. These outputs lack the <asr_text> separator token that normally separates the language prefix from the transcription. Check for the presence of asr_text_token_id (151704) in the generated tokens. If absent, return empty string instead of passing garbage through to the consumer. Logs a warning with the first 20 token IDs for diagnostic purposes. Adds asr_text_token_id to SpecialTokens config struct with serde default for backward compatibility with existing config.json files. style: fix cargo fmt formatting in session.rs
When TranscribeOptions.language is set (e.g. "English"), the decoder
prompt includes "Please transcribe the above {language} audio." which
conditions the decoder toward the specified language. This eliminates
the "ology" degenerate output on non-speech audio (see OLOGY_BUG.md)
and aligns int4 output with FP32 behavior.
Language token IDs are encoded on first use via greedy longest-match
on the BPE vocabulary and cached in RAM for reuse.
Changes:
- tokenizer.rs: Add encode() with reverse vocabulary lookup
- prompt.rs: Add build_prompt_ids_with_language() with template tokens
- model.rs: Thread language_token_ids through greedy_decode
- engine.rs: Cache language tokens, pass options.language through
instead of warning. Qwen3Params gains a language field.
fix: address review findings for language hint implementation
- Eliminate clone on cache hit: ensure_language_cached() + borrow
from cache instead of returning owned Vec
- Unify prompt builders: build_prompt_ids delegates to
build_prompt_ids_with_language(_, _, None), single code path
- Add BCP-47 normalization: "en" → "English" so TranscribeOptions
language codes work correctly (14 common codes)
- Trim normalize_language_name to common codes only
simplify: remove normalize_language_name, pass language string directly
The model tokenizes whatever language string is given and includes it
in the prompt. No need to map BCP-47 codes to full names — the model
handles both "en" and "English" in the prompt context.
fix: Qwen3Params::default() max_tokens 0 bug, empty language guard
- Implement Default manually with max_tokens=512 instead of derive
(derive produced max_tokens=0 which silently truncated output)
- Filter empty language strings to None to avoid malformed prompt
- Document that language accepts both full names and short codes
fix: address branch review findings (5 warnings, 5 infos)
- asr_text guard: only apply when EOS was seen, not on max_tokens
truncation (fixes conflict with truncation test)
- Add asr_text_token_id >= 0 to load-time validation
- Mark tokenizer encode() as pub(crate) to prevent misuse on long text
- Use ..Default::default() in transcribe_raw instead of hardcoded 512
- Fix dangling OLOGY_BUG.md doc reference
- Fix cfg(test) function doc reference
- Add unit tests for language-conditioned prompt structure and
None-path equivalence with standard prompt
refactor: use official Qwen3-ASR language hint template
Replace the instruction-based language hint (modified system/user turns)
with the official Qwen3-ASR template that forces the assistant prefix:
<|im_start|>assistant\nlanguage {Name}<asr_text>
This is a "forced generation" pattern — the model skips language
detection entirely and goes straight to transcription after <asr_text>.
Matches the reference implementation in qwen_asr/core/vllm_backend.
Changes:
- prompt.rs: Remove SYSTEM_CONTENT, USER_PREFIX, USER_SUFFIX_* constants.
Language hint now appends to assistant prefix instead of modifying
system/user turns. System and user turns are identical with or without
language hint.
- engine.rs: Encode " {name}" (with leading space) to match BPE
tokenization of "language English" → [11528, 6364].
- model.rs: Skip <asr_text> guard when language is forced (the token
is in the prompt, not in generated output).
de179e2 to
838181f
Compare
Summary
Adds
engines/qwen3module implementingTranscriptionEnginefor Qwen3-ASR, Alibaba's multilingual speech recognition model. Supports 0.6B and 1.7B model variants.New
qwen3Cargo feature — follows the same pattern as existing ONNX-based engines (parakeet, moonshine, sense_voice): feature-gated, usesort+ndarray, CPU execution.Engine details
rustfft)language <Name><text>, matched against known language names to find the boundaryQwen3ModelParams)Pre-exported ONNX models
Export scripts and methodology: andrewleech/qwen3-asr-onnx
Resolves #30