Conversation
|
Thank you for the PR. Bumping to rc.11 will kill Intel MacOS builds in Handy, right now I'm hesitant to pull in this version bump. I need some time to think about this, I may end up doing a final release for Intel Macs and dropping new feature support for them. Also for now, it would be best to drop the streaming interface side of changes. I want to continue to evaluate the best streaming interface. But it would still be great to have offline support regardless. I'll probably use what you've done also as a reference. |
|
@cjpais thanks for the feedback - I didn't realise ort was a breaking change for older macs! I can split this into two ort / streaming pr's if that helps, I've implemented the local-specific streaming interface for my own use but thought I'd share it in case it helps - no pressure to merge though certainly. The ort upgrade was just to support the parakeet-rs library. For some reason I thought Whispering/epicenter was a closer fit to my workflow but after using it for a couple of days I don't think that's true, Handy might be just as close. I'll take another look at that and whether what I want fits there better! |
|
No worries! Thank you! Yeah it would be amazing to have the streaming PR separately. I totally understand it's need :) I want it too, just need to take my time with it and try to review and reason a bit on my own. I'm trying to do some major refactoring and want to make sure all the interface boundaries are clear and solid. Plus all the other issues and features I'm trying to support in Handy as well If only ort rc.11 works for Nemotron Streaming, we might just have to break Intel support on MacOS (or at least that's what someone mentioned in cjpais/Handy#436. I know the newer ort does solve some issues on other machines. I will need to evaluate this, as it might break downstream projects as well (like whispering) |
While simplifying and breaking up my API proposal some details were lost (in unwritten sub-specs.) Probably the most important part of my spec is the structured I actually specified two different API's: one for engine implementors and one for transcribe-rs users
Results:
The lost details still live in previous versions of the spec; these simply need to be moved into the sub-specs:
Originally I specified the types/interfaces in Rust. However I converted to pseudocode in case details like |
push_samples now returns Vec<StreamingSegment> instead of String. Each segment carries an is_endpoint flag indicating whether the text ends at a sentence boundary (. ? !) detected from the model's punctuated output. Includes split_at_sentence_boundaries helper with unit tests.
a8aebcd to
03d5782
Compare
Summary
Adds streaming transcription via parakeet-rs's Nemotron engine behind a
nemotron-streamingfeature flag. TheStreamingTranscriptionEnginetrait providespush_samples,get_transcript, andreset— designed for showing partial transcription during recording in Whispering/epicenter. Happy to adjust the API if a different shape would be preferred.Audio resampling utilities (
mix_to_mono,create_resampler,resample_chunk) are included behind aresamplingfeature thatnemotron-streamingdepends on. These were previously duplicated in the downstream consumer.The ort/ndarray bump (rc.10→rc.11, ndarray 0.16→0.17) is in the first commit. This replicates #27 on current master with the broader testing across all ONNX engines that was requested on that PR. The upgrade required API migrations across all ONNX engines — sense_voice had two additional issues:
input.namebecame private, andmetadata().custom(key)returns empty string for absent keys instead of None.Full disclosure; I'm not very experienced with Rust, my focus is more in the embedded / python space. This PR was prepared with a lot of involvement from claude code, I hope this is acceptible. My appologies if the design / style is at-odds with the project, I've made efforts to integrate this cleanly.
Relates to #31, #27.
Re: streaming API spec (#4)
I saw @Leftium's streaming API proposal in #4 after implementing this. The current API is simpler —
push_samplesmerges accept/decode/get_result into one call returning aString, whereas the spec proposes a 4-method pull-based loop with a structuredTranscriptreturn type carryingis_final,is_endpoint, timing, confidence, etc.The main gaps vs that spec are: no
input_finished()flush, nois_endpoint()for speech boundary detection, and no structured result type. The spec's design makes more sense once there are multiple streaming backends with different chunking requirements — with only Nemotron so far, the simpler API was sufficient for the Whispering use case. Happy to refactor toward that spec if desired.Testing
Tested on Linux (WSL2, Zen 4 / Ryzen AI 9 365) with downloaded models.
ort rc.11 migration — existing engines:
nemotron-streaming (5 tests):
get_transcript()audio utilities (13 unit tests): mix_to_mono, create_resampler, resample_chunk. No models required.
whisper: not tested — whisper-medium on llvmpipe software Vulkan ran 1h+ without completing a single transcription. Needs real GPU hardware. No code changes to whisper engine in this branch.
openai: API call succeeds but the existing exact-string assertion fails — OpenAI's model now returns slightly different punctuation. Pre-existing issue.
Note: pre-existing issues found during testing
These predate this branch, can fix in a follow-up:
.../onnx/merged/{variant}/to.../onnx/merged/{variant}/float/models/sense-voicebut only int8 is available as a packaged download, so the test silently skips