Skip to content

add nemotron streaming#36

Open
andrewleech wants to merge 4 commits intocjpais:mainfrom
andrewleech:feat/nemotron-streaming
Open

add nemotron streaming#36
andrewleech wants to merge 4 commits intocjpais:mainfrom
andrewleech:feat/nemotron-streaming

Conversation

@andrewleech
Copy link
Copy Markdown
Contributor

@andrewleech andrewleech commented Feb 16, 2026

Summary

Adds streaming transcription via parakeet-rs's Nemotron engine behind a nemotron-streaming feature flag. The StreamingTranscriptionEngine trait provides push_samples, get_transcript, and reset — designed for showing partial transcription during recording in Whispering/epicenter. Happy to adjust the API if a different shape would be preferred.

Audio resampling utilities (mix_to_mono, create_resampler, resample_chunk) are included behind a resampling feature that nemotron-streaming depends on. These were previously duplicated in the downstream consumer.

The ort/ndarray bump (rc.10→rc.11, ndarray 0.16→0.17) is in the first commit. This replicates #27 on current master with the broader testing across all ONNX engines that was requested on that PR. The upgrade required API migrations across all ONNX engines — sense_voice had two additional issues: input.name became private, and metadata().custom(key) returns empty string for absent keys instead of None.

Full disclosure; I'm not very experienced with Rust, my focus is more in the embedded / python space. This PR was prepared with a lot of involvement from claude code, I hope this is acceptible. My appologies if the design / style is at-odds with the project, I've made efforts to integrate this cleanly.

Relates to #31, #27.

Re: streaming API spec (#4)

I saw @Leftium's streaming API proposal in #4 after implementing this. The current API is simpler — push_samples merges accept/decode/get_result into one call returning a String, whereas the spec proposes a 4-method pull-based loop with a structured Transcript return type carrying is_final, is_endpoint, timing, confidence, etc.

The main gaps vs that spec are: no input_finished() flush, no is_endpoint() for speech boundary detection, and no structured result type. The spec's design makes more sense once there are multiple streaming backends with different chunking requirements — with only Nemotron so far, the simpler API was sufficient for the Whispering use case. Happy to refactor toward that spec if desired.

Testing

Tested on Linux (WSL2, Zen 4 / Ryzen AI 9 365) with downloaded models.

ort rc.11 migration — existing engines:

  • parakeet: pass (2 tests)
  • moonshine: pass
  • sense_voice: pass (when using int8 model — the existing test silently skips due to a pre-existing path mismatch, see note below)

nemotron-streaming (5 tests):

  • streaming transcription of JFK audio in 560ms chunks
  • reset clears accumulated transcript
  • concatenated incremental returns match get_transcript()
  • model-free tests: empty transcript on fresh engine, error on push without model

audio utilities (13 unit tests): mix_to_mono, create_resampler, resample_chunk. No models required.

whisper: not tested — whisper-medium on llvmpipe software Vulkan ran 1h+ without completing a single transcription. Needs real GPU hardware. No code changes to whisper engine in this branch.

openai: API call succeeds but the existing exact-string assertion fails — OpenAI's model now returns slightly different punctuation. Pre-existing issue.

Note: pre-existing issues found during testing

These predate this branch, can fix in a follow-up:

  • Moonshine README URLs: HuggingFace restructured the repo — files moved from .../onnx/merged/{variant}/ to .../onnx/merged/{variant}/float/
  • SenseVoice test: expects FP32 model at models/sense-voice but only int8 is available as a packaged download, so the test silently skips

@cjpais
Copy link
Copy Markdown
Owner

cjpais commented Feb 17, 2026

Thank you for the PR.

Bumping to rc.11 will kill Intel MacOS builds in Handy, right now I'm hesitant to pull in this version bump. I need some time to think about this, I may end up doing a final release for Intel Macs and dropping new feature support for them.

Also for now, it would be best to drop the streaming interface side of changes. I want to continue to evaluate the best streaming interface. But it would still be great to have offline support regardless. I'll probably use what you've done also as a reference.

@andrewleech
Copy link
Copy Markdown
Contributor Author

andrewleech commented Feb 17, 2026

@cjpais thanks for the feedback - I didn't realise ort was a breaking change for older macs!

I can split this into two ort / streaming pr's if that helps, I've implemented the local-specific streaming interface for my own use but thought I'd share it in case it helps - no pressure to merge though certainly. The ort upgrade was just to support the parakeet-rs library.

For some reason I thought Whispering/epicenter was a closer fit to my workflow but after using it for a couple of days I don't think that's true, Handy might be just as close. I'll take another look at that and whether what I want fits there better!

@cjpais
Copy link
Copy Markdown
Owner

cjpais commented Feb 17, 2026

No worries! Thank you! Yeah it would be amazing to have the streaming PR separately. I totally understand it's need :) I want it too, just need to take my time with it and try to review and reason a bit on my own. I'm trying to do some major refactoring and want to make sure all the interface boundaries are clear and solid. Plus all the other issues and features I'm trying to support in Handy as well

If only ort rc.11 works for Nemotron Streaming, we might just have to break Intel support on MacOS (or at least that's what someone mentioned in cjpais/Handy#436. I know the newer ort does solve some issues on other machines. I will need to evaluate this, as it might break downstream projects as well (like whispering)

@Leftium
Copy link
Copy Markdown

Leftium commented Feb 17, 2026

Re: streaming API spec (#4)

I saw @Leftium's streaming API proposal in #4 after implementing this. The current API is simpler — push_samples merges accept/decode/get_result into one call returning a String, whereas the spec proposes a 4-method pull-based loop with a structured Transcript return type carrying is_final, is_endpoint, timing, confidence, etc.

The main gaps vs that spec are: no input_finished() flush, no is_endpoint() for speech boundary detection, and no structured result type. The spec's design makes more sense once there are multiple streaming backends with different chunking requirements — with only Nemotron so far, the simpler API was sufficient for the Whispering use case. Happy to refactor toward that spec if desired.

While simplifying and breaking up my API proposal some details were lost (in unwritten sub-specs.)

Probably the most important part of my spec is the structured Transcript return type. If only one part of the spec is implemented, it should be this.

I actually specified two different API's: one for engine implementors and one for transcribe-rs users

  1. Low-level pull-based StreamingTranscriptionEngine Interface
    • Most engine implementors should implement to this interface
    • Labeled (B) in current spec diagram
    • The implementations should be mostly thin wrappers of the underlying API's (Nemotron, Vosk, etc)
  2. High Level: callback-Based StreamingTranscriptionSource
    • Most transcribe-rs users should use this interface
    • Labeled (C) in the current spec diagram
    • The (Nemotron) StreamingTranscriptionSource is automatically available "for free" if the (Nemotron) StreamingTranscriptionEngine was implemented.

Results:

  • minimal effort for both new engine implementors and transcribe-rs consumers
  • unified streaming API: a single API for multiple engines

The lost details still live in previous versions of the spec; these simply need to be moved into the sub-specs:

Originally I specified the types/interfaces in Rust. However I converted to pseudocode in case details like 'static were over-specified. (I am not familiar with Rust).

pi-anl added 4 commits March 2, 2026 19:24
push_samples now returns Vec<StreamingSegment> instead of String.
Each segment carries an is_endpoint flag indicating whether the text
ends at a sentence boundary (. ? !) detected from the model's
punctuated output. Includes split_at_sentence_boundaries helper with
unit tests.
@andrewleech andrewleech force-pushed the feat/nemotron-streaming branch from a8aebcd to 03d5782 Compare March 2, 2026 11:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants