Skip to content

accel: Add configurable ORT intra-op thread count.#64

Open
andrewleech wants to merge 2 commits intocjpais:mainfrom
andrewleech:feat/ort-thread-count
Open

accel: Add configurable ORT intra-op thread count.#64
andrewleech wants to merge 2 commits intocjpais:mainfrom
andrewleech:feat/ort-thread-count

Conversation

@andrewleech
Copy link
Copy Markdown
Contributor

Summary

ORT defaults to using all available CPU cores for intra-op parallelism, but this isn't always optimal — on some machines (particularly those with SMT/hyperthreading), fewer threads gives better throughput for autoregressive decoding workloads. There was no way for host applications to override this.

I added a global atomic setting (set_ort_intra_threads / get_ort_intra_threads) that build_session reads when no explicit thread count is provided. This follows the same pattern as the existing set_ort_accelerator — a process-wide preference set once at startup.

The setting applies to all sessions created via create_session(). Sessions created via create_session_with_threads() (used by Moonshine streaming) still take their explicit parameter. create_decoder_session() has its own thread management for sequential autoregressive decode and is not affected.

Also added set_decoder_gpu / get_decoder_gpu for controlling whether decoder sessions use GPU execution providers (useful for benchmarking GPU vs CPU decode).

Used by Handy PR #1120 which adds a user-facing thread count setting with auto-tune benchmark.

Testing

  • 18 unit tests in accel.rs pass (including 2 new round-trip tests for the thread count get/set)
  • AccelGuard test helper updated to restore ORT_INTRA_THREADS on drop
  • Integrated and tested in Handy on Windows with Qwen3-ASR and Parakeet models

Trade-offs and Alternatives

The alternative was a per-session parameter threaded through every Model::load() call, but that would require API changes across all engines. The global atomic matches the existing accelerator pattern and lets host apps set it once without modifying engine-specific code. The downside is it's process-wide — you can't have different thread counts for different models loaded simultaneously. In practice this isn't a real limitation since Handy loads one model at a time.

Global atomic setting (set_ort_intra_threads / get_ort_intra_threads)
that build_session and create_decoder_session read when no explicit
thread count is provided. 0 = ORT default (all cores). Allows host
applications to tune thread count for optimal performance.
@andrewleech andrewleech force-pushed the feat/ort-thread-count branch from 3a95cb2 to dc103f9 Compare March 23, 2026 03:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants