accel: Add configurable ORT intra-op thread count. by andrewleech · Pull Request #64 · cjpais/transcribe-rs

andrewleech · 2026-03-23T02:39:05Z

Summary

ORT defaults to using all available CPU cores for intra-op parallelism, but this isn't always optimal — on some machines (particularly those with SMT/hyperthreading), fewer threads gives better throughput for autoregressive decoding workloads. There was no way for host applications to override this.

I added a global atomic setting (set_ort_intra_threads / get_ort_intra_threads) that build_session reads when no explicit thread count is provided. This follows the same pattern as the existing set_ort_accelerator — a process-wide preference set once at startup.

The setting applies to all sessions created via create_session(). Sessions created via create_session_with_threads() (used by Moonshine streaming) still take their explicit parameter. create_decoder_session() has its own thread management for sequential autoregressive decode and is not affected.

Also added set_decoder_gpu / get_decoder_gpu for controlling whether decoder sessions use GPU execution providers (useful for benchmarking GPU vs CPU decode).

Used by Handy PR #1120 which adds a user-facing thread count setting with auto-tune benchmark.

Testing

18 unit tests in accel.rs pass (including 2 new round-trip tests for the thread count get/set)
AccelGuard test helper updated to restore ORT_INTRA_THREADS on drop
Integrated and tested in Handy on Windows with Qwen3-ASR and Parakeet models

Trade-offs and Alternatives

The alternative was a per-session parameter threaded through every Model::load() call, but that would require API changes across all engines. The global atomic matches the existing accelerator pattern and lets host apps set it once without modifying engine-specific code. The downside is it's process-wide — you can't have different thread counts for different models loaded simultaneously. In practice this isn't a real limitation since Handy loads one model at a time.

Global atomic setting (set_ort_intra_threads / get_ort_intra_threads) that build_session and create_decoder_session read when no explicit thread count is provided. 0 = ORT default (all cores). Allows host applications to tune thread count for optimal performance.

andrewleech mentioned this pull request Mar 23, 2026

settings: Add ORT thread count setting with auto-tune benchmark. cjpais/Handy#1120

Open

andrewleech force-pushed the feat/ort-thread-count branch from 3a95cb2 to dc103f9 Compare March 23, 2026 03:01

Merge branch 'main' into feat/ort-thread-count

a452be7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

accel: Add configurable ORT intra-op thread count.#64

accel: Add configurable ORT intra-op thread count.#64
andrewleech wants to merge 2 commits intocjpais:mainfrom
andrewleech:feat/ort-thread-count

andrewleech commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

andrewleech commented Mar 23, 2026

Summary

Testing

Trade-offs and Alternatives

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants