Work-Stealing CPU Executor

Module Purpose

The executor module implements a work-stealing thread pool that drives all scanning parallelism in the project. External producers (discovery threads, I/O completion handlers) enqueue tasks into a global injector queue, and N worker threads consume them using a local-first, steal-on-idle strategy.

The design splits into two files:

executor.rs — Production types: Executor, ExecutorHandle, WorkerCtx, ExecutorConfig, threaded worker loop, and idle policy.
executor_core.rs — Shared policy logic: combined atomic state encoding, pop_task algorithm, worker_step function, trait abstractions for idle/tracing hooks, and loom concurrency tests. This split ensures the production executor and deterministic simulation harness share identical scheduling semantics.

Architecture

graph TB
    subgraph "External Producers"
        Discovery["Discovery Thread"]
        IO["I/O Completion"]
        API["API Callers"]
    end

    subgraph "Executor"
        Injector["Injector Queue<br/>(Crossbeam MPMC)"]

        subgraph "Worker 0"
            D0["Deque (LIFO local)"]
            S0["WorkerCtx<br/>scratch + metrics + rng"]
        end

        subgraph "Worker 1"
            D1["Deque (LIFO local)"]
            S1["WorkerCtx<br/>scratch + metrics + rng"]
        end

        subgraph "Worker N"
            DN["Deque (LIFO local)"]
            SN["WorkerCtx<br/>scratch + metrics + rng"]
        end

        SharedState["Shared State<br/>state · done · unparkers · panic"]
    end

    Discovery -->|spawn_external| Injector
    IO -->|ExecutorHandle::spawn| Injector
    API -->|ExecutorHandle::spawn_batch| Injector

    Injector -->|steal_batch_and_pop| D0
    Injector -->|steal_batch_and_pop| D1
    Injector -->|steal_batch_and_pop| DN

    D0 <-->|"FIFO steal"| D1
    D1 <-->|"FIFO steal"| DN
    D0 <-->|"FIFO steal"| DN

Key Design Decisions

Decision	Rationale
Per-worker Chase-Lev deque	LIFO local pop maximizes cache locality; FIFO steal preserves fairness
Global crossbeam injector	MPMC queue for external producers; batch steal amortizes lock cost
Combined atomic state word	Eliminates TOCTOU race between spawn and join (see Shutdown Protocol)
Randomized victim selection	Avoids correlated contention when all workers go idle simultaneously
Parker/Unparker pattern	No lost wakeups; each unpark is either consumed or becomes a no-op
Tiered idle (spin → yield → park)	Trades CPU burn for latency on bursty workloads

Key Types

`ExecutorConfig`

Configuration for the executor. All defaults are conservative.

pub struct ExecutorConfig {
    pub workers: usize,        // Number of worker threads (default: 1)
    pub seed: u64,             // RNG seed for deterministic victim selection
    pub steal_tries: u32,      // Steal attempts per idle cycle (default: 4)
    pub spin_iters: u32,       // Spin iterations before park (default: 200)
    pub park_timeout: Duration,// Park timeout after spin (default: 200µs)
    pub pin_threads: bool,     // Pin workers to cores (Linux only)
}

Source: executor.rs

Knob	Workload Sensitivity
`workers`	CPU count, task CPU-boundedness
`steal_tries`	Task fanout pattern, worker count
`spin_iters`	Task latency distribution
`park_timeout`	External spawn frequency

`Executor<T>`

The top-level executor that owns worker threads and shared state.

pub struct Executor<T> {
    shared: Arc<Shared<T>>,
    threads: Vec<JoinHandle<WorkerMetricsLocal>>,
}

Source: executor.rs

Type parameter: T is the task type. Should be small (≤32 bytes) and Copy if possible. Avoid Box<dyn FnOnce()> — use an enum instead.

Lifecycle:

Executor::new(config, scratch_init, runner) — creates and starts worker threads immediately
spawn_external(task) or spawn_external_batch(tasks) — enqueue work
join() — close gate, drain in-flight tasks, collect metrics, propagate panics

Key methods:

Method	Purpose
`new(cfg, scratch_init, runner)`	Create executor; workers start and idle immediately
`handle()`	Get a cloneable `ExecutorHandle<T>` for external spawning
`spawn_external(task)`	Convenience for `handle().spawn(task)`
`spawn_external_batch(tasks)`	Batch injection; amortizes wakeups
`shutdown()`	Signal cooperative stop; rejects future spawns
`join(self)`	Close gate, wait for completion, return `MetricsSnapshot`

`ExecutorHandle<T>`

Thin, cloneable handle for external producers. This is the seam between I/O and CPU engines.

pub struct ExecutorHandle<T> {
    shared: Arc<Shared<T>>,
}

Source: executor.rs

Thread safety: Clone + Send + Sync. Multiple producers can call spawn concurrently.

Key methods:

Method	Purpose
`spawn(task)`	CAS loop: atomically check accepting + increment count; push to injector; unpark one worker
`spawn_batch(tasks)`	All-or-nothing batch spawn; wakeups bounded to worker count
`is_accepting()`	Check if executor is still open
`shutdown()`	Close gate + signal done

Error handling: spawn returns Err(task) if the executor is shutting down, giving the caller back the task.

`WorkerCtx<T, S>`

Per-worker context passed to every task execution. Contains both user-facing state and internal scheduling machinery.

pub struct WorkerCtx<T, S> {
    // User-facing
    pub worker_id: usize,
    pub scratch: S,
    pub rng: XorShift64,
    pub metrics: WorkerMetricsLocal,

    // Internal
    local: Worker<T>,              // Chase-Lev deque
    parker: Parker,                // Crossbeam Parker
    shared: Arc<Shared<T>>,        // Shared state
    local_spawns_since_wake: u32,  // Wake-on-hoard counter
}

Source: executor.rs

Type parameters:

T: Task type
S: User-defined scratch type, initialized via scratch_init closure

Key methods:

Method	Cost	Description
`spawn_local(task)`	`fetch_add` + deque push	Enqueue to own deque; best cache locality
`spawn_global(task)`	`fetch_add` + injector push + unpark	Enqueue to global injector; higher contention
`handle()`	Arc clone	Get an `ExecutorHandle` for external spawning

Wake-on-hoard: After WAKE_ON_HOARD_THRESHOLD (32) consecutive local spawns, spawn_local proactively wakes a sibling to prevent one worker from hoarding work while others sleep.

`Shared<T>` (internal)

Shared state behind Arc, accessed by all workers and the executor owner.

struct Shared<T> {
    injector: Injector<T>,          // Global MPMC queue
    stealers: Vec<Stealer<T>>,      // Per-worker steal handles
    state: AtomicUsize,             // Combined (count << 1) | accepting
    done: AtomicBool,               // Monotonic stop flag
    unparkers: Vec<Unparker>,       // Per-worker wakeup handles
    next_unpark: AtomicUsize,       // Round-robin wakeup counter
    panic: Mutex<Option<Box<...>>>, // First captured panic
}

Source: executor.rs

Invariants:

stealers.len() == unparkers.len() == config.workers
state encodes both accepting flag and in-flight count
done is monotonic: once true, never cleared
panic captures only the first panic; subsequent panics are discarded

Core Types (executor_core.rs)

`WorkerCtxLike<T, S>` trait

Abstraction over worker context that allows both the production WorkerCtx and the simulation harness to share the same worker_step logic.