File: crates/scanner-scheduler/src/scheduler/engine_impl.rs
This module implements the adapter layer connecting the scheduler's abstract engine traits to the production scanning engine. It enables the real Engine to work seamlessly with the scheduler's chunked scanning architecture.
The engine_impl module provides trait implementations that bridge the gap between two different worlds:
- Scheduler side: Expects generic traits (
ScanEngine,EngineScratch,FindingRecord,FindingWithHashRecord) - Engine side: Has concrete types (
crate::engine::Engine,ScanScratch,FindingRec,NormHash)
This decoupling allows:
- The scheduler to remain engine-agnostic (can work with mock or real engine)
- The engine implementation to evolve independently
- Testing to proceed in parallel (mock engine in
crates/scanner-scheduler/src/scheduler/engine_stub.rs) - Clear contract documentation through traits
Key principle: The scheduler doesn't directly call Engine::scan_chunk_into(). Instead, it calls through the ScanEngine trait, which is implemented here to forward to the real engine.
| Concept | Scheduler Trait | Real Engine Type | Wrapper |
|---|---|---|---|
| Finding | FindingRecord |
crate::api::FindingRec |
Direct impl (no wrapper) |
| Finding+Hash | FindingWithHashRecord |
FindingWithHash<ApiFindingRec> |
FindingWithHash carrier |
| Scratch | EngineScratch |
crate::engine::ScanScratch |
RealEngineScratch |
| Engine | ScanEngine |
crate::engine::Engine |
Direct impl |
For FindingRecord:
impl FindingRecord for ApiFindingRec {
fn rule_id(&self) -> u32 { self.rule_id }
fn root_hint_start(&self) -> u64 { self.root_hint_start }
fn root_hint_end(&self) -> u64 { self.root_hint_end }
fn span_start(&self) -> u64 { u64::from(self.span_start) }
fn span_end(&self) -> u64 { u64::from(self.span_end) }
fn confidence_score(&self) -> i8 { self.confidence_score }
}This is straightforward because FindingRec fields map directly to trait methods. The span_start/end conversion from u32 to u64 handles the real engine's compact representation.
For ScanEngine:
impl ScanEngine for Engine {
type Scratch = RealEngineScratch;
fn required_overlap(&self) -> usize { self.required_overlap() }
fn new_scratch(&self) -> Self::Scratch {
let scratch = self.new_scratch();
RealEngineScratch::new(scratch, Engine::max_findings_per_chunk(self))
}
fn scan_chunk_into(&self, data, file_id, base_offset, scratch) {
self.scan_chunk_into(data, file_id, base_offset, scratch.inner_mut())
}
fn rule_name(&self, rule_id) -> &str { self.rule_name(rule_id) }
}The key insight: RealEngineScratch wraps the engine's native ScanScratch and adds the findings buffer for extraction.
The real engine's scratch reset methods require an &Engine reference:
impl ScanScratch {
fn reset_for_scan_after_prefilter(&mut self, engine: &Engine) {
// Resets per-scan state using engine data while preserving prefilter hits
}
}But the trait method clear() doesn't have access to the engine:
pub trait EngineScratch {
fn clear(&mut self); // No engine reference available!
}Instead of fighting the design, the module implements a lazy reset approach:
-
In
RealEngineScratch::clear():- Only clear the temporary
findings_buf(our local drain buffer) - Do NOT reset the underlying engine scratch state
- This is a no-op for the real scratch's internal state
- Only clear the temporary
-
In
Engine::scan_chunk_into():- The real engine owns scratch lifecycle internally (capacity validation + per-scan state reset on scan path)
- This happens in the real engine code, not in the trait adapter
- The engine has the
&Enginereference needed for its reset methods
Worker Thread Creation:
├─ Engine::new_scratch() → RealScanScratch
├─ RealEngineScratch::new(scratch, max_findings)
│ ├─ Creates findings_buf with capacity
│ └─ Creates norm_hash_buf with capacity
└─ Stored in WorkerCtx (thread-local)
For Each File:
├─ Optional: scratch.clear() → clears findings_buf + norm_hash_buf only
└─ For Each Chunk:
├─ scan_chunk_into()
│ ├─ Engine handles internal scratch reset/validation as part of scan
│ ├─ Scans chunk, appends findings + norm_hashes to real ScanScratch
│ └─ Findings and hashes now in scratch.scratch
├─ drop_prefix_findings() → delegates to real scratch
└─ drain_findings_into()
├─ Real scratch.drain_findings_with_hashes() → findings_buf + norm_hash_buf
├─ Zip findings with aligned hashes → FindingWithHash carriers
└─ Append FindingWithHash values → out vector
- Zero adapter-level resets: Scratch reset logic stays inside the engine scan path
- Clean trait interface: The scheduler doesn't need to know engine internals
- Low overhead: The drain buffer is reused across chunks to avoid repeated allocations
This module is a classic Adapter pattern (also called Wrapper):
┌─────────────────────────────────────────────────────────────┐
│ Scheduler (uses ScanEngine trait) │
│ - new_scratch() │
│ - scan_chunk_into() │
│ - required_overlap() │
└───────────────────────┬─────────────────────────────────────┘
│ implements
│ ScanEngine trait
▼
┌──────────────────────────────────────┐
│ Engine (with ScanEngine impl) │
│ from engine_impl.rs │
└────────┬─────────────────────────────┘
│ forwards to
▼
┌──────────────────────────────────────┐
│ crate::engine::Engine (real impl) │
│ - new_scratch() │
│ - scan_chunk_into() │
│ - required_overlap() │
└──────────────────────────────────────┘
| Problem | Solution |
|---|---|
Trait needs &mut Self::Scratch in clear(), but engine reset methods need &Engine |
Lazy reset pattern |
| Finding types differ slightly (u32 vs u16 for rule_id) | Type conversion in trait impl |
Engine returns ScanScratch, trait expects RealEngineScratch |
RealEngineScratch::new() wrapper |
| Findings need extraction from engine scratch | drain_findings_into() buffer pattern |
| Findings and hashes stored in parallel vectors in engine scratch | drain_findings_with_hashes() + zip into FindingWithHash carriers |
- Flexibility: Scheduler works with any
ScanEngineimpl (mock or real) - Maintainability: Engine changes don't require scheduler changes
- Testability: Mock engine can have different scratch behavior
- Clarity: Trait defines the contract explicitly
pub struct RealEngineScratch {
scratch: RealScanScratch, // The real engine's native scratch
findings_buf: Vec<ApiFindingRec>, // Temporary drain buffer (findings)
norm_hash_buf: Vec<NormHash>, // Temporary drain buffer (hashes)
}Purpose: Wraps the real ScanScratch and provides trait compliance.
Fields:
scratch: Stores pattern match state, findings lists, etc. from the real enginefindings_buf: A pre-allocated buffer used to extract findings from the real scratch- Reused across chunk drains to minimize allocations
- Sized to
max_findings_per_chunkat creation
norm_hash_buf: A pre-allocated buffer for extractingNormHashvalues aligned withfindings_buf- Sized identically to
findings_bufat creation - Drained in lockstep with findings to construct
FindingWithHashcarriers
- Sized identically to
Methods:
new(scratch, max_findings): Wraps a real scratch with findings and hash buffersinner_mut(): Provides mutable access for the engine'sscan_chunk_into()
Not defined here, but conceptually:
- Holds internal Vectorscan pattern state
- Accumulates findings during scanning
- Provides reset/validation helpers (
reset_for_scan_after_prefilter(),ensure_capacity()), plus finding APIs (drop_prefix_findings(),drain_findings())
The real engine's finding type:
pub struct FindingRec {
pub rule_id: u32,
pub root_hint_start: u64,
pub root_hint_end: u64,
pub span_start: u32, // Note: u32, not u64
pub span_end: u32,
pub file_id: FileId,
pub step_id: StepId,
pub confidence_score: i8,
}Maps directly to the FindingRecord trait methods.
The Pattern:
fn drain_findings_into(&mut self, out: &mut Vec<Self::Finding>) {
self.findings_buf.clear();
self.norm_hash_buf.clear();
let pending = self.scratch.pending_findings_len();
// Buffers were pre-sized to max_findings_per_chunk at startup.
// The engine enforces this cap, so pending must fit. Allocating
// here would violate the zero-alloc-after-init invariant.
debug_assert!(self.findings_buf.capacity() >= pending,
"findings_buf capacity < pending; engine exceeded max_findings_per_chunk");
debug_assert!(self.norm_hash_buf.capacity() >= pending,
"norm_hash_buf capacity < pending; engine exceeded max_findings_per_chunk");
// Drain findings and hashes into separate staging buffers
self.scratch.drain_findings_with_hashes(
&mut self.findings_buf, &mut self.norm_hash_buf,
);
debug_assert_eq!(self.findings_buf.len(), self.norm_hash_buf.len(),
"finding/hash drain length mismatch");
// `out` is the per-worker pending vec, pre-sized at worker init.
// After clear() by the caller, capacity is guaranteed sufficient.
debug_assert!(out.capacity() - out.len() >= self.findings_buf.len(),
"output vec remaining capacity < drained findings");
// Zip and append as FindingWithHash carriers
for (finding, norm_hash) in self.findings_buf.drain(..)
.zip(self.norm_hash_buf.drain(..))
{
out.push(FindingWithHash::new(finding, norm_hash));
}
}How it works:
drain_findings_with_hashes()moves findings and hashes into separate buffers- Zip iterator pairs each finding with its aligned hash
- Each pair is wrapped in a
FindingWithHashcarrier and appended toout - Both buffers retain capacity for reuse across chunks
Cost: O(n) moves for drain_findings_with_hashes() + O(n) FindingWithHash construction, typically without extra allocations once buffers are sized
Key insight: Each worker gets its own RealEngineScratch
Advantages:
- No lock contention (scratch is never shared)
- No cache line bouncing
- CPU-friendly access patterns (thread-local memory)
- Pattern state stays in L3 cache
Safety note:
- Raw pointers in
VsScratch(Vectorscan handles) are!Sendby default - But they're safe to transfer once (at worker startup) because:
- Each scratch pinned to exactly one thread
- Only one thread accesses the Vectorscan handles
- Vectorscan doesn't require thread-safe handles, just exclusive access
findings_bufis pre-allocated and reused across chunksVec::with_capacity()avoids repeated allocationsVec::clear()preserves capacity for next scan- No per-chunk allocation if findings count stays under capacity
drop_prefix_findings()filters findings by offset (O(n) scan, but n is small)- Happens after each chunk scan to keep per-worker state bounded
- Prevents memory growth over many chunks
| Aspect | engine_impl.rs (Real) |
engine_stub.rs (Mock) |
|---|---|---|
| Engine Type | crate::engine::Engine |
MockEngine |
| Scratch Type | RealEngineScratch wrapping ScanScratch |
ScanScratch directly |
| Finding Type | crate::api::FindingRec (u32 spans) |
FindingRec (u64 spans) |
| Scanning | Vectorscan regex engine with anchors, transforms, validators | Simple substring search |
| Performance | Highly optimized, Vectorscan-backed | Slow but deterministic |
| Use Case | Production scanning | Testing scheduler logic |
Both modules implement the same traits:
// engine_impl.rs
impl FindingRecord for crate::api::FindingRec { ... }
impl ScanEngine for crate::engine::Engine { ... }
impl EngineScratch for RealEngineScratch {
type Finding = FindingWithHash<ApiFindingRec>;
...
}
// engine_stub.rs
impl FindingRecord for FindingRec { ... }
impl ScanEngine for MockEngine { ... }
impl EngineScratch for ScanScratch {
type Finding = FindingWithHash<FindingRec>;
...
}Mock is simpler because:
ScanScratchdirectly contains findings (no intermediate extraction buffer needed)- Finding types are simple and match trait exactly
- No engine reference complexity (no lazy reset needed)
Real is more complex because:
- Must wrap
ScanScratchinRealEngineScratch - Must handle lazy reset pattern for engine compliance
- Type conversions for finding fields
The scheduler code is identical for both:
fn scan_one_chunk<E: ScanEngine>(
engine: &E,
data: &[u8],
file_id: FileId,
offset: u64,
new_bytes_start: u64,
out: &mut Vec<<E::Scratch as EngineScratch>::Finding>,
) {
let mut scratch = engine.new_scratch();
engine.scan_chunk_into(data, file_id, offset, &mut scratch);
scratch.drop_prefix_findings(new_bytes_start);
scratch.drain_findings_into(out);
}This is the power of the adapter pattern.
The module includes tests that verify:
real_engine_implements_scan_engine():
- Trait methods work through the adapter
- Findings are correctly extracted
- Rule names resolve properly
drop_prefix_findings_works():
- Overlap deduplication works
- Findings in the overlap region are dropped
These tests confirm that the real engine works correctly through the trait interface.
engine_impl.rs is a small but crucial module that:
- Enables production use of the scheduler with the real scanning engine
- Maintains trait abstraction so scheduler stays engine-agnostic
- Solves impedance mismatches through lazy reset and buffer patterns
- Preserves performance with buffer reuse and thread-local scratch
- Provides parallel testability by coexisting with the mock engine
- Bridges hash plumbing by zipping findings and norm_hashes into
FindingWithHashcarriers for persistence
The adapter pattern keeps concerns separated: the engine focuses on pattern matching, the scheduler focuses on chunking and deduplication, and this module keeps them talking.