The scheduler::engine_trait module defines a trait-based abstraction layer that decouples the scheduler from specific detection engine implementations. This abstraction enables the scheduler to work seamlessly with both mock engines (for testing) and real production engines (for actual secret scanning).
The module exports four core traits and one carrier type:
ScanEngine- The primary scanning interfaceEngineScratch- Per-worker scratch state managementFindingRecord- Finding representation abstractionFindingWithHashRecord- Extension ofFindingRecordthat carries normalized secret hash bytesFindingWithHash<F>- Generic carrier type that bundles a finding with itsNormHash
This design allows the scheduler logic to remain engine-agnostic while supporting different underlying implementations with varying data types and behaviors.
ScanEngine defines the primary interface that the scheduler uses to perform chunk-based scanning operations. It abstracts the core functionality without coupling to a specific engine implementation.
- Stateless Design: The engine itself is immutable and can be safely shared across all worker threads (
Send + Sync) - Per-Worker State: All mutable state is isolated in the associated
Scratchtype, ensuring thread safety without synchronization overhead - Overlapping Chunks: The engine declares a
required_overlap()that the scheduler must respect when dividing work
Purpose: Returns the minimum byte overlap required between consecutive chunks.
Contract: The scheduler guarantees that if chunk N spans [base, base + len), chunk N+1 will start no later than base + len - overlap, ensuring no findings are missed at boundaries.
Example: If a rule needs to match across boundaries, it might require 100 bytes of overlap to capture patterns that span chunk edges.
Purpose: Creates a fresh Scratch instance for a worker thread.
Contract: Called once per worker at startup. The returned scratch is reused across all chunks processed by that worker, avoiding repeated allocations.
Usage: The scheduler calls this during worker thread initialization to set up per-thread state.
Purpose: Scans a data buffer and appends findings to the scratch space.
Parameters:
data: The chunk to scanfile_id: Identifier of the file being scanned (for attribution)base_offset: Absolute byte offset ofdata[0]in the original filescratch: The worker's per-thread scratch space to accumulate findings
Contract: Findings are reported with absolute byte offsets (not relative to the chunk). The scheduler is responsible for deduplication via scratch.drop_prefix_findings().
Archive note: When scanning archive entries, the scheduler supplies virtual FileId values
in a dedicated high‑bit namespace. This ensures per-entry engine state isolation and prevents
collisions with real filesystem file IDs.
Purpose: Retrieves the human-readable name of a rule by its ID.
Contract: Returns the rule name on success; returns "<unknown-rule>" for invalid IDs. Used for output formatting and reporting.
Purpose: Returns the stable 32-byte BLAKE3 fingerprint for a rule.
Contract: The fingerprint is precomputed at engine construction from the rule name via BLAKE3 derive-key with the "gossip/rule/v1" domain constant. Position-independent: the same rule name always produces the same fingerprint regardless of compilation order. Returns all-zeros for invalid IDs.
Usage: Used for durable finding-identity derivation instead of the positional rule_id. Callers wrap the returned bytes in RuleFingerprint::from_bytes when crossing the boundary into gossip-contracts. Returned as raw [u8; 32] because the ScanEngine trait lives in scanner-scheduler, which does not depend on gossip-contracts.
Purpose: Returns the maximum number of findings retained per chunk scan.
Contract: Required method (no default). The real engine returns the configured tuning.max_findings_per_chunk value; the mock engine returns the capacity set at construction. Used by the scheduler for capacity planning and by dropped_findings() accounting.
EngineScratch abstracts the per-worker scratch memory used to accumulate findings during scanning. Each worker thread has its own scratch instance, ensuring thread-safe finding collection without locks.
- Thread-Local: One instance per worker, never shared across threads
- Reusable: Scratch is reused across chunks to minimize allocations
- Deduplication-Aware: Provides methods to manage findings at chunk boundaries
Specifies the finding type produced by this scratch. The bound was elevated from FindingRecord to FindingWithHashRecord to ensure every finding carries a normalized secret hash for persistence plumbing. This associated type allows different engines to use their own finding representations while maintaining a common trait interface that includes hash access.
Purpose: Clears all accumulated findings, preparing the scratch for a new scan.
Contract: After calling clear(), drain_findings_into() yields no findings.
Typical Usage: Available as an implementation reset hook. Current scheduler scan loops call scan_chunk_into(), drop_prefix_findings(), and drain_findings_into() per chunk.
Purpose: Implements overlap-based deduplication by removing findings that "belong" to the previous chunk.
Parameters:
new_bytes_start: Absolute byte offset where "new" (non-overlapping) bytes begin
Semantics: A finding is dropped if finding.root_hint_end() < new_bytes_start because:
- That finding will already be detected by the chunk that covers those bytes
- Keeping it would result in duplicate reports
Example: If chunk 1 spans bytes [0, 1000) and chunk 2 spans [900, 1800):
- Overlap region: [900, 1000)
- Findings with
root_hint_end < 900are dropped (duplicates from chunk 1) - Findings with
root_hint_end >= 900are kept (only found in chunk 2's new bytes)
Purpose: Transfers all remaining findings from the scratch to an output vector.
Contract: Findings are appended to out; the caller is responsible for clearing out beforehand if a fresh batch is desired. The scratch's internal finding buffer is empty after this call. Implementations should transfer ownership without extra allocation when possible (e.g., Vec::append or drain(..) into out).
Usage: Called after processing each chunk to extract findings for output or further processing.
Purpose: Returns the count of findings dropped by the engine due to per-scan capacity limits (e.g., max_findings_per_chunk tuning).
Contract: Default implementation returns 0. Used for run-level loss accounting in persistence backends via FsRunLoss.
Purpose: Returns the number of findings currently buffered in the scratch.
Contract: Default implementation returns 0. Used by the scheduler to check if drain is needed before reuse.
FindingRecord abstracts the representation of a single finding (a matched secret/pattern), allowing different engines to use their own finding types. It defines the common interface for querying finding metadata.
Clone: Findings must be cloneable for efficient buffer accumulationSend: Findings must be sendable across thread boundaries (though used thread-locally)'static: No borrowed data; findings are self-contained
Purpose: Returns the ID of the rule that matched.
Returns: A u32 rule ID (normalized across different engine types)
Purpose: Returns the start byte offset of the finding in the original buffer.
Usage: Used for cross-chunk deduplication. Under the trait contract, findings with root_hint_end < new_bytes_start are dropped.
Purpose: Returns the end byte offset (exclusive) of the finding's "root hint" region.
Semantics: This is the critical deduplication boundary. Findings are deduplicated using root_hint_end < new_bytes_start.
Purpose: Returns the start of the full match span (the actual matched content).
Usage: The span may differ from the root hint:
- Root hint: The region used for deduplication
- Span: The actual matched content (may be wider for context)
Purpose: Returns the end (exclusive) of the full match span.
Contract: Typically span_end >= root_hint_end to capture the full matched region.
Purpose: Returns whether span coordinates should contribute to within-chunk deduplication key computation.
Contract: Required method (no default). When true, two findings at the same root_hint with different spans are considered distinct. When false, span coordinates are zeroed in the dedup key, collapsing different spans to the same identity. Used by push_finding_with_drop_hint to decide whether span and UTF-16 endianness contribute to the dedupe identity.
Purpose: Returns the confidence score assigned to this finding.
Contract: Required method (no default). Values map to enum-level confidence (High/Medium/Low). Used by persistence layers for prioritization and filtering.
Findings use a two-level deduplication strategy:
-
Cross-Chunk Deduplication (
root_hintfields):- Findings with
root_hint_end < new_bytes_startare dropped - Prevents reporting the same finding multiple times across overlapping chunks
- Findings with
-
Within-Chunk Uniqueness (
span+norm_hashfields):- When
dedupe_with_span()returnstrue, two findings with the sameroot_hintbut different spans are distinct - When
dedupe_with_span()returnsfalse, span coordinates are zeroed in the dedup key, so onlyroot_hint+norm_hashmatter - Two findings at the same span but with different
norm_hashvalues are preserved (different secrets at the same location) - Allows multiple matches or transformed variants of the same secret
- When
FindingWithHashRecord extends FindingRecord to carry normalized secret hash bytes alongside finding metadata. This trait enables persistence backends to deduplicate findings across runs without storing raw secret bytes.
Inherits all constraints from FindingRecord (Clone, Send, 'static).
Purpose: Returns the BLAKE3 digest of the raw secret bytes extracted after gate validation.
Semantics: Two findings with the same norm_hash matched the same logical secret, even if their byte spans differ due to surrounding context or transform chains.
Usage: Used by within-chunk dedup (as an additional dedup key), persistence batch construction, and cross-run deduplication in the store backend.
FindingWithHash<F: FindingRecord> is a generic wrapper that bundles a finding record with its normalized secret hash. It implements both FindingRecord (delegating to the inner finding) and FindingWithHashRecord (returning the hash).
pub struct FindingWithHash<F: FindingRecord> {
pub finding: F,
pub norm_hash: NormHash, // [u8; 32]
}The engine computes a normalized hash of the matched secret at scan time (inside scan_chunk_into). This hash must travel with the finding through overlap dedup, within-chunk dedup, and final emission. Storing them in separate parallel vectors is fragile — any sort, filter, or drain would require coordinating two collections. Bundling into a single value type makes the 1:1 alignment structural and impossible to violate.
Both the real engine adapter and mock engine produce FindingWithHash<F> values:
| Engine | F type |
Hash source |
|---|---|---|
Real (engine_impl) |
api::FindingRec |
Engine-computed BLAKE3 of normalized secret |
Mock (engine_stub) |
FindingRec |
Deterministic placeholder hash |
The trait abstraction enables mock implementations for testing:
- Mock engine and scratch provide deterministic, controllable behavior
- Scheduler logic can be tested without real scanning engines
- No need for expensive file I/O or secret detection in unit tests
Example: engine_stub::MockEngine and engine_stub::ScanScratch provide test implementations with minimal overhead.
Different engines can provide their own optimizations:
- Mock engine: Simple in-memory finding accumulation
- Real engine: Optimized SIMD scanning, specialized memory layouts
- Both work seamlessly with the same scheduler code
The traits bridge type differences between implementations:
| Aspect | Mock Engine | Real Engine |
|---|---|---|
| Rule ID | RuleId(u16) |
u32 |
| Span Offsets | u64 |
u32 (exposed as u64 via trait) |
| Root Hint Offsets | u64 |
u64 |
| Finding Type | FindingWithHash<FindingRec> |
FindingWithHash<api::FindingRec> |
| Norm Hash | Deterministic placeholder | Engine-computed BLAKE3 |
| File ID in finding record | Not stored | FileId |
| Decode step/provenance | Not stored | StepId |
The traits normalize these via their method signatures (all return u32 for rule IDs, u64 for offsets).
The design separates concerns:
ScanEngineisSync: Can be safely shared across threadsEngineScratchis thread-local: No synchronization needed- No mutex/atomic operations in the hot path
The scheduler doesn't need to know:
- How the engine represents findings internally
- What data structures the engine uses
- Engine-specific optimization details
The scheduler only cares about the trait contract.
The scheduler and engine collaborate to ensure no findings are missed:
Chunk 1: [0 ────────── 1000)
Chunk 2: [900 ────────── 1800)
└─ overlap ─┘
Findings with root_hint_end < 900 → dropped (dedup)
Findings with root_hint_end >= 900 → kept (new in chunk 2)
for file in files {
for chunk in file.chunks() {
engine.scan_chunk_into(&chunk, file_id, offset, &mut scratch);
scratch.drop_prefix_findings(new_bytes_start); // Dedup
scratch.drain_findings_into(&mut output); // Extract findings
}
}This pattern achieves:
- Single allocation per worker (scratch reused)
- O(1) drain operations (no copying)
- Automatic deduplication at chunk boundaries
// tests use MockEngine which:
// - Uses simple substring matching
// - Requires no production engine wiring
// - Enables deterministic scheduler tests
let engine = MockEngine::new(vec![/* MockRule values */], 16);
let mut scratch = engine.new_scratch();
engine.scan_chunk_into(b"SECRET123", FileId(0), 0, &mut scratch);// Production uses crate::engine::Engine with trait impls from engine_impl:
// - impl ScanEngine for Engine
// - impl EngineScratch for RealEngineScratch
// - impl FindingRecord for api::FindingRec
let engine = crate::engine::Engine::new(rules, transforms, tuning);
// Same scheduler code, different backendDifferent engines can customize deduplication by varying root_hint fields:
- Strict dedup: Make
root_hint == spanto deduplicate similar matches - Lenient dedup: Use wider
root_hintto allow overlapping matches - Context-aware: Adjust dedup based on transformed content
Trait methods enable targeted testing:
#[test]
fn test_overlap_deduplication() {
let engine = MockEngine::new(vec![], 16);
let mut scratch = engine.new_scratch();
scratch.drop_prefix_findings(900);
let mut out = Vec::new();
scratch.drain_findings_into(&mut out);
assert!(out.is_empty());
}The trait design allows engines to optimize:
- Memory layout: Engine chooses finding struct layout
- Copying strategy:
drain_findings_intocan swap buffers - Dedup performance: Engine optimizes
drop_prefix_findingslogic
New engines can be added without modifying:
- Scheduler logic
- Worker thread code
- Deduplication logic
- Output formatting
Only a new trait implementation is needed.
┌──────────────────────────────────────────────────────┐
│ ScanEngine (Sync, Shared) │
│ (created once, used by all workers) │
└──────────────────────────────────────────────────────┘
│ │ │
┌──────────┴─────────┼─────────┴──────────┐
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Worker 0 │ │ Worker 1 │ │ Worker N │
│ │ │ │ │ │
│ Scratch 0 │ │ Scratch 1 │ │ Scratch N │
│(thread-loc) │ │(thread-loc) │ │(thread-loc) │
└─────────────┘ └─────────────┘ └─────────────┘
Key Properties:
- Engine is immutable and safely shared (no synchronization)
- Each worker has isolated scratch (no contention)
- Findings are accumulated locally, extracted after each chunk
- No data is shared between workers during scanning
The engine trait abstraction provides:
- Decoupling: Scheduler is independent of engine type
- Testability: Mock implementations enable unit testing
- Flexibility: Different engines with different optimizations
- Type Normalization: Bridges implementation-specific types
- Thread Safety: Lock-free design with per-worker state
- Performance: Efficient finding accumulation and deduplication
- Extensibility: New engines without modifying core logic