This document describes the scan runner subsystem: how git scanning is orchestrated from configuration through execution, scheduling, engine integration, and finalization.
The runner subsystem is the top-level orchestrator for git scans. It owns the end-to-end pipeline from repository open through finalize and optional persistence. The subsystem is split across seven source files:
| Module | Location | Responsibility |
|---|---|---|
runner |
crates/scanner-git/src/runner.rs |
Public types, configuration, and run_git_scan entry |
runner_odb_blob |
crates/scanner-git/src/runner_odb_blob.rs |
ODB-blob fast-path mode pipeline |
runner_diff_history |
crates/scanner-git/src/runner_diff_history.rs |
Diff-history mode pipeline |
runner_exec |
crates/scanner-git/src/runner_exec.rs |
Shared pack execution helpers and scheduler dispatch |
engine_adapter |
crates/scanner-git/src/engine_adapter.rs |
Bridge from decoded blobs to the core Engine |
finalize |
crates/scanner-git/src/finalize.rs |
Deterministic write-op builder for persistence |
persist |
crates/scanner-git/src/persist.rs |
Atomic finalize commit and incremental seen-bitmap ops |
The runner dispatches to one of two mode-specific pipelines after shared setup. Both the shared runner and the mode-specific pipelines also receive a caller-owned cooperative abort flag that can stop the scan before finalize:
- ODB-blob fast path (
runner_odb_blob) -- walks the unique blob set from the commit graph, then scans in pack order with streaming plan generation. - Diff-history (
runner_diff_history) -- walks commits, diffs trees, spills/dedupes candidates, then batch-plans and executes pack decode + scan.
Both pipelines produce a ScanModeOutput that the runner finalizes and
optionally persists. The mode output also carries a completed-pack bitmap
indexed by MIDX pack_id; the runner thread marks a bit only after the
scheduler has reassembled that plan's final PackExecReport, so sharded
execution flips the bit only after every shard succeeds without error-class
skips. Seen-bitmap persistence depends on the caller-supplied
SeenBitmapPersister: when no persistence store is provided, run_git_scan
uses a NullSeenBitmapPersister that discards deltas, so the seen bitmap is
persisted only via the atomic finalize batch. When a PersistenceStore is
present, spill-stage seen-bitmap deltas are written incrementally during mode
execution in addition to the finalize batch.
flowchart TD
Cancel["abort flag"] --> A["run_git_scan()"]
A --> B["repo_open"]
B --> C["acquire_midx"]
C --> D{"artifacts_unchanged?"}
D -- No --> ERR["ConcurrentMaintenance"]
D -- Yes --> E["acquire_commit_graph"]
E --> F["introduced_by_plan"]
F --> G["CommitGraphIndex + AtomicBitSet"]
G --> H{"scan_mode"}
H -- OdbBlobFast --> I["run_odb_blob"]
H -- DiffHistory --> J["run_diff_history"]
I --> X{"abort?"}
J --> X
X -- Yes --> ABORT["TreeDiffError::Aborted"]
X -- No --> K{"artifacts_unchanged?"}
K -- No --> ERR
K -- Yes --> L["build_finalize_ops"]
L --> M{"persist_store?"}
M -- Some --> N["persist_finalize_output"]
M -- None --> O["Return GitScanResult"]
N --> O
- Repo open (
runner.rs) -- resolve repo layout, start set, ref watermarks, and artifact lock paths viarepo_open. - MIDX acquisition (
runner.rs) -- build the multi-pack-index in memory viaacquire_midx. An immediate artifact stability check follows. - Commit-graph acquisition (
runner.rs) -- build the commit graph (optionally with identity enrichment). Whenenrich_identitiesis enabled,acquire_commit_graph_with_identitiesalso produces anIdentityInternerwhose dictionary is emitted before anyCommitMetaevents. - Commit plan (
runner.rs) --introduced_by_plancomputes the(watermark, tip]topo-ordered commit list. - Commit-graph index + bitset (
runner.rs) --CommitGraphIndexis built for OID/timestamp resolution;AtomicBitSetgates exactly-onceCommitMetaemission across workers. - Mode dispatch (
runner.rs) -- delegates torun_odb_bloborrun_diff_history. Both receive the shared engine, seen store, optional seen-bitmap persister, commit plan, caller-provided abort flag, andCommitMetaContext. - Post-execution stability check (
runner.rs) -- verifies pack artifacts have not changed during execution (detects concurrentgit gc). - Cancellation gate (
runner.rs) -- a final abort check runs after the post-execution artifact validation. Aborted scans stop here and do not build finalize ops or persist partial progress. - Finalize (
runner.rs) --build_finalize_opstransforms scan results into deterministic write operations. - Persist (
runner.rs) -- optional two-phase atomic write viapersist_finalize_output. - Report assembly (
runner.rs) -- perf counters are snapshot, and the finalGitScanReportis returned.
The runner treats cancellation as a first-class control flow path rather than a best-effort hint:
run_git_scanchecks the abort flag before repo open work continues, after identity-dictionary emission, and after the post-mode artifact stability check.run_odb_blobandrun_diff_historycheck the same flag at commit boundaries, before pack execution, before loose-object scanning, and inside their hottest tree-walk helpers.- The scheduler bridge in
runner_exec.rsobserves the external abort flag in addition to its internal error latch before launching each pack-exec task. - Aborted scans return
GitScanError::TreeDiff(TreeDiffError::Aborted)and skip finalize persistence entirely.
| Type | Description |
|---|---|
GitScanConfig |
Full scan configuration: mode, limits, worker counts, spill dir |
GitScanMode |
Enum: DiffHistory or OdbBlobFast (default) |
GitScanResult |
Newtype wrapper around GitScanReport |
GitScanReport |
Summary report with per-stage stats, findings, and finalize output |
GitScanError |
Error taxonomy organized by pipeline stage |
ScanModeOutput |
Common output struct produced by both mode pipelines, including completed-pack state |
GitScanStageNanos |
Per-stage nanosecond timings (populated with git-perf feature) |
GitScanAllocStats |
Allocation deltas for hot stages |
PackMmapLimits |
Pack file mmap count and byte budget |
CandidateSkipReason |
Enum of reasons a candidate blob was skipped |
SkippedCandidate |
Blob OID paired with its skip reason |
| Type | Description |
|---|---|
PackExecStrategy |
Enum: Serial, PackParallel, or IntraPackSharded |
SpillCandidateSink |
Bridges CandidateSink to Spiller::push for diff-history mode |
SchedulerPackExecOutput |
Output from one scheduler-dispatched pack-plan task |
SchedulerPackTask |
Enum: ExecPlan { seq } or ExecShard { plan_idx, shard_idx } |
SchedulerPackScratch |
Per-worker reusable scratch (cache + decode workspace + runtime) |
SchedulerPackWorkerRuntime |
Heavy per-worker state (MIDX + PackIo + EngineAdapter) with custom Drop |
SchedulerPackShared |
Immutable state shared across all scheduler worker threads |
SchedulerShardMeta |
Pre-computed sharding metadata for one pack plan |
PlanCostHint |
Stats-free structural cost hint for strategy selection |
LocalityPressure |
Cross-shard locality pressure estimate |
| Type | Description |
|---|---|
EngineAdapter<'a> |
Primary blob scanner; implements PackObjectSink |
EngineAdapterConfig |
Chunk window size and binary scan policy |
CommitMetaContext |
Bundle of event sink + commit-graph + bitset for CommitMeta |
FindingKey |
Normalized finding identity (start, end, rule_id, norm_hash) |
ScoredFinding |
Finding key with confidence score |
FindingSpan |
Range into the shared findings arena for a single blob |
ScannedBlob |
Blob result with OID, context, and findings span |
ScannedBlobs |
Collected scan results with shared findings arena |
GitScanCommonMetrics |
Always-on counters: objects/chunks/bytes scanned, findings emitted |
RingChunker |
Fixed-size ring buffer for overlap-safe chunk streaming |
ChunkView |
Single chunk window with base offset and first-window flag |
| Type | Description |
|---|---|
FinalizeInput |
Input bundle: repo_id, blobs, findings arena, refs, skip OIDs |
FinalizeOutput |
Separated data ops + watermark ops for two-phase persistence |
FinalizeOutcome |
Enum: Complete (watermarks safe) or Partial (watermarks skipped) |
FinalizeStats |
Unique blobs, total findings, dedup counts |
WriteOp |
Single key-value write operation for the persistence layer |
RefEntry |
Ref name + tip OID from the start set |
NamespaceCounts |
Per-namespace operation counts for diagnostics |
| Type | Description |
|---|---|
PersistenceStore (trait) |
Atomic finalize commit interface; extends SeenBitmapPersister for incremental scope updates |
InMemoryPersistenceStore |
Test-only in-memory store for inspection and scoped seen-bitmap tests |
Finalize-time sb\0 ops remain SeenBitmapDelta payloads. When a store folds
those deltas into the durable scope snapshot, RoaringSeenBitmap keeps the
sorted OID index flat-packed in memory and applies a roaring bitmap over
positions in that table. The write ordering and delta encoding stay unchanged.
The EngineAdapter (engine_adapter.rs) bridges decoded git blob bytes
into the core detection Engine. It implements PackObjectSink so pack
execution can feed it decoded blobs directly.
Each blob flows through four stages inside the adapter:
- Classify (
scan_blob_into_buf,engine_adapter.rs) -- content policy classifies the blob as text, extractable binary (.class,.pyc), or opaque binary. Opaque binaries are skipped. Extractable formats have their text extracted first. - Scan -- blob bytes are fed through overlap-safe chunk windows.
Blobs that fit in a single chunk (
<= chunk_bytes) take a fast path that skips the ring buffer memcpy entirely (engine_adapter.rs). Larger blobs stream throughRingChunker(engine_adapter.rs). - Stream (
stream_findings,engine_adapter.rs) -- findings are emitted to the structuredEventSinkwith the blob OID wire-encoded into eachFindingEvent. ACommitMetaevent is emitted at most once per commit viaAtomicBitSet::test_and_set. - Record (
record_findings,engine_adapter.rs) -- findings are appended to the shared arena and the resultingFindingSpanis attached to theScannedBlob.
The RingChunker (engine_adapter.rs) streams blob bytes into fixed
windows with configurable overlap (from Engine::required_overlap()). After
scanning each chunk, findings wholly within the overlap prefix are dropped
via ScanScratch::drop_prefix_findings to avoid cross-chunk duplication.
Findings are converted to FindingKey values (no raw secret bytes), then
sorted + deduped per blob for deterministic ordering.
When findings are non-empty and the commit_id falls within the commit-graph
range, commit_meta_seen.test_and_set(commit_id) returns true exactly once
across all adapter instances sharing the same Arc<AtomicBitSet>. This ensures
at most one CommitMeta event per commit even under parallel pack-exec workers.
Cross-worker stream ordering is intentionally non-deterministic.
select_pack_exec_strategy (runner_exec.rs) chooses one of three
execution strategies from the worker count and per-plan structural hints:
| Strategy | Condition | Behavior |
|---|---|---|
Serial |
workers <= 1, no plans, or total need < 512 |
Single-threaded execution |
PackParallel |
plan_count >= workers |
One plan per worker, deterministic reassembly |
IntraPackSharded |
Fewer plans than workers | Large plans split into index-range shards |
For IntraPackSharded, each plan's shard count is the minimum of five
independent caps (runner_exec.rs):
- Worker count -- never more shards than available workers.
need_count / 1024-- avoids oversharding tiny plans.span_bytes / 4 MiB-- avoids splitting narrow byte ranges.- Dependency pressure -- if more than half the need offsets have forward or external deps, shard count is capped to 2.
- Locality pressure (
apply_locality_shard_cap,runner_exec.rs) -- if projected shard boundaries split too many offset-based delta deps, fan-out is reduced iteratively until cross-shard pressure falls below 55%.
Pack cache is computed through a layered heuristic
(per_worker_cache_bytes, runner_exec.rs):
- Raw estimate:
total_mapped_bytes / 16(~6.25% of pack data). - Per-worker cap:
16 GiB / workers(aggregate memory bound). - Per-worker floor: 32 MiB (functional minimum).
- Per-worker ceiling: 2 GiB hard cap.
The floor can exceed the per-worker cap intentionally -- a worker with too little cache degrades hit-rate more than marginally exceeding the aggregate target.
auto_pack_exec_workers_for_in_pack (runner.rs) selects pack-exec
workers by repository size tier:
| In-pack objects | Multiplier | Example (12 cores) |
|---|---|---|
< 100,000 |
1x cores | 12 workers |
< 2,000,000 |
3x cores | 36 workers |
>= 2,000,000 |
6x cores | 72 workers |
Workers are capped at MAX_PACK_EXEC_WORKERS (128, runner.rs) to
prevent excessive per-worker memory (Decompress ~37 KiB, scratch buffers,
PackCache 32 MiB floor each).
The scheduler implementation lives in runner_exec.rs. It wraps
the scanner_scheduler::Executor work-queue:
execute_pack_plans_with_scheduler(runner_exec.rs) selects aPackExecStrategyand delegates to eitherexecute_plan_tasksorexecute_sharded_tasks.- Plan tasks (
runner_exec.rs) -- oneSchedulerPackTask::ExecPlanper plan, dispatched as a batch. Outputs are stored in sequence-indexed mutex slots for deterministic reassembly. - Shard tasks (
runner_exec.rs) --build_shard_dispatch_planpre-computesSchedulerShardMeta(execution plan, hot deps, candidate ranges, shard ranges) for each plan. Tasks areSchedulerPackTask::ExecShard { plan_idx, shard_idx }. Outputs are stored in flattened shard slots and merged per-plan after join. - Error handling -- on the first worker error, an
AtomicBoolabort flag prevents new tasks from starting. Afterex.join(), the first error is returned and all successful outputs are discarded. - Per-worker scratch -- each worker thread creates a
SchedulerPackScratchwith aPackCache,PackExecScratch, and a lazily-initializedSchedulerPackWorkerRuntime(parsed MIDX +PackIo+EngineAdapter) that is reused across all tasks on that thread.
SchedulerPackWorkerRuntime (runner_exec.rs) caches expensive per-worker
setup: a parsed MidxView in a PackIo and an EngineAdapter. These hold
transmuted 'static references that actually borrow from Arc<Engine> and
BytesView fields stored in the same struct. A custom Drop impl
(runner_exec.rs) drops borrowers (adapter, external) before their
backing storage (_engine, _midx_bytes). The struct is #[repr(C)] with
compile-time assertions (runner_exec.rs) to enforce field ordering
soundness.
build_finalize_ops (finalize.rs) is a pure function -- no I/O, no side
effects. It transforms scan results into stably-ordered write operations.
- Sort blobs by OID, refs by name (
finalize.rs). - Group blobs by OID. For each group:
- Select the canonical context (minimum under a strict total order:
commit_id, path bytes,parent_idx,change_kind,ctx_flags,cand_flags). - Emit a
blob_ctx(bc\0) write op with the encoded context. - Gather findings across all contexts for this OID, sort + dedupe by
identity
(start, end, rule_id, norm_hash), emitfinding(fn\0) ops. - Accumulate this OID into the scope-scoped seen-bitmap delta (
sb\0).
- Select the canonical context (minimum under a strict total order:
- Assemble data ops in namespace order:
bc\0<fn\0<sb\0. - If the run is complete (no skipped candidates), emit ref watermark (
rw) ops. If partial, watermark ops are empty.
| Prefix | Namespace | Description | Written by |
|---|---|---|---|
bc\0 |
blob_ctx | Canonical context per scanned blob | build_finalize_ops |
fn\0 |
finding | Individual finding records | build_finalize_ops |
sb\0 |
seen_blob | Scope-scoped seen-bitmap delta | build_finalize_ops |
so\0 |
seen_ordinal | Persisted MIDX ordinal cache | Persistence backend |
rw |
ref_watermark | Ref tip watermarks (complete only) | build_finalize_ops |
The so\0 ordinal cache key is not part of build_finalize_ops output.
It is written by the persistence backends (RocksDbStore::commit_finalize,
GitPersistenceAdapter::commit_finalize) after merging finalize ops with
the current MIDX ordinal state.
All keys use big-endian numeric fields to preserve lexicographic ordering. Ref watermark keys are null-terminated for prefix-safe scans.
Complete-- all candidates scanned; watermark ops are populated and safe to write. Persistence advances ref tips.Partial-- some candidates were skipped (decode failure, budget exceeded, corrupt objects). Watermark ops are empty; ref tips are NOT advanced, ensuring the next scan re-visits unscanned content.
During spill flushing, the spiller forwards sorted unseen-OID batches to the
SeenBitmapPersister (which may be the persistence store or a no-op) so the
scope bitmap is warmed before finalize. persist_finalize_output (persist.rs) then forwards the
FinalizeOutput to the same store. The finalize commit must write data_ops
and (when complete) watermark_ops atomically, so readers never observe
watermarks without corresponding data writes.
| Concept | File |
|---|---|
run_git_scan entry point |
runner.rs |
GitScanConfig definition |
runner.rs |
GitScanMode enum |
runner.rs |
GitScanError taxonomy |
runner.rs |
GitScanReport definition |
runner.rs |
ScanModeOutput definition |
runner.rs |
| Worker auto-sizing | runner.rs |
PackExecStrategy selection |
runner_exec.rs |
| Pack cache sizing | runner_exec.rs |
| Mmap management | runner_exec.rs |
| Loose candidate scanning | runner_exec.rs |
| Scheduler dispatch entry | runner_exec.rs |
| Plan task execution | runner_exec.rs |
| Shard task execution | runner_exec.rs |
| Shard dispatch plan builder | runner_exec.rs |
EngineAdapter definition |
engine_adapter.rs |
| Per-blob scan pipeline | engine_adapter.rs |
| Exactly-once CommitMeta | engine_adapter.rs |
RingChunker implementation |
engine_adapter.rs |
build_finalize_ops |
finalize.rs |
FinalizeOutput definition |
finalize.rs |
FinalizeOutcome enum |
finalize.rs |
| Key namespace constants | finalize.rs |
PersistenceStore trait |
persist.rs |
persist_finalize_output |
persist.rs |
| Worker runtime safety | runner_exec.rs |
| Scheduler task execution | runner_exec.rs |
SeenBitmapDelta (finalize-time delta) |
roaring_seen.rs |
RoaringSeenBitmap (durable scope snapshot) |
roaring_seen.rs |
docs/scanner-git/git-scanning.md-- end-to-end pipeline overviewdocs/scanner-git/git-pack-execution.md-- pack decode + execution detailsdocs/scanner-engine/detection-engine.md-- core detection engine