- Status: Accepted
- Date: 2026-03-31
Git repository scanning uses a repo-native execution path, not the ordered-content page loop. The repository already has most of the building blocks for that path:
| Existing building block | Current location | Why it matters |
|---|---|---|
ShardStatus::{Active, Done, Split, Parked} |
crates/gossip-coordination/src/record.rs |
The outer coordinator lifecycle is already fixed and persisted. |
LeaseUncertainty::{DeadlineElapsed, AdvanceStaleFence, AdvanceLeaseExpired} |
crates/gossip-scanner-runtime/src/distributed.rs |
Lease-loss semantics already exist at runtime level. |
CheckpointBoundaryKind::RepoFrontier |
crates/gossip-contracts/src/persistence/page_commit.rs |
Repo-frontier progress is already a first-class boundary kind. |
CompletedUnit::repo_frontier(...) |
crates/gossip-scanner-runtime/src/commit_model.rs |
Repo-frontier units already fit the shared runtime commit vocabulary. |
CheckpointAggregatorInput |
crates/gossip-scanner-runtime/src/commit_model.rs |
Outer progress already advances from durable receipts only. |
PersistenceStore::commit_finalize(&FinalizeOutput) |
crates/scanner-git/src/persist.rs |
Inner Git execution already has an atomic finalize seam. |
FinalizeOutcome::{Complete, Partial} with suppressed watermark_ops on Partial |
crates/scanner-git/src/finalize.rs |
Partial runs already prevent ref-frontier advancement. |
SeenBlobStore, seen_blob markers, and MIDX ordinal caches |
crates/scanner-git/src/seen_store.rs, crates/scanner-git/src/finalize.rs, crates/scanner-git/src/ordinal_seen.rs |
Replay already has an idempotent dedupe surface plus a fingerprint-scoped acceleration cache for configured snapshots. |
GitRepoDiscoverySource, GitMirrorManager, GitRepoExecutor, GitRunError |
crates/gossip-contracts/src/connector/git.rs |
The repo-family contract surface already exists. |
This document locks how those pieces compose. It does not introduce a second coordinator state machine, a second checkpoint path, or a compatibility layer.
- Repo execution is a worker-local substate machine inside
ShardStatus::Active. - Inner Git durability hands off to outer repo-frontier checkpointing through a sequential durable-receipt seam.
- The connector-level Git runtime error model is ternary:
Retryable,Permanent, andStaleOwner. - The first shipped scope stays narrow: one normalized repo target per shard, deterministic mirror and selection inputs, no repo-internal coordination state, and no repo-internal shard splitting.
The coordinator keeps its current coarse shard lifecycle. Repo execution phases
are worker-local substates inside ShardStatus::Active; they do not add new
persisted ShardStatus variants.
stateDiagram-v2
direction LR
[*] --> Active : worker acquires shard
state "ShardStatus::Active" as Active {
[*] --> Claimed
Claimed --> MirrorSync : start mirror work
MirrorSync --> Executing : mirror ready
Executing --> InnerDurable : inner finalize durable
InnerDurable --> OuterCheckpointed : repo-frontier receipt durably checkpointed
MirrorSync --> MirrorSync : retryable error
Executing --> Executing : retryable error
Claimed --> LeaseLost : lease loss
MirrorSync --> LeaseLost : lease loss
Executing --> LeaseLost : lease loss
InnerDurable --> LeaseLost : lease loss
Claimed --> PermanentFailure : permanent error
MirrorSync --> PermanentFailure : permanent error
Executing --> PermanentFailure : permanent error
InnerDurable --> PermanentFailure : permanent error
}
Active --> Done : completion after OuterCheckpointed
Active --> Parked : coordinator accepts park request
Done --> [*]
Parked --> [*]
LeaseLost is terminal only for the current worker attempt. The worker stops
immediately, discards in-flight local state, and leaves the shard Active so a
new owner can replay from the last durable frontier. PermanentFailure is a
worker-local decision to request parking; the actual persisted transition remains
the existing outer Active -> Parked coordination transition.
ShardStatusstays exactlyActive | Done | Split | Parked.Claimed,MirrorSync,Executing,InnerDurable, andOuterCheckpointedare invisible to coordination storage.- Retryable errors in
MirrorSyncorExecutingmay loop inside the same lease, bounded by the lease deadline. - Lease loss from any inner phase stops execution immediately and never produces a new checkpoint or terminal transition.
- Permanent failure from any inner phase requests
Parked.
- Track repo execution in a second persisted state machine. Rejected because it duplicates lease-loss handling and creates desynchronization risk between the worker view and the coordinator view.
- Add repo-specific outer shard states. Rejected because coordination only needs
coarse ownership and terminal-state information, and
ShardStatusdiscriminants are already persisted.
Inner Git persistence owns repo-data durability. The outer runtime owns shard frontier durability. The handoff is sequential: inner persistence commits first, then the runtime translates that durable outcome into the shared receipt chain that drives repo-frontier checkpointing.
flowchart TD
A["MirrorSync + Executing"] --> B["scanner-git builds FinalizeOutput"]
B --> C["PersistenceStore::commit_finalize(output)"]
C --> D{"FinalizeOutcome"}
D -->|Complete| E["data_ops + watermark_ops durable"]
D -->|Partial| F["data_ops durable; watermark_ops suppressed"]
E --> G["Translate repo result + captured findings<br/>via translate_git_item_result"]
F --> G
G --> H["ResultCommitter -> ItemCommitReceipt"]
H --> I["CompletedUnit::repo_frontier + ItemCommitReceipt -> UnitCommitReceipt"]
I --> J["CheckpointAggregatorInput(RepoFrontier)"]
J --> K["PrefixCheckpointAggregator.prepare_checkpoint()"]
K --> L["CheckpointCommitReceipt"]
L --> M["Coordinator checkpoint / complete transition"]
C -. crash after inner durability .-> N["Safe replay: blob_ctx, finding, and seen_blob keys are deterministic; seen_blob markers suppress duplicate blob work; Partial never writes watermark_ops; outer repo frontier has not advanced yet"]
The seam is deliberately not a two-phase commit. The inner store already has the atomic contract it needs:
data_opsare always safe to write.watermark_opsare written only forFinalizeOutcome::Complete.commit_finalizewrites the chosen operation set atomically.
The outer runtime already has the receipt-only rule it needs:
CheckpointAggregatorInputaccepts onlyUnitCommitReceipt.PrefixCheckpointAggregatoradvances only from durable contiguous prefixes.RepoFrontierremains a normal checkpoint-boundary kind in the shared aggregator rather than a Git-only side channel.
The translation step now includes Git findings themselves, not only finalize
outcome metadata. Repo-frontier workers capture emitted finding payloads during
scan execution, normalize them behind PersistenceFinding, and commit them
through the same findings-first, done-ledger-second ResultCommitter path used
by ordered-content scans before they synthesize the outer checkpoint receipt.
The only interesting crash window is after commit_finalize succeeds but before
the outer checkpoint receipt is durably acknowledged.
That window is safe because:
- inner writes are keyed deterministically, so reissuing them converges on the same persisted state;
seen_blobmarkers provide replay-time dedupe for already-scanned blobs;- configured MIDX ordinal caches are keyed by the repo artifact fingerprint, so replay only reuses them when the snapshot metadata still matches the committed seen scope;
FinalizeOutcome::Partialsuppresseswatermark_ops, so partial replays never advance ref-frontier watermarks; and- outer repo-frontier advancement still has not happened, because only the
receipt path through
CheckpointAggregatorInputandCheckpointCommitReceiptmay move the shard frontier.
The result is a single-writer outbox-style handoff: inner persistence proves the repo-local writes are durable, and the outer runtime turns that proof into the family-neutral checkpoint protocol.
- Use a two-phase commit spanning scanner-git persistence and the outer runtime. Rejected because it adds a large coordination refactor without improving the safety properties that deterministic replay already provides.
- Advance the outer frontier from scan completion, queue drain, or another non-durable signal. Rejected because the shared runtime model already forbids raw completion signals from acting as authoritative progress.
The Git repo-runtime contract needs a third connector-level error class:
StaleOwner.
Three distinct ErrorClass enums exist today:
| Layer | Current location | Current shape | This ADR changes it? |
|---|---|---|---|
| Connector level | crates/gossip-contracts/src/connector/api.rs |
Retryable | Permanent |
Yes, conceptually reserve StaleOwner here. |
| Scheduler level | crates/scanner-scheduler/src/scheduler/failure.rs |
Binary with reason subtypes | No. |
| Remote backend level | crates/scanner-scheduler/src/scheduler/remote.rs |
Retryable | Permanent |
No. |
This ADR changes only the connector-level meaning. Scheduler and remote error taxonomies remain separate until they need their own explicit ownership-loss mapping.
| Class | Meaning | Worker action | Example |
|---|---|---|---|
Retryable |
The current owner may retry within the same lease. | Retry locally while the lease is still authoritative. | transient mirror sync failure, temporary rate limit, concurrent maintenance retry surface |
Permanent |
The current owner cannot make progress without an external change. | Request Parked. |
permission denied, repository missing, unsupported selection or malformed configuration |
StaleOwner |
The work has become non-authoritative because lease ownership is no longer trustworthy. | Stop immediately, drop in-flight local state, leave the shard Active for reassignment. |
deadline elapsed, stale fence rejection, lease-expired rejection, mirror or persistence detecting superseded ownership |
StaleOwner is not a retry request and not a park request. It means "the work
may be valid, but this worker is no longer allowed to finish it."
When the connector-level enum grows StaleOwner, every binary
if err.is_retryable() { ... } else { ... } call site must be audited. The
current is_retryable() helper is intentionally binary; without an audit it
would collapse StaleOwner into the non-retryable branch and silently treat
ownership loss like a permanent failure. Future code must branch on class() or
an equivalent three-way helper.
- Keep the connector layer binary and model ownership loss only through
LeaseUncertaintyin the runtime. Rejected because ownership loss may be detected inside mirror management or persistence code that is naturally exposed throughGitRunError. - Add
StaleOwnerto every error enum immediately. Rejected because the connector, scheduler, and remote layers have different responsibilities and should not be forced into lockstep without a separate design decision.
The first shipped shape is intentionally constrained.
| Topic | Locked decision | Reason | Deferred expansion |
|---|---|---|---|
| Shard granularity | One normalized repo target maps to one shard. | Control-plane state scales with repo targets instead of repo-internal objects. | Multi-repo packing and shard-level packing heuristics. |
| Shard splitting | No repo-internal split points in this shape. A repo shard is either replayed, parked, or completed as one repo target. | Split-point discovery inside a repository adds coordination and replay complexity immediately. | Split/pack strategies that operate on multi-repo shards. |
| Coordinator visibility | The coordinator stores only coarse shard state, repo-frontier progress, and terminal outcomes. | Repo-internal commits, trees, blobs, and mirror details are worker-local execution state. | Repo-internal coordination state, if a later design proves it is needed. |
| Mirror lifecycle | Mirror location and refresh behavior are deterministic per repo target. The runtime implementation lives in gossip-scanner-runtime::git_mirror::LocalMirrorManager. |
Reassignment and replay need a stable local execution surface. | Shared mirror pools, eviction policy, and provider-specific mirror orchestration. |
| Explicit commit selection | Explicit commit inputs lower to stable synthetic refs before execution. | Identical inputs must normalize to the same ordered repo targets and checkpoint identity. | Additional selection UX and provider-specific aliases. |
StaleOwner rollout |
The semantic is locked now, but the connector-level enum, constructor surface, and consumers may land in a separate change. | The behavior must be documented before wiring spreads through the runtime. | Scheduler-level and remote-level ownership-loss mappings. |
- Pack multiple repositories into a single shard from the start. Rejected because it immediately couples checkpointing, split policy, and replay policy.
- Introduce repo-internal coordinator state early. Rejected because it increases storage and protocol surface before the coarse repo-frontier path is proven.
Every downstream implementation task, design note, and test plan for this execution model keeps these invariants explicit:
- A stale fence token is never accepted again.
- Loss of lease stops repo execution quickly.
- The outer repo frontier never advances before inner durable persistence completes.
- Control-plane state scales with shards, not with objects inside repositories.
- Identical request inputs normalize to the same ordered repo targets.
- Identical explicit commit inputs lower to the same synthetic ref.
- Logs, metrics, and traces never contain raw secret bytes, repo paths, refs, or tokens.
- Duplicate submission and worker replay are idempotent.
Operational telemetry follows the same rule. Stage-oriented Git observability may emit scalar timings, shard digests, and closed-set retry or lease-loss labels, but never raw mirror roots, repo locators, refs, commit IDs, or connector tokens.
ShardStatusremains the only persisted outer lifecycle for repo work.RepoFrontierremains part of the shared receipt and checkpoint model rather than a Git-only completion path.- Lease loss is a first-class outcome that stops work without parking the shard.
- Git stage telemetry stays low-cardinality and redaction-safe: worker logs and recorder events emit digests plus scalar timings rather than raw repository identifiers.
- The first scope optimizes for deterministic replay and narrow control-plane state, not for packing density.
docs/source-families.mddescribes the family split between ordered-content and repo-native Git execution.docs/gossip-coordination/coordination-error-model.mdcomplements this ADR with the coordination-layer error taxonomy and lease/fence validation model.diagrams/05-shard-and-run-state-machines.mddescribes the outer shard and run lifecycle that this ADR intentionally preserves.diagrams/10-failure-modes-and-recovery.mddescribes the failure and replay model that this ADR relies on for ownership loss and durable replay.