Deterministic simulation infrastructure for the coordination subsystem, inspired by FoundationDB's simulation framework, TigerBeetle's VOPR, and sled's simulation harness.
Execution integration note: the runtime exposes family-oriented execution nouns
through gossip-contracts::connector and gossip-scanner-runtime. The
simulation harness remains focused on coordination correctness and
intentionally does not depend on scanner-engine or scheduler internals.
Persistence note: crates/gossip-coordination-etcd/ provides two entrypoints —
EtcdCoordinator (sync wrapper) and AsyncEtcdCoordinator (async core) —
both persisting coordination state directly to etcd via CAS transactions.
Deterministic simulation continues to model the InMemoryCoordinator reference
backend because the sim harness requires synchronous &mut self control over
logical time and fault injection, which the in-memory backend provides without
I/O or async runtime dependencies.
The simulation is built in five layers, each composing the one below it:
Owns the single ChaCha8Rng PRNG and a monotonic LogicalTime clock.
Everything random in a simulation run flows through this one RNG instance.
Per-worker bookkeeping that tracks:
- Lease claims -- which shards the worker believes it holds (may diverge from coordinator truth after silent lease expiry).
- Op-ID generation -- partitioned by worker ID (
worker_id * 1_000_000) to guarantee cross-worker uniqueness without coordination. - Pause state -- models GC pauses, network partitions, or VM migrations.
- Cursor progress -- last checkpoint cursor per
(RunId, ShardId)for forward-only cursor generation.
An external observer that verifies nine safety properties against coordinator ground truth at every simulation step. It never trusts worker-side bookkeeping. See the invariant table below.
Deterministic scripted stress workloads that exercise cooldown and capacity
behavior. Defines scenario descriptors (OverloadKind, OverloadScenario),
lightweight telemetry (GoodputTracker, D1Observation, OverloadReport),
and operation burst generators consumed by CoordinationSim::run_overload.
The top-level driver. Runs a three-stage simulation (zombie preamble, safety, then liveness) with weighted random op generation, fault injection, and full invariant checking after every step.
Test-support builds also expose a composition extension around that core
harness. sim/composition.rs couples InMemoryCoordinator with
InMemoryDoneLedger in a single deterministic loop, sim/scan_driver_sim.rs
generates synthetic scan outcomes for that loop, and sim/shared.rs holds the
borrow-friendly helper aliases and utilities reused by both harnesses.
For split-replace operations, the harness uses the contracts-owned planner
helper (gossip_contracts::coordination::split::plan_split_replace_at_points_initial_cursor)
so simulation planning behavior stays aligned across boundary consumers.
All randomness flows through a single ChaCha8Rng seeded from a u64.
Why ChaCha8Rng? ChaCha8Rng is algorithm-specified and value-stable
across patch releases, unlike StdRng which is explicitly non-portable.
Cargo.lock provides sufficient pinning for cross-platform reproducibility.
PRNG call ordering matters. The RNG is a single sequential stream. Any
change to the order or number of rng() calls changes all subsequent random
decisions for a given seed. When adding new randomized logic, append calls
rather than inserting them between existing ones.
Integer PPM fault rates. Fault probabilities use integer parts-per-million
(PPM) instead of f64 to eliminate IEEE 754 rounding variance across
platforms and prevent invalid probability construction.
| Label | Name | Rule |
|---|---|---|
| S1 | MutualExclusion | At most one worker holds a non-expired lease per shard. |
| S2 | FenceMonotonicity | fence_epoch never decreases for a given (RunId, ShardId). |
| S3 | TerminalIrreversibility | Terminal states (Done, Split, Parked) never revert, except Parked->Active (unpark) which requires a fence bump. |
| S4 | RecordInvariant | SimIntrospection::validate_record_invariants() returns Ok (delegates to ShardRecord::assert_invariants() with backend-owned storage context). |
| S5 | CursorMonotonicity | cursor.last_key() never decreases per shard. |
| S6 | CursorBounds | Non-initial cursors remain within shard spec key range. |
| S7 | SplitCoverage | Split-parent's spawned children exist and reference the correct parent. |
| S8 | RunTerminalIrreversibility | Terminal run states (Done, Failed, Cancelled) never revert. |
| S9 | CooldownViolation | A worker must not successfully claim twice within the configured cooldown. |
All nine invariants are checked after every operation (both successful and rejected).
S9 push-style pattern: Unlike S1–S8 which are validated during
InvariantChecker::check_all()'s pass over coordinator state, S9 uses a
push-style check. The harness calls record_claim_success(worker, now)
when ClaimNext succeeds, and violations are buffered internally. The
buffered violations are then drained and appended to the result vector at
the end of check_all, keeping all invariant reporting in one place.
CoordinationSim::run(safety_ops, liveness_ops) executes three stages:
Before the main safety phase, the harness seeds initial leases and then immediately expires them, creating "zombie" workers that hold stale lease references. This exercises the fence-based rejection paths from the very first operation of the run, ensuring that stale-lease handling is tested even before normal operation begins.
Runs safety_ops random operations under fault injection. The first few
operations (warmup) suppress faults to let the system reach a healthy
baseline. After warmup, time jumps, worker pauses, lease expiry, split
operations, OpId replays, zombie checkpoints, and run-terminal transitions
(complete/fail/cancel) are all exercised at weighted probabilities.
Goal: Verify that no invariant (S1-S9) is ever violated regardless of operation ordering, timing, or fault injection.
Runs liveness_ops operations biased toward acquire and complete. No faults
are injected.
Goal: Verify that the system converges -- all shards reach a terminal state (Done, Split, or Parked).
CoordinationSim::run_overload(warmup_ops, scenario, recovery_ops) provides a
targeted stress path for cooldown/capacity behavior.
The run has three phases:
- Warmup random operations to establish baseline leases.
- Scripted overload rounds (
OverloadScenario):BurstClaim: all workers issueClaimNext.CapacityDrop: pause half the workers, then force a time jump.BurstShards: issueSplitReplaceon all currently held shards.
- Recovery with periodic
ClaimNextinjection and liveness-biased ops.
The overload report includes:
- Standard simulation fields (
ops_executed,violations,event_counts,seed,end_time). overload_goodput(completion ratio during overload rounds).- D1 diagnostics (
d1_observations) comparing reportedcount_available_for_runvalues with coordinator-derived ground truth. D1 samples are captured after each overload round. A mismatch betweenreportedandground_truthindicates the coordinator's fast-path availability counter drifted from the full-scan result — a bug in the bookkeeping optimization, not in the protocol itself. - L1 liveness sentinel (
l1_any_completed).
| Level | Lease Expiry | Worker Pause | Time Jump | Use Case |
|---|---|---|---|---|
| SunnyDay | 0% | 0% | 0% | Happy-path correctness |
| Stormy | 10% | 5% | 10% | Moderate fault coverage |
| Radioactive | 20% | 10% | 20% | Stress/edge-case search |
Pause durations and time-jump magnitudes scale with the level. See
FaultConfig::for_level for exact PPM values.
The harness exercises two distinct zombie-rejection paths:
-
Local bookkeeping cleanup -- When Worker B acquires a shard previously held by Worker A (whose lease expired), Worker A's local bookkeeping is cleared. A subsequent checkpoint attempt by Worker A is rejected with
NotLeasedbefore reaching the coordinator. -
Coordinator fence-based StaleFence -- Stale leases are saved when bookkeeping cleanup supersedes them.
ZombieCheckpointops use these saved stale leases to bypass local cleanup entirely, exercising the coordinator'sStaleFenceerror path directly.
Every SimReport includes the seed field. To reproduce:
let report = CoordinationSim::new(FAILING_SEED, FaultLevel::Stormy)
.with_workers_and_shards(3, 5)
.run(500, 200);
assert!(report.violations.is_empty());The simulation is fully deterministic: same seed, same fault level, same worker/shard counts, same op counts produce identical results on any platform.
-
Define the violation variant in
InvariantViolation(insim/invariants.rs). -
Add the check inside
InvariantChecker::check_all(). Follow the single-pass pattern: accumulate state during the shard iteration, then validate after the loop if needed (like S1's post-pass duplicate check).Alternative: push-style checks. If the invariant depends on information not available in coordinator state (e.g., S9 depends on knowing when a claim succeeded, which is transient), use the push-style pattern: add a public method that the harness calls at the relevant event, buffer violations internally, and drain them in
check_all. Seerecord_claim_success(S9) as a reference. -
Add a negative test in
invariants::teststhat constructs a coordinator state violating the new invariant and asserts the checker detects it. Useseed_shard_uncheckedfor states thatseed_shardwould reject. -
Update the invariant table in this document and in the module doc comment at the top of
sim/mod.rs. -
Run the full simulation across multiple seeds to verify the new check does not produce false positives:
cargo test --all-features -p gossip-coordination -- sim
The simulation harness exports several types from sim/mod.rs and its
submodules that are part of the crate's public API.
| Type | Purpose |
|---|---|
CoordinationSim |
Main simulation driver; configures workers, shards, fault level, and runs operations |
SimReport |
Simulation output: ops executed, violations, event counts, seed, timing |
SimOp |
Enum of 17 operation variants the harness can execute (Acquire, Renew, Checkpoint, Complete, Park, SplitReplace, SplitResidual, ReplayCheckpoint, ConflictCheckpoint, ZombieCheckpoint, ClaimNext, AdvanceTime, PauseWorker, ResumeWorker, SessionLifecycle, Unpark, TerminateRun) |
SimEvent |
Enum of 20 outcome variants recording what happened (AcquireOk, RenewOk, CheckpointOk, CompleteOk, ParkOk, SplitReplaceOk, SplitResidualOk, ReplayedOk, ClaimOk, ClaimNoneAvailable, ClaimThrottled, Rejected, TimeAdvanced, WorkerPaused, WorkerResumed, Skipped, SessionLifecycleOk, SessionLifecyclePartial, UnparkOk, RunTerminalOk) |
SimEventKind |
Discriminant-only enum matching SimEvent variants, used for event counting |
RejectionKind |
Enum of 24 rejection reasons (NotLeased, StaleFence, LeaseExpired, TerminalState, CursorRegression, CursorOutOfBounds, etc.) used by SimEvent::Rejected |
| Type/Trait | Purpose |
|---|---|
SimIntrospection |
GAT-based trait for read-only coordinator inspection; provides shards(), runs(), shard_count(), shard_lookup(), cursor_last_key(), spec_bounds(), validate_record_invariants(), spawned_children(), release_record_fields() |
SimulationBackend |
Blanket trait combining CoordinationFacade + SimIntrospection; any type implementing both automatically satisfies this |
| Type | Purpose |
|---|---|
FaultInjectingIntrospector<B> |
Wraps a SimIntrospection backend and injects synthetic shard records before invariant checking. Synthetic records appear in shards() iteration but are invisible to shard_lookup(). Used to test invariant checker robustness against corrupted state. |
| Type | Purpose |
|---|---|
InvariantChecker |
Runs the coordination safety invariants in a single pass over coordinator state |
InvariantViolation |
Enum of S1-S9 safety violations detected by the checker |
SplitCoverageDetail |
Detail record for S7 (split coverage) violations, containing parent/child spec information |
| Type | Purpose |
|---|---|
OverloadKind |
Scripted overload workload selector (BurstClaim, CapacityDrop, BurstShards) |
OverloadScenario |
Scenario descriptor configuring the overload phase |
GoodputTracker |
Tracks completion ratio during overload rounds for goodput reporting |
D1Observation |
Snapshot comparing reported availability to full-scan ground truth |
OverloadReport |
Extended run report with overload goodput, D1 samples, and the L1 recovery sentinel |
| Type | Purpose |
|---|---|
CompositionSim |
Test-support harness that drives coordinator and done-ledger state in one deterministic loop |
CompositionSimOp |
Operation enum for composition steps: coordinator passthrough, scan lifecycle, and ledger fault injection |
CompositionSimEvent |
Outcome enum for composition operations |
CompositionSimViolation |
Wrapper over coordination and persistence invariant violations |
CompositionInvariantChecker |
Cross-component invariant checker for coordinator ↔ done-ledger provenance (C1–C4) |
CompositionFaultConfig |
Cross-boundary fault-rate configuration for composition simulation |
CrossComponentViolation |
Cross-boundary invariant violation enum for orphaned provenance, fence drift, and post-terminal writes |
DoneLedgerFaultOp |
Imperative done-ledger fault injection operations used by the composition loop |
ProvenanceEntry |
Bridge record linking shard completion to done-ledger writes for cross-component checking |
ScanOutcome |
Synthetic scan result: done-ledger records plus cursor bytes for complete() |