Accuracy measurement tool for scanner-rs. Compares scanner findings against labeled ground-truth corpora to compute precision, recall, Average Precision (AP), and related metrics. Designed for regression gating in CI and iterative rule development.
Secret scanners need quantitative accuracy tracking. Without measurement, rule changes might improve detection of one secret type while silently breaking another. The eval harness answers concrete questions:
- What is the scanner's precision and recall against known corpora? The position-based pipeline computes these from byte-level overlap between scanner findings and ground-truth annotations.
- Did this rule change make things worse? Baseline comparison with CI-gated regression verdicts catches regressions before they ship.
- Which rules produce the most false positives? The error book surfaces the top FP/FN by rule for targeted debugging.
- How confident are we in the AP estimate? Bootstrap confidence intervals quantify sampling uncertainty.
Use the harness after modifying detection rules, before releases, in CI pipelines, when adding new secret types, or when tuning confidence thresholds.
The harness lives in tools/eval-harness/ as a standalone crate with its own Cargo.toml. It depends on scanner-engine for the detection engine, scanner-scheduler for pipeline types, and scanner-git for git scanning utilities.
| Module | Role |
|---|---|
main.rs |
CLI entry point, subcommand dispatch, pipeline orchestration |
types.rs |
Core domain types: NormalizedFinding, TruthItem, TruthLabel, ClassifiedFinding, FindingClass |
creddata.rs |
CredData CSV truth loader |
synthetic.rs |
Synthetic JSON manifest truth loader |
leaky_repo.rs |
LeakyRepo CSV expectations and count-based comparison |
finding_parser.rs |
Scanner JSONL output parser, deduplication |
line_index.rs |
Byte-offset to line-number conversion (O(log n) binary search) |
fs_walk.rs |
Recursive file collection (internal utility) |
matching.rs |
Position-based finding-to-truth matching (greedy, confidence-sorted) |
metrics.rs |
Precision, recall, F1, F2, AP, P@R, R@P, bootstrap CI, per-rule breakdown |
provenance.rs |
BLAKE3 corpus/binary/ruleset hashing for reproducibility |
regression.rs |
Baseline comparison with CI overlap gating and two-tier verdicts |
report.rs |
Report assembly, JSON/table rendering, error book generation |
pipeline.rs |
DedupMode, EvalPipelineConfig — cross-rule dedup configuration for evaluation runs |
lib.rs |
Module re-exports |
Both pipelines follow a load-match-measure-report pattern:
Position-based (creddata / synthetic):
truth loader ──► TruthItem[]
├─► match_findings ──► ClassifiedFinding[]
finding source ──► NormalizedFinding[] ──► dedup pipeline (identity, optional cross-rule)
├─► compute_metrics ──► EvalMetrics
corpus files ──► HashMap<path, bytes> ├─► hash_corpus_snapshot ──► Provenance
├─► bootstrap_ap_ci ──► CI
├─► check_regression ──► Verdict
└─► build_error_book [JSON only] ──► ErrorBook
└─► EvalReport ──► JSON / table
Count-based (leaky-repo):
expectations CSV ──► FileExpectation[]
├─► compare_counts ──► per-file TP/FP/FN
findings JSONL ──► NormalizedFinding[] └─► aggregate ──► EvalMetrics
└─► EvalReport ──► JSON / table
NormalizedFinding — A scanner finding normalized for comparison. Identity is (path, byte_start, byte_end, rule); confidence is excluded from equality so duplicates at the same location collapse correctly. Byte offsets use half-open [byte_start, byte_end) convention.
TruthItem — A ground-truth annotation using 1-indexed inclusive line numbers (line_start, line_end). Each item carries a TruthLabel (Positive, Negative, or Placeholder) and a rule name.
ClassifiedFinding — A finding paired with its FindingClass (TruePositive, FalsePositive, or Unlabeled). FalseNegative is intentionally not a FindingClass variant because false negatives are truth-derived, not finding-derived.
EvalMetrics — Aggregate metrics: AP, precision, recall, F1, F2, baseline AP, P@R targets, R@P targets, bootstrap CI, per-rule breakdown (BTreeMap<String, RuleMetrics>).
EvalReport — Top-level serializable artifact combining required EvalMetrics, Provenance, and required pipeline_config, plus optional RegressionResult and optional ErrorBook.
EvalPipelineConfig — Pipeline semantics used for this run. Currently includes cross_rule_dedup (default false). Legacy baseline JSON without this field deserializes with cross_rule_dedup = false.
The matching layer uses confidence-sorted greedy matching, the same strategy as COCO and PASCAL VOC evaluation protocols:
- Sort findings by confidence descending (deterministic tiebreak via
NormalizedFinding::Ord). - Group truth items by file path; sort each group by
line_start. - For each finding (highest confidence first):
- Convert byte range to line range via
LineIndex. - Binary search for overlapping truth items in the same file.
- Apply label priority: Positive > Negative > Placeholder.
- Consume matched Positive truths (one-to-one TP counting).
- Negative truths are not consumed (multiple findings at a negative region all classify as FP).
- Placeholder truths are not consumed (multiple findings at an ignore region all classify as Unlabeled).
- Convert byte range to line range via
Greedy matching is required for valid PRC-AUC: it produces nested TP sets across confidence thresholds, ensuring recall is monotonically non-decreasing.
| Metric | Description |
|---|---|
| Average Precision (AP) | Step-function AP with tie collapsing (matches sklearn's average_precision_score) |
| Precision | TP / (TP + FP) |
| Recall | TP / (TP + FN) |
| F1 | Harmonic mean of precision and recall |
| F2 | Recall-weighted F-score (β=2); weights recall 4x more than precision |
| Baseline AP | Class prevalence among scored items (tp / (tp + fp)); the expected AP of a random ranker |
| P@R | Precision at fixed recall targets (default: 0.80, 0.90, 0.95) |
| R@P | Recall at fixed precision targets (default: 0.95) |
| Bootstrap CI | Percentile-based confidence interval for AP via stratified resampling (default: 1000 iterations, α=0.05, seed=42) |
| Per-rule breakdown | TP, FP, and precision per detection rule |
The harness provides three subcommands, each targeting a different corpus format:
eval-harness creddata \
--meta-dir <DIR> \ # CredData CSV directory with ground-truth annotations
--corpus-root <DIR> \ # Path normalization root (stripped from finding/truth paths)
--findings <JSONL> \ # Pre-computed findings JSONL file ─┐ mutually
--scan-corpus <DIR> \ # OR: directory to live-scan ─┘ exclusive
--cross-rule-dedup \ # Optional: collapse same-span findings across rules
--format <json|table> \ # Output format (default: json)
--output <PATH> \ # Write JSON to file instead of stdout
--baseline <JSON> # Baseline report for regression comparison
eval-harness synthetic \
--manifest <JSON> \ # Synthetic corpus JSON manifest
--corpus-root <DIR> \ # Path normalization root
--findings <JSONL> \ # Pre-computed findings JSONL file ─┐ mutually
--scan-corpus <DIR> \ # OR: directory to live-scan ─┘ exclusive
--cross-rule-dedup \ # Optional: collapse same-span findings across rules
--format <json|table> \ # Output format (default: json)
--output <PATH> \ # Write JSON to file instead of stdout
--baseline <JSON> # Baseline report for regression comparison
eval-harness leaky-repo \
--secrets-csv <CSV> \ # LeakyRepo secrets CSV (per-file expected counts)
--findings <JSONL> \ # Pre-computed findings JSONL file
--corpus-root <DIR> \ # Path normalization root
--format <json|table> \ # Output format (default: json)
--output <PATH> # Write JSON to file instead of stdout
Position-based subcommands (creddata, synthetic) accept findings from two mutually exclusive sources:
--findings <path>— Pre-computed JSONL from a previous scanner run. Finding lines use scanner wire fields:type: "finding",path,start,end,rule, and optionalconfidence_score.--scan-corpus <dir>— Live-scan a directory using the embeddedscanner_engine::demo_engine()with the default ruleset. Findings are collected via an in-memory event sink and re-parsed through the same JSONL path. Intended for quick iteration during rule development on small-to-medium corpora.
The leaky-repo subcommand only supports --findings (no live scan).
When --cross-rule-dedup is enabled, the harness keeps one finding per
(path, start, end) span across all rules before matching. Winner selection
is deterministic: highest confidence wins, and equal-confidence ties resolve to
the lexicographically smaller rule name.
--format json(default) — Machine-readable pretty-printed JSON. Supports--output <path>for file persistence. This is the format used for baseline comparison.--format table— Fixed-width ASCII table for terminal display. Shows aggregate metrics, per-rule breakdown, and regression verdict. Cannot be combined with--output(rejected at validation time).
| Code | Meaning |
|---|---|
| 0 | Pass or Warn — metrics meet thresholds (or no baseline provided) |
| 1 | Block — regression detected against --baseline report |
| 2 | Argument or runtime error |
| Dataset | Granularity | Best For | Corpus Size |
|---|---|---|---|
| CredData | Position-based (file + lines) | Comprehensive accuracy measurement, release benchmarks, cross-scanner comparison | 73,842 annotations from 297 real repos |
| Synthetic | Position-based (file + lines) | Rule development iteration, CI regression gates for specific secret types | Hand-crafted, typically 10–100 items |
| LeakyRepo | Count-based (file + total) | Corpora with count-only annotations, no positional ground truth available | Varies |
- CredData is the standard benchmark for secret scanners. Use it for measuring overall scanner quality and comparing against published results from other tools. Large and real-world, but requires downloading source files (~11,000 files from GitHub).
- Synthetic manifests give precise control over what's tested. Each item specifies an exact file, line range, label, and rule. Use for fast, deterministic regression detection on specific rules during development.
- LeakyRepo uses per-file expected counts rather than line positions. Coarser than the position-based pipelines (no confidence-aware metrics, no error book), but useful when positional ground truth is unavailable.
Minimal examples for each subcommand. See Dataset Guides for full setup instructions.
# CredData — position-based evaluation with live scan
eval-harness creddata \
--meta-dir /path/to/CredData/meta \
--corpus-root /path/to/CredData \
--scan-corpus /path/to/CredData/data
# Synthetic — position-based evaluation with live scan
eval-harness synthetic \
--manifest tests/synthetic/manifest.json \
--corpus-root tests/synthetic/corpus \
--scan-corpus tests/synthetic/corpus
# LeakyRepo — count-based evaluation with pre-computed findings
eval-harness leaky-repo \
--secrets-csv /path/to/secrets.csv \
--findings findings.jsonl \
--corpus-root /path/to/corpusSamsung CredData is a 73,842-annotation ground-truth corpus drawn from 297 real GitHub repositories. It contains 4,583 true positives (6.2% positive rate) and is the standard benchmark for secret scanners.
The meta/ directory (332 CSV annotation files) is checked into the CredData git repository. The data/ directory (~11,000 source files) is not in git — it must be generated by downloading pinned repository snapshots.
git clone https://github.com/Samsung/CredData /path/to/CredData
cd /path/to/CredData
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python download_data.py --data_dir data --jobs $(nproc)Requirements: Python 3.10+. Some source repositories may have been deleted or made private since the snapshot was pinned; partial corpus download is expected and does not prevent evaluation (rows referencing missing files are skipped during matching).
CredData/
meta/ # 332 CSV annotation files (ground truth) — in git
data/ # ~11,000 source files — generated by download_data.py
snapshot.json # Pinned repo commits for reproducibility
Each CSV in meta/ has 13 PascalCase columns. The harness uses 5:
| Column | Type | Description |
|---|---|---|
FilePath |
string | Relative path starting with data/ (e.g., data/00408ef6/src/config.py) |
LineStart |
integer | 1-indexed inclusive start line, or -1 for unknown location |
LineEnd |
integer | 1-indexed inclusive end line, or -1 for unknown location |
GroundTruth |
string | T (true positive), F (false positive), or X (excluded from scoring) |
Category |
string | CredSweeper rule name (e.g., Password, Token, AWS Multi); used for per-rule breakdown |
Rows with GroundTruth = X are loaded as Placeholder labels and excluded from precision/recall calculation. Rows with LineStart = -1 or LineEnd = -1 (unknown location) are skipped during loading.
# Live scan — the harness scans data/ using the embedded demo engine
eval-harness creddata \
--meta-dir /path/to/CredData/meta \
--corpus-root /path/to/CredData \
--scan-corpus /path/to/CredData/data
# Pre-computed findings — supply JSONL from a previous scanner run
eval-harness creddata \
--meta-dir /path/to/CredData/meta \
--corpus-root /path/to/CredData \
--findings findings.jsonl
# Table output for quick terminal review
eval-harness creddata \
--meta-dir /path/to/CredData/meta \
--corpus-root /path/to/CredData \
--scan-corpus /path/to/CredData/data \
--format tableCSV FilePath values start with the data/ prefix (e.g., data/00408ef6/src/config.py). The harness normalizes both finding paths and truth paths by stripping the --corpus-root prefix. When --corpus-root points to the CredData repo root:
- Truth path
data/00408ef6/src/config.pystays as-is (already relative to repo root). - Finding path
/path/to/CredData/data/00408ef6/src/config.pyis stripped todata/00408ef6/src/config.py.
Both resolve to the same key, enabling correct matching. If --corpus-root pointed to data/ instead, truth paths would retain the data/ prefix while finding paths would not, causing zero matches.
# Save a baseline report
eval-harness creddata \
--meta-dir /path/to/CredData/meta \
--corpus-root /path/to/CredData \
--findings baseline-findings.jsonl \
--output baseline.json
# Compare a new run against the baseline
eval-harness creddata \
--meta-dir /path/to/CredData/meta \
--corpus-root /path/to/CredData \
--findings new-findings.jsonl \
--baseline baseline.json
# Exit code 0 = pass/warn, 1 = block (regression detected)
echo "Exit code: $?"If current and baseline pipeline configs differ (for example one run enabled
--cross-rule-dedup and the other did not), the harness emits a warning and
also records machine-readable comparability metadata in the regression JSON.
Samsung published cross-scanner benchmarks on CredData (results may vary by CredData version and scanner configuration):
| Scanner | Precision | Recall |
|---|---|---|
| CredSweeper (ML) | 91.7% | 80.8% |
| Gitleaks | 52.6% | 24.4% |
| truffleHog (v2) | 25.0% | 0.9% |
| truffleHog3 (v3) | 15.0% | 54.7% |
| detect-secrets | 14.2% | 38.1% |
CredSweeper's numbers include ML-based filtering and were evaluated on combined training and test data, so its metrics reflect partial evaluation on its own training set. Other scanners primarily use pattern-based detection (regex and entropy). scanner-rs results will depend on the current ruleset.
- Obfuscated values: CredData replaces real credential values with synthetic substitutes. Rules that rely on high-entropy detection may behave differently than on live repositories.
- Category names: The
Categorycolumn uses CredSweeper rule names (e.g.,Password,Token,AWS Multi), which do not match scanner-rs rule names. This is expected — the harness does not require rule name matching for TP/FP classification (matching is position-based). Category names appear in the per-rule breakdown for analysis. - Partial downloads: Some pinned repositories may be unavailable. The harness handles this gracefully — truth rows referencing files that do not exist on disk produce no findings to match (counted as FN).
A synthetic corpus is a hand-crafted JSON manifest paired with a directory of source files. Each manifest entry specifies an exact file, line range, label, and rule. Synthetic corpora are ideal for testing specific rules and catching regressions on known secret patterns.
The manifest is a JSON array of objects:
[
{
"path": "src/config.py",
"line_start": 12,
"line_end": 12,
"label": "positive",
"rule": "generic-api-key"
},
{
"path": "src/config.py",
"line_start": 20,
"line_end": 21,
"label": "negative",
"rule": "generic-api-key"
}
]| Field | Type | Required | Description |
|---|---|---|---|
path |
string | yes | File path relative to --corpus-root |
line_start |
integer | yes | 1-indexed inclusive start line |
line_end |
integer | yes | 1-indexed inclusive end line |
label |
string | yes | "positive", "negative", or "placeholder" |
rule |
string | yes* | Rule name for per-rule breakdown |
category |
string | no | Alias for rule (if both present, rule takes precedence) |
Validation is fail-fast: the first invalid entry halts loading with an error that identifies the entry index and reason. Manifests must be under 16 MB.
# Live scan — harness scans the corpus directory
eval-harness synthetic \
--manifest tests/synthetic/manifest.json \
--corpus-root tests/synthetic/corpus \
--scan-corpus tests/synthetic/corpus \
--cross-rule-dedup
# Pre-computed findings
eval-harness synthetic \
--manifest tests/synthetic/manifest.json \
--corpus-root tests/synthetic/corpus \
--findings findings.jsonl
# With baseline regression check
eval-harness synthetic \
--manifest tests/synthetic/manifest.json \
--corpus-root tests/synthetic/corpus \
--scan-corpus tests/synthetic/corpus \
--cross-rule-dedup \
--baseline baseline.json- Rule development: After modifying a detection rule, run the synthetic eval to verify the rule still detects its target patterns and does not fire on known negatives.
- CI regression gates: Commit a synthetic manifest alongside rule changes. The harness exits with code 1 if a
--baselinecomparison detects regression. - Targeted testing: Unlike CredData (broad coverage, many categories), synthetic manifests test exactly the patterns you specify. Use them to cover edge cases that may not appear in real-world corpora.
LeakyRepo uses count-based evaluation: each entry specifies a file path and the expected number of secrets, without line-level positions. This is coarser than position-based evaluation but useful when positional ground truth is unavailable.
Three columns, no header row. Lines starting with # are comments.
# file_path,num_risk,num_informative
.bash_profile,6,5
.bashrc,3,3
.docker/.dockercfg,2,2
src/app/config.py,1,0
| Column | Type | Description |
|---|---|---|
| file path | string | Relative to --corpus-root; may contain commas (parsed via right-split) |
| num_risk | integer | Count of high-risk secrets |
| num_informative | integer | Count of informative secrets |
The expected count per file is num_risk + num_informative. Duplicate paths are rejected (fail-fast). The parser handles UTF-8 BOM and CRLF line endings.
eval-harness leaky-repo \
--secrets-csv /path/to/secrets.csv \
--findings findings.jsonl \
--corpus-root /path/to/corpusLeakyRepo only accepts --findings (no --scan-corpus support) and does not support --baseline regression comparison. The evaluation verdict exit code is always 0 (no regression gate); argument or runtime errors still produce exit code 2 (see Exit Codes).
For each file in the expectations CSV:
- TP =
min(expected, actual)(correctly detected secrets) - FP =
max(0, actual - expected)(excess findings beyond expectation) - FN =
max(0, expected - actual)(missing secrets)
Findings for files not listed in the CSV are emitted with expected = 0 (all counted as FP).
- No confidence-aware metrics: AP is set to precision as a stand-in since count-based comparison cannot produce a ranked precision-recall curve.
- No error book: without positional annotations, individual FP/FN cannot be identified.
- No regression checking: the
--baselineflag is not available.
Average Precision (AP) is the primary ranking-quality metric. It measures how well the scanner's confidence scores separate true secrets from false positives:
- AP = 1.0 — Perfect ranking: all true positives are ranked above all false positives.
- AP = baseline_ap — Confidence scores add no value; ranking is equivalent to random.
- AP < baseline_ap — Confidence scores are anti-correlated with truth (unusual in practice).
For secret scanning, AP above 0.90 is generally strong. Compare AP against baseline_ap to gauge whether confidence ranking adds value beyond the raw detection rate.
Secret scanning typically prioritizes recall (missed secrets are dangerous) over precision (false positives are annoying but not dangerous). The F2 score reflects this: it weights recall 4x more than precision.
- High precision, low recall — The scanner is conservative: it reports few false positives but misses real secrets. Consider lowering confidence thresholds or relaxing rule patterns.
- Low precision, high recall — The scanner is aggressive: it catches most secrets but generates noise. Consider tightening rule patterns or raising confidence thresholds.
- P@R and R@P targets — Check
precision_at_recallto see what precision is achievable at 80%/90%/95% recall. Checkrecall_at_precisionto see what recall is achievable at 95% precision.
When a --baseline is provided, the harness compares current metrics against the baseline:
| Verdict | Meaning | Default Threshold |
|---|---|---|
| Pass | No meaningful regression detected | AP drop < 0.5pp AND precision drop < 0.5pp |
| Warn | Small regression detected (non-blocking) | AP drop 0.5pp–2pp OR precision drop 0.5pp–2pp |
| Block | Significant regression detected | AP drop ≥ 2pp OR precision drop ≥ 2pp |
The CI overlap gate (enabled by default) provides a safety valve: if the baseline AP falls within the current run's bootstrap confidence interval, the drop is attributable to sampling noise and the check returns Pass regardless of the raw delta.
In JSON output mode, the error_book field lists the top false positives and false negatives grouped by rule, sorted by frequency descending. Each FP entry may include a context window around the detection span (optionally BLAKE3-redacted). With the current CLI defaults, redacted_context is usually null unless context output is explicitly enabled. Use this to identify:
- Recurring FP patterns — Rules that consistently fire on non-secrets (e.g., placeholder tokens, test fixtures).
- Missing detections — Truth items that no finding matched, indicating gaps in rule coverage.
The bootstrap CI quantifies uncertainty in the AP estimate from finite-sample effects. A wide CI (e.g., [0.82, 0.96]) means the AP estimate is unstable — small changes to the corpus could shift it substantially. A narrow CI (e.g., [0.93, 0.95]) means the estimate is reliable.
When comparing runs, check whether the CIs overlap. Overlapping CIs suggest the difference may not be statistically meaningful, which is why the regression gate uses CI overlap as a safety valve.
Default configuration: 1000 iterations, α=0.05 (95% CI), seed=42 (deterministic). For publication-quality intervals, increase iterations to 10,000.