The regex2anchor module (crates/scanner-engine/src/regex2anchor.rs) extracts literal byte substrings ("anchors") from regex patterns for use as Vectorscan prefilters. This enables a two-pass scanning architecture:
- Fast pass: Vectorscan scans the haystack for anchor literals (hardware-accelerated multi-pattern matching).
- Slow pass: The full regex engine runs only on candidate regions where an anchor was found.
Without anchors, every regex rule must run against the entire input. With anchors, the engine skips regions that cannot possibly match, producing large throughput gains on real-world workloads where most of the input does not contain secrets.
Anchor extraction is the critical bridge between regex rules (authored by humans for readability) and Vectorscan databases (optimized for throughput). A pattern like ghp_[A-Za-z0-9]{36} compiles to the anchor "ghp_", so Vectorscan only reports positions containing that 4-byte literal. The full regex then confirms the trailing 36-character alphanumeric suffix in a narrow window.
The single non-negotiable correctness rule:
For any pattern P and haystack H:
if regex_matches(P, H),
then H must contain at least one derived anchor.
If this invariant is violated, the scanner produces false negatives (missed secrets). The algorithm is therefore conservative: it returns errors (Unanchorable, OnlyWeakAnchors) when it cannot safely produce anchors, rather than risk missing a match.
Key corollaries:
- Patterns that can match the empty string are always rejected (
regex2anchor.rs). An empty match has no bytes for an anchor to match against. - Anchors are never filtered by length post-derivation for alternations. If an alternation branch produces a short anchor, the entire set is rejected as
OnlyWeakAnchorsrather than silently dropping the short branch (regex2anchor.rs). - Information is only ever weakened, never strengthened, as it propagates up the HIR tree (
regex2anchor.rs).
Regex string
│
▼
┌────────────────────────────────┐
│ Parse via regex-syntax │
│ ParserBuilder → HIR │
│ (handles flags, case folding) │
└────────────┬───────────────────┘
│
▼
┌────────────────────────────────┐
│ analyze(hir, cfg) │
│ Structural induction on HIR │
│ Returns Info { exact, pf } │
└────────────┬───────────────────┘
│
▼
┌────────────────────────────────┐
│ choose_anchors(info, cfg) │
│ Enforce min_anchor_len │
│ Reject empty-matching sets │
└────────────┬───────────────────┘
│
▼
┌────────────────────────────────┐
│ TriggerPlan │
│ Anchored / Residue / │
│ Unfilterable │
└────────────────────────────────┘
The module uses regex-syntax::ParserBuilder to parse the pattern into a High-level Intermediate Representation (HIR). The AnchorDeriveConfig::utf8 flag controls whether UTF-8 or byte-oriented parsing is used — this must match the regex engine that will perform the final match (regex2anchor.rs).
HIR handles flags ((?i), (?-u)), case folding, and character class normalization. For example, (?i:ab) is lowered to [aA][bB], which the analyzer handles as two character classes without needing case folding logic.
The analyze function (regex2anchor.rs) performs structural induction on each HIR node, returning an Info value that tracks:
exact: An explicit, finite set of byte strings the node can match (orNoneif the set is too large or unbounded).pf: APrefiltersummary (a required substring or set of substrings) that must appear in any match.
Each HIR node type has specific analysis rules:
| HIR Node | Analysis Rule | Reference |
|---|---|---|
Empty |
Exact set {b""} |
regex2anchor.rs |
Literal(bytes) |
Exact singleton {bytes} |
regex2anchor.rs |
Class |
Expand to exact set if small ASCII/byte; otherwise All |
regex2anchor.rs |
Look |
Treated as empty (zero-width; consumes no bytes) | regex2anchor.rs |
Repetition |
Delegated to analyze_repetition |
regex2anchor.rs |
Capture |
Transparent; analyze inner sub-expression | regex2anchor.rs |
Concat |
Delegated to analyze_concat |
regex2anchor.rs |
Alternation |
Delegated to analyze_alternation |
regex2anchor.rs |
Concatenation (analyze_concat, regex2anchor.rs):
- Try full cross-product of all children's exact sets. If the result fits within config caps, return the exact set.
- Otherwise, find the best contiguous exact window via
best_exact_window_prefilter— this searches every contiguous run of children with exact sets, cross-products them, and picks the highest-scoring window (regex2anchor.rs). - Fall back to the single most selective child prefilter.
Alternation (analyze_alternation, regex2anchor.rs):
- If all branches have exact sets, union them into one exact set.
- Otherwise, collect all branch anchors into an
AnyOfset. If any branch isAll(unanchorable) or can match the empty string, the entire alternation degrades toAll.
Repetition (analyze_repetition, regex2anchor.rs):
min == 0: Required substrings may disappear. Special cases:{0}returns exact{b""}, and?(max=1) returns{b""} ∪ sub_exact. Otherwise degrades toAll.min > 0: Cross-product the sub's exact setmintimes. For exact repetition (max == min), the result stays exact. For open-ended repetition, the mandatorymincopies produce a prefilter.
choose_anchors (regex2anchor.rs) enforces global constraints on the final Info:
- Empty strings in the exact set yield
Unanchorable. - Any anchor shorter than
min_anchor_lenyieldsOnlyWeakAnchors(no partial filtering — that would be unsound). - Empty
AnyOfsets yieldUnanchorable.
compile_trigger_plan (regex2anchor.rs) orchestrates the full pipeline:
- Reject empty-matching patterns upfront.
- Attempt anchor derivation (fastest gate).
- If anchors fail, attempt a residue gate (run-length or k-gram).
- If all fail, return
Unfilterablewith the reason.
Configuration controlling all derivation limits:
| Field | Default | Purpose |
|---|---|---|
min_anchor_len |
3 | Minimum anchor byte length; shorter anchors are rejected |
max_exact_set |
64 | Maximum elements in an exact set before degrading |
max_exact_string_len |
256 | Maximum byte length of any single exact string |
max_class_expansion |
16 | Maximum character class expansion size |
utf8 |
true | Whether to parse with UTF-8 semantics |
kgram_k |
4 | K-gram prefix length (feature-gated) |
max_kgram_set |
4096 | Maximum k-gram enumeration count (feature-gated) |
max_kgram_alphabet |
32 | Maximum per-position alphabet for k-grams (feature-gated) |
Internal per-node summary. Not public — used only during HIR traversal:
exact: Option<Vec<Vec<u8>>>— Finite set of byte strings the node can match, orNone.pf: Prefilter— A prefilter summary preserved whenexactis unavailable.
pub enum Prefilter {
All, // No useful anchor; matches too broadly
AnyOf(Vec<Vec<u8>>), // Any of these substrings must be present
Substring(Vec<u8>), // This specific substring must be present
}The output of compile_trigger_plan:
Anchored { anchors, confirm_all }— At least one anchor must appear in any match.confirm_allis an optional AND filter of mandatory literal islands used to reduce false positives after an anchor hit.Residue { gate }— Conservative ASCII-only gate (run-length or k-gram) when anchors are unavailable.Unfilterable { reason }— No safe gate exists; caller must run the full regex.
pub enum AnchorDeriveError {
InvalidPattern(String), // Regex failed to parse
Unanchorable, // Pattern matches too broadly (e.g., `.*`)
OnlyWeakAnchors, // All anchors shorter than min_anchor_len
}Fallback gate when anchors are unavailable:
RunLength(RunLengthGate)— Scan for a byte-class run of[min_len, max_len]bytes. Only applies to patterns that reduce to a single consuming ASCII atom.KGrams(KGramGate)— Enumerated prefix k-grams (feature-gated viakgram-gate).Or(Vec<ResidueGatePlan>)— Logical OR for alternations.
A 256-bit byte mask with min/max run length, optional ASCII word boundary, and optional UTF-16 variant scanning.
K-length prefix grams packed as u64 values (little-endian for k <= 8, FNV-1a hash for k > 8). Feature-gated behind kgram-gate.
| Construct | Example | Anchor Result |
|---|---|---|
| Literal | foo |
["foo"] |
| Concatenation | foobar |
["foobar"] |
| Alternation | foo|bar |
["foo", "bar"] |
| Small character class | [ab]cd |
["acd", "bcd"] |
| Exact repetition | a{3} |
["aaa"] |
| Bounded repetition (min>0) | a{2,4} |
["aa"] (minimum copies) |
| Plus repetition | a{3,} |
["aaa"] (minimum copies) |
Optional (?) with exact sub |
a? (in concat context) |
Exact {b"", b"a"} |
| Capture groups | (foo)(bar) |
["foobar"] (transparent) |
Anchors (^, $) |
^foo$ |
["foo"] (zero-width) |
| Word boundaries | \bfoo\b |
["foo"] (zero-width) |
| Case-insensitive flag | (?i:ab) |
["AB", "Ab", "aB", "ab"] (via HIR lowering) |
| Non-ASCII literals | 日本 |
[UTF-8 bytes of "日本"] |
| Byte-mode hex | (?-u)\xFF (with utf8=false) |
[b"\xFF"] |
| Nested alternation | (a|b)|(c|d) |
["a", "b", "c", "d"] |
| Overlapping alternatives | foo|foobar |
["foo", "foobar"] |
| Construct | Example | Behavior |
|---|---|---|
| Large character class | [a-z] (>16 elements) |
Class degrades to All; may still extract anchor from surrounding concat context |
| Hex character class | [A-F0-9]{4} |
Run-length gate (4 bytes of hex mask) |
| Dot (any char) | . |
All (too broad) |
| Star repetition (min=0) | a* |
All (can match empty) |
| Unicode character class | \p{L} |
All (non-ASCII expansion rejected) |
| Construct | Example | Reason |
|---|---|---|
| Empty-matching patterns | a*, a?, |a |
MatchesEmptyString |
| Pure wildcard | .* |
Unanchorable |
| Mixed-length alternation with short branch | ab|abcdef (min=3) |
OnlyWeakAnchors |
| All-wildcard alternation branch | .*|foo |
Unanchorable (one branch is All) |
Pattern a?bcd with min_anchor_len=3:
a?is analyzed as{b"", b"a"}(exact set with empty).- The concatenation cross-product with
bcdyields{b"bcd", b"abcd"}. - The exact window search finds the best window. If the full cross-product exceeds caps, the algorithm falls back to the best contiguous exact window, which still includes
b"bcd".
Pattern ((a|b)|(c|d)) with min_anchor_len=1:
- Inner alternations produce exact sets
{a, b}and{c, d}. - The outer alternation unions them:
{a, b, c, d}. - All branches have anchors, so the result is sound.
Pattern foo| or |foo:
- The empty branch makes the pattern match the empty string.
hir_matches_emptyreturns true, socompile_trigger_planreturnsUnfilterable { reason: MatchesEmptyString }before anchor derivation is attempted.derive_anchors_from_patternalso catches this: the exact set containsb"", which triggersUnanchorable.
Pattern (?i)foo:
- HIR lowering converts this to
[fF][oO][oO]. - Character class expansion produces 2 options per position.
- Cross-product yields 2^3 = 8 anchors:
["FOO", "FOo", "FoO", "Foo", "fOO", "fOo", "foO", "foo"].
Pattern "api" [_-] "key" "=" [0-9]+:
The algorithm searches every contiguous window of exact children, computes the cross-product for each window, and selects the best by a scoring heuristic. The detailed example in the code (regex2anchor.rs) shows how ["api_key=", "api-key="] (score 63) beats the single-child fallback of "api" or "key" (score 24).
Pattern [ab]{2} with min_anchor_len=2:
[ab]expands to exact set{b"a", b"b"}.{2}cross-products twice:{b"aa", b"ab", b"ba", b"bb"}.- All 4 anchors are returned.
Pattern [abc][def][ghi] with max_exact_set=64:
- Each class has 3 elements: 3 * 3 * 3 = 27 combinations.
- 27 <= 64, so the full cross-product succeeds.
- If the product exceeds
max_exact_set, the algorithm falls back to the best window or child prefilter.
Pattern foo\d+bar:
compile_trigger_planextracts anchors (e.g.,"foo"or"bar") and then callscollect_confirm_all_literalsto find mandatory fixed-literal islands from the top-level concatenation.- Both
"foo"and"bar"are mandatory islands. The one chosen as the primary anchor is excluded fromconfirm_all; the other is kept as an AND filter to reduce false positive candidate windows (regex2anchor.rs).
Anchors flow from regex2anchor into the engine via core.rs:
-
Rule compilation (
core.rs):compile_trigger_plan(rule.re, &derive_cfg)is called for each rule. -
Anchor registration (
core.rs): ForTriggerPlan::Anchored, each anchor is registered as:- A raw-byte pattern (Variant::Raw)
- A UTF-16LE encoded pattern (Variant::Utf16Le)
- A UTF-16BE encoded pattern (Variant::Utf16Be)
-
Database compilation (
vectorscan_prefilter.rs): Anchor byte sequences are encoded as\xNNVectorscan patterns and compiled into block-mode or stream-mode databases alongside raw rule patterns. -
Runtime scanning: Vectorscan reports anchor hits via callbacks. Each hit seeds a candidate window using the anchor's position plus configurable seed radius. Only these windows are passed to the full regex engine.
-
Residue/Unfilterable handling (
core.rs): Rules withResidueorUnfilterableplans fall back to manual anchor patterns (if provided by the rule author) or are flagged as unfilterable.
┌──────────────┐ ┌──────────────────┐ ┌───────────────────┐
│ Rule spec │───▶│ compile_trigger │───▶│ TriggerPlan │
│ (regex str) │ │ _plan() │ │ │
└──────────────┘ └──────────────────┘ └─────────┬─────────┘
│
┌─────────────────────────────────┐│┌──────────────────────┐
│ Anchored │││ Unfilterable │
▼ │▼│ ▼
┌───────────────────┐ ┌─────┴──────┐ ┌──────────────────┐
│ Register anchors │ │ Residue │ │ Fall back to │
│ Raw + UTF16LE/BE │ │ gate plan │ │ manual anchors │
│ into pat_map │ └────────────┘ │ or skip prefilter│
└────────┬──────────┘ └──────────────────┘
│
▼
┌───────────────────┐
│ VsPrefilterDb │
│ compile patterns │
│ into Vectorscan │
└────────┬──────────┘
│
▼
┌───────────────────┐
│ Runtime: scan │
│ haystack, seed │
│ candidate windows │
└───────────────────┘
When multiple anchor candidates exist (e.g., in concatenation), the algorithm selects the most selective option using a scoring heuristic (regex2anchor.rs):
score = min_len * 8 - ceil_log2(set_size)
- Longer anchors score higher (8 bits of "selectivity" per byte).
- Large
AnyOfsets incur a log2 penalty. - Example: min_len=4, set_size=16 → score = 32 - 4 = 28.
- This favors a small set of longer anchors over a large set of short ones.
Tie-breaking (regex2anchor.rs): when scores are equal, prefer higher min_len, then fewer set elements, then higher max_len.
When anchors are unavailable, the algorithm attempts conservative fallback gates:
Applies to patterns that reduce to a single consuming ASCII atom (literal or class) with optional repetition and word boundaries. The gate specifies:
- A 256-bit byte mask of allowed bytes
- Min/max run length in bytes
- Optional ASCII word boundary requirement (only when both leading and trailing boundaries are present)
- Optional UTF-16 variant scanning
Example: [A-F0-9]{4} → RunLengthGate with hex byte mask, min=4, max=4.
Enumerates all possible k-length prefixes of the pattern. Each position contributes an ASCII alphabet; the cross-product of alphabets produces k-gram hashes stored for fast membership checking. Only applies when the prefix is fixed-length and enumerable.
For alternations where individual branches have residue gates, an Or(Vec<ResidueGatePlan>) is produced. All branch gates must exist; if any branch has no gate, the entire plan fails.
Targeted tests covering:
- Basic functionality: literals, concatenation, alternation, repetition, character classes (
regex2anchor_tests.rs) - Bug-hunting tests for specific failure modes: min-anchor-length post-filtering, optional prefix handling, cross-product overflow, empty alternation branches, Unicode vs. bytes (
regex2anchor_tests.rs) - Property-based tests using proptest with correlated pattern/haystack pairs (
regex2anchor_tests.rs) - Real-world pattern tests: AWS access keys, GitHub tokens, JWTs, Slack tokens (
regex2anchor_tests.rs)
Exhaustive and property-based soundness verification:
- Exhaustive domain tests (
regex2anchor_soundness.rs): Enumerate all strings of length 0..6 over alphabet{a,b,c,d}(5,461 strings). For each pattern, verify that every regex match contains at least one anchor. - Property-based tests (
regex2anchor_soundness.rs): 250-500 cases with simple and complex generated patterns, tested against exhaustive small domains and random byte haystacks. - Metamorphic tests (
regex2anchor_soundness.rs): Verify that wrapping a pattern in a capture group does not change the derived anchors. - Edge case tests: nested alternations, repetition bounds, concatenation with classes, case-insensitive flags, byte-mode hex escapes, NUL bytes, class expansion limits, exact set size limits.
- TriggerPlan tests (
regex2anchor_soundness.rs): Verify thatcompile_trigger_planproduces the correct plan type (Anchored, Residue, Unfilterable) for various patterns, including run-length gates, OR gates for alternations, and k-gram gates (feature-gated).
The soundness tests implement the invariant check directly:
// For every haystack in the domain:
if regex.is_match(&haystack) && !anchors_cover_match(&haystack, &anchors) {
// Soundness violation found — this is a bug
}This provides a mathematical proof of correctness over the bounded domain and statistical confidence via property-based testing over larger inputs.
| Component | File |
|---|---|
| Module implementation | crates/scanner-engine/src/regex2anchor.rs |
AnchorDeriveConfig |
crates/scanner-engine/src/regex2anchor.rs |
AnchorDeriveError |
crates/scanner-engine/src/regex2anchor.rs |
Prefilter enum |
crates/scanner-engine/src/regex2anchor.rs |
TriggerPlan enum |
crates/scanner-engine/src/regex2anchor.rs |
UnfilterableReason |
crates/scanner-engine/src/regex2anchor.rs |
ResidueGatePlan / RunLengthGate / KGramGate |
crates/scanner-engine/src/regex2anchor.rs |
Info (internal) |
crates/scanner-engine/src/regex2anchor.rs |
class_to_small_byte_set |
crates/scanner-engine/src/regex2anchor.rs |
cross_product |
crates/scanner-engine/src/regex2anchor.rs |
analyze (core HIR dispatch) |
crates/scanner-engine/src/regex2anchor.rs |
analyze_repetition |
crates/scanner-engine/src/regex2anchor.rs |
best_exact_window_prefilter |
crates/scanner-engine/src/regex2anchor.rs |
analyze_concat |
crates/scanner-engine/src/regex2anchor.rs |
analyze_alternation |
crates/scanner-engine/src/regex2anchor.rs |
derive_residue_gate_plan |
crates/scanner-engine/src/regex2anchor.rs |
derive_run_length_gate |
crates/scanner-engine/src/regex2anchor.rs |
extract_core_with_boundary |
crates/scanner-engine/src/regex2anchor.rs |
derive_kgram_gate (feature-gated) |
crates/scanner-engine/src/regex2anchor.rs |
choose_anchors |
crates/scanner-engine/src/regex2anchor.rs |
collect_confirm_all_literals |
crates/scanner-engine/src/regex2anchor.rs |
compile_trigger_plan (public entry) |
crates/scanner-engine/src/regex2anchor.rs |
derive_anchors_from_pattern (public entry) |
crates/scanner-engine/src/regex2anchor.rs |
| Engine integration (caller) | crates/scanner-engine/src/engine/core.rs |
| Vectorscan prefilter module | crates/scanner-engine/src/engine/vectorscan_prefilter.rs |
| Unit tests | crates/scanner-engine/src/regex2anchor_tests.rs |
| Soundness property tests | crates/scanner-engine-integration-tests/tests/property/regex2anchor_soundness.rs |