Content classification, format-specific text extraction, base64 pre-decode gating, and cache-line-friendly CLOCK caching for the scanner engine.
Located in:
crates/scanner-engine/src/content_policy/(classification and extraction)crates/scanner-engine/src/b64_yara_gate.rs(base64 gate)crates/scanner-engine/src/lsm/set_associative_cache.rs(CLOCK cache)
These three subsystems serve different stages of the scan pipeline:
-
Content Policy — Decides whether incoming file content should be scanned as text, skipped as binary, or have text extracted from a known binary format. The primary entry point is
classify_content, which combines a NUL-byte heuristic with file extension matching. -
YARA Base64 Gate — A fast pre-filter that checks whether any base64-encoded form of known anchor patterns appears in a buffer. This gates downstream base64 decoding: if no anchor permutation is found, decoding is skipped entirely.
-
Set-Associative Cache — A compact, cache-line-friendly CLOCK Nth-Chance cache used by the LSM subsystem for hot data. Tags provide fast mismatch filtering; SIMD-accelerated tag search handles the common case in a single instruction.
flowchart TD
Input["File content + path"]
NUL{"NUL byte in<br/>first 8192 bytes?"}
TextExt{"Extension is<br/>.ipynb or .env?"}
BinExt{"Extension is<br/>.class/.jar/.war/.pyc<br/>.ipynb/.env?"}
Text["ContentVerdict::Text"]
Binary["ContentVerdict::Binary"]
Extract["ContentVerdict::BinaryExtractable"]
Input --> NUL
NUL -- No --> TextExt
NUL -- Yes --> BinExt
TextExt -- Yes --> Extract
TextExt -- No --> Text
BinExt -- Yes --> Extract
BinExt -- No --> Binary
style Text fill:#e8f5e9
style Binary fill:#ffebee
style Extract fill:#fff3e0
classify_content(data, path, check_len) is the single entry point. It operates in two phases:
-
NUL-byte heuristic —
is_likely_binaryscans the firstCHECK_LEN(8192) bytes for a NUL byte usingmemchr(SIMD-accelerated). This mirrors Git'sbuffer_is_binarytechnique. Empty data is not considered binary. -
Extension matching —
match_extractable_formatchecks the file path for known extractable formats. Matching is case-insensitive on ASCII bytes. Dotenv files are matched by basename (.env,.env.*); other formats are matched by extension (.ipynb,.class,.jar,.war,.pyc).
The interaction between these two phases has an intentional asymmetry:
-
Text-extractable formats (
.ipynb,.env) — Classified asBinaryExtractableeven when no NUL bytes are present. These are text files that benefit from structured extraction (e.g., stripping JSON noise from notebooks, normalizing dotenv quoting). -
Binary-only formats (
.class,.jar,.war,.pyc) — Classified asBinaryExtractableonly when NUL bytes are present. A NUL-free file with a.classextension is likely a misnamed text file and should be scanned as-is rather than passed through the class file parser.
| Verdict | Meaning | Action |
|---|---|---|
Text |
Content appears to be text | Scan normally |
Binary |
Content appears to be binary, no known extractor | Skip unless --scan-binary is set |
BinaryExtractable(fmt) |
Known format with text extraction strategy | Run format-specific extractor, then scan extracted text |
-
Dotenv: Matches basename
.envexactly, or.env.followed by any suffix (e.g.,.env.local,.env.production). Names like.envrcdo not match. Matching usesfile_name()which splits on both/and\separators. -
Other formats: Matched by the file extension after the last
.in the filename component. Dotfiles (leading.with no other.) return no extension.
All format handlers implement the Extractor trait:
pub trait Extractor {
fn extract(&self, data: &[u8], out: &mut Vec<u8>, scratch: &mut Vec<u8>) -> ExtractResult;
}Contract:
- Implementations append to
out; they must not clear it. - On
ParseErrororEmpty,outmust be left unchanged from its state at entry. - Implementations must not panic on arbitrary input.
scratchis a caller-provided temporary workspace (e.g., for decompressing JAR entries). Extractors that don't need it ignore it.
extract_content(format, data, out, scratch) clears out before dispatching to the correct extractor. Production code should use this function — it owns the clear-before-dispatch invariant.
pub fn extract_content(format: ExtractableFormat, data: &[u8],
out: &mut Vec<u8>, scratch: &mut Vec<u8>) -> ExtractResult;| Constant | Value | Purpose |
|---|---|---|
EXTRACT_INPUT_CAP |
64 KiB | Pre-allocation hint for input buffers |
EXTRACT_OUTPUT_CAP |
32 KiB | Output budget cap; extractors stop writing beyond this |
JAR_ENTRY_CAP |
32 KiB | Pre-allocation hint for JAR entry scratch |
Extractor: DotEnvExtractor
Produces normalized KEY=VALUE\n lines from dotenv content, stripping comments, quoting, escape sequences, and export prefixes.
Supported syntax:
| Form | Behavior |
|---|---|
KEY=VALUE |
Bare assignment |
export KEY=VALUE |
Lowercase export prefix stripped |
Empty lines, # ... |
Skipped entirely |
| Unquoted value | Leading/trailing whitespace trimmed; # starts inline comment at position 0 or after whitespace |
'...' |
Single-quoted: literal bytes, no escape processing |
"..." |
Double-quoted: \n, \r, \t, \\, \" unescaped; unknown \x keeps x |
| Quoted multiline | Spans physical lines until closing quote |
Algorithm (single forward pass):
- Advance through lines, skipping blanks and comments.
- Strip optional
exportprefix, then locate the=separator. - Dispatch value to a quoting-specific parser (unquoted, single-quoted, double-quoted). Quoted parsers consume continuation lines via a shared
poscursor. - Each
KEY=value\nentry is written atomically using checkpoint/rollback: if the value is malformed or the output budget is exhausted,outis truncated back to its pre-entry length.
Robustness features:
- Unterminated quotes trigger a fallback: the first physical line's raw value is emitted as unquoted, and the position cursor is rewound so continuation lines are re-evaluated as independent entries.
MAX_MULTILINE_LINES(1024) caps continuation lines to prevent unbounded scanning from unclosed quotes.- Line scanning and delimiter searches use
memchr/memchr2for SIMD-accelerated matching.
Extractor: IpynbExtractor
Parses the JSON structure and extracts source lines from all cell types — code, markdown, and raw. Markdown cells are included because they frequently contain secrets in code fences, inline snippets, or configuration examples.
Extraction strategy:
- Each cell's
sourcearray is concatenated verbatim (lines already contain their own\nterminators). - A trailing
\nis appended after each non-empty cell to ensure cell boundaries are visible to line-oriented scan rules. outputsblobs are ignored — they are large, noisy, and rarely contain secrets.- Uses a minimal typed
Notebook/Cellstruct for deserialization via serde, avoidingserde_json::Valueallocations.
Extractor: JavaClassExtractor
Parses the constant pool of a compiled Java class file to extract CONSTANT_Utf8 entries (tag 1). These contain string literals, class names, method signatures, and other text that may hold secrets.
Format:
Bytes 0-3: 0xCAFEBABE (magic)
Bytes 4-7: Minor/major version
Bytes 8-9: Constant pool count (u16)
Bytes 10+: Constant pool entries
Parsing:
- Only
CONSTANT_Utf8entries are extracted, each emitted as a newline-terminated string. - All other constant types are skipped by their known fixed sizes per JVM spec §4.4.
CONSTANT_LongandCONSTANT_Doubleconsume two constant pool slots per §4.4.5.- Parsing stops at the end of the constant pool — fields, methods, and attributes are not descended into because the constant pool already contains every string literal referenced in the class.
- Unknown tags cause immediate bail-out (cannot determine their size).
Extractor: JarWarExtractor
Opens a ZIP archive and iterates over .class entries, delegating each to JavaClassExtractor. Only .class files are processed — other text files inside JARs are scannable individually if encountered separately.
Budget limits:
| Limit | Value | Purpose |
|---|---|---|
MAX_ENTRIES |
10,000 | Cap on ZIP central directory iteration |
MAX_CLASS_SIZE |
10 MiB | Skips anomalously large .class files |
MAX_TOTAL_EXTRACTED |
100 MiB | Hard ceiling on output buffer size |
Per-entry limits skip the oversized entry. The total budget stops extraction entirely. In both cases, text already appended is kept.
The caller-provided scratch buffer is reused across entries to hold decompressed .class bytes, avoiding per-entry allocation. Extension matching is case-insensitive (e.g., .CLASS, .Class both match).
Extractor: PycExtractor
Parses the marshal data in a compiled Python 3.6+ file to extract string constants.
Extracted string types:
TYPE_STRING(s),TYPE_UNICODE(u),TYPE_INTERNED(t) — u32 LE length prefixTYPE_SHORT_ASCII(z),TYPE_SHORT_ASCII_INTERNED(Z) — u8 length prefixTYPE_ASCII(a),TYPE_ASCII_INTERNED(A) — u32 LE length prefix
Header format:
Bytes 0-3: magic (2 bytes) + \r\n (2 bytes)
Python 3.6: Bytes 4-11: timestamp + source size → header_size = 12
Python 3.7+: Bytes 4-15: flags + hash/timestamp → header_size = 16
Header size is selected from the CPython magic number: values below 3390 use 12 bytes (pre-3.7), 3390+ use 16 bytes (PEP 552).
Marshal walk strategy (linear sweep):
- String types: content appended to output, one line per string.
- Fixed-size types (
int,ref,True, etc.): skipped by known byte sizes.floatandcomplexare variable-length (length-prefixed). - Container types (
tuple,list,set,frozenset,dict): only the element-count prefix is skipped; contained objects follow inline. TYPE_CODE: skips the 24-byte numeric prefix (Python 3.8+ layout: 6 × u32). Sub-fields are marshal objects parsed in subsequent iterations.FLAG_REF(0x80): masked off before matching so ref-tagged objects are handled identically.- Unknown opcodes: cause immediate bail-out.
Base64 encoding groups bytes into 3-byte blocks. A raw anchor string can start at offsets 0, 1, or 2 within a block, yielding three base64 permutations. Leading and trailing base64 characters can be unstable because they depend on adjacent (unknown) bytes. YARA documents this for the string "This program cannot"; Sigma's base64offset modifier describes the same phenomenon.
For each anchor A and each offset o in {0, 1, 2}:
- Prefix
Awithozero bytes to model unknown preceding bytes. - Encode with the standard base64 alphabet.
- Drop the unstable prefix:
o=0 → 0 chars,o=1 → 2 chars,o=2 → 3 chars. - Drop the unstable suffix based on
(len(prefix+A) % 3):rem=0 → 0,rem=1 → 3,rem=2 → 2. - Keep the remaining substring if it is at least
min_pattern_lencharacters.
Generated patterns are deduplicated across anchors and offsets using a BTreeSet for stable ordering.
Example — "This program cannot" produces three permutations:
| Offset | Pattern |
|---|---|
| 0 | VGhpcyBwcm9ncmFtIGNhbm5vd |
| 1 | RoaXMgcHJvZ3JhbSBjYW5ub3 |
| 2 | UaGlzIHByb2dyYW0gY2Fubm90 |
The compiled patterns are assembled into a dense Aho-Corasick automaton over a 64-symbol alphabet (one symbol per base64 character). Scan time is O(1) per byte — a single table lookup with no branches.
Input canonicalization:
- Whitespace is ignored per policy (
Rfc4648: SP/TAB/CR/LF only;AsciiWhitespace: includes VT and FF). - URL-safe characters are mapped:
'-'→'+','_'→'/'. '='is handled perPaddingPolicy:StopAndHalthalts scanning (sticky across chunks);ResetAndContinueresets to root state and continues.- Any other byte resets the automaton to the root state, so matches never span invalid bytes.
A 256-entry lookup table (input_lut) classifies each input byte into one of: a base64 symbol (0–63), whitespace tag (TAG_WS = 0xFE), padding tag (TAG_PAD = 0xFD), or invalid tag (TAG_INVALID = 0xFF). The LUT is built at compile time.
| Method | Behavior | Use Case |
|---|---|---|
hits(encoded) |
One-shot scan; respects PaddingPolicy |
Per-span scanning with precise boundaries |
hits_anywhere(encoded) |
Always treats '=' as boundary (resets), continues scanning |
Whole-buffer prefiltering |
scan_with_state(encoded, &mut GateState) |
Incremental scan across chunk boundaries; caller resets state between spans | Streaming input |
- False positives expected — this is a gate, not a validator.
- False negatives possible if short patterns are dropped via
min_pattern_len. - Memory: O(states × 64) for the dense automaton, trading space for predictable latency.
Ac64 is a dense Aho-Corasick automaton:
next: Box<[u32]>— Dense transition table:state * 64 + sym → next state. Every state has a transition for every symbol (no branches on lookup).out: Box<[u8]>— Output marker per state (0/1). Output propagation through fail links ensures suffix matches are reported.- The automaton is reference-counted (
Arc<Ac64>) so cloningBase64YaraGateis cheap.
The scan engine (engine/core.rs) builds a Base64YaraGate during engine initialization. For rules with Gate::AnchorsInDecoded on base64 transforms, the gate is checked against encoded bytes before invoking the full decode pipeline. If the gate does not hit, base64 decoding is skipped entirely, saving significant work on buffers that don't contain the target patterns.
Provides a compact, cache-line-friendly cache for LSM hot data. Designed for high-throughput lookups where both hit-rate and latency matter.
┌─────────────────────────────────────────────┐
│ Hash + fast_range │
│ key → set index + tag │
└──────────────────┬──────────────────────────┘
│
┌─────────▼─────────┐
│ Set (WAYS slots) │
├───┬───┬───┬───┬───┤
│tag│tag│tag│tag│...│ ← SIMD-compared
├───┼───┼───┼───┼───┤
│cnt│cnt│cnt│cnt│...│ ← packed CLOCK counts
├───┼───┼───┼───┼───┤
│val│val│val│val│...│ ← aligned value buffer
└───┴───┴───┴───┴───┘
↑
clock hand
- Hash the key to 64-bit entropy via
C::hash(key). - Select set with
fast_range(entropy, sets)and derive tag viaTag::truncate(entropy). - Search tags — SIMD-accelerated comparison of the tag against all
WAYStags in the set. Returns a bitmask of matching ways.- aarch64: NEON intrinsics (
vceqq_u8/u16). - x86_64: SSE2 baseline (
_mm_cmpeq_epi8/16); AVX2 (_mm256_cmpeq_epi16) for 16-way u16 tags. - Miri / other: Scalar fallback.
- aarch64: NEON intrinsics (
- Confirm keys — For each tag-matching, occupied slot (count > 0), extract the key from the stored value and compare against the lookup key.
- On hit: increment the count (saturating, capped at
max_count).
On insertion when no existing entry matches:
- Start at the set's clock hand position.
- For each way (wrapping around):
- If count is 0: insert here.
- Otherwise: decrement count. If it reaches 0, evict and insert here.
- Advance the clock hand to the next way after the insertion point.
Worst-case iterations: WAYS × (max_count - 1). With CLOCK_BITS=2 and WAYS=16, this is 16 × 2 = 32 iterations before guaranteed eviction.
The algorithm is similar to "RRIParoo" from "Kangaroo: Caching Billions of Tiny Objects on Flash" — each slot gets multiple chances before eviction, controlled by the CLOCK_BITS parameter.
Tags (Vec<TagT>) — Short partial hashes stored separately from values to keep hot metadata compact. 8-bit or 16-bit. Tag matches are advisory; a full key comparison is authoritative.
Values (AlignedBuf<C::Value>) — Heap-allocated buffer with custom alignment for cache-line-aligned storage. Elements are MaybeUninit<T>; occupancy is tracked by counts, not the value buffer. Destructors are not run (intended for Copy types).
Counts (PackedUnsignedIntegerArray<CLOCK_BITS>) — Per-slot access counts packed into 64-bit words. CLOCK_BITS can be 1, 2, or 4. A count of 0 means the slot is empty.
Clock hands (PackedUnsignedIntegerArray<CLOCK_HAND_BITS>) — Per-set clock position packed similarly. CLOCK_HAND_BITS = log2(WAYS).
Stores small unsigned integers densely in little-endian 64-bit words. BITS must be a power of two less than 8. Operations are O(1):
// get: extract value at index
let word = words[index / uints_per_word];
let shift = (index % uints_per_word) * BITS;
value = (word >> shift) & mask;
// set: write value at index
words[index / uints_per_word] = (word & !mask_shifted) | (value << shift);| Parameter | Valid Values | Purpose |
|---|---|---|
WAYS |
2, 4, 16 | Associativity (slots per set) |
CLOCK_BITS |
1, 2, 4 | Bits per access count |
CACHE_LINE_SIZE |
Power of 2 | Target cache line size (typically 64) |
VALUE_ALIGNMENT |
Power of 2 (0 = type default) | Alignment for value buffer |
CLOCK_HAND_BITS |
log2(WAYS) |
Bits per clock hand |
VALUE_COUNT_MAX_MULTIPLE is computed at compile time to ensure all arrays stay cache-line aligned. The init constructor enforces all invariants with compile-time and runtime assertions.
| Operation | Behavior |
|---|---|
get(key) |
Returns pointer to cached value; increments count on hit |
get_index(key) |
Returns slot index; updates metrics (under perf-stats feature) |
upsert(value) |
Inserts or replaces; returns UpsertResult with evicted value if any |
remove(key) |
Zeros count; marks slot as uninitialized |
demote(key) |
Sets count to 1 (hinting low priority) |
reset() |
Clears tags, counts, clocks, metrics. Values are left untouched. |
The cache is not thread-safe. Interior mutability (UnsafeCell, Cell) is used for counts, clocks, and metrics without synchronization. Callers must synchronize shared access externally.
| Type | Kind | Purpose |
|---|---|---|
ContentVerdict |
Enum | Classification result: Text, Binary, BinaryExtractable(fmt) |
ExtractableFormat |
Enum | Variants: DotEnv, Ipynb, JavaClass, JarWar, Pyc |
ExtractResult |
Enum | Extraction outcome: Ok, Empty, ParseError |
Extractor |
Trait | fn extract(&self, data, out, scratch) -> ExtractResult |
DotEnvExtractor |
Struct (unit) | Dotenv format handler |
IpynbExtractor |
Struct (unit) | Jupyter Notebook handler |
JavaClassExtractor |
Struct (unit) | Java class file handler |
JarWarExtractor |
Struct (unit) | JAR/WAR archive handler |
PycExtractor |
Struct (unit) | Python bytecode handler |
| Type | Kind | Purpose |
|---|---|---|
Base64YaraGate |
Struct | Immutable gate backed by Arc<Ac64>; cheap to clone |
Base64YaraGateConfig |
Struct | Config: min_pattern_len, padding_policy, whitespace_policy |
GateState |
Struct | Streaming state: state: u32 (automaton position), halted: bool |
PaddingPolicy |
Enum | StopAndHalt or ResetAndContinue |
WhitespacePolicy |
Enum | Rfc4648 (SP/TAB/CR/LF) or AsciiWhitespace (+ VT/FF) |
Ac64 |
Struct (private) | Dense automaton: next: Box<[u32]>, out: Box<[u8]> |
| Type | Kind | Purpose |
|---|---|---|
SetAssociativeCache<C, TagT, WAYS, ...> |
Struct | Main cache with const-generic parameters |
SetAssociativeCacheContext |
Trait | Defines Key, Value, key_from_value, hash |
Tag |
Trait | Partial hash type; impls for u8 (8-bit) and u16 (16-bit) |
PackedUnsignedIntegerArray<BITS> |
Struct | Dense packed integers in 64-bit words |
AlignedBuf<T> |
Struct | Heap buffer with custom alignment; MaybeUninit<T> elements |
Metrics |
Struct | Hit/miss counters via Cell<u64> |
Options |
Struct | name: &str for diagnostics |
UpsertResult<V> |
Struct | index, updated (Update/Insert), evicted: Option<V> |
UpdateOrInsert |
Enum | Update (replaced existing) or Insert (new entry) |
| Operation | Complexity | Notes |
|---|---|---|
classify_content |
O(min(n, 8192)) | memchr SIMD scan for NUL + extension match |
is_likely_binary |
O(min(n, check_len)) | Single memchr call |
| DotEnv extraction | O(n) | Single-pass; memchr/memchr2 for delimiters |
| Ipynb extraction | O(n) | JSON parse via serde, then linear cell concatenation |
| Java class extraction | O(constant pool) | Single pass through constant pool entries |
| JAR/WAR extraction | O(entries × class size) | ZIP iteration + per-entry class extraction |
| Pyc extraction | O(marshal stream) | Linear sweep of marshal opcodes |
| Base64 gate build | O(anchors × 3) | Pattern generation + Aho-Corasick construction |
| Base64 gate scan | O(n) per byte: O(1) | Dense automaton lookup; no branches per byte |
Cache get/get_index |
O(1) amortized | Hash + fast_range + SIMD tag search + key compare |
Cache upsert |
O(WAYS) worst case | CLOCK scan; bounded by WAYS × (max_count - 1) |
Memory:
- Content policy: No heap allocation in classification. Extraction allocates only via the caller-provided
outandscratchbuffers. - Base64 gate: O(states × 64) for the dense automaton transition table + O(states) for output markers.
- Cache: Tags + values + packed counts + packed clocks, all cache-line-aligned.
| Component | File |
|---|---|
| Content classification | content_policy/mod.rs |
| Extraction framework | content_policy/extract.rs |
| DotEnv extractor | content_policy/dotenv.rs |
| Ipynb extractor | content_policy/ipynb.rs |
| Java class extractor | content_policy/java_class.rs |
| JAR/WAR extractor | content_policy/jar_war.rs |
| Pyc extractor | content_policy/pyc.rs |
| Base64 gate | b64_yara_gate.rs |
| Gate config/policies | b64_yara_gate.rs |
| Dense automaton | b64_yara_gate.rs |
| Set-associative cache | lsm/set_associative_cache.rs |
| Cache context trait | lsm/set_associative_cache.rs |
| Packed int array | lsm/set_associative_cache.rs |
| Aligned buffer | lsm/set_associative_cache.rs |
| Cache metrics | lsm/set_associative_cache.rs |
| Engine integration | engine/core.rs |
engine/core.rs— BuildsBase64YaraGateduring engine initialization; uses it as a pre-filter before base64 decode.engine/transform.rs— Streaming decode for URL percent-encoding and base64 payloads (documented indocs/scanner-engine/engine-transforms.md).lsm/mod.rs— LSM tree module that uses the set-associative cache.stdx— Re-exportsgossip_stdxutilities includingfastrange(used for set selection in the cache).