Epic: Compile-time code generation for structured trace extractors
Scope revision (based on review feedback, 2026-04-23): the original proposal here treated this as a single implementation. That was too broad. Review found (a) the per-field fallback depended on ontology-level validation that doesn't exist today, (b) the runtime execution target was underspecified — current extraction runs inside BigQuery as server-side AI.GENERATE, not client-side Python, and emitting Python bundles is an architectural choice, not an implementation detail, (c) the proposal overgeneralized the paper's wins on structured workflow tasks to free-text extraction from arbitrary LLM_RESPONSE prose, and (d) the "every trace → LLM call" framing was imprecise — ontology_graph.py already aggregates events per session before the LLM call, only context_graph.py's BizNode path is row-level. Reworking as an epic with scoped phases, prerequisites, and an explicit Phase 1 limited to compiling deterministic extractors from known structured event schemas (where the paper's evidence maps cleanly). Runtime AI.GENERATE stays as the semantic fallback until Phase 1 has measured precision/recall on real traces.
Motivation
arXiv:2604.05150 — Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation (Trooskens et al., 2026, submitted April 6, 2026) treats the LLM as a compile-time code generator whose output is constrained to fill validated templates. Reported results on comparable tasks: 96% completion with zero execution tokens on function-calling (BFCL, n=400), 57× token reduction at 1,000 transactions, 80.4% accuracy on structured document extraction (DocILE, n=5,680).
The paper's evidence is strongest on structured-to-structured transformations — the target schema is known, the input has exploitable shape, and the task repeats at scale. That's exactly the profile of the SDK's structured event extractors (structured_extraction.py). It is not the profile of free-text semantic extraction from LLM_RESPONSE prose, which this issue no longer tries to compile in Phase 1.
Current architecture (accurate characterization)
Three distinct extraction paths, with different costs:
context_graph.py:299 — row-level BQ-native AI.GENERATE inside SQL/MERGE, invoked per row to pull business entities from LLM_RESPONSE text. This is the expensive, open-ended, free-text path.
ontology_graph.py:100, 631 — session-aggregated BQ-native AI.GENERATE. Events for a session are assembled into a transcript inside SQL, then a single AI.GENERATE call produces JSON shaped by the compiled ontology schema. One LLM call per session, not per row — already amortized. This path is what the paper's evidence most-closely matches, but it's also the path where the validator and runtime-target prerequisites bite.
structured_extraction.py — pure-Python structured extractors. A registry of typed extractors (e.g., extract_bka_decision_event) that convert specific event shapes into ExtractedNode / ExtractedEdge without calling an LLM at all. Already deterministic. This is the natural Phase 1 expansion point.
The SDK isn't uniformly "one LLM call per trace." It's a three-tier hierarchy — deterministic registry → session-aggregated LLM → row-level LLM — and the right compilation target is the middle tier, only after the prerequisites below are in place.
Prerequisites (Phase 0) — must land before any compilation work
P0.1 — Ontology-aware validator (tracked as its own prerequisite issue)
extracted_models.py:18 defines ExtractedProperty.value: Any. ExtractedGraph validates container shape (nodes/edges are lists, keys present) but not ontology correctness: unknown field names survive, type mismatches pass, unknown entity types aren't rejected. ontology_materializer.py:263 silently drops unknown fields and lets missing edge keys become empty strings.
Ship a real validate_extracted_graph(spec: OntologySpec, graph: ExtractedGraph) -> ValidationReport first. It must check:
- Every
ExtractedNode.entity_name matches a declared entity in the spec.
- Every
ExtractedProperty.name on a node or edge exists on that entity/relationship in the spec, and value satisfies the declared type (per ontology v0: scalar-ish types plus json; arrays and structs are explicitly deferred in docs/ontology/ontology.md:293 and should be modeled as separate entities + relationships, not as nested properties).
- Every edge's
from_node_id and to_node_id resolve to nodes in the graph or to external node-refs matching the declared endpoint entity.
- Required keys are present. Per the current ontology model, "required" means entity primary/alternate keys and edge endpoint keys only — not every declared property. A property that isn't a key may legitimately be absent (partial extraction is valid and common). If the ontology model later grows an explicit
required: bool on non-key properties, the validator extends to cover it; until then, non-key properties are optional.
ValidationReport must classify failures for fallback-granularity
The report returns a list of issues, each tagged with its fallback scope, because the runtime needs to know the smallest safe unit to replace:
field — property type mismatch or enum miss on an otherwise well-formed node. Safe fallback unit: re-extract that one property. Compiled path keeps the rest of the node.
node — missing key, malformed node_id, unknown entity_name. Safe fallback unit: re-extract the whole node (and any edges referencing it by node_id). Cannot recover at field level because the node's identity is broken.
edge — unresolved from_node_id / to_node_id, missing endpoint key, wrong endpoint entity type. Safe fallback unit: re-extract the whole edge. Cannot recover at field level because the endpoints define the edge's identity.
event — the compiled extractor's entire output for this event is structurally invalid (zero nodes when the event type guarantees at least one, or every emitted node/edge failed). Safe fallback unit: re-run the entire event through the fallback extractor.
Runtime consumes ValidationReport.failures by scope tag and invokes the smallest-unit fallback. Per-field fallback on a type mismatch does not trigger a whole-event LLM call; a missing node's key does.
Why this is a prerequisite: the per-field fallback model in the original proposal assumed "Pydantic validation miss → LLM fallback for that field." With value: Any, Pydantic never misses. There's nothing to fall back on. The validator is what makes per-field fallback meaningful — and, per the granularity model above, also what makes per-node / per-edge / per-event fallback meaningful. The validator is useful independently (validating any extraction, LLM or deterministic, against the ontology). P0.1 ships as its own issue before any compilation work starts.
P0.2 — Runtime execution target decision
The current AI.GENERATE path runs inside BigQuery as server-side SQL. Any compiled extractor has to execute somewhere. Three options, with real tradeoffs:
| Option |
Execution location |
Latency |
Cost model |
Complexity |
| A. Client-side Python |
SDK process pulls events back, runs extractor, re-writes results |
BQ round-trip per extraction batch |
No BQ LLM cost; client compute time |
Lowest — plain Python |
| B. BigQuery Remote Function |
Extractor wrapped as a deployed Cloud Run endpoint; called from SQL |
In-SQL, one HTTP hop per row/batch |
Cloud Run cost + deploy surface |
Highest — deploy pipeline, IAM, versioning |
| C. BigQuery SQL / JavaScript UDF |
Generated extractor compiled to SQL + UDFs; runs entirely in BQ |
In-SQL, no network hop |
Slot time only |
Middle — UDF translation layer + SQL testability |
This is a design decision the epic must settle before Phase 1 begins, because the "compile target language" changes fundamentally across the three. Recommendation, subject to review:
- Phase 1: Option A (client-side Python). Fastest to build, easiest to test, matches the pattern already used by
structured_extraction.py. Accepts a round-trip cost because Phase 1 traces are already being fetched client-side for materialization anyway.
- Phase 2 (if Phase 1 validates): evaluate Option C for the BQ-native path, keeping Option A as the default. Option C unlocks running compiled extractors inside the same SQL that produces the graph tables — matches the current
AI.GENERATE-in-SQL pattern and removes the round-trip.
- Option B stays off the table unless there's a concrete user need for it — the deploy surface is disproportionate to the problem.
Why this is a prerequisite: "emit a Python bundle" is only one of three plausible answers and the original proposal silently chose it. The tradeoffs are large enough that the choice needs to be explicit.
Phase 1 — compile deterministic structured extractors from known event schemas
Narrowed from the original scope. Not touching ontology_graph.py's session-aggregated path or context_graph.py's free-text LLM_RESPONSE extraction yet. Phase 1 targets the existing structured_extraction.py registry as the expansion point.
Phase 1 value proposition
Phase 1 does not reduce runtime token cost compared to the current hand-written registry, because both are already zero-token deterministic Python. The AI.GENERATE baseline is only relevant as a ground-truth comparison and as the Phase-0 fallback when no hand-written extractor exists.
What Phase 1 actually delivers:
- Authoring scale — writing a new structured extractor goes from "write + review a Python function" to "declare the event-payload shape and extraction rules, let the compiler emit the function." The LLM's productivity benefit moves from runtime to author time, and stays one-off rather than per-event.
- Safety checks applied uniformly — every compiled extractor runs through the AST validator + ontology validator before acceptance. Hand-written extractors today rely on reviewer diligence; compiled extractors can't skip the gate.
- Reproducibility and fingerprinted provenance — compiled bundles carry a deterministic hash of their inputs (see the Fingerprint section). Two compile runs on the same inputs produce byte-identical code; drift is detected, not assumed.
- Shared validation infrastructure — the smoke-test harness, the ontology validator, and the revalidation job built for Phase 1 are reused verbatim in Phase 2. Phase 1's infrastructure investment is what makes Phase 2 tractable.
Token-cost reporting is still wired in, because it's the measurement that matters for the AI.GENERATE fallback path and for Phase 2's session-aggregated target. Phase 1 just doesn't lead with a token-cost claim.
What gets compiled
The structured extractors that today live as hand-written Python functions (extract_bka_decision_event and siblings). The input is a well-defined event payload shape (fields in content, attributes, content_parts carrying typed values). The output is typed ExtractedNode / ExtractedEdge instances with ontology-valid property values. This is the profile the paper's evidence covers cleanly.
Compile phase (runs once per (ontology, binding, event_schema, extraction_rules, compiler_version))
- Input: the ontology YAML, binding YAML, event schema (which
event_type values and their expected payload shapes), per-event extraction rules (event_type X → entity Y with properties from fields {...}), and a sample of ≥ 100 real events per covered event_type.
- LLM step: prompt the LLM to fill validated Jinja templates — one template per field-kind supported by ontology v0: scalar (
string, int64, float64, bool, date, datetime, timestamp, time, bytes, numeric, decimal), json (opaque structured blob), and reference (to another entity's key). Arrays and structs are explicitly deferred in ontology v0 (docs/ontology/ontology.md:293) and are modeled as separate entities + relationships, not as nested properties — so Phase 1 templates do not cover them. Enums are represented via json plus an in-template membership check against a declared value list; if the ontology model later grows a first-class enum type, it gets its own template then. Generation is constrained; the LLM cannot emit free-form code, only field-level template fills.
The distinction between the ontology model's field kinds (constrained by ontology v0) and the event payload shape (which may have richer structure, e.g., arrays of tool-result objects in a TOOL_COMPLETED event's content) is handled separately. Event-payload arrays are flattened into repeated ExtractedEdge or ExtractedNode emissions by the extraction rules, not carried through as array properties. event_schema and extraction_rules are the place to declare this shape; they are a separate input to the compiler, not part of the ontology.
3. Static check: AST-validate the emitted code. Reject anything that's not pure-Python, reads from unexpected fields, calls out-of-allowlist functions, or has side effects.
4. Smoke test: run the generated extractor on the sample set. For each event, compare the generated extractor's output to a reference output produced by (a) the hand-written extractor, if one exists, or (b) AI.GENERATE with the same prompt as Phase-0 fallback. Accept iff field-level F1 ≥ threshold (threshold TBD per extractor type; proposed start: 0.95).
5. Ontology validation: run the full validate_extracted_graph from P0.1 over the smoke-test outputs. Any validator failure blocks acceptance.
If any of stages 3–5 fails, the compile run fails and the hand-written or AI.GENERATE extractor continues to be used.
Runtime phase
run_structured_extractors() loads compiled bundles from compiled_extractors/<fingerprint>/ if a bundle matches the active (ontology, binding, event_schema, compiler_version). Otherwise falls back to the hand-written registry entries.
- Per-field fallback (now meaningful because P0.1 ships): if a compiled extractor produces a node where
validate_extracted_graph flags a field, that field's value is replaced by the hand-written or LLM-based extraction result for that field only. Logged with a trace-shape signature so the next compile run can cover it.
- No impact on
ontology_graph.py or context_graph.py in this phase.
Fingerprint (expanded per review)
sha256(ontology, binding, event_schema, event_allowlist, transcript_builder_version, content_serialization_rules, extraction_rules, template_version, compiler_package_version)
Changes in the trace-shape dependencies invalidate the bundle just as surely as changes in the ontology do. Stale bundles refuse to load and the runtime falls back to hand-written / LLM extraction until gm compile-extractors reruns.
Measured outcomes before Phase 2 proceeds
Phase 1 must produce, against a reference ontology + real trace corpus:
- Field-level F1 per compiled extractor, vs hand-written and vs
AI.GENERATE.
- Token-cost delta per 1,000 events: compiled (zero LLM tokens) vs hand-written (zero LLM tokens) vs
AI.GENERATE (today's baseline).
- Per-event latency delta.
- Rate of per-field fallbacks actually triggered on a holdout trace set.
If F1 < 0.95 or fallback rate > 10%, Phase 2 does not proceed. The structured-extractor compilation has to beat or match the hand-written baseline before taking on the harder free-text case.
Phase 2 — compile session-aggregated ontology-graph extractors (only after Phase 1 validates)
Only begun after Phase 1's measurements show the compile→validate loop works. Targets ontology_graph.extract_graph()'s session-aggregated AI.GENERATE path. This is the tier where the paper's "57× token reduction" has the most direct mapping, but it also has more open-ended inputs (the session transcript is not a known schema the way an individual event payload is).
Phase 2 scope is intentionally deferred until Phase 1 data is in. If Phase 1 shows deterministic extractors can match AI.GENERATE on structured events with > 0.95 F1 and < 10% fallback, Phase 2 can reuse the compiler infrastructure. If not, Phase 2 doesn't happen and the epic's scope contracts to Phase 1 only.
The runtime-target decision from P0.2 is re-evaluated here. Option A (client-side Python) may no longer be acceptable at session scale; Option C (SQL UDF) becomes the likely target, since session-aggregated extraction currently runs in SQL.
Phase 3 — free-text extraction (explicitly out of scope)
context_graph.py's row-level AI.GENERATE extraction of business entities from LLM_RESPONSE text is not proposed for compilation in this epic. The paper's evidence doesn't cover this profile, and the review correctly flagged that deterministic generated code is much less convincing there.
A separate issue may propose structured-NLU approaches (entity recognizers, intent classifiers) for that path later. This epic does not commit to it.
Revalidation harness (shared across phases once compilation lands)
Scheduled or on-demand job:
- Sample N recent events matching a covered event schema.
- Run both the compiled extractor and a reference path (hand-written if one exists, else
AI.GENERATE).
- Report agreement rate + per-field disagreement table.
- When agreement drops below threshold, surface a recompile recommendation in the SDK's health check: "Compiled extractor for event_type X agreement dropped to 87% over the last 500 events. Recompile recommended."
No auto-recompile — the compile-time LLM call is the trust boundary. Human decision to re-run, backed by the measurement.
Risks + mitigations (revised)
- Novel event-payload shapes. Addressed per-field, now meaningful because of P0.1. Revalidation harness catches distributional drift.
- Fingerprint too narrow. Expanded per review to cover trace-shape dependencies; stale bundles fail loud.
- Runtime target changes between phases. P0.2 decision is reviewed again at Phase 2. Client-side Python for Phase 1 keeps the commitment small.
- Phase 1 result doesn't support Phase 2. Epic scope contracts to Phase 1 only. No sunk-cost pressure to force compilation into the session-aggregated or free-text paths.
- Compile cost. Front-loaded, amortized. Uses the unified token-budget config from #69.
- Debugging. Generated bundles are Python; stack traces point at the template that produced the bad extractor. Checked into the repo alongside the ontology or stored as a versioned sidecar dataset.
Open questions
- P0.1 first, or in parallel with Phase 1 scaffolding? Proposal: P0.1 first as a standalone PR, because it's useful independently (validates any extraction output) and unblocks meaningful discussion of the Phase 1 fallback semantics.
- Phase 1 F1 threshold. Proposed 0.95 as the accept bar. Too strict if hand-written baselines themselves don't hit 0.95 on production traces? Worth measuring hand-written baselines first.
- Which event types get compiled first in Phase 1. Today
structured_extraction.py has exactly one hand-written extractor: extract_bka_decision_event at structured_extraction.py:120. The first implementation slice is therefore "compile extract_bka_decision_event first," with the hand-written version as the F1 ground truth for its own compiled replacement. Before promoting compilation to a general solution, add one or two more hand-written extractors (likely TOOL_COMPLETED result shapes and a HITL event) as hand-written baselines, so the smoke-test harness has multi-extractor coverage and the F1 metric isn't a single-point measurement.
- Where compiled bundles live. Checked into the SDK-using repo next to
ontology.yaml (auditable, reviewable), emitted as a versioned BQ table (runtime-discoverable), or both. Leaning both — in-repo file is source of truth, BQ-table mirror is for runtime discovery.
- Revalidation cadence. Scheduled (daily / weekly) vs on-demand vs triggered by the SDK's doctor check?
Related work in-repo
- #57 — SKOS import support. Phase 1 compiled extractors must cover SKOS-derived abstract entities +
skos_-prefixed relationships as part of the template set.
- #58 — Runtime entity-resolution primitives. Shares the fingerprint-versioned-artifact pattern with this epic; compiled extractor bundles should live under the same provenance contract (compile-id in a sidecar table).
- #69 — LLM judger improvements. The unified token-budget config proposed there applies to the compile-time LLM call here. The "compile rubrics into deterministic sub-checks" direction for judges is a parallel application of the same idea, not blocked by this epic.
Reference
- Trooskens G., Karlsberg A., Sharma A., De Brouwer L., Van Puyvelde M., Young M., Thickstun J., Alterovitz G., De Brouwer W. A. Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation. arXiv:2604.05150 (2026). https://arxiv.org/abs/2604.05150
Epic: Compile-time code generation for structured trace extractors
Motivation
arXiv:2604.05150 — Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation (Trooskens et al., 2026, submitted April 6, 2026) treats the LLM as a compile-time code generator whose output is constrained to fill validated templates. Reported results on comparable tasks: 96% completion with zero execution tokens on function-calling (BFCL, n=400), 57× token reduction at 1,000 transactions, 80.4% accuracy on structured document extraction (DocILE, n=5,680).
The paper's evidence is strongest on structured-to-structured transformations — the target schema is known, the input has exploitable shape, and the task repeats at scale. That's exactly the profile of the SDK's structured event extractors (
structured_extraction.py). It is not the profile of free-text semantic extraction fromLLM_RESPONSEprose, which this issue no longer tries to compile in Phase 1.Current architecture (accurate characterization)
Three distinct extraction paths, with different costs:
context_graph.py:299— row-level BQ-nativeAI.GENERATEinside SQL/MERGE, invoked per row to pull business entities fromLLM_RESPONSEtext. This is the expensive, open-ended, free-text path.ontology_graph.py:100, 631— session-aggregated BQ-nativeAI.GENERATE. Events for a session are assembled into a transcript inside SQL, then a singleAI.GENERATEcall produces JSON shaped by the compiled ontology schema. One LLM call per session, not per row — already amortized. This path is what the paper's evidence most-closely matches, but it's also the path where the validator and runtime-target prerequisites bite.structured_extraction.py— pure-Python structured extractors. A registry of typed extractors (e.g.,extract_bka_decision_event) that convert specific event shapes intoExtractedNode/ExtractedEdgewithout calling an LLM at all. Already deterministic. This is the natural Phase 1 expansion point.The SDK isn't uniformly "one LLM call per trace." It's a three-tier hierarchy — deterministic registry → session-aggregated LLM → row-level LLM — and the right compilation target is the middle tier, only after the prerequisites below are in place.
Prerequisites (Phase 0) — must land before any compilation work
P0.1 — Ontology-aware validator (tracked as its own prerequisite issue)
extracted_models.py:18definesExtractedProperty.value: Any.ExtractedGraphvalidates container shape (nodes/edges are lists, keys present) but not ontology correctness: unknown field names survive, type mismatches pass, unknown entity types aren't rejected.ontology_materializer.py:263silently drops unknown fields and lets missing edge keys become empty strings.Ship a real
validate_extracted_graph(spec: OntologySpec, graph: ExtractedGraph) -> ValidationReportfirst. It must check:ExtractedNode.entity_namematches a declared entity in the spec.ExtractedProperty.nameon a node or edge exists on that entity/relationship in the spec, andvaluesatisfies the declared type (per ontology v0: scalar-ish types plusjson; arrays and structs are explicitly deferred indocs/ontology/ontology.md:293and should be modeled as separate entities + relationships, not as nested properties).from_node_idandto_node_idresolve to nodes in the graph or to external node-refs matching the declared endpoint entity.required: boolon non-key properties, the validator extends to cover it; until then, non-key properties are optional.ValidationReport must classify failures for fallback-granularity
The report returns a list of issues, each tagged with its fallback scope, because the runtime needs to know the smallest safe unit to replace:
field— property type mismatch or enum miss on an otherwise well-formed node. Safe fallback unit: re-extract that one property. Compiled path keeps the rest of the node.node— missing key, malformednode_id, unknownentity_name. Safe fallback unit: re-extract the whole node (and any edges referencing it bynode_id). Cannot recover at field level because the node's identity is broken.edge— unresolvedfrom_node_id/to_node_id, missing endpoint key, wrong endpoint entity type. Safe fallback unit: re-extract the whole edge. Cannot recover at field level because the endpoints define the edge's identity.event— the compiled extractor's entire output for this event is structurally invalid (zero nodes when the event type guarantees at least one, or every emitted node/edge failed). Safe fallback unit: re-run the entire event through the fallback extractor.Runtime consumes
ValidationReport.failuresby scope tag and invokes the smallest-unit fallback. Per-field fallback on a type mismatch does not trigger a whole-event LLM call; a missing node's key does.Why this is a prerequisite: the per-field fallback model in the original proposal assumed "Pydantic validation miss → LLM fallback for that field." With
value: Any, Pydantic never misses. There's nothing to fall back on. The validator is what makes per-field fallback meaningful — and, per the granularity model above, also what makes per-node / per-edge / per-event fallback meaningful. The validator is useful independently (validating any extraction, LLM or deterministic, against the ontology). P0.1 ships as its own issue before any compilation work starts.P0.2 — Runtime execution target decision
The current
AI.GENERATEpath runs inside BigQuery as server-side SQL. Any compiled extractor has to execute somewhere. Three options, with real tradeoffs:This is a design decision the epic must settle before Phase 1 begins, because the "compile target language" changes fundamentally across the three. Recommendation, subject to review:
structured_extraction.py. Accepts a round-trip cost because Phase 1 traces are already being fetched client-side for materialization anyway.AI.GENERATE-in-SQL pattern and removes the round-trip.Why this is a prerequisite: "emit a Python bundle" is only one of three plausible answers and the original proposal silently chose it. The tradeoffs are large enough that the choice needs to be explicit.
Phase 1 — compile deterministic structured extractors from known event schemas
Narrowed from the original scope. Not touching
ontology_graph.py's session-aggregated path orcontext_graph.py's free-textLLM_RESPONSEextraction yet. Phase 1 targets the existingstructured_extraction.pyregistry as the expansion point.Phase 1 value proposition
Phase 1 does not reduce runtime token cost compared to the current hand-written registry, because both are already zero-token deterministic Python. The
AI.GENERATEbaseline is only relevant as a ground-truth comparison and as the Phase-0 fallback when no hand-written extractor exists.What Phase 1 actually delivers:
Token-cost reporting is still wired in, because it's the measurement that matters for the
AI.GENERATEfallback path and for Phase 2's session-aggregated target. Phase 1 just doesn't lead with a token-cost claim.What gets compiled
The structured extractors that today live as hand-written Python functions (
extract_bka_decision_eventand siblings). The input is a well-defined event payload shape (fields incontent,attributes,content_partscarrying typed values). The output is typedExtractedNode/ExtractedEdgeinstances with ontology-valid property values. This is the profile the paper's evidence covers cleanly.Compile phase (runs once per
(ontology, binding, event_schema, extraction_rules, compiler_version))event_typevalues and their expected payload shapes), per-event extraction rules (event_type X → entity Y with properties from fields {...}), and a sample of ≥ 100 real events per coveredevent_type.string,int64,float64,bool,date,datetime,timestamp,time,bytes,numeric,decimal), json (opaque structured blob), and reference (to another entity's key). Arrays and structs are explicitly deferred in ontology v0 (docs/ontology/ontology.md:293) and are modeled as separate entities + relationships, not as nested properties — so Phase 1 templates do not cover them. Enums are represented viajsonplus an in-template membership check against a declared value list; if the ontology model later grows a first-classenumtype, it gets its own template then. Generation is constrained; the LLM cannot emit free-form code, only field-level template fills.The distinction between the ontology model's field kinds (constrained by ontology v0) and the event payload shape (which may have richer structure, e.g., arrays of tool-result objects in a
TOOL_COMPLETEDevent'scontent) is handled separately. Event-payload arrays are flattened into repeatedExtractedEdgeorExtractedNodeemissions by the extraction rules, not carried through as array properties.event_schemaandextraction_rulesare the place to declare this shape; they are a separate input to the compiler, not part of the ontology.3. Static check: AST-validate the emitted code. Reject anything that's not pure-Python, reads from unexpected fields, calls out-of-allowlist functions, or has side effects.
4. Smoke test: run the generated extractor on the sample set. For each event, compare the generated extractor's output to a reference output produced by (a) the hand-written extractor, if one exists, or (b)
AI.GENERATEwith the same prompt as Phase-0 fallback. Accept iff field-level F1 ≥ threshold (threshold TBD per extractor type; proposed start: 0.95).5. Ontology validation: run the full
validate_extracted_graphfrom P0.1 over the smoke-test outputs. Any validator failure blocks acceptance.If any of stages 3–5 fails, the compile run fails and the hand-written or
AI.GENERATEextractor continues to be used.Runtime phase
run_structured_extractors()loads compiled bundles fromcompiled_extractors/<fingerprint>/if a bundle matches the active(ontology, binding, event_schema, compiler_version). Otherwise falls back to the hand-written registry entries.validate_extracted_graphflags a field, that field's value is replaced by the hand-written or LLM-based extraction result for that field only. Logged with a trace-shape signature so the next compile run can cover it.ontology_graph.pyorcontext_graph.pyin this phase.Fingerprint (expanded per review)
sha256(ontology, binding, event_schema, event_allowlist, transcript_builder_version, content_serialization_rules, extraction_rules, template_version, compiler_package_version)Changes in the trace-shape dependencies invalidate the bundle just as surely as changes in the ontology do. Stale bundles refuse to load and the runtime falls back to hand-written / LLM extraction until
gm compile-extractorsreruns.Measured outcomes before Phase 2 proceeds
Phase 1 must produce, against a reference ontology + real trace corpus:
AI.GENERATE.AI.GENERATE(today's baseline).If F1 < 0.95 or fallback rate > 10%, Phase 2 does not proceed. The structured-extractor compilation has to beat or match the hand-written baseline before taking on the harder free-text case.
Phase 2 — compile session-aggregated ontology-graph extractors (only after Phase 1 validates)
Only begun after Phase 1's measurements show the compile→validate loop works. Targets
ontology_graph.extract_graph()'s session-aggregatedAI.GENERATEpath. This is the tier where the paper's "57× token reduction" has the most direct mapping, but it also has more open-ended inputs (the session transcript is not a known schema the way an individual event payload is).Phase 2 scope is intentionally deferred until Phase 1 data is in. If Phase 1 shows deterministic extractors can match
AI.GENERATEon structured events with > 0.95 F1 and < 10% fallback, Phase 2 can reuse the compiler infrastructure. If not, Phase 2 doesn't happen and the epic's scope contracts to Phase 1 only.The runtime-target decision from P0.2 is re-evaluated here. Option A (client-side Python) may no longer be acceptable at session scale; Option C (SQL UDF) becomes the likely target, since session-aggregated extraction currently runs in SQL.
Phase 3 — free-text extraction (explicitly out of scope)
context_graph.py's row-levelAI.GENERATEextraction of business entities fromLLM_RESPONSEtext is not proposed for compilation in this epic. The paper's evidence doesn't cover this profile, and the review correctly flagged that deterministic generated code is much less convincing there.A separate issue may propose structured-NLU approaches (entity recognizers, intent classifiers) for that path later. This epic does not commit to it.
Revalidation harness (shared across phases once compilation lands)
Scheduled or on-demand job:
AI.GENERATE).No auto-recompile — the compile-time LLM call is the trust boundary. Human decision to re-run, backed by the measurement.
Risks + mitigations (revised)
Open questions
structured_extraction.pyhas exactly one hand-written extractor:extract_bka_decision_eventat structured_extraction.py:120. The first implementation slice is therefore "compileextract_bka_decision_eventfirst," with the hand-written version as the F1 ground truth for its own compiled replacement. Before promoting compilation to a general solution, add one or two more hand-written extractors (likelyTOOL_COMPLETEDresult shapes and a HITL event) as hand-written baselines, so the smoke-test harness has multi-extractor coverage and the F1 metric isn't a single-point measurement.ontology.yaml(auditable, reviewable), emitted as a versioned BQ table (runtime-discoverable), or both. Leaning both — in-repo file is source of truth, BQ-table mirror is for runtime discovery.Related work in-repo
skos_-prefixed relationships as part of the template set.Reference