Feat: Ontology-aware validate_extracted_graph(spec, graph) with fallback-granularity classification
Status: Prerequisite issue, split out from epic #75. Independently useful — ships a validator the SDK needs regardless of whether the epic proceeds.
Goal
Ship a real ontology-aware validator that checks an ExtractedGraph against a ResolvedGraph (the runtime-facing spec surface), not just container shape. Return a structured ValidationReport that classifies each failure by the smallest safe unit of replacement (field / node / edge / event), so downstream consumers can fall back at the right granularity.
Motivation
What exists today
extracted_models.py:18 defines ExtractedProperty.value: Any — no ontology-level type check.
ExtractedGraph validates container shape (nodes/edges are lists, fields present) but not ontology correctness: unknown field names survive, type mismatches pass, unknown entity types aren't rejected.
ontology_materializer.py:263 silently drops unknown fields and lets missing edge keys become empty strings.
The practical consequence: an LLM or hand-written extractor can produce output that "validates" through Pydantic but maps to nonsense rows in the property-graph DDL — empty-string edge keys, orphan edges pointing at node_ids the extractor never emitted, properties with string values on int fields.
Why this is a prerequisite for epic #75
The epic's per-field / per-node / per-edge / per-event fallback model depends on a validator that can actually detect each failure class and classify it correctly. Without this issue, the fallback logic has nothing to trigger on.
Why it's independently useful
The validator is useful outside the epic for:
- Validating any extraction output — LLM (
AI.GENERATE) or hand-written (structured_extraction.py) — before materialization. Catches silent data-quality issues today.
- Testing ontology authoring — a unit test that asserts a golden
ExtractedGraph fixture round-trips valid gives ontology authors fast feedback.
- CI gate on generated fixtures — new ontologies get a "does a sample extraction validate?" check for free.
API
The validator's primary input surface is ResolvedGraph (from resolved_spec.py:100) — the runtime-facing view that extraction, materialization, DDL compilation, and GQL generation all already consume. ResolvedGraph carries physical column names, SDK-normalized types, and primary-key metadata already mapped through the binding, so the validator doesn't need Ontology + Binding separately.
def validate_extracted_graph(
spec: ResolvedGraph,
graph: ExtractedGraph,
) -> ValidationReport: ...
For users who hold upstream Ontology + Binding models instead of a ResolvedGraph (e.g., authoring-time validation before binding is finalized), ship a thin adapter:
def validate_extracted_graph_from_ontology(
ontology: Ontology,
binding: Binding,
graph: ExtractedGraph,
) -> ValidationReport:
"""Adapter: resolve(ontology, binding), then delegate."""
return validate_extracted_graph(resolve(ontology, binding), graph)
One validator implementation, one spec surface, one set of tests — with a thin convenience for the Ontology + Binding case.
ValidationReport carries a list of typed failures. Each failure is tagged with a fallback scope so runtime callers (including the compiled-extractor runtime proposed in #75) know the smallest safe unit of replacement.
from dataclasses import dataclass
from enum import Enum
from typing import Any, Optional
class FallbackScope(str, Enum):
FIELD = "field" # one property on one node/edge
NODE = "node" # whole node + any edges referencing it
EDGE = "edge" # whole edge
EVENT = "event" # whole extractor output for this event
@dataclass(frozen=True)
class ValidationFailure:
scope: FallbackScope
code: str # e.g. "unknown_entity", "type_mismatch", "missing_key"
path: str # e.g. "nodes[3].properties[1].value"
node_id: Optional[str] = None
edge_id: Optional[str] = None
event_id: Optional[str] = None # populated when the caller tracks extractor-event boundaries
detail: str = ""
observed: Any = None
expected: Any = None
@dataclass(frozen=True)
class ValidationReport:
failures: list[ValidationFailure]
@property
def ok(self) -> bool:
return not self.failures
def by_scope(self, scope: FallbackScope) -> list[ValidationFailure]:
return [f for f in self.failures if f.scope is scope]
Validation rules
Entity-level
- Every
ExtractedNode.entity_name must match a declared entity in the spec.
Failure → scope=NODE, code="unknown_entity".
node_id must be non-empty and unique within the graph.
Failure → scope=NODE, code="missing_node_id" or "duplicate_node_id".
- Every physical column listed in
ResolvedEntity.key_columns must appear as a property on the node with a non-empty value.
Failure → scope=NODE, code="missing_key".
Key scope is primary-key-only in this issue. ResolvedEntity today carries key_columns + ontology_key_primary but does not carry resolved alternate-key metadata. Adding alternate-key validation first requires extending ResolvedEntity to surface alternate keys (resolved to physical columns through the binding); that extension is a separate prerequisite and is called out as a follow-up below. Partial extraction of non-key properties remains valid (see "Required vs optional" below).
Property-level (applied to both node and edge properties)
-
Property-name matching. ExtractedProperty.name is matched against ResolvedProperty.logical_name first (the ontology-level name an LLM extractor produces) and falls back to ResolvedProperty.column (the physical column name from the binding, per resolved_spec.py:29). An extractor emitting physical column names directly is also accepted. If the name matches neither on any property of the entity/relationship in the spec:
Failure → scope=FIELD, code="unknown_property". (Does not force scope=NODE; the rest of the node is recoverable.)
-
value must satisfy the declared SDK-normalized type (from ResolvedProperty.sdk_type). The validator checks against the seven types that _PROPERTY_TYPE_TO_SDK actually emits (see resolved_spec.py:125):
sdk_type |
Accepted Python value shapes |
Notes |
string |
str |
Also receives values from lossy-normalized ontology time and json types. |
bytes |
bytes |
|
int64 |
int (and bool is explicitly rejected despite being an int subclass) |
From ontology integer. |
double |
int, float |
From ontology double and numeric (lossy, numeric → double). |
boolean |
bool |
From ontology boolean. |
date |
datetime.date or ISO-8601 YYYY-MM-DD string |
|
timestamp |
datetime.datetime (tz-aware) or ISO-8601 string |
From ontology datetime and timestamp (lossy, datetime → timestamp). |
Failure on any mismatch → scope=FIELD, code="type_mismatch".
-
Arrays and structs are explicitly deferred in ontology v0 — modeled as separate entities + relationships, not as nested properties. If sdk_type normalizes an ontology array/struct (it currently doesn't), the validator rejects list- and dict-valued properties as code="unsupported_type", scope=FIELD. Today this is a future-proofing check; it triggers only if upstream ontology v1 lands with composite types that the validator hasn't been updated for.
-
Enum membership validation is deferred. Ontology v0 has no first-class enum type; enums are conventionally represented via string + a declared value list in ontology-level property metadata (annotations). ResolvedProperty today carries only column, logical_name, sdk_type, description (resolved_spec.py:29) — no annotations / value-list passthrough from the upstream ontology. Adding enum-miss detection requires extending ResolvedProperty with (at minimum) an annotations: dict[str, Any] or a typed enum_values: Optional[tuple[str, ...]] field populated by resolve(). That is a separate prerequisite and is called out in the Deferred section below. Not included in this issue's validator.
Edge-level
relationship_name must match a declared relationship in the spec.
Failure → scope=EDGE, code="unknown_relationship".
from_node_id and to_node_id must resolve to nodes in the graph (or to external node-refs matching the declared endpoint entity).
Failure → scope=EDGE, code="unresolved_endpoint".
- Endpoint entity types must match the declared endpoint entities in the spec.
Failure → scope=EDGE, code="wrong_endpoint_entity".
- Every physical column in
ResolvedRelationship.from_columns and to_columns must be present on the edge (or on the endpoint nodes it references) with non-empty values.
Failure → scope=EDGE, code="missing_endpoint_key".
Required vs optional
"Required" means entity primary keys (from ResolvedEntity.key_columns) and edge endpoint keys (from ResolvedRelationship.from_columns / to_columns) only — not every declared property, and not alternate keys (deferred, see below). A non-key property that isn't present is a valid partial extraction and does not produce a failure. If the ontology model later grows an explicit required: bool on non-key properties, the validator extends to cover it; until then, non-key properties are optional by default.
This is a deliberate scope decision to avoid breaking the valid partial-extraction case that hand-written extractors depend on today (e.g., extract_bka_decision_event only populates a subset of the BkaDecision entity's declared properties).
Deferred: alternate-key validation
ResolvedEntity today exposes only primary-key metadata (key_columns + ontology_key_primary). Entity alternate keys are declared upstream as Ontology.Entity.keys.alternate — a list[list[str]], one inner list per alternate-key tuple (see ontology_models.py:109). (Note: keys.additional is the relationship-only uniqueness-without-primary construct and is not the entity alternate-key field.) Entity alternate keys are not resolved to physical columns on ResolvedEntity. Before alternate-key validation can land here, ResolvedEntity needs a new field — e.g., alternate_key_columns: tuple[tuple[str, ...], ...] — built by resolve() from Ontology.Entity.keys.alternate plus the binding's column mapping. That is a separate prerequisite to this issue. Tracked as a follow-up; not blocking this validator's first landing.
Deferred: enum-value-list validation
ResolvedProperty carries no enum value list today. Adding enum-miss detection requires extending ResolvedProperty with either a generic annotations: dict[str, Any] channel (and a convention for where enum value lists live inside it) or a typed enum_values: Optional[tuple[str, ...]] field, populated by resolve() from the upstream ontology property metadata. Same prerequisite shape as alternate keys: a resolve() pass extension, tracked as a follow-up. Not blocking this validator's first landing.
Out of scope
- Semantic correctness — whether the extracted values are "right" given the source event. The validator checks structural / ontology conformance only. Semantic correctness is the LLM judge's job.
- Constraint checks beyond the ontology model — uniqueness across datasets, referential integrity to data outside the graph, business-rule validation. A separate layer.
- Auto-repair — the validator returns failures; it does not patch the graph in-place. Repair is the caller's decision (drop node, re-extract field, fall back to LLM).
Implementation sketch
src/bigquery_agent_analytics/graph_validation.py (new) — single module with validate_extracted_graph, ValidationReport, ValidationFailure, FallbackScope, and a set of per-type scalar validators.
- Pure Python, no BigQuery dependency. Runs client-side against models already loaded in memory.
- Test fixtures in
tests/graph_validation/ — one fixture per failure code, plus positive cases that should produce ok=True.
- Existing hand-written extractor output (
extract_bka_decision_event) runs through the validator in a new integration test; it must produce ok=True against its declared entity today (regression guard on any future ontology changes).
Success criteria
- All 12+ failure codes covered by unit tests (one positive + one negative case each).
extract_bka_decision_event's current output validates clean against its declared entity — the validator doesn't accidentally break existing code.
- Documentation in
docs/ontology/validation.md (new) covering the rules, the ValidationReport shape, and how callers consume it.
- Public API exported from
bigquery_agent_analytics/__init__.py so users can call it directly against their own extractor output.
Non-goals for this issue
Related
- Epic #75 — compile-time code generation for structured trace extractors. This issue is P0.1, the hard prerequisite.
- #58 — runtime entity-resolution primitives.
ValidationReport.ok is a natural integration point for the resolver's strict-mode checks.
Feat: Ontology-aware
validate_extracted_graph(spec, graph)with fallback-granularity classificationGoal
Ship a real ontology-aware validator that checks an
ExtractedGraphagainst aResolvedGraph(the runtime-facing spec surface), not just container shape. Return a structuredValidationReportthat classifies each failure by the smallest safe unit of replacement (field / node / edge / event), so downstream consumers can fall back at the right granularity.Motivation
What exists today
extracted_models.py:18definesExtractedProperty.value: Any— no ontology-level type check.ExtractedGraphvalidates container shape (nodes/edges are lists, fields present) but not ontology correctness: unknown field names survive, type mismatches pass, unknown entity types aren't rejected.ontology_materializer.py:263silently drops unknown fields and lets missing edge keys become empty strings.The practical consequence: an LLM or hand-written extractor can produce output that "validates" through Pydantic but maps to nonsense rows in the property-graph DDL — empty-string edge keys, orphan edges pointing at
node_ids the extractor never emitted, properties with string values on int fields.Why this is a prerequisite for epic #75
The epic's per-field / per-node / per-edge / per-event fallback model depends on a validator that can actually detect each failure class and classify it correctly. Without this issue, the fallback logic has nothing to trigger on.
Why it's independently useful
The validator is useful outside the epic for:
AI.GENERATE) or hand-written (structured_extraction.py) — before materialization. Catches silent data-quality issues today.ExtractedGraphfixture round-trips valid gives ontology authors fast feedback.API
The validator's primary input surface is
ResolvedGraph(fromresolved_spec.py:100) — the runtime-facing view that extraction, materialization, DDL compilation, and GQL generation all already consume.ResolvedGraphcarries physical column names, SDK-normalized types, and primary-key metadata already mapped through the binding, so the validator doesn't needOntology+Bindingseparately.For users who hold upstream
Ontology+Bindingmodels instead of aResolvedGraph(e.g., authoring-time validation before binding is finalized), ship a thin adapter:One validator implementation, one spec surface, one set of tests — with a thin convenience for the
Ontology + Bindingcase.ValidationReportcarries a list of typed failures. Each failure is tagged with a fallback scope so runtime callers (including the compiled-extractor runtime proposed in #75) know the smallest safe unit of replacement.Validation rules
Entity-level
ExtractedNode.entity_namemust match a declared entity in the spec.Failure →
scope=NODE,code="unknown_entity".node_idmust be non-empty and unique within the graph.Failure →
scope=NODE,code="missing_node_id"or"duplicate_node_id".ResolvedEntity.key_columnsmust appear as a property on the node with a non-empty value.Failure →
scope=NODE,code="missing_key".Key scope is primary-key-only in this issue.
ResolvedEntitytoday carrieskey_columns+ontology_key_primarybut does not carry resolved alternate-key metadata. Adding alternate-key validation first requires extendingResolvedEntityto surface alternate keys (resolved to physical columns through the binding); that extension is a separate prerequisite and is called out as a follow-up below. Partial extraction of non-key properties remains valid (see "Required vs optional" below).Property-level (applied to both node and edge properties)
Property-name matching.
ExtractedProperty.nameis matched againstResolvedProperty.logical_namefirst (the ontology-level name an LLM extractor produces) and falls back toResolvedProperty.column(the physical column name from the binding, perresolved_spec.py:29). An extractor emitting physical column names directly is also accepted. If the name matches neither on any property of the entity/relationship in the spec:Failure →
scope=FIELD,code="unknown_property". (Does not forcescope=NODE; the rest of the node is recoverable.)valuemust satisfy the declared SDK-normalized type (fromResolvedProperty.sdk_type). The validator checks against the seven types that_PROPERTY_TYPE_TO_SDKactually emits (seeresolved_spec.py:125):sdk_typestringstrtimeandjsontypes.bytesbytesint64int(andboolis explicitly rejected despite being anintsubclass)integer.doubleint,floatdoubleandnumeric(lossy,numeric → double).booleanboolboolean.datedatetime.dateor ISO-8601YYYY-MM-DDstringtimestampdatetime.datetime(tz-aware) or ISO-8601 stringdatetimeandtimestamp(lossy,datetime → timestamp).Failure on any mismatch →
scope=FIELD,code="type_mismatch".Arrays and structs are explicitly deferred in ontology v0 — modeled as separate entities + relationships, not as nested properties. If
sdk_typenormalizes an ontology array/struct (it currently doesn't), the validator rejects list- and dict-valued properties ascode="unsupported_type",scope=FIELD. Today this is a future-proofing check; it triggers only if upstream ontology v1 lands with composite types that the validator hasn't been updated for.Enum membership validation is deferred. Ontology v0 has no first-class enum type; enums are conventionally represented via
string+ a declared value list in ontology-level property metadata (annotations).ResolvedPropertytoday carries onlycolumn,logical_name,sdk_type,description(resolved_spec.py:29) — no annotations / value-list passthrough from the upstream ontology. Adding enum-miss detection requires extendingResolvedPropertywith (at minimum) anannotations: dict[str, Any]or a typedenum_values: Optional[tuple[str, ...]]field populated byresolve(). That is a separate prerequisite and is called out in the Deferred section below. Not included in this issue's validator.Edge-level
relationship_namemust match a declared relationship in the spec.Failure →
scope=EDGE,code="unknown_relationship".from_node_idandto_node_idmust resolve to nodes in the graph (or to external node-refs matching the declared endpoint entity).Failure →
scope=EDGE,code="unresolved_endpoint".Failure →
scope=EDGE,code="wrong_endpoint_entity".ResolvedRelationship.from_columnsandto_columnsmust be present on the edge (or on the endpoint nodes it references) with non-empty values.Failure →
scope=EDGE,code="missing_endpoint_key".Required vs optional
"Required" means entity primary keys (from
ResolvedEntity.key_columns) and edge endpoint keys (fromResolvedRelationship.from_columns/to_columns) only — not every declared property, and not alternate keys (deferred, see below). A non-key property that isn't present is a valid partial extraction and does not produce a failure. If the ontology model later grows an explicitrequired: boolon non-key properties, the validator extends to cover it; until then, non-key properties are optional by default.This is a deliberate scope decision to avoid breaking the valid partial-extraction case that hand-written extractors depend on today (e.g.,
extract_bka_decision_eventonly populates a subset of theBkaDecisionentity's declared properties).Deferred: alternate-key validation
ResolvedEntitytoday exposes only primary-key metadata (key_columns+ontology_key_primary). Entity alternate keys are declared upstream asOntology.Entity.keys.alternate— alist[list[str]], one inner list per alternate-key tuple (seeontology_models.py:109). (Note:keys.additionalis the relationship-only uniqueness-without-primary construct and is not the entity alternate-key field.) Entity alternate keys are not resolved to physical columns onResolvedEntity. Before alternate-key validation can land here,ResolvedEntityneeds a new field — e.g.,alternate_key_columns: tuple[tuple[str, ...], ...]— built byresolve()fromOntology.Entity.keys.alternateplus the binding's column mapping. That is a separate prerequisite to this issue. Tracked as a follow-up; not blocking this validator's first landing.Deferred: enum-value-list validation
ResolvedPropertycarries no enum value list today. Adding enum-miss detection requires extendingResolvedPropertywith either a genericannotations: dict[str, Any]channel (and a convention for where enum value lists live inside it) or a typedenum_values: Optional[tuple[str, ...]]field, populated byresolve()from the upstream ontology property metadata. Same prerequisite shape as alternate keys: aresolve()pass extension, tracked as a follow-up. Not blocking this validator's first landing.Out of scope
Implementation sketch
src/bigquery_agent_analytics/graph_validation.py(new) — single module withvalidate_extracted_graph,ValidationReport,ValidationFailure,FallbackScope, and a set of per-type scalar validators.tests/graph_validation/— one fixture per failure code, plus positive cases that should produceok=True.extract_bka_decision_event) runs through the validator in a new integration test; it must produceok=Trueagainst its declared entity today (regression guard on any future ontology changes).Success criteria
extract_bka_decision_event's current output validates clean against its declared entity — the validator doesn't accidentally break existing code.docs/ontology/validation.md(new) covering the rules, theValidationReportshape, and how callers consume it.bigquery_agent_analytics/__init__.pyso users can call it directly against their own extractor output.Non-goals for this issue
AI.GENERATEoutput post-processing (follow-up, once the validator exists).Related
ValidationReport.okis a natural integration point for the resolver's strict-mode checks.