Skip to content

Feat: Ontology-aware validate_extracted_graph with fallback-scope classification (prerequisite for #75) #76

@caohy1988

Description

@caohy1988

Feat: Ontology-aware validate_extracted_graph(spec, graph) with fallback-granularity classification

Status: Prerequisite issue, split out from epic #75. Independently useful — ships a validator the SDK needs regardless of whether the epic proceeds.

Goal

Ship a real ontology-aware validator that checks an ExtractedGraph against a ResolvedGraph (the runtime-facing spec surface), not just container shape. Return a structured ValidationReport that classifies each failure by the smallest safe unit of replacement (field / node / edge / event), so downstream consumers can fall back at the right granularity.

Motivation

What exists today

  • extracted_models.py:18 defines ExtractedProperty.value: Any — no ontology-level type check.
  • ExtractedGraph validates container shape (nodes/edges are lists, fields present) but not ontology correctness: unknown field names survive, type mismatches pass, unknown entity types aren't rejected.
  • ontology_materializer.py:263 silently drops unknown fields and lets missing edge keys become empty strings.

The practical consequence: an LLM or hand-written extractor can produce output that "validates" through Pydantic but maps to nonsense rows in the property-graph DDL — empty-string edge keys, orphan edges pointing at node_ids the extractor never emitted, properties with string values on int fields.

Why this is a prerequisite for epic #75

The epic's per-field / per-node / per-edge / per-event fallback model depends on a validator that can actually detect each failure class and classify it correctly. Without this issue, the fallback logic has nothing to trigger on.

Why it's independently useful

The validator is useful outside the epic for:

  • Validating any extraction output — LLM (AI.GENERATE) or hand-written (structured_extraction.py) — before materialization. Catches silent data-quality issues today.
  • Testing ontology authoring — a unit test that asserts a golden ExtractedGraph fixture round-trips valid gives ontology authors fast feedback.
  • CI gate on generated fixtures — new ontologies get a "does a sample extraction validate?" check for free.

API

The validator's primary input surface is ResolvedGraph (from resolved_spec.py:100) — the runtime-facing view that extraction, materialization, DDL compilation, and GQL generation all already consume. ResolvedGraph carries physical column names, SDK-normalized types, and primary-key metadata already mapped through the binding, so the validator doesn't need Ontology + Binding separately.

def validate_extracted_graph(
    spec: ResolvedGraph,
    graph: ExtractedGraph,
) -> ValidationReport: ...

For users who hold upstream Ontology + Binding models instead of a ResolvedGraph (e.g., authoring-time validation before binding is finalized), ship a thin adapter:

def validate_extracted_graph_from_ontology(
    ontology: Ontology,
    binding: Binding,
    graph: ExtractedGraph,
) -> ValidationReport:
    """Adapter: resolve(ontology, binding), then delegate."""
    return validate_extracted_graph(resolve(ontology, binding), graph)

One validator implementation, one spec surface, one set of tests — with a thin convenience for the Ontology + Binding case.

ValidationReport carries a list of typed failures. Each failure is tagged with a fallback scope so runtime callers (including the compiled-extractor runtime proposed in #75) know the smallest safe unit of replacement.

from dataclasses import dataclass
from enum import Enum
from typing import Any, Optional

class FallbackScope(str, Enum):
    FIELD = "field"   # one property on one node/edge
    NODE = "node"     # whole node + any edges referencing it
    EDGE = "edge"     # whole edge
    EVENT = "event"   # whole extractor output for this event

@dataclass(frozen=True)
class ValidationFailure:
    scope: FallbackScope
    code: str                           # e.g. "unknown_entity", "type_mismatch", "missing_key"
    path: str                           # e.g. "nodes[3].properties[1].value"
    node_id: Optional[str] = None
    edge_id: Optional[str] = None
    event_id: Optional[str] = None      # populated when the caller tracks extractor-event boundaries
    detail: str = ""
    observed: Any = None
    expected: Any = None

@dataclass(frozen=True)
class ValidationReport:
    failures: list[ValidationFailure]

    @property
    def ok(self) -> bool:
        return not self.failures

    def by_scope(self, scope: FallbackScope) -> list[ValidationFailure]:
        return [f for f in self.failures if f.scope is scope]

Validation rules

Entity-level

  • Every ExtractedNode.entity_name must match a declared entity in the spec.
    Failure → scope=NODE, code="unknown_entity".
  • node_id must be non-empty and unique within the graph.
    Failure → scope=NODE, code="missing_node_id" or "duplicate_node_id".
  • Every physical column listed in ResolvedEntity.key_columns must appear as a property on the node with a non-empty value.
    Failure → scope=NODE, code="missing_key".

Key scope is primary-key-only in this issue. ResolvedEntity today carries key_columns + ontology_key_primary but does not carry resolved alternate-key metadata. Adding alternate-key validation first requires extending ResolvedEntity to surface alternate keys (resolved to physical columns through the binding); that extension is a separate prerequisite and is called out as a follow-up below. Partial extraction of non-key properties remains valid (see "Required vs optional" below).

Property-level (applied to both node and edge properties)

  • Property-name matching. ExtractedProperty.name is matched against ResolvedProperty.logical_name first (the ontology-level name an LLM extractor produces) and falls back to ResolvedProperty.column (the physical column name from the binding, per resolved_spec.py:29). An extractor emitting physical column names directly is also accepted. If the name matches neither on any property of the entity/relationship in the spec:
    Failure → scope=FIELD, code="unknown_property". (Does not force scope=NODE; the rest of the node is recoverable.)

  • value must satisfy the declared SDK-normalized type (from ResolvedProperty.sdk_type). The validator checks against the seven types that _PROPERTY_TYPE_TO_SDK actually emits (see resolved_spec.py:125):

    sdk_type Accepted Python value shapes Notes
    string str Also receives values from lossy-normalized ontology time and json types.
    bytes bytes
    int64 int (and bool is explicitly rejected despite being an int subclass) From ontology integer.
    double int, float From ontology double and numeric (lossy, numeric → double).
    boolean bool From ontology boolean.
    date datetime.date or ISO-8601 YYYY-MM-DD string
    timestamp datetime.datetime (tz-aware) or ISO-8601 string From ontology datetime and timestamp (lossy, datetime → timestamp).

    Failure on any mismatch → scope=FIELD, code="type_mismatch".

  • Arrays and structs are explicitly deferred in ontology v0 — modeled as separate entities + relationships, not as nested properties. If sdk_type normalizes an ontology array/struct (it currently doesn't), the validator rejects list- and dict-valued properties as code="unsupported_type", scope=FIELD. Today this is a future-proofing check; it triggers only if upstream ontology v1 lands with composite types that the validator hasn't been updated for.

  • Enum membership validation is deferred. Ontology v0 has no first-class enum type; enums are conventionally represented via string + a declared value list in ontology-level property metadata (annotations). ResolvedProperty today carries only column, logical_name, sdk_type, description (resolved_spec.py:29) — no annotations / value-list passthrough from the upstream ontology. Adding enum-miss detection requires extending ResolvedProperty with (at minimum) an annotations: dict[str, Any] or a typed enum_values: Optional[tuple[str, ...]] field populated by resolve(). That is a separate prerequisite and is called out in the Deferred section below. Not included in this issue's validator.

Edge-level

  • relationship_name must match a declared relationship in the spec.
    Failure → scope=EDGE, code="unknown_relationship".
  • from_node_id and to_node_id must resolve to nodes in the graph (or to external node-refs matching the declared endpoint entity).
    Failure → scope=EDGE, code="unresolved_endpoint".
  • Endpoint entity types must match the declared endpoint entities in the spec.
    Failure → scope=EDGE, code="wrong_endpoint_entity".
  • Every physical column in ResolvedRelationship.from_columns and to_columns must be present on the edge (or on the endpoint nodes it references) with non-empty values.
    Failure → scope=EDGE, code="missing_endpoint_key".

Required vs optional

"Required" means entity primary keys (from ResolvedEntity.key_columns) and edge endpoint keys (from ResolvedRelationship.from_columns / to_columns) only — not every declared property, and not alternate keys (deferred, see below). A non-key property that isn't present is a valid partial extraction and does not produce a failure. If the ontology model later grows an explicit required: bool on non-key properties, the validator extends to cover it; until then, non-key properties are optional by default.

This is a deliberate scope decision to avoid breaking the valid partial-extraction case that hand-written extractors depend on today (e.g., extract_bka_decision_event only populates a subset of the BkaDecision entity's declared properties).

Deferred: alternate-key validation

ResolvedEntity today exposes only primary-key metadata (key_columns + ontology_key_primary). Entity alternate keys are declared upstream as Ontology.Entity.keys.alternate — a list[list[str]], one inner list per alternate-key tuple (see ontology_models.py:109). (Note: keys.additional is the relationship-only uniqueness-without-primary construct and is not the entity alternate-key field.) Entity alternate keys are not resolved to physical columns on ResolvedEntity. Before alternate-key validation can land here, ResolvedEntity needs a new field — e.g., alternate_key_columns: tuple[tuple[str, ...], ...] — built by resolve() from Ontology.Entity.keys.alternate plus the binding's column mapping. That is a separate prerequisite to this issue. Tracked as a follow-up; not blocking this validator's first landing.

Deferred: enum-value-list validation

ResolvedProperty carries no enum value list today. Adding enum-miss detection requires extending ResolvedProperty with either a generic annotations: dict[str, Any] channel (and a convention for where enum value lists live inside it) or a typed enum_values: Optional[tuple[str, ...]] field, populated by resolve() from the upstream ontology property metadata. Same prerequisite shape as alternate keys: a resolve() pass extension, tracked as a follow-up. Not blocking this validator's first landing.

Out of scope

  • Semantic correctness — whether the extracted values are "right" given the source event. The validator checks structural / ontology conformance only. Semantic correctness is the LLM judge's job.
  • Constraint checks beyond the ontology model — uniqueness across datasets, referential integrity to data outside the graph, business-rule validation. A separate layer.
  • Auto-repair — the validator returns failures; it does not patch the graph in-place. Repair is the caller's decision (drop node, re-extract field, fall back to LLM).

Implementation sketch

  • src/bigquery_agent_analytics/graph_validation.py (new) — single module with validate_extracted_graph, ValidationReport, ValidationFailure, FallbackScope, and a set of per-type scalar validators.
  • Pure Python, no BigQuery dependency. Runs client-side against models already loaded in memory.
  • Test fixtures in tests/graph_validation/ — one fixture per failure code, plus positive cases that should produce ok=True.
  • Existing hand-written extractor output (extract_bka_decision_event) runs through the validator in a new integration test; it must produce ok=True against its declared entity today (regression guard on any future ontology changes).

Success criteria

  • All 12+ failure codes covered by unit tests (one positive + one negative case each).
  • extract_bka_decision_event's current output validates clean against its declared entity — the validator doesn't accidentally break existing code.
  • Documentation in docs/ontology/validation.md (new) covering the rules, the ValidationReport shape, and how callers consume it.
  • Public API exported from bigquery_agent_analytics/__init__.py so users can call it directly against their own extractor output.

Non-goals for this issue

Related

  • Epic #75 — compile-time code generation for structured trace extractors. This issue is P0.1, the hard prerequisite.
  • #58 — runtime entity-resolution primitives. ValidationReport.ok is a natural integration point for the resolver's strict-mode checks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions