Feat: Ontology-aware validate_extracted_graph with fallback-scope classification (prerequisite for #75)

# Feat: Ontology-aware `validate_extracted_graph(spec, graph)` with fallback-granularity classification

> **Status**: Prerequisite issue, split out from epic [#75](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/issues/75). Independently useful — ships a validator the SDK needs regardless of whether the epic proceeds.

## Goal

Ship a real ontology-aware validator that checks an `ExtractedGraph` against a `ResolvedGraph` (the runtime-facing spec surface), not just container shape. Return a structured `ValidationReport` that classifies each failure by the smallest safe unit of replacement (field / node / edge / event), so downstream consumers can fall back at the right granularity.

## Motivation

### What exists today

- [`extracted_models.py:18`](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/blob/main/src/bigquery_agent_analytics/extracted_models.py) defines `ExtractedProperty.value: Any` — no ontology-level type check.
- `ExtractedGraph` validates container shape (nodes/edges are lists, fields present) but not ontology correctness: unknown field names survive, type mismatches pass, unknown entity types aren't rejected.
- [`ontology_materializer.py:263`](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/blob/main/src/bigquery_agent_analytics/ontology_materializer.py) silently drops unknown fields and lets missing edge keys become empty strings.

The practical consequence: an LLM or hand-written extractor can produce output that "validates" through Pydantic but maps to nonsense rows in the property-graph DDL — empty-string edge keys, orphan edges pointing at `node_id`s the extractor never emitted, properties with string values on int fields.

### Why this is a prerequisite for epic [#75](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/issues/75)

The epic's per-field / per-node / per-edge / per-event fallback model depends on a validator that can actually *detect* each failure class and classify it correctly. Without this issue, the fallback logic has nothing to trigger on.

### Why it's independently useful

The validator is useful outside the epic for:

- **Validating any extraction output** — LLM (`AI.GENERATE`) or hand-written (`structured_extraction.py`) — before materialization. Catches silent data-quality issues today.
- **Testing ontology authoring** — a unit test that asserts a golden `ExtractedGraph` fixture round-trips valid gives ontology authors fast feedback.
- **CI gate on generated fixtures** — new ontologies get a "does a sample extraction validate?" check for free.

## API

The validator's primary input surface is **`ResolvedGraph`** (from [`resolved_spec.py:100`](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/blob/main/src/bigquery_agent_analytics/resolved_spec.py#L100)) — the runtime-facing view that extraction, materialization, DDL compilation, and GQL generation all already consume. `ResolvedGraph` carries physical column names, SDK-normalized types, and primary-key metadata already mapped through the binding, so the validator doesn't need `Ontology` + `Binding` separately.

```python
def validate_extracted_graph(
    spec: ResolvedGraph,
    graph: ExtractedGraph,
) -> ValidationReport: ...
```

For users who hold upstream `Ontology` + `Binding` models instead of a `ResolvedGraph` (e.g., authoring-time validation before binding is finalized), ship a thin adapter:

```python
def validate_extracted_graph_from_ontology(
    ontology: Ontology,
    binding: Binding,
    graph: ExtractedGraph,
) -> ValidationReport:
    """Adapter: resolve(ontology, binding), then delegate."""
    return validate_extracted_graph(resolve(ontology, binding), graph)
```

One validator implementation, one spec surface, one set of tests — with a thin convenience for the `Ontology + Binding` case.

`ValidationReport` carries a list of typed failures. Each failure is tagged with a **fallback scope** so runtime callers (including the compiled-extractor runtime proposed in #75) know the smallest safe unit of replacement.

```python
from dataclasses import dataclass
from enum import Enum
from typing import Any, Optional

class FallbackScope(str, Enum):
    FIELD = "field"   # one property on one node/edge
    NODE = "node"     # whole node + any edges referencing it
    EDGE = "edge"     # whole edge
    EVENT = "event"   # whole extractor output for this event

@dataclass(frozen=True)
class ValidationFailure:
    scope: FallbackScope
    code: str                           # e.g. "unknown_entity", "type_mismatch", "missing_key"
    path: str                           # e.g. "nodes[3].properties[1].value"
    node_id: Optional[str] = None
    edge_id: Optional[str] = None
    event_id: Optional[str] = None      # populated when the caller tracks extractor-event boundaries
    detail: str = ""
    observed: Any = None
    expected: Any = None

@dataclass(frozen=True)
class ValidationReport:
    failures: list[ValidationFailure]

    @property
    def ok(self) -> bool:
        return not self.failures

    def by_scope(self, scope: FallbackScope) -> list[ValidationFailure]:
        return [f for f in self.failures if f.scope is scope]
```

## Validation rules

### Entity-level

- **Every `ExtractedNode.entity_name` must match a declared entity in the spec.**
  Failure → `scope=NODE`, `code="unknown_entity"`.
- **`node_id` must be non-empty and unique within the graph.**
  Failure → `scope=NODE`, `code="missing_node_id"` or `"duplicate_node_id"`.
- **Every physical column listed in `ResolvedEntity.key_columns` must appear as a property on the node with a non-empty value.**
  Failure → `scope=NODE`, `code="missing_key"`.

**Key scope is primary-key-only in this issue.** `ResolvedEntity` today carries `key_columns` + `ontology_key_primary` but does *not* carry resolved alternate-key metadata. Adding alternate-key validation first requires extending `ResolvedEntity` to surface alternate keys (resolved to physical columns through the binding); that extension is a separate prerequisite and is called out as a follow-up below. Partial extraction of non-key properties remains valid (see "Required vs optional" below).

### Property-level (applied to both node and edge properties)

- **Property-name matching.** `ExtractedProperty.name` is matched against `ResolvedProperty.logical_name` first (the ontology-level name an LLM extractor produces) and falls back to `ResolvedProperty.column` (the physical column name from the binding, per [`resolved_spec.py:29`](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/blob/main/src/bigquery_agent_analytics/resolved_spec.py#L29)). An extractor emitting physical column names directly is also accepted. If the name matches neither on any property of the entity/relationship in the spec:
  Failure → `scope=FIELD`, `code="unknown_property"`. (Does not force `scope=NODE`; the rest of the node is recoverable.)

- **`value` must satisfy the declared SDK-normalized type** (from `ResolvedProperty.sdk_type`). The validator checks against the seven types that `_PROPERTY_TYPE_TO_SDK` actually emits (see [`resolved_spec.py:125`](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/blob/main/src/bigquery_agent_analytics/resolved_spec.py#L125)):

  | `sdk_type` | Accepted Python value shapes | Notes |
  |---|---|---|
  | `string` | `str` | Also receives values from lossy-normalized ontology `time` and `json` types. |
  | `bytes` | `bytes` | |
  | `int64` | `int` (and `bool` is explicitly rejected despite being an `int` subclass) | From ontology `integer`. |
  | `double` | `int`, `float` | From ontology `double` and `numeric` (lossy, `numeric → double`). |
  | `boolean` | `bool` | From ontology `boolean`. |
  | `date` | `datetime.date` or ISO-8601 `YYYY-MM-DD` string | |
  | `timestamp` | `datetime.datetime` (tz-aware) or ISO-8601 string | From ontology `datetime` and `timestamp` (lossy, `datetime → timestamp`). |

  Failure on any mismatch → `scope=FIELD`, `code="type_mismatch"`.

- **Arrays and structs** are [explicitly deferred in ontology v0](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/blob/main/docs/ontology/ontology.md) — modeled as separate entities + relationships, not as nested properties. If `sdk_type` normalizes an ontology array/struct (it currently doesn't), the validator rejects list- and dict-valued properties as `code="unsupported_type"`, `scope=FIELD`. Today this is a future-proofing check; it triggers only if upstream ontology v1 lands with composite types that the validator hasn't been updated for.

- **Enum membership validation is deferred.** Ontology v0 has no first-class enum type; enums are conventionally represented via `string` + a declared value list in ontology-level property metadata (annotations). `ResolvedProperty` today carries only `column`, `logical_name`, `sdk_type`, `description` ([`resolved_spec.py:29`](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/blob/main/src/bigquery_agent_analytics/resolved_spec.py#L29)) — no annotations / value-list passthrough from the upstream ontology. Adding enum-miss detection requires extending `ResolvedProperty` with (at minimum) an `annotations: dict[str, Any]` or a typed `enum_values: Optional[tuple[str, ...]]` field populated by `resolve()`. That is a separate prerequisite and is called out in the Deferred section below. Not included in this issue's validator.

### Edge-level

- **`relationship_name` must match a declared relationship in the spec.**
  Failure → `scope=EDGE`, `code="unknown_relationship"`.
- **`from_node_id` and `to_node_id` must resolve to nodes in the graph (or to external node-refs matching the declared endpoint entity).**
  Failure → `scope=EDGE`, `code="unresolved_endpoint"`.
- **Endpoint entity types must match the declared endpoint entities in the spec.**
  Failure → `scope=EDGE`, `code="wrong_endpoint_entity"`.
- **Every physical column in `ResolvedRelationship.from_columns` and `to_columns` must be present on the edge (or on the endpoint nodes it references) with non-empty values.**
  Failure → `scope=EDGE`, `code="missing_endpoint_key"`.

### Required vs optional

"Required" means **entity primary keys (from `ResolvedEntity.key_columns`) and edge endpoint keys (from `ResolvedRelationship.from_columns` / `to_columns`) only** — not every declared property, and not alternate keys (deferred, see below). A non-key property that isn't present is a valid partial extraction and does not produce a failure. If the ontology model later grows an explicit `required: bool` on non-key properties, the validator extends to cover it; until then, non-key properties are optional by default.

This is a deliberate scope decision to avoid breaking the valid partial-extraction case that hand-written extractors depend on today (e.g., `extract_bka_decision_event` only populates a subset of the `BkaDecision` entity's declared properties).

### Deferred: alternate-key validation

`ResolvedEntity` today exposes only primary-key metadata (`key_columns` + `ontology_key_primary`). Entity alternate keys are declared upstream as `Ontology.Entity.keys.alternate` — a `list[list[str]]`, one inner list per alternate-key tuple (see [`ontology_models.py:109`](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/blob/main/src/bigquery_ontology/ontology_models.py#L109)). (Note: `keys.additional` is the relationship-only uniqueness-without-primary construct and is not the entity alternate-key field.) Entity alternate keys are not resolved to physical columns on `ResolvedEntity`. Before alternate-key validation can land here, `ResolvedEntity` needs a new field — e.g., `alternate_key_columns: tuple[tuple[str, ...], ...]` — built by `resolve()` from `Ontology.Entity.keys.alternate` plus the binding's column mapping. That is a separate prerequisite to this issue. Tracked as a follow-up; not blocking this validator's first landing.

### Deferred: enum-value-list validation

`ResolvedProperty` carries no enum value list today. Adding enum-miss detection requires extending `ResolvedProperty` with either a generic `annotations: dict[str, Any]` channel (and a convention for where enum value lists live inside it) or a typed `enum_values: Optional[tuple[str, ...]]` field, populated by `resolve()` from the upstream ontology property metadata. Same prerequisite shape as alternate keys: a `resolve()` pass extension, tracked as a follow-up. Not blocking this validator's first landing.

## Out of scope

- **Semantic correctness** — whether the extracted values are "right" given the source event. The validator checks structural / ontology conformance only. Semantic correctness is the LLM judge's job.
- **Constraint checks beyond the ontology model** — uniqueness across datasets, referential integrity to data outside the graph, business-rule validation. A separate layer.
- **Auto-repair** — the validator returns failures; it does not patch the graph in-place. Repair is the caller's decision (drop node, re-extract field, fall back to LLM).

## Implementation sketch

- `src/bigquery_agent_analytics/graph_validation.py` (new) — single module with `validate_extracted_graph`, `ValidationReport`, `ValidationFailure`, `FallbackScope`, and a set of per-type scalar validators.
- Pure Python, no BigQuery dependency. Runs client-side against models already loaded in memory.
- Test fixtures in `tests/graph_validation/` — one fixture per failure code, plus positive cases that should produce `ok=True`.
- Existing hand-written extractor output (`extract_bka_decision_event`) runs through the validator in a new integration test; it must produce `ok=True` against its declared entity today (regression guard on any future ontology changes).

## Success criteria

- All 12+ failure codes covered by unit tests (one positive + one negative case each).
- `extract_bka_decision_event`'s current output validates clean against its declared entity — the validator doesn't accidentally break existing code.
- Documentation in `docs/ontology/validation.md` (new) covering the rules, the `ValidationReport` shape, and how callers consume it.
- Public API exported from `bigquery_agent_analytics/__init__.py` so users can call it directly against their own extractor output.

## Non-goals for this issue

- Compiled-extractor runtime (epic #75).
- Integration with `AI.GENERATE` output post-processing (follow-up, once the validator exists).
- Materializer-side repair logic (different concern; materializer can adopt the validator as a gate later).

## Related

- Epic [#75](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/issues/75) — compile-time code generation for structured trace extractors. This issue is P0.1, the hard prerequisite.
- [#58](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/issues/58) — runtime entity-resolution primitives. `ValidationReport.ok` is a natural integration point for the resolver's strict-mode checks.



`sdk_type`	Accepted Python value shapes	Notes
`string`	`str`	Also receives values from lossy-normalized ontology `time` and `json` types.
`bytes`	`bytes`
`int64`	`int` (and `bool` is explicitly rejected despite being an `int` subclass)	From ontology `integer`.
`double`	`int`, `float`	From ontology `double` and `numeric` (lossy, `numeric → double`).
`boolean`	`bool`	From ontology `boolean`.
`date`	`datetime.date` or ISO-8601 `YYYY-MM-DD` string
`timestamp`	`datetime.datetime` (tz-aware) or ISO-8601 string	From ontology `datetime` and `timestamp` (lossy, `datetime → timestamp`).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Ontology-aware validate_extracted_graph with fallback-scope classification (prerequisite for #75) #76

Feat: Ontology-aware `validate_extracted_graph(spec, graph)` with fallback-granularity classification

Goal

Motivation

What exists today

Why this is a prerequisite for epic #75

Why it's independently useful

API

Validation rules

Entity-level

Property-level (applied to both node and edge properties)

Edge-level

Required vs optional

Deferred: alternate-key validation

Deferred: enum-value-list validation

Out of scope

Implementation sketch

Success criteria

Non-goals for this issue

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feat: Ontology-aware validate_extracted_graph with fallback-scope classification (prerequisite for #75) #76

Description

Feat: Ontology-aware validate_extracted_graph(spec, graph) with fallback-granularity classification

Goal

Motivation

What exists today

Why this is a prerequisite for epic #75

Why it's independently useful

API

Validation rules

Entity-level

Property-level (applied to both node and edge properties)

Edge-level

Required vs optional

Deferred: alternate-key validation

Deferred: enum-value-list validation

Out of scope

Implementation sketch

Success criteria

Non-goals for this issue

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Feat: Ontology-aware `validate_extracted_graph(spec, graph)` with fallback-granularity classification