Skip to content

Feat: Runtime entity resolution primitives — OntologyRuntime, concept index, EntityResolver protocol (design proposal — feedback wanted) #58

@caohy1988

Description

@caohy1988

Status: Design proposal. Not yet implemented. Comments welcome — especially on the "Open questions" section at the bottom and on the agent-vs-SDK boundary.

Goal

Today the SDK's ontology pipeline stops at DDL compilation (gm compile emits CREATE PROPERTY GRAPH + table scaffolding). Runtime — the point where an agent receives a user/client input like format_ids: ["display_static"] or geo: ["San Francisco-Stockton-Modesto"] and needs to resolve it against a declared ontology — is left entirely to the application layer.

Feedback from a production user building agentic media buying on top of this SDK quantified the gap: ~85% of brief-validation value for their use case sits at runtime, not schema time. They implemented a 5-layer resolver (notation match → lexical → token-set equality → Jaccard → Levenshtein) on top of ~10K lines of TTL (274 SKOS concepts, 942 synonyms, 210 GAM DMA display names). It works — but every vertical building on the SDK will rewrite some version of this, and today there is no supported runtime surface for them to build against.

This issue proposes a small, opinion-light set of runtime primitives that make resolution implementable in application code without pushing domain-specific matching logic into the SDK core.

Guiding principle: SDK provides, agent decides

The SDK and the agent layer make different kinds of claims:

  • SDK: knows what's declared in the ontology. Entities, relationships, synonyms, notations, concept schemes, taxonomy structure. Stable, typed, queryable.
  • Agent: knows what's intended. Which matcher to try first, what confidence threshold is safe for this domain, how to phrase a "did you mean" suggestion, whether fuzzy match on a free-text company_name is acceptable or dangerous.

Consequences for the runtime:

  1. The SDK exposes read access over loaded ontologies (annotations, synonyms, scheme membership, taxonomy edges). No matching logic.
  2. The SDK optionally materializes an ontology-derived concept index into BigQuery so agents can do SQL-native fuzzy match using BQ's existing EDIT_DISTANCE / SOUNDEX / UDFs.
  3. The SDK defines an EntityResolver protocol and ships two trivial references (ExactMatchResolver, SynonymResolver). Anything beyond exact-match lives outside core.
  4. Domain-specific resolvers (advertising, healthcare, finance) live in contrib/ or user code, never in the runtime's required surface.

The SDK stays general. Verticals get a contract to build against instead of reaching into YAML or reconstructing structure from BQ tables.

Current gaps

  1. No runtime accessor over loaded ontologies. load_ontology() returns Pydantic models, but there's no shape-agnostic API like rt.synonyms("DMA") or rt.annotation("DMA", "skos:notation"). Agents parse the model directly, which couples them to schema details the SDK otherwise hides.

  2. Annotations are not queryable at runtime. Issue Feat: SKOS import support alongside OWL (design proposal — feedback wanted) #57 proposes persisting SKOS annotations (skos:definition, skos:notation, skos:prefLabel, etc.) through import. Nothing today reads those annotations at runtime. They live in the YAML and die there.

  3. No concept index. Synonyms and notations are scattered across per-entity YAML nodes. Agents that want to do SQL-level matching have to flatten this themselves at query time on every request.

  4. No resolver interface. Every SDK user writes their own resolution entry point, with their own return type, with their own "did you mean" shape. No convention, no reuse.

Proposed primitives

1. OntologyRuntime — read accessor over loaded ontology + binding

Small, stateless, zero external dependencies at read time. Built on top of existing load_ontology() + load_binding().

from bigquery_agent_analytics import OntologyRuntime

rt = OntologyRuntime.load(
    ontology_path="ontology.yaml",
    binding_path="binding.yaml",
)

rt.entities()                              # list[str]
rt.entity("DMA")                           # Entity with annotations + synonyms
rt.synonyms("DMA")                         # ["Designated Market Area", ...]
rt.annotation("DMA", "skos:notation")      # "807"
rt.in_scheme("NielsenDMA")                 # list[Entity] — all concepts in scheme
rt.broader("RetailBanking")                # list[Entity] — skos:broader targets
rt.narrower("Banking")                     # inverse
rt.related("Account")                      # skos:related abstract-relationship targets

Design notes:

  • Only reads. Never mutates ontology or binding.
  • Covers both concrete and abstract (SKOS-derived) entities and relationships. Abstract elements are first-class at the runtime layer — they're the whole reason users care about SKOS at runtime.
  • Works against the annotations produced by issue Feat: SKOS import support alongside OWL (design proposal — feedback wanted) #57's SKOS import without coupling to SKOS specifically. rt.annotation(name, key) treats skos:definition, owl:equivalentClass, or a user's custom annotation identically.

Identity rules (important after #57 lands):

  • Entities are name-addressed. rt.entity(name), rt.synonyms(name), rt.annotation(name, key) are singular lookups — entity names remain globally unique.
  • Relationships are traversal-first, not name-addressed. Issue Feat: SKOS import support alongside OWL (design proposal — feedback wanted) #57 relaxes relationship uniqueness to (name, from, to) for abstract relationships, so a single skos_broader can repeat across endpoint pairs. A hypothetical rt.relationship(name) would be unsafe because it has no single answer to return.
  • All relationship accessors take an entity and traverse. rt.broader(entity), rt.narrower(entity), rt.related(entity) return the set of entities reachable from the given starting point via the named predicate. That's a well-defined question regardless of how many skos_broader edges exist in the ontology.
  • If a relationship-by-name accessor is ever added, its contract must be compound identity (rt.relationship(name, from, to) -> Relationship | None) or list-returning (rt.relationships(name) -> list[Relationship]). Never singular-by-name.

2. Concept index materialization (opt-in)

At gm compile time, optionally emit a BigQuery sidecar table:

CREATE TABLE `{dataset}.ontology_concept_index` (
  entity_name STRING NOT NULL,
  label STRING NOT NULL,         -- for label_kind='notation', this holds the notation value
  label_kind STRING NOT NULL,    -- 'name' | 'pref' | 'alt' | 'hidden' | 'synonym' | 'notation'
  notation STRING,               -- per-entity notation for display; repeats across rows of the same entity
  scheme STRING,                 -- concept scheme this row's entity belongs to;
                                 -- NULL means "entity is not a member of any scheme"
  language STRING,               -- ISO-639 tag; NULL means unspecified or N/A (notation rows)
  is_abstract BOOL NOT NULL,     -- TRUE for SKOS-derived informational entities
  compile_id STRING NOT NULL     -- pair-consistency tag; see "Provenance and compatibility contract"
);

notation is a first-class row kind. For every entity that has a skos:notation, the compiler emits a row with label_kind='notation' and label=<notation value> — so resolvers searching by label naturally catch notation matches without a separate OR notation = @input predicate. The notation column is kept as per-entity metadata that repeats across all rows of the same entity, for display convenience (a caller with a winning match can read the entity's notation directly from the candidate row without a separate lookup).

Row multiplicity contract:

  • One row per (entity_name, label, label_kind, language, scheme) membership tuple. A SKOS concept can legally belong to multiple skos:inScheme schemes (a DMA concept may be in both NielsenDMA and CensusMSA, a banking concept may be in both BankingTaxonomy and FinancialProductsTaxonomy). This is denormalized — a concept in 3 schemes × 5 labels produces 15 rows. Intentional; see below.
  • Entities that aren't members of any scheme produce rows with scheme IS NULL. They're still in the index; entity= resolution finds them, scheme= resolution skips them.
  • notation is per-entity (not per-scheme), so it repeats across membership rows for the same entity. Callers selecting a single notation per entity use DISTINCT notation or aggregate.

Why denormalized rather than ARRAY<STRING> scheme or a separate membership table:

  • WHERE scheme = @x stays a trivial clustered lookup — critical for the common scheme=<name> resolver path.
  • Predicate push-down into BQ clustering is straightforward; the clustering key (scheme, entity_name) stays usable.
  • ARRAY<STRING> forces WHERE @x IN UNNEST(scheme) on every scheme-scoped query, which is less indexable and harder for less-experienced SQL callers to write correctly.
  • A separate membership table adds a join to every resolver query, defeats the "one-table SQL lookup" simplicity that motivates the index.
  • Row multiplication is bounded: even for pathological multi-scheme ontologies, row count is linear in (concepts × labels × schemes), which stays tractable at BQ scale.

Agents do fuzzy match in SQL:

-- exact, scheme-scoped (the common case)
SELECT DISTINCT entity_name
FROM ontology_concept_index
WHERE scheme = @scheme AND LOWER(label) = LOWER(@input);

-- fuzzy fallback with BQ native functions
SELECT entity_name, MIN(EDIT_DISTANCE(LOWER(label), LOWER(@input))) AS dist
FROM ontology_concept_index
WHERE scheme = @scheme
  AND EDIT_DISTANCE(LOWER(label), LOWER(@input)) <= 3
GROUP BY entity_name
ORDER BY dist ASC
LIMIT 5;

The DISTINCT/GROUP BY on entity_name is how callers collapse the denormalized rows back to one result per matched concept.

Matches the SDK's agent-native ethos: any action an agent takes in SQL is something a user or another tool can also take. No new Python-only runtime, no new service, no new matcher implementation to maintain.

Opt-in. v1 ships with a CLI flag only: gm compile --emit-concept-index. Default off for users who don't need it. A binding-side toggle (index: concept_index block on Binding) was considered but deferred — it requires schema and loader changes in bigquery_ontology.binding_models + binding_loader.py that are worth scoping as their own change once the CLI behavior is settled. If v2 adds it, the explicit precedence rule will be: CLI flag overrides binding setting; binding setting serves as the project default when the CLI flag is absent.

Index population contract

The existing DDL compiler (src/bigquery_ontology/graph_ddl_compiler.py) only emits schema SQL — CREATE TABLE / CREATE PROPERTY GRAPH. A concept index needs rows, which is a new kind of output. This subsection names who writes those rows and when.

Who writes the rows: the ontology compiler itself, in the same gm compile invocation that emits the DDL. The index is a deterministic function of both the ontology YAML and the binding — see "What's in the index" below. Treating it as a separate build step creates two sources of truth and a refresh-skew class of bugs that the SDK shouldn't inherit.

What's in the index (scope relative to binding): compile_concept_index(ontology, binding) takes both inputs because the index respects the binding's subset semantics. Since a binding may legally realize only a subset of the declared ontology (binding_models.py:147), the compiler needs a rule for which entities participate in the index. The rule is:

  • All abstract entities from the ontology, regardless of binding — they're informational-only and never bound by construction (Feat: SKOS import support alongside OWL (design proposal — feedback wanted) #57's binding rejection rule). Their value is precisely in being available for runtime resolution even when the agent's BQ tables don't materialize them.
  • Only concrete entities that are bound in this binding. Concrete + unbound entities are deliberately excluded from this deployment's runtime surface; including them would let a resolver return matches the agent then can't query. That's worse than a miss.

In short: abstract: always. Concrete: iff bound. This matches the SDK-level invariant from the adapter design ("every element in GraphSpec is bindable and has data") while preserving the taxonomy-browse value that abstract SKOS entities add at runtime.

Consequence: two different bindings over the same ontology produce different indexes. A narrow deployment binding only Account and Customer emits a smaller index than a wide deployment binding all 40 concrete entities, but both share the same abstract skos_Banking / skos_FinancialProduct / etc. nodes. Abstract relationships between abstract entities are always in scope; abstract relationships touching an unbound concrete entity are included (they're informational metadata, not runtime operations).

The is_abstract column in the index row lets resolvers filter at query time: a resolver that wants only runtime-materializable matches does WHERE NOT is_abstract; a resolver producing taxonomy-aware "did you mean" suggestions keeps both.

Table naming contract: because two bindings against the same ontology produce legitimately different indexes, a single global table name is unsafe — the second compile would silently overwrite the first. The output table name is therefore a required parameter, not a fixed convention:

def compile_concept_index(
    ontology: Ontology,
    binding: Binding,
    *,
    output_table: str,   # required — fully-qualified `project.dataset.table`
) -> str: ...

CLI:

gm compile --emit-concept-index \
           --concept-index-table my-project.my_dataset.ontology_concept_index__retail

Both library and CLI error cleanly if the name is missing when --emit-concept-index is set. No silent global default. Users with a single binding per dataset pick any unique name they like (ontology_concept_index is fine); users with multiple bindings per dataset pick distinct names per binding (ontology_concept_index__retail, ontology_concept_index__investment_bank, etc.).

Why required rather than auto-derived:

  • Bindings do carry a binding: str identifier (binding_models.py:159), but it isn't a safe or stable source for a BQ table name: it's an identity tag for the binding document, not a deployment-unique BQ-legal identifier. Using it would couple operational naming to a field authors rename for non-operational reasons, and would collide across environments (dev/stage/prod) that share the same binding identity.
  • Hash-derived defaults like ontology_concept_index__{sha1(binding)[:8]} are collision-free but unreadable and change on every trivial binding edit — bad ergonomics for a table name that appears in user-written resolver SQL.
  • Explicit naming forces the deployment-operator-level decision at compile time, where it belongs.

OntologyRuntime reads the index via the same name the caller passed at compile time — runtime construction takes a matching concept_index_table: str parameter (or reads it from configuration) so lookups target the right table. The name is not stored on the ontology or binding model; it's a runtime/deployment concern.

Provenance and compatibility contract: because the table name is caller-supplied and binding-scoped, nothing in the data columns alone would catch a mismatched wiring like OntologyRuntime.from_models(ontology_A, binding_B, concept_index_table=table_C) where table_C was actually compiled from a different (ontology, binding) pair. Plausible-but-wrong matches are worse than no matches — the agent gets confident answers against stale or unrelated data.

The compiler therefore emits a sibling metadata table named {output_table}__meta, written in the same gm compile invocation. One row per compile:

CREATE OR REPLACE TABLE `{output_table}__meta` AS
SELECT * FROM UNNEST([
  STRUCT(
    'retail' AS ontology_name,                         -- from Ontology.name
    'sha256:abc123...' AS ontology_fingerprint,        -- see "Fingerprint algorithm" below
    'sha256:def456...' AS binding_fingerprint,         -- same algorithm, over Binding model
    'my-project' AS target_project,                    -- from Binding.target.project
    'my_dataset' AS target_dataset,                    -- from Binding.target.dataset
    'gm-1.2.0' AS compiler_version,                    -- version of bigquery_ontology that compiled
    'a1b2c3d4e5f6' AS compile_id                       -- pair-consistency tag; deterministic from inputs
  )
]);

Sibling rather than embedded columns so the bulk of the index (the label/notation rows) stays lean.

Fingerprint algorithm: fingerprints are SHA-256 hashes over a canonical serialization of the validated Ontology / Binding Pydantic models — not over raw YAML text. Concretely:

  1. Load YAML → validated model (existing load_ontology() / load_binding() path). Validation normalizes optional fields, default values, and type coercion.
  2. Serialize the validated model to a canonical JSON form: keys sorted lexicographically at every nesting level, no extra whitespace, UTF-8, stable encoding of None / booleans / numbers, lists preserved in declaration order (list order is semantically meaningful in the ontology model — e.g., key columns).
  3. Hash the resulting bytes with SHA-256, prefix with sha256:.

The same approach is used for both ontology and binding fingerprints, with one runtime difference: ontology fingerprinting covers every field of the Ontology model. Binding fingerprinting covers every field of the Binding model except ephemeral annotations (if any are introduced later) — the binding's identity for the purpose of "does this index correspond to this binding" is its declared structure, not its documentation metadata.

Why model-based and not YAML-text-based:

  • Two semantically identical YAML documents with different formatting, comment placement, or emitter behavior must produce the same fingerprint. A strict verification gate that rejects non-semantic edits would be a constant source of false positives and would push operators to disable verification — worse than no verification.
  • Pydantic-validated models are already the canonical in-memory form the SDK works with (src/bigquery_agent_analytics/runtime_spec.py:199 and adjacent). Hashing at that layer matches the layer where the rest of the SDK's determinism lives.
  • The existing compile contract is already model-based (compile_graph(ontology, binding) -> str takes models, not YAML strings). Keeping fingerprint input at the same layer maintains consistency across compile output and runtime verification.

Two bindings produced from the same source YAML by different emitters (e.g., one with trailing newlines, one without) fingerprint identically. Two bindings that disagree on any declared field — entity names, target dataset, property types — fingerprint differently and correctly fail strict verification.

Canonicalization rules in brief (formal spec in the implementation):

  • Keys sorted at every nesting level (stable across Python dict iteration).
  • Model fields serialized via Pydantic.model_dump(mode="json", by_alias=False, exclude_none=False) so defaults materialize consistently.
  • Enum values serialized as their canonical string form, not member name.
  • None / missing-but-defaulted fields serialized as explicit null to distinguish "absent" from "defaulted."
  • List order preserved; no reordering of entity/relationship/property lists (order is semantically load-bearing).
  • Output encoded as UTF-8 JSON with separators=(",", ":") (no extra whitespace).

OntologyRuntime runtime verification:

  • At construction, OntologyRuntime.load(...) / .from_models(...) computes the same fingerprints on the loaded Ontology and Binding models.
  • On first access to the concept index (lazy — construction doesn't hit BQ), the runtime reads the __meta sibling and compares fingerprints.
  • Mismatch raises ConceptIndexMismatchError with a clear message naming the expected vs actual fingerprints and the table name involved. The runtime refuses to return matches from an index that doesn't correspond to the loaded models.
  • Missing __meta sibling (e.g., a manually-created index or one compiled with an older toolchain) raises a distinct ConceptIndexProvenanceMissing — caller can explicitly opt out with OntologyRuntime(..., verify_concept_index="off") for read-only dashboards or interactive exploration.
  • Verification re-checks on a configurable TTL, not once-per-lifetime. See "Long-lived runtime verification" below.

Long-lived runtime verification (strict is strict for the whole lifetime, not just the first call). A naive "verify once then cache forever" contract would let a long-lived service sail past an index refresh that swapped in a different (ontology, binding) pair — returning matches against the new index while still believing it was verified. That defeats the "plausible-but-wrong matches are worse than no matches" argument behind the strict default.

The contract:

  • After the first successful verification, OntologyRuntime caches the expected compile_id, ontology_fingerprint, and binding_fingerprint on the instance.
  • On each resolve / validate call, the runtime checks whether the cached verification is still fresh under a configurable TTL (verify_ttl_seconds, default 60). If the cache is fresh, the call proceeds without a BQ round-trip.
  • If the cache is stale, the runtime re-runs the full pair-consistency check plus a full-fingerprint freshness check, not just a single-table sentinel. Concretely:
    1. SELECT DISTINCT compile_id FROM {output_table} LIMIT 2 returns exactly one value.
    2. SELECT compile_id, ontology_fingerprint, binding_fingerprint FROM {output_table}__meta LIMIT 1 — read compile_id and the full fingerprints.
    3. Verify: main.compile_id == meta.compile_id (pair consistency).
    4. Verify: meta.compile_id == cached.compile_id AND meta.ontology_fingerprint == cached.ontology_fingerprint AND meta.binding_fingerprint == cached.binding_fingerprint (full-fingerprint freshness).
  • Outcomes:
    • All checks hold → refresh the cache timestamp and proceed.
    • Pair consistent but any cached value differs from meta → raise ConceptIndexRefreshed. Service operator recreates OntologyRuntime with updated models; new instance's full fingerprint verification catches whether the new index matches or not.
    • Main and meta disagree (refresh in progress) → one-shot 2s retry, then raise ConceptIndexInconsistentPair. Same contract as first-load.

Why the sentinel must read both tables, not just meta. An earlier draft checked only meta.compile_id. That has a correctness hole: the inline refresh order is "main first, meta second," so during the swap window a reader could see the old meta compile_id (matches cache, accepted), then query the new main table, and serve data from the refreshed index under stale verification. Reading both tables on TTL re-check closes that window — main's compile_id is authoritative for "which compile does the data belong to," and the meta comparison catches inconsistent pairs.

Why the freshness check compares full fingerprints, not just compile_id. The compile_id column is a 12-hex-char truncation of sha256(ontology_fingerprint || binding_fingerprint || compiler_version) — 48 bits of entropy, chosen to keep the per-row compile_id column short (storage efficiency on a column that repeats across every data row). That's enough for first-pass pair consistency: two tables with different compile_ids definitely belong to different compiles, and the birthday bound on distinct compiles for a single (output_table) over realistic deployment lifetimes is comfortably below collision probability.

But "comfortably below" is not "zero," and a strict verification contract shouldn't rely on it. The meta row carries the full ontology_fingerprint and binding_fingerprint (SHA-256, 256 bits each) — storing those in a single-row meta table costs nothing. The TTL re-check therefore compares all three (compile_id + both full fingerprints) against the cache. A hypothetical 48-bit collision where a legitimately-different (ontology, binding) pair happens to share a 12-char prefix is caught because the full fingerprints won't match.

Pair consistency between the two tables still runs on the short compile_id — it only needs to detect "are these from the same compile or different compiles," and 48 bits is overkill for that single-dataset comparison. The strict freshness check runs on the full 256-bit fingerprints where the safety story demands it.

The three reads are still cheap. Main's SELECT DISTINCT compile_id FROM {output_table} LIMIT 2 reads at most two rows from a clustered column; meta reads exactly one row (and always has, just with more columns than before). Per-TTL-window cost remains negligible even at default 60s.

Configuration surface on OntologyRuntime construction:

  • verify_ttl_seconds: int = 60 — default 60. Balance between correctness-staleness window and re-verification cost.
  • verify_ttl_seconds=0 — check on every call. Useful for low-QPS services where correctness matters more than cost.
  • verify_ttl_seconds=None — snapshot-bound: verify once on first use, never again. Explicit opt-in for services that coordinate refresh out-of-band (e.g., rolling-restart on recompile). Matches the old "verify once" behavior for callers who want it.

Why TTL rather than check-every-call by default: the pair re-check is cheap but not free, and for high-QPS resolver workloads it adds up. A 60s staleness window matches typical service-refresh cadences while keeping per-call cost bounded at O(1) with no BQ hit in the common case.

Pair-consistency contract (the two tables must agree on the same compile). Because {output_table} and {output_table}__meta are written as two separate CREATE OR REPLACE TABLE statements, a reader interleaved with a refresh could otherwise observe:

  • new meta + old data → strict verification would pass against stale data (plausible-but-wrong matches).
  • new data + old meta → strict verification would raise an incorrect mismatch.

To make the pair coherent without requiring DDL-level transactions (which BigQuery doesn't offer for CREATE OR REPLACE TABLE), both tables carry a compile_id tag that is derived deterministically from compile inputs — not a per-run UUID or a timestamped value:

compile_id = sha256(ontology_fingerprint || binding_fingerprint || compiler_version)[:12]

(First 12 hex chars is enough to make accidental collisions vanishingly unlikely while keeping the column short.)

  • compile_id STRING NOT NULL column on the main table — every row of {output_table} shares the same value.
  • compile_id field on the single __meta row — same value.
  • Write order: main table first, meta second. Readers never see "new meta promising data that doesn't exist yet."

Why deterministic rather than per-run:

  • Preserves the byte-identical output contract on compile_concept_index() (see Compiler output contract below). Two compiles of the same ontology + binding + compiler version produce character-identical SQL.
  • Pair consistency still works: interleaved compiles with different inputs produce different compile_ids and the runtime check catches the inconsistency. Interleaved compiles with identical inputs produce identical compile_ids and the data is also identical — the worst case is wasted work, not wrong data.
  • Callers auditing compile output in code review can diff it against the previous compile and see only the changes caused by ontology/binding edits, not a new UUID every run.

compiled_at is deliberately not in the emitted SQL. An earlier draft included a compiled_at TIMESTAMP field in the meta row; that's been removed to preserve byte-identical output. Operators who want compile timestamp visibility can read it from INFORMATION_SCHEMA.TABLES.creation_time on the __meta table, which BigQuery maintains automatically. The tradeoff is deliberate: runtime correctness (deterministic compile output, reviewable diffs) over embedded operator metadata that BQ already provides.

Runtime pair-consistency check (first concept-index access):

  1. SELECT * FROM {output_table}__meta LIMIT 1 → get expected_compile_id, expected fingerprints.
  2. SELECT DISTINCT compile_id FROM {output_table} LIMIT 2 → verify exactly one compile_id is present and it equals expected_compile_id.
  3. If compile_id mismatches or multiple distinct compile_ids are observed (which would indicate a broken compile), retry once after a short backoff (default 2 seconds) — handles the narrow interleaving window during normal refresh.
  4. If the retry also fails, raise ConceptIndexInconsistentPair with both observed compile_ids. This is distinct from ConceptIndexMismatchError (which is a wiring/fingerprint error, not a timing one) so callers can handle them differently.
  5. Once pair-consistency is established, fingerprint verification proceeds against the meta row.

The retry is deliberately one-shot and small: a legitimately long refresh window indicates operator misbehavior (concurrent compiles against the same table) and should fail loudly.

Compatibility flag vocabulary:

  • verify_concept_index="strict" (default) — fingerprint mismatch, missing meta, or persistent pair inconsistency all raise.
  • verify_concept_index="missing_ok" — fingerprint mismatch and pair inconsistency raise, missing meta warns and proceeds.
  • verify_concept_index="off" — no verification, purely caller-managed. Intended for explicit "I know what I'm doing" paths.

Rejected alternatives for pair consistency:

  • Transactional multi-statement BEGIN TRANSACTION; ... COMMIT;. BigQuery's transaction support doesn't cover CREATE OR REPLACE TABLE — DDL is generally non-transactional. Not a viable primitive.
  • Shadow-version both tables + atomic pointer-table swap. Three tables (main, meta, pointer) adds significant operational complexity for a narrow window. The compile_id approach gets the same correctness property with one extra column.
  • BQ table OPTIONS(description=...) tagging. Requires INFORMATION_SCHEMA lookups, has its own freshness semantics, more tooling surface. compile_id in a data column is simpler to query from plain SQL.

Rejected alternatives for the provenance storage shape:

  • Embed provenance as repeated columns on every index row. Wastes storage proportional to the number of rows; makes diff-based review noisier; still requires runtime verification logic. Sibling table is strictly better. (The compile_id column is a deliberate exception — it's a single short fixed-length tag needed for pair consistency, not full provenance.)
  • Encode provenance in BQ table OPTIONS(description=...). BQ-native and elegant, but INFORMATION_SCHEMA queries have their own cost and tooling constraints, and the sibling table approach is easier to inspect from plain SQL (SELECT * FROM concept_index__meta).
  • Caller-managed in v1 with no verification. Considered and rejected — shipping a primitive that silently produces wrong results under a plausible operator mistake is a feature bug the SDK shouldn't ship with. v1 ships strict verification on by default, with documented escape hatches.

How the rows reach BQ (atomic-swap semantics for runtime readers): the compiler emits a single CREATE OR REPLACE TABLE ... AS SELECT ... statement. In BigQuery this is atomic — concurrent readers see either the previous table's rows or the new table's rows, never an empty intermediate state. This is the critical difference from a DELETE + INSERT pair, which would expose a window where the index is queryable but empty. For a runtime lookup primitive, that window is a correctness hazard, not just a performance one.

-- gated on --emit-concept-index, name passed via --concept-index-table
-- write order: main table first, __meta second. compile_id ties the pair.
CREATE OR REPLACE TABLE `{output_table}` AS
SELECT * FROM UNNEST([
  STRUCT('DMA' AS entity_name, 'DMA' AS label, 'name' AS label_kind,
         '807' AS notation, 'NielsenDMA' AS scheme,
         CAST(NULL AS STRING) AS language,
         FALSE AS is_abstract,
         'a1b2c3d4e5f6' AS compile_id),
  STRUCT('DMA', 'Designated Market Area', 'synonym',
         '807', 'NielsenDMA', 'en', FALSE, 'a1b2c3d4e5f6'),
  STRUCT('DMA', 'Marché de diffusion désigné', 'pref',
         '807', 'NielsenDMA', 'fr', FALSE, 'a1b2c3d4e5f6'),
  -- first-class notation row: label holds the notation value, label_kind='notation'
  STRUCT('DMA', '807', 'notation',
         '807', 'NielsenDMA', CAST(NULL AS STRING), FALSE, 'a1b2c3d4e5f6'),
  -- multi-scheme example: same entity appearing in two schemes
  STRUCT('BayAreaMetro', 'Bay Area Metro', 'pref',
         CAST(NULL AS STRING), 'NielsenDMA', 'en', FALSE, 'a1b2c3d4e5f6'),
  STRUCT('BayAreaMetro', 'Bay Area Metro', 'pref',
         CAST(NULL AS STRING), 'CensusMSA', 'en', FALSE, 'a1b2c3d4e5f6'),
  -- abstract SKOS concept: informational, no scheme membership
  STRUCT('skos_Banking', 'Banking', 'pref',
         CAST(NULL AS STRING), CAST(NULL AS STRING), 'en', TRUE, 'a1b2c3d4e5f6'),
  ...
]);

The separate CREATE TABLE scaffold is not emitted — CREATE OR REPLACE TABLE creates the table on first run and atomically replaces it on every subsequent run. This collapses "ensure table exists" and "populate rows" into one statement, eliminating the empty-table-existing intermediate state entirely.

For ontologies with tens of thousands of concepts (Yahoo's YAMO example: 274 SKOS concepts × multiple labels ≈ ~1K-10K rows), inline UNNEST(ARRAY<STRUCT<...>>) is well within BigQuery's query-text limits. For ontologies above ~50K rows, the compiler emits a shadow-table swap pattern for both tables in the pair — the pair-consistency contract applies to the shadow path too:

-- both tables get a shadow; suffix is "_shadow" on each production name
CREATE OR REPLACE TABLE `{output_table}_shadow` (...);
INSERT INTO `{output_table}_shadow` VALUES (...);  -- batched, includes compile_id column
CREATE OR REPLACE TABLE `{output_table}__meta_shadow` AS
  SELECT * FROM UNNEST([STRUCT(... 'a1b2c3d4e5f6' AS compile_id)]);

-- swap order: data first, then meta (matches the inline-path write order)
DROP TABLE IF EXISTS `{output_table}`;
ALTER TABLE `{output_table}_shadow` RENAME TO <short name from output_table>;
DROP TABLE IF EXISTS `{output_table}__meta`;
ALTER TABLE `{output_table}__meta_shadow` RENAME TO <short name from output_table>__meta;

Two distinct non-atomicity windows exist on this path:

  1. Table-existence window: between DROP and RENAME on each table, that table name does not resolve. Readers get BigQuery's "table not found" error, which they must tolerate as transient.
  2. Pair-inconsistency window: between "main renamed" and "meta renamed", the main table carries the new compile_id while the meta row still carries the old one. Readers in this window see compile_id disagreement → ConceptIndexInconsistentPair on the pair-consistency check.

The pair-inconsistency window on the shadow path can exceed the inline path's one-shot 2-second retry budget, because large-ontology rename operations take longer than small-ontology CREATE OR REPLACE TABLE statements. This means strict verification will raise during a legitimate shadow-path refresh if it happens to sample during the swap. That's by-design, not a defect: strict verification correctly rejects an inconsistent pair even when the inconsistency is transient. The alternative (silently serving old data against a new meta, or vice versa) is the failure mode strict verification exists to prevent.

Operational contract for the shadow path:

  • Treat shadow-path refreshes as offline/admin operations. Pause reader traffic (or accept ConceptIndexInconsistentPair exceptions) during gm compile runs that hit the shadow path.
  • If traffic cannot be paused, the caller has two options, neither of which involves missing_ok (which per the verification-mode contract above still raises on pair inconsistency — transient or not):
    • Increase verify_ttl_seconds so the pair re-check samples less frequently. Reduces the probability of landing inside a swap window at the cost of a longer staleness tolerance.
    • Catch ConceptIndexInconsistentPair at the application layer and retry the call after a short delay. Cleanest at the service-mesh level where transient 5xx handling already exists.
  • For services where neither is acceptable: bind the main + meta pair under a higher-level indirection (a separate {output_table}__current pointer table that callers resolve through). Not shipped in v1 — out of scope as a third level of indirection, tracked as follow-up work if real users hit this constraint.

This limitation is specific to the shadow path. The inline-UNNEST path (the default for ontologies under 50K rows, covering the motivating use cases including Yahoo's YAMO) remains fully atomic per-statement and doesn't exhibit either window.

When refresh happens: on every gm compile run with --emit-concept-index. No incremental build. If the user edits ontology YAML, they re-run compile, same as any other DDL change. This matches the compile model users already have for schema changes and avoids adding a second refresh command.

Alternatives considered and rejected:

  • Separate gm build-concept-index step. Adds a command users have to remember and introduces drift between "DDL is up to date" and "index is up to date." Two invocations for one conceptual change.
  • Runtime lazy-build. Rebuild the index in memory on OntologyRuntime.load() and optionally push to BQ. Surfaces inconsistent state when multiple agent instances load simultaneously and makes the "query the index in SQL" path unreliable until someone has pushed.
  • Streaming incremental updates. Possible future work for ontologies with externally-sourced concept rolls. Out of scope for the initial primitive.

Failure modes:

  • Inline CREATE OR REPLACE TABLE path: if the statement fails (quota, permissions, query text too large), gm compile errors with a message naming the concept index. Because CREATE OR REPLACE is atomic, there is no half-written state — the previous table (if any) remains queryable, or no table exists at all. The user can re-run compile without cleanup.
  • Shadow-table swap path: failure mid-swap can leave the pair in an inconsistent state (main renamed but meta not, or either table dropped but not renamed). gm compile retry detects the orphaned _shadow table(s) and resumes from the swap step. Runtime readers during the orphaned window get either "table not found" or ConceptIndexInconsistentPair — both expected transient conditions on the shadow path, tolerated via the operational contract above (pause traffic during refresh, or accept transient failures).

3. EntityResolver Protocol + two reference implementations

Interface-only in core:

from typing import Protocol
from dataclasses import dataclass

@dataclass
class Candidate:
    entity_name: str         # unique per Candidate in ResolveResult.candidates
    label: str               # the winning label that produced this match
    label_kind: str          # 'name' | 'pref' | 'alt' | 'hidden' | 'synonym' | 'notation'
    scheme: Optional[str]    # scheme the winning match came through (None = entity-scoped or no scheme)
    confidence: float
    reason: str              # 'exact' | 'notation' | 'synonym' | 'fuzzy' | 'none'

@dataclass
class ResolveResult:
    match: Optional[str]           # resolved entity_name (None = no match)
    confidence: float              # 0.0 - 1.0; 1.0 = exact
    candidates: list[Candidate]    # top-k "did you mean" suggestions, one per entity
    reason: str                    # why `match` resolved (or 'none')

class EntityResolver(Protocol):
    def resolve(
        self,
        value: str,
        *,
        scheme: str | None = None,
        entity: str | None = None,
        limit: int = 5,
    ) -> ResolveResult: ...

scheme and entity are mutually exclusive — see "Scope semantics" in the Library API impact section below. Exactly one must be provided.

Candidate dedup contract (important once the index is denormalized per (entity_name, label, label_kind, language, scheme)):

  • ResolveResult.candidates contains at most one entry per entity_name. The denormalized index naturally produces multiple matching rows for the same entity (same entity, different label or different scheme). For an agent-facing "did you mean" list, duplicates are noise — the agent wants a list of distinct concepts, each annotated with the best evidence for why it matched.
  • limit=N means N distinct entities, not N raw rows. Resolvers do the dedup before truncating.
  • Winning-label rule when the same entity matches through multiple rows: pick the row with the highest confidence under the resolver's matching rule. Ties broken by label_kind priority, in this order: name > pref > alt > hidden > synonym > notation. Further ties broken by lexicographic label order for determinism.
  • Candidate.label / Candidate.label_kind / Candidate.scheme / Candidate.reason reflect the winning row. The other rows that also matched are discarded — callers wanting the full provenance use the concept index directly via SQL.
  • reason values are resolver-defined but drawn from a shared vocabulary so callers can branch on them without ambiguity: exact (name match), notation (notation match), synonym (any label other than name), fuzzy (non-exact match produced by a fuzzy resolver), none (no match found — only valid on the ResolveResult.reason, not on Candidate.reason).

SDK ships two references in core:

  • ExactMatchResolver — O(1) lookup against name + skos:notation. Confidence is 1.0 or 0.0. Good for notation-heavy inputs (Nielsen DMA codes, Google Ads Criteria IDs).
  • SynonymResolver — extends ExactMatchResolver by also matching against prefLabel / altLabel / hiddenLabel / synonyms. Still exact on each label; still confidence 1.0 or 0.0.

Everything above exact-match — token-set equality, Jaccard, Levenshtein, phonetic, weighted ensembles — lives in user code or contrib/ packages. Verticals pick (or write) a resolver tuned for their domain.

4. validate_against_ontology — small validation helper

Not resolution — just pass/fail against the declared ontology. Return shape is bounded by design so it stays useful on large concept schemes (IAB Taxonomy, Nielsen DMAs, SNOMED excerpts) where the candidate universe is hundreds to tens of thousands of entries:

rt.validate(
    {"format_ids": ["display_static", "display_banner"]},
    scheme="AdFormat",       # see "Scope semantics" below
    sample_limit=10,         # default — cap on known_values_sample
)
# → ValidationResult(
#     valid=["display_banner"],
#     invalid=["display_static"],
#     known_value_count=47,
#     known_values_sample=["display_banner", "display_native", ...],  # up to sample_limit
#     candidates=None,   # populated only when composed with a resolver
# )

Agents combine validate_against_ontology with a resolver to produce "did you mean." The SDK doesn't match; it only knows what exists.

Design notes on the return shape:

  • known_value_count is always the full count. Tells the caller whether the sample is representative.
  • known_values_sample is capped at sample_limit (default 10). Enough for a "did you mean" hint without bloating every validation miss on a 10K-concept scheme. Callers who genuinely need the full set use rt.in_scheme(...) or rt.entities() — that's what those accessors are for.
  • candidates stays None unless the caller composes validation with a resolver. Keeps validate pure set-membership; keeps ranking logic in resolver-land. No double-duty.
  • Sample order is not specified by the contract — callers should not rely on alphabetical or any other ordering. If deterministic ordering matters for a specific use, pass a sorted known_values_sample through a resolver that ranks.

Library API impact

This section pins down the parts of the proposal that touch existing public APIs, so they're clear before implementation starts.

Compiler output contract

The existing bigquery_ontology.compile_graph(ontology, binding) -> str is documented to be deterministic — "same inputs → byte-identical text." That contract is preserved. Concept-index emission does not modify compile_graph().

Instead, a new sibling function ships alongside:

# existing, unchanged
def compile_graph(ontology: Ontology, binding: Binding) -> str: ...

# new, additive
def compile_concept_index(
    ontology: Ontology,
    binding: Binding,
    *,
    output_table: str,   # required — see "Table naming contract" above
) -> str: ...

Both return deterministic strings. compile_concept_index() extends the compile_graph() byte-identical contract in the same spirit: same inputs → byte-identical DML text, including row order.

"Same inputs" for compile_concept_index() = (ontology, binding, output_table, compiler_version). Everything in the emitted SQL is derived from those four values:

  • compile_id is sha256(ontology_fingerprint || binding_fingerprint || compiler_version)[:12] — deterministic.
  • No per-run timestamps, UUIDs, or process identifiers appear in the emitted SQL. Compile timestamps are recoverable from INFORMATION_SCHEMA.TABLES.creation_time on the emitted tables.
  • Row order is determined by the sort key below, applied before SQL generation.

Rows are sorted by a stable key before SQL generation:

(scheme, entity_name, label_kind, language, label, notation, is_abstract)

with NULLs ordered last consistently per column. is_abstract is last because it's determined by entity_name — included only for defensive stability if the invariant ever loosens. This sort order guarantees that two invocations of compile_concept_index() on the same ontology + binding emit character-identical SQL — critical for diffing compile output in code review, caching compiled artifacts, and verifying that ontology edits produced only the expected row changes.

Library callers who want only DDL keep calling compile_graph() as today; callers who want the concept index call compile_concept_index() for a separate DML script. The CLI layer composes the two:

# CLI behavior for `gm compile --emit-concept-index`
sql_parts = [compile_graph(ont, binding)]
if args.emit_concept_index:
    sql_parts.append(
        compile_concept_index(ont, binding, output_table=args.concept_index_table)
    )
print("\n\n-- concept index --\n\n".join(sql_parts))

Why a sibling and not a composed option:

  • Preserves the byte-identical contract on compile_graph().
  • No breaking change to existing callers.
  • Each function has one job; easier to test, version, and reason about.
  • CLI callers with shell-orchestrated pipelines can write DDL and DML to separate files if they want — composition stays a caller concern.

Rejected alternatives:

  • compile_graph(..., emit_concept_index=True) returning concatenated DDL+DML — breaks the byte-identical contract for one config mode and creates a function whose return value depends on a flag.
  • Return an object (CompileResult(ddl=..., dml=...)) — breaks every existing caller of compile_graph() that treats the return as a string.
  • CLI-only (no library-layer API for the index DML) — forces library users to reimplement concept-index generation themselves, defeats the point of the primitive.

OntologyRuntime construction: paths and models

The example in section 1 shows OntologyRuntime.load(ontology_path=..., binding_path=...). But existing SDK code already carries validated Ontology and Binding models around in memory (e.g., src/bigquery_agent_analytics/runtime_spec.py:199 passes models directly). Forcing callers to reparse YAML or round-trip through disk would be a step backward.

Two classmethods cover both cases:

class OntologyRuntime:
    @classmethod
    def load(
        cls,
        ontology_path: str | Path,
        binding_path: str | Path,
    ) -> "OntologyRuntime":
        """Load from YAML files on disk."""
        ...

    @classmethod
    def from_models(
        cls,
        ontology: Ontology,
        binding: Binding,
    ) -> "OntologyRuntime":
        """Wrap already-validated models."""
        ...

load() is the convenience path for one-off scripts and the CLI. from_models() is the integration path for the SDK's existing flows — runtime_spec, ontology_orchestrator, adapters downstream of load_ontology() can all wrap without touching disk again.

Internal implementation: load() calls load_ontology() + load_binding() then delegates to from_models(). Zero code duplication.

Scope semantics: scheme vs entity

Resolvers and validate() need an explicit target set. Two mutually-exclusive named parameters, no polymorphism:

# Scheme-scoped: resolve/validate against all members of a concept scheme.
# Most common case — this is what the motivating examples (AdFormat, DMA, IAB) want.
rt.validate({"dma": ["Nielsen 807"]}, scheme="NielsenDMA")
resolver.resolve("San Francisco-Oakland", scheme="NielsenDMA")

# Entity-scoped: resolve/validate against a single named entity. Identity check only.
# Rare — used when you want "is this exactly this one entity?" rather than
# "is this a member of a taxonomy?"
rt.validate({"customer_id": ["C-42"]}, entity="Customer")

Rules:

  • Exactly one of scheme or entity must be provided. Passing both or neither is an error with a clear message.
  • scheme=<name> resolves against the set {e : e.in_scheme(name) or name == e.name and e.is_abstract_scheme_root}. This is the motivating case. Works for both explicit SKOS concept schemes and abstract entities that act as taxonomy roots.
  • entity=<name> resolves against the singleton set {e : e.name == name}. Identity check. Returns match iff the input exactly matches the entity's name, notation, or a declared label/synonym.
  • Narrower-closure scoping (e.g., "all narrower-than some abstract node") is explicitly deferred. When the need surfaces, it'll come back as scope=Scope.narrower_closure(name) or similar, without changing the meaning of scheme and entity.

Why not polymorphic ("entity= means scheme-scoped if it's a scheme, entity-scoped otherwise"):

  • Two implementers following the spec would return different answers for the same call, depending on their interpretation of the ontology's structure.
  • Ontology authors who later change an entity from concrete to abstract-scheme-root would silently change the semantics of every entity= call targeting it.
  • Callers would need ontology knowledge to predict what a given entity= call does — defeats the point of a stable API.

Explicit parameters keep the contract boring and predictable.

Non-goals

  • Ship a general string-matching library. BQ already has EDIT_DISTANCE, SOUNDEX, JACCARD UDFs. If the concept index is materialized, users get these for free. Don't wrap.
  • Ship the 5-layer resolver in core. Token-set equality thresholds, Jaccard coefficients, Levenshtein cutoffs — all domain-tuned. Advertising's tuning for DMAs is not the right tuning for SNOMED or legal-entity names. The feedback author's resolver is valuable as a reference for their domain and belongs in contrib/ or a separate package.
  • Promise a <50ms SLA. Latency is a function of index size and resolver choice, both of which vary by user. The SDK can guarantee the primitive shapes; it can't guarantee the performance of every application that uses them.
  • Provide a concept-scheme browser UI. Out of scope — this is an analytics SDK, not an ontology editor.
  • Take a position on "did you mean" phrasing. The SDK returns structured candidates; the agent composes user-facing copy.

How this lands on top of existing code

Using the current SDK's module boundaries:

Piece Belongs in Notes
OntologyRuntime class with load() + from_models() classmethods bigquery_agent_analytics/ontology_runtime.py (new) Wraps load_ontology + load_binding from bigquery_ontology. Pure Python, no BQ calls. Both construction paths share one implementation.
compile_graph() (existing) bigquery_ontology/graph_ddl_compiler.py Unchanged. Preserves byte-identical contract.
compile_concept_index() (new sibling) bigquery_ontology/graph_ddl_compiler.py (new function) Separate deterministic DML emitter. CLI composes with compile_graph() when --emit-concept-index is set.
EntityResolver Protocol + references bigquery_agent_analytics/entity_resolver.py (new) Core SDK layer. Protocol + two implementations. Both accept scheme= or entity= (mutually exclusive).
validate_against_ontology Method on OntologyRuntime Same scheme= / entity= scope parameters.
Domain packs and layered resolvers bigquery_ontology/contrib/ or external packages Advertising, healthcare, finance. Never in core.

Changes to existing modules are limited but not zero. Most of the proposal is additive (new files, new functions, new classmethods). Two concrete edits to existing code are needed:

  • src/bigquery_ontology/cli.py:299 — the existing compile command gains --emit-concept-index and --concept-index-table <name> flags. When --emit-concept-index is set, the command composes the existing compile_graph() output with compile_concept_index(..., output_table=...). Without the flag, the command's behavior is byte-identical to today.
  • src/bigquery_ontology/graph_ddl_compiler.py — adds the new compile_concept_index() function in the same module. compile_graph() itself is not modified.

The runtime accessor (OntologyRuntime) reads the same Ontology/Binding models already loaded today — that path is purely additive in the SDK package.

Ties to issue #57 (SKOS import)

This proposal depends on issue #57 landing first, because the concept index's value comes almost entirely from SKOS annotations (skos:notation, skos:prefLabel, skos:altLabel, skos:broader) being preserved through import. Without #57, the concept index is a thin wrapper over entity names and existing synonyms — useful but not transformative.

Specifically:

  • skos:notation in annotations → notation column in concept index → L1 code match becomes trivial
  • skos:prefLabel / altLabel / hiddenLabel → rows in concept index with label_kind discriminator → L2 lexical becomes trivial
  • skos_broader abstract relationships → rt.broader() traversal → taxonomy-aware "did you mean a parent or sibling"
  • Abstract entities with skos_ prefix → rt.in_scheme() enumerates all concepts in a taxonomy → agent can present the scheme to the LLM as context

Open questions — feedback wanted

  1. Is OntologyRuntime the right wrapper, or should the accessors live as methods on Ontology / Binding directly? Pro-wrapper: keeps bigquery_ontology pure-data and the runtime layer in bigquery_agent_analytics. Pro-direct: fewer classes to learn. Proposal leans wrapper — the accessor layer is SDK-runtime concern, not ontology-package concern.

  2. Should the concept index be opt-in or opt-out? Pro-opt-in: users who don't need it don't pay storage. Pro-opt-out: users discover the primitive because it just exists. Proposal leans opt-in: no silent BQ table creation.

  3. Should OntologyRuntime cache the concept index in memory for pure-Python access, or always go to BQ? Pro-memory: fast, no BQ cost, works offline. Pro-BQ-only: always consistent with DDL, scales to ontologies with 100K+ concepts. Proposal: pure-Python by default for ontologies under some size threshold; explicit BQ-backed resolver for large ones.

  4. Does EntityResolver need an async variant? Resolution against a BQ-backed index is I/O. Proposal: ship sync; add async later if users ask.

  5. Should the SDK ship a richer FuzzyResolver reference (just exact + prefix, not full 5-layer) so users have a middle option? Proposal: no — either exact or bring-your-own. Avoids the "SDK partially solves fuzzy matching" trap where the reference becomes everyone's default despite being domain-unaware.

  6. Should the Protocol be typing.Protocol or an ABC? Protocol allows duck typing; ABC forces inheritance. Proposal: Protocol — matches modern typing conventions and doesn't force users to inherit.

  7. Should rt.validate() also return a nearest field when values are invalid? Would require calling a resolver inside validate, coupling the two. Proposal: no — keep validate pure set-membership, let callers compose it with a resolver.

  8. Concept index: do we need a per-row score or priority for when multiple labels map to the same entity? Some verticals (IAB) prefer one label over another as the "canonical" display form. Proposal: defer — label_kind (name vs pref vs alt) already lets callers prioritize. Add score if needed.

  9. Is contrib/ the right home for domain resolvers, or should they be separate packages? Pro-contrib: easy discovery, versioned together. Pro-separate: community can ship without depending on SDK releases. Proposal: contrib for reference implementations (advertising, healthcare); external packages for user-owned domains.

  10. Should narrower-closure scoping ship in v1? The current proposal settled on two explicit parameters — scheme= for concept-scheme membership and entity= for single-entity identity. A third mode (narrower-closure: "resolve against all entities narrower-than some abstract node") is deferred. Advertising taxonomies nest (IAB Tier 1 → Tier 2), and a caller may want to resolve against the subtree under a specific abstract node rather than a flat scheme. Proposal: ship scheme= and entity= only in v1; add scope=Scope.narrower_closure(name) in v2 if real callers need it. For most cases, scheme membership plus rt.narrower(entity) traversal covers the need without a new API.


Related:

Please comment if you have opinions, real-world resolver implementations you'd like to see supported, or disagreements about where the SDK/agent boundary should sit.


Final design decisions — detailed

After twelve rounds of review the design is frozen. In-repo implementation plan at docs/implementation_plan_concept_index_runtime.md. This section is the design-level recap, split by package.

Ontology package (bigquery_ontology) — changes

New files (all under src/bigquery_ontology/):

  • _fingerprint.py (internal — underscore prefix). Single source of truth for model fingerprinting and the compile_id pair-consistency tag. Two functions: fingerprint_model(model) -> "sha256:<64 hex>" and compile_id(ont_fp, bnd_fp, compiler_version) -> "<12 hex>". Contract pinned in docstring (W1): model_dump(mode="json", by_alias=False, exclude_none=False)json.dumps(sort_keys=True, separators=(",",":"), ensure_ascii=False) → SHA-256. Not re-exported; both packages import via from bigquery_ontology._fingerprint import .... Landed in PR feat(ontology): A1 — internal _fingerprint module for concept-index provenance #71.

  • concept_index.py (module importable but not re-exported in v1). Row builder. Function: build_rows(ontology, binding) -> list[ConceptIndexRow]. Applies the "abstract always included, concrete iff bound" rule. Emits one row per (entity_name, label, label_kind, language, scheme) membership tuple, plus one notation row per skos:notation. Sorts deterministically by (scheme, entity_name, label_kind, language, label, notation, is_abstract) with NULLs last. Package-level re-export may be added later; kept out of the root for v1 to avoid growing semver surface ahead of need.

Modified files:

  • graph_ddl_compiler.py — gains a new public function compile_concept_index(ontology, binding, *, output_table) -> str alongside the existing compile_graph(). compile_graph() contract is preserved byte-identically; the existing function body is not touched. compile_concept_index() emits two statements by default: CREATE OR REPLACE TABLE {output_table} AS SELECT * FROM UNNEST([STRUCT(...), ...]) for the main index and a matching CREATE OR REPLACE TABLE {output_table}__meta AS SELECT * FROM UNNEST([STRUCT(...)]) for the meta sibling. Shadow-swap fallback activates at > 50K rows. Every row in both tables carries the same compile_id; the meta row additionally carries full ontology_fingerprint and binding_fingerprint.

  • cli.py:299 (the compile command) — gains two new flags: --emit-concept-index (boolean) and --concept-index-table <fqn> (required when --emit-concept-index is set — no silent global default). When both flags are absent, command output is byte-identical to today. No other CLI flags change.

  • __init__.py — adds from .graph_ddl_compiler import compile_concept_index so the new public function is importable as from bigquery_ontology import compile_concept_index, matching the existing compile_graph re-export. No other exports change. _fingerprint stays unexported.

Unchanged:

New CLI surface summary:

gm compile \
  --ontology ontology.yaml \
  --binding binding.yaml \
  --emit-concept-index \
  --concept-index-table my-proj.my_ds.ontology_concept_index

Produces: the existing CREATE PROPERTY GRAPH DDL + two concept-index tables (ontology_concept_index and ontology_concept_index__meta). Re-running the same command produces byte-identical SQL — the compile_id is deterministic from inputs (no timestamps, no UUIDs).

Version bump: Minor — new public function (compile_concept_index) and new CLI flags. Existing API byte-identical.


SDK package (bigquery_agent_analytics) — changes

New files (all under src/bigquery_agent_analytics/):

  • ontology_runtime.py. Hosts OntologyRuntime (the read accessor wrapper), the verification machinery (first-call + TTL re-check), and all four exception classes (ConceptIndexMismatchError, ConceptIndexProvenanceMissing, ConceptIndexInconsistentPair, ConceptIndexRefreshed). OntologyRuntime exposes two constructors — .load(ontology_path, binding_path, ...) and .from_models(ontology, binding, ...) — both routing through one shared implementation.

  • entity_resolver.py. Hosts the EntityResolver Protocol (not ABC — duck-typed for modern typing), the Candidate and ResolveResult dataclasses, and two reference implementations: ExactMatchResolver (name + notation) and SynonymResolver (extends exact with label-based match). Candidate dedup: one candidate per entity, winning-label priority (name > pref > alt > hidden > synonym > notation, lexicographic tiebreaker), limit=N returns N distinct entities.

Modified files:

  • __init__.py — adds to the existing try/except re-export block (same pattern as Client, CodeEvaluator, etc.):
    • OntologyRuntime — from .ontology_runtime
    • EntityResolver, ExactMatchResolver, SynonymResolver, Candidate, ResolveResult — from .entity_resolver
    • ConceptIndexMismatchError, ConceptIndexProvenanceMissing, ConceptIndexInconsistentPair, ConceptIndexRefreshed — from .ontology_runtime

Unchanged:

  • All other SDK modules. The runtime accessor layer is strictly additive.

Read accessors on OntologyRuntime (pure-Python, no BQ round-trip):

Method Returns Notes
entities() list[str] Names of concrete + abstract entities
entity(name) Entity With annotations, synonyms, abstract flag
synonyms(name) list[str] Pref + alt + hidden labels
annotation(name, key) str | None E.g. skos:notation, skos:definition
in_scheme(scheme_name) list[Entity] Concepts in a skos:ConceptScheme
broader(name) list[Entity] skos:broader traversal
narrower(name) list[Entity] Inverse
related(name) list[Entity] skos:related traversal

Identity rules: entities are name-addressed (singular lookup); relationships are traversal-first, not name-addressed — a single skos_broader can repeat across endpoint pairs after #62's relaxed uniqueness, so a hypothetical rt.relationship(name) would have no single answer.

Validation accessor:

Method Returns Notes
validate_against_ontology(values, *, scheme=None, entity=None, sample_limit=20) ValidationResult scheme= and entity= are mutually exclusive; neither or both = ValueError. Bounded output via known_value_count + known_values_sample. candidates is None unless a resolver is explicitly composed by the caller.

Verification configuration (on construction):

Parameter Default Notes
verify_concept_index "strict" "strict" (raises on any provenance issue), "missing_ok" (tolerates missing meta), "off" (disables verification entirely — for read-only dashboards)
verify_ttl_seconds 60 0 = every-call check; None = snapshot-bound (verify once, never re-check)

Verification lifecycle:

  1. ConstructionOntologyRuntime.load(...) / .from_models(...) computes local ontology_fingerprint and binding_fingerprint (both full SHA-256). No BQ round-trip.
  2. First concept-index access (lazy — not on construction) — reads the __meta sibling, compares fingerprints. Mismatch → ConceptIndexMismatchError. Missing meta → ConceptIndexProvenanceMissing.
  3. TTL re-check (each resolve / validate call past the TTL window) — runs two queries:
    • SELECT DISTINCT compile_id FROM {output_table} LIMIT 2 — asserts exactly one value (pair consistency).
    • SELECT compile_id, ontology_fingerprint, binding_fingerprint FROM {output_table}__meta LIMIT 1 — full-fingerprint freshness.
    • Main/meta disagreement → 2s one-shot retry → persistent disagreement = ConceptIndexInconsistentPair.
    • Fingerprints drift from cache = ConceptIndexRefreshed.

The TTL re-check reading both tables with full fingerprints is a W2 watchpoint in the plan — a single-table sentinel or short-compile-id-only comparison reintroduces either the meta/main race or the 48-bit collision hole.

Resolver surface:

from bigquery_agent_analytics import (
    OntologyRuntime,
    ExactMatchResolver,
    SynonymResolver,
)

rt = OntologyRuntime.load(
    ontology_path="ontology.yaml",
    binding_path="binding.yaml",
    concept_index_table="my-proj.my_ds.ontology_concept_index",
    verify_concept_index="strict",    # default
    verify_ttl_seconds=60,            # default
)

resolver = SynonymResolver(runtime=rt)
result = resolver.resolve(
    input_value="Consumer Banking",
    scheme="BankingTaxonomy",         # scheme= XOR entity=
    limit=5,
)
# result.candidates: list[Candidate] with entity_name, matched_label, label_kind, scheme

Both reference resolvers query the concept index via BigQuery; ExactMatchResolver uses WHERE label = @input and SynonymResolver composes with label_kind preference ordering.

Version bump: Minor — new public API surface (OntologyRuntime, four resolver-related classes, four exception types). No existing behavior changes.

Existing user code: No deprecation. Users with their own resolution layers continue unaffected until they opt into the SDK primitive.


Sequencing (from the plan)

PR stack, in merge order:

  1. A1_fingerprint.pyfeat(ontology): A1 — internal _fingerprint module for concept-index provenance #71, open.
  2. A2concept_index.py row builder.
  3. A3-A5compile_concept_index + inline-UNNEST SQL emission.
  4. A7 — CLI flags.
  5. A8 (partial) — docs/ontology/concept-index.md.
  6. B1-B7 — SDK read accessors + resolver Protocol + references (verification off as intermediate default).
  7. C1-C6 — verification layer (strict default on) + shadow-swap full impl + all four exception types.
  8. Phase 4 — integration tests, examples/concept_index_quickstart.py, full docs.
  9. Phase 5contrib/ scaffolding (Yahoo advertising resolver when contributed).

Each PR leaves main shippable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions