You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Status: Design proposal. Not yet implemented. Comments welcome — especially on the "Open questions" section at the bottom and on the agent-vs-SDK boundary.
Goal
Today the SDK's ontology pipeline stops at DDL compilation (gm compile emits CREATE PROPERTY GRAPH + table scaffolding). Runtime — the point where an agent receives a user/client input like format_ids: ["display_static"] or geo: ["San Francisco-Stockton-Modesto"] and needs to resolve it against a declared ontology — is left entirely to the application layer.
Feedback from a production user building agentic media buying on top of this SDK quantified the gap: ~85% of brief-validation value for their use case sits at runtime, not schema time. They implemented a 5-layer resolver (notation match → lexical → token-set equality → Jaccard → Levenshtein) on top of ~10K lines of TTL (274 SKOS concepts, 942 synonyms, 210 GAM DMA display names). It works — but every vertical building on the SDK will rewrite some version of this, and today there is no supported runtime surface for them to build against.
This issue proposes a small, opinion-light set of runtime primitives that make resolution implementable in application code without pushing domain-specific matching logic into the SDK core.
Guiding principle: SDK provides, agent decides
The SDK and the agent layer make different kinds of claims:
SDK: knows what's declared in the ontology. Entities, relationships, synonyms, notations, concept schemes, taxonomy structure. Stable, typed, queryable.
Agent: knows what's intended. Which matcher to try first, what confidence threshold is safe for this domain, how to phrase a "did you mean" suggestion, whether fuzzy match on a free-text company_name is acceptable or dangerous.
Consequences for the runtime:
The SDK exposes read access over loaded ontologies (annotations, synonyms, scheme membership, taxonomy edges). No matching logic.
The SDK optionally materializes an ontology-derived concept index into BigQuery so agents can do SQL-native fuzzy match using BQ's existing EDIT_DISTANCE / SOUNDEX / UDFs.
The SDK defines an EntityResolverprotocol and ships two trivial references (ExactMatchResolver, SynonymResolver). Anything beyond exact-match lives outside core.
Domain-specific resolvers (advertising, healthcare, finance) live in contrib/ or user code, never in the runtime's required surface.
The SDK stays general. Verticals get a contract to build against instead of reaching into YAML or reconstructing structure from BQ tables.
Current gaps
No runtime accessor over loaded ontologies.load_ontology() returns Pydantic models, but there's no shape-agnostic API like rt.synonyms("DMA") or rt.annotation("DMA", "skos:notation"). Agents parse the model directly, which couples them to schema details the SDK otherwise hides.
Annotations are not queryable at runtime. Issue Feat: SKOS import support alongside OWL (design proposal — feedback wanted) #57 proposes persisting SKOS annotations (skos:definition, skos:notation, skos:prefLabel, etc.) through import. Nothing today reads those annotations at runtime. They live in the YAML and die there.
No concept index. Synonyms and notations are scattered across per-entity YAML nodes. Agents that want to do SQL-level matching have to flatten this themselves at query time on every request.
No resolver interface. Every SDK user writes their own resolution entry point, with their own return type, with their own "did you mean" shape. No convention, no reuse.
Proposed primitives
1. OntologyRuntime — read accessor over loaded ontology + binding
Small, stateless, zero external dependencies at read time. Built on top of existing load_ontology() + load_binding().
Covers both concrete and abstract (SKOS-derived) entities and relationships. Abstract elements are first-class at the runtime layer — they're the whole reason users care about SKOS at runtime.
Entities are name-addressed.rt.entity(name), rt.synonyms(name), rt.annotation(name, key) are singular lookups — entity names remain globally unique.
Relationships are traversal-first, not name-addressed. Issue Feat: SKOS import support alongside OWL (design proposal — feedback wanted) #57 relaxes relationship uniqueness to (name, from, to) for abstract relationships, so a single skos_broader can repeat across endpoint pairs. A hypothetical rt.relationship(name) would be unsafe because it has no single answer to return.
All relationship accessors take an entity and traverse.rt.broader(entity), rt.narrower(entity), rt.related(entity) return the set of entities reachable from the given starting point via the named predicate. That's a well-defined question regardless of how many skos_broader edges exist in the ontology.
If a relationship-by-name accessor is ever added, its contract must be compound identity (rt.relationship(name, from, to) -> Relationship | None) or list-returning (rt.relationships(name) -> list[Relationship]). Never singular-by-name.
2. Concept index materialization (opt-in)
At gm compile time, optionally emit a BigQuery sidecar table:
CREATE TABLE `{dataset}.ontology_concept_index` (
entity_name STRING NOT NULL,
label STRING NOT NULL, -- for label_kind='notation', this holds the notation value
label_kind STRING NOT NULL, -- 'name' | 'pref' | 'alt' | 'hidden' | 'synonym' | 'notation'
notation STRING, -- per-entity notation for display; repeats across rows of the same entity
scheme STRING, -- concept scheme this row's entity belongs to;-- NULL means "entity is not a member of any scheme"
language STRING, -- ISO-639 tag; NULL means unspecified or N/A (notation rows)
is_abstract BOOL NOT NULL, -- TRUE for SKOS-derived informational entities
compile_id STRING NOT NULL-- pair-consistency tag; see "Provenance and compatibility contract"
);
notation is a first-class row kind. For every entity that has a skos:notation, the compiler emits a row with label_kind='notation' and label=<notation value> — so resolvers searching by label naturally catch notation matches without a separate OR notation = @input predicate. The notationcolumn is kept as per-entity metadata that repeats across all rows of the same entity, for display convenience (a caller with a winning match can read the entity's notation directly from the candidate row without a separate lookup).
Row multiplicity contract:
One row per (entity_name, label, label_kind, language, scheme) membership tuple. A SKOS concept can legally belong to multiple skos:inScheme schemes (a DMA concept may be in both NielsenDMA and CensusMSA, a banking concept may be in both BankingTaxonomy and FinancialProductsTaxonomy). This is denormalized — a concept in 3 schemes × 5 labels produces 15 rows. Intentional; see below.
Entities that aren't members of any scheme produce rows with scheme IS NULL. They're still in the index; entity= resolution finds them, scheme= resolution skips them.
notation is per-entity (not per-scheme), so it repeats across membership rows for the same entity. Callers selecting a single notation per entity use DISTINCT notation or aggregate.
Why denormalized rather than ARRAY<STRING> scheme or a separate membership table:
WHERE scheme = @x stays a trivial clustered lookup — critical for the common scheme=<name> resolver path.
Predicate push-down into BQ clustering is straightforward; the clustering key (scheme, entity_name) stays usable.
ARRAY<STRING> forces WHERE @x IN UNNEST(scheme) on every scheme-scoped query, which is less indexable and harder for less-experienced SQL callers to write correctly.
A separate membership table adds a join to every resolver query, defeats the "one-table SQL lookup" simplicity that motivates the index.
Row multiplication is bounded: even for pathological multi-scheme ontologies, row count is linear in (concepts × labels × schemes), which stays tractable at BQ scale.
Agents do fuzzy match in SQL:
-- exact, scheme-scoped (the common case)SELECT DISTINCT entity_name
FROM ontology_concept_index
WHERE scheme = @scheme ANDLOWER(label) =LOWER(@input);
-- fuzzy fallback with BQ native functionsSELECT entity_name, MIN(EDIT_DISTANCE(LOWER(label), LOWER(@input))) AS dist
FROM ontology_concept_index
WHERE scheme = @scheme
AND EDIT_DISTANCE(LOWER(label), LOWER(@input)) <=3GROUP BY entity_name
ORDER BY dist ASCLIMIT5;
The DISTINCT/GROUP BY on entity_name is how callers collapse the denormalized rows back to one result per matched concept.
Matches the SDK's agent-native ethos: any action an agent takes in SQL is something a user or another tool can also take. No new Python-only runtime, no new service, no new matcher implementation to maintain.
Opt-in. v1 ships with a CLI flag only: gm compile --emit-concept-index. Default off for users who don't need it. A binding-side toggle (index: concept_index block on Binding) was considered but deferred — it requires schema and loader changes in bigquery_ontology.binding_models + binding_loader.py that are worth scoping as their own change once the CLI behavior is settled. If v2 adds it, the explicit precedence rule will be: CLI flag overrides binding setting; binding setting serves as the project default when the CLI flag is absent.
Index population contract
The existing DDL compiler (src/bigquery_ontology/graph_ddl_compiler.py) only emits schema SQL — CREATE TABLE / CREATE PROPERTY GRAPH. A concept index needs rows, which is a new kind of output. This subsection names who writes those rows and when.
Who writes the rows: the ontology compiler itself, in the same gm compile invocation that emits the DDL. The index is a deterministic function of both the ontology YAML and the binding — see "What's in the index" below. Treating it as a separate build step creates two sources of truth and a refresh-skew class of bugs that the SDK shouldn't inherit.
What's in the index (scope relative to binding): compile_concept_index(ontology, binding) takes both inputs because the index respects the binding's subset semantics. Since a binding may legally realize only a subset of the declared ontology (binding_models.py:147), the compiler needs a rule for which entities participate in the index. The rule is:
All abstract entities from the ontology, regardless of binding — they're informational-only and never bound by construction (Feat: SKOS import support alongside OWL (design proposal — feedback wanted) #57's binding rejection rule). Their value is precisely in being available for runtime resolution even when the agent's BQ tables don't materialize them.
Only concrete entities that are bound in this binding. Concrete + unbound entities are deliberately excluded from this deployment's runtime surface; including them would let a resolver return matches the agent then can't query. That's worse than a miss.
In short: abstract: always. Concrete: iff bound. This matches the SDK-level invariant from the adapter design ("every element in GraphSpec is bindable and has data") while preserving the taxonomy-browse value that abstract SKOS entities add at runtime.
Consequence: two different bindings over the same ontology produce different indexes. A narrow deployment binding only Account and Customer emits a smaller index than a wide deployment binding all 40 concrete entities, but both share the same abstract skos_Banking / skos_FinancialProduct / etc. nodes. Abstract relationships between abstract entities are always in scope; abstract relationships touching an unbound concrete entity are included (they're informational metadata, not runtime operations).
The is_abstract column in the index row lets resolvers filter at query time: a resolver that wants only runtime-materializable matches does WHERE NOT is_abstract; a resolver producing taxonomy-aware "did you mean" suggestions keeps both.
Table naming contract: because two bindings against the same ontology produce legitimately different indexes, a single global table name is unsafe — the second compile would silently overwrite the first. The output table name is therefore a required parameter, not a fixed convention:
Both library and CLI error cleanly if the name is missing when --emit-concept-index is set. No silent global default. Users with a single binding per dataset pick any unique name they like (ontology_concept_index is fine); users with multiple bindings per dataset pick distinct names per binding (ontology_concept_index__retail, ontology_concept_index__investment_bank, etc.).
Why required rather than auto-derived:
Bindings do carry a binding: str identifier (binding_models.py:159), but it isn't a safe or stable source for a BQ table name: it's an identity tag for the binding document, not a deployment-unique BQ-legal identifier. Using it would couple operational naming to a field authors rename for non-operational reasons, and would collide across environments (dev/stage/prod) that share the same binding identity.
Hash-derived defaults like ontology_concept_index__{sha1(binding)[:8]} are collision-free but unreadable and change on every trivial binding edit — bad ergonomics for a table name that appears in user-written resolver SQL.
Explicit naming forces the deployment-operator-level decision at compile time, where it belongs.
OntologyRuntime reads the index via the same name the caller passed at compile time — runtime construction takes a matching concept_index_table: str parameter (or reads it from configuration) so lookups target the right table. The name is not stored on the ontology or binding model; it's a runtime/deployment concern.
Provenance and compatibility contract: because the table name is caller-supplied and binding-scoped, nothing in the data columns alone would catch a mismatched wiring like OntologyRuntime.from_models(ontology_A, binding_B, concept_index_table=table_C) where table_C was actually compiled from a different(ontology, binding) pair. Plausible-but-wrong matches are worse than no matches — the agent gets confident answers against stale or unrelated data.
The compiler therefore emits a sibling metadata table named {output_table}__meta, written in the same gm compile invocation. One row per compile:
CREATE OR REPLACE TABLE `{output_table}__meta`ASSELECT*FROM UNNEST([
STRUCT(
'retail'AS ontology_name, -- from Ontology.name'sha256:abc123...'AS ontology_fingerprint, -- see "Fingerprint algorithm" below'sha256:def456...'AS binding_fingerprint, -- same algorithm, over Binding model'my-project'AS target_project, -- from Binding.target.project'my_dataset'AS target_dataset, -- from Binding.target.dataset'gm-1.2.0'AS compiler_version, -- version of bigquery_ontology that compiled'a1b2c3d4e5f6'AS compile_id -- pair-consistency tag; deterministic from inputs
)
]);
Sibling rather than embedded columns so the bulk of the index (the label/notation rows) stays lean.
Fingerprint algorithm: fingerprints are SHA-256 hashes over a canonical serialization of the validated Ontology / Binding Pydantic models — not over raw YAML text. Concretely:
Load YAML → validated model (existing load_ontology() / load_binding() path). Validation normalizes optional fields, default values, and type coercion.
Serialize the validated model to a canonical JSON form: keys sorted lexicographically at every nesting level, no extra whitespace, UTF-8, stable encoding of None / booleans / numbers, lists preserved in declaration order (list order is semantically meaningful in the ontology model — e.g., key columns).
Hash the resulting bytes with SHA-256, prefix with sha256:.
The same approach is used for both ontology and binding fingerprints, with one runtime difference: ontology fingerprinting covers every field of the Ontology model. Binding fingerprinting covers every field of the Binding model except ephemeral annotations (if any are introduced later) — the binding's identity for the purpose of "does this index correspond to this binding" is its declared structure, not its documentation metadata.
Why model-based and not YAML-text-based:
Two semantically identical YAML documents with different formatting, comment placement, or emitter behavior must produce the same fingerprint. A strict verification gate that rejects non-semantic edits would be a constant source of false positives and would push operators to disable verification — worse than no verification.
Pydantic-validated models are already the canonical in-memory form the SDK works with (src/bigquery_agent_analytics/runtime_spec.py:199 and adjacent). Hashing at that layer matches the layer where the rest of the SDK's determinism lives.
The existing compile contract is already model-based (compile_graph(ontology, binding) -> str takes models, not YAML strings). Keeping fingerprint input at the same layer maintains consistency across compile output and runtime verification.
Two bindings produced from the same source YAML by different emitters (e.g., one with trailing newlines, one without) fingerprint identically. Two bindings that disagree on any declared field — entity names, target dataset, property types — fingerprint differently and correctly fail strict verification.
Canonicalization rules in brief (formal spec in the implementation):
Keys sorted at every nesting level (stable across Python dict iteration).
Model fields serialized via Pydantic.model_dump(mode="json", by_alias=False, exclude_none=False) so defaults materialize consistently.
Enum values serialized as their canonical string form, not member name.
None / missing-but-defaulted fields serialized as explicit null to distinguish "absent" from "defaulted."
List order preserved; no reordering of entity/relationship/property lists (order is semantically load-bearing).
Output encoded as UTF-8 JSON with separators=(",", ":") (no extra whitespace).
OntologyRuntime runtime verification:
At construction, OntologyRuntime.load(...) / .from_models(...) computes the same fingerprints on the loaded Ontology and Binding models.
On first access to the concept index (lazy — construction doesn't hit BQ), the runtime reads the __meta sibling and compares fingerprints.
Mismatch raises ConceptIndexMismatchError with a clear message naming the expected vs actual fingerprints and the table name involved. The runtime refuses to return matches from an index that doesn't correspond to the loaded models.
Missing __meta sibling (e.g., a manually-created index or one compiled with an older toolchain) raises a distinct ConceptIndexProvenanceMissing — caller can explicitly opt out with OntologyRuntime(..., verify_concept_index="off") for read-only dashboards or interactive exploration.
Verification re-checks on a configurable TTL, not once-per-lifetime. See "Long-lived runtime verification" below.
Long-lived runtime verification (strict is strict for the whole lifetime, not just the first call). A naive "verify once then cache forever" contract would let a long-lived service sail past an index refresh that swapped in a different (ontology, binding) pair — returning matches against the new index while still believing it was verified. That defeats the "plausible-but-wrong matches are worse than no matches" argument behind the strict default.
The contract:
After the first successful verification, OntologyRuntime caches the expected compile_id, ontology_fingerprint, and binding_fingerprint on the instance.
On each resolve / validate call, the runtime checks whether the cached verification is still fresh under a configurable TTL (verify_ttl_seconds, default 60). If the cache is fresh, the call proceeds without a BQ round-trip.
If the cache is stale, the runtime re-runs the full pair-consistency check plus a full-fingerprint freshness check, not just a single-table sentinel. Concretely:
SELECT DISTINCT compile_id FROM {output_table} LIMIT 2 returns exactly one value.
SELECT compile_id, ontology_fingerprint, binding_fingerprint FROM {output_table}__meta LIMIT 1 — read compile_id and the full fingerprints.
Verify: meta.compile_id == cached.compile_id AND meta.ontology_fingerprint == cached.ontology_fingerprint AND meta.binding_fingerprint == cached.binding_fingerprint (full-fingerprint freshness).
Outcomes:
All checks hold → refresh the cache timestamp and proceed.
Pair consistent but any cached value differs from meta → raise ConceptIndexRefreshed. Service operator recreates OntologyRuntime with updated models; new instance's full fingerprint verification catches whether the new index matches or not.
Main and meta disagree (refresh in progress) → one-shot 2s retry, then raise ConceptIndexInconsistentPair. Same contract as first-load.
Why the sentinel must read both tables, not just meta. An earlier draft checked only meta.compile_id. That has a correctness hole: the inline refresh order is "main first, meta second," so during the swap window a reader could see the old meta compile_id (matches cache, accepted), then query the new main table, and serve data from the refreshed index under stale verification. Reading both tables on TTL re-check closes that window — main's compile_id is authoritative for "which compile does the data belong to," and the meta comparison catches inconsistent pairs.
Why the freshness check compares full fingerprints, not just compile_id. The compile_id column is a 12-hex-char truncation of sha256(ontology_fingerprint || binding_fingerprint || compiler_version) — 48 bits of entropy, chosen to keep the per-row compile_id column short (storage efficiency on a column that repeats across every data row). That's enough for first-pass pair consistency: two tables with different compile_ids definitely belong to different compiles, and the birthday bound on distinct compiles for a single (output_table) over realistic deployment lifetimes is comfortably below collision probability.
But "comfortably below" is not "zero," and a strict verification contract shouldn't rely on it. The meta row carries the fullontology_fingerprint and binding_fingerprint (SHA-256, 256 bits each) — storing those in a single-row meta table costs nothing. The TTL re-check therefore compares all three (compile_id + both full fingerprints) against the cache. A hypothetical 48-bit collision where a legitimately-different (ontology, binding) pair happens to share a 12-char prefix is caught because the full fingerprints won't match.
Pair consistency between the two tables still runs on the short compile_id — it only needs to detect "are these from the same compile or different compiles," and 48 bits is overkill for that single-dataset comparison. The strict freshness check runs on the full 256-bit fingerprints where the safety story demands it.
The three reads are still cheap. Main's SELECT DISTINCT compile_id FROM {output_table} LIMIT 2 reads at most two rows from a clustered column; meta reads exactly one row (and always has, just with more columns than before). Per-TTL-window cost remains negligible even at default 60s.
Configuration surface on OntologyRuntime construction:
verify_ttl_seconds: int = 60 — default 60. Balance between correctness-staleness window and re-verification cost.
verify_ttl_seconds=0 — check on every call. Useful for low-QPS services where correctness matters more than cost.
verify_ttl_seconds=None — snapshot-bound: verify once on first use, never again. Explicit opt-in for services that coordinate refresh out-of-band (e.g., rolling-restart on recompile). Matches the old "verify once" behavior for callers who want it.
Why TTL rather than check-every-call by default: the pair re-check is cheap but not free, and for high-QPS resolver workloads it adds up. A 60s staleness window matches typical service-refresh cadences while keeping per-call cost bounded at O(1) with no BQ hit in the common case.
Pair-consistency contract (the two tables must agree on the same compile). Because {output_table} and {output_table}__meta are written as two separate CREATE OR REPLACE TABLE statements, a reader interleaved with a refresh could otherwise observe:
new meta + old data → strict verification would pass against stale data (plausible-but-wrong matches).
new data + old meta → strict verification would raise an incorrect mismatch.
To make the pair coherent without requiring DDL-level transactions (which BigQuery doesn't offer for CREATE OR REPLACE TABLE), both tables carry a compile_id tag that is derived deterministically from compile inputs — not a per-run UUID or a timestamped value:
(First 12 hex chars is enough to make accidental collisions vanishingly unlikely while keeping the column short.)
compile_id STRING NOT NULL column on the main table — every row of {output_table} shares the same value.
compile_id field on the single __meta row — same value.
Write order: main table first, meta second. Readers never see "new meta promising data that doesn't exist yet."
Why deterministic rather than per-run:
Preserves the byte-identical output contract on compile_concept_index() (see Compiler output contract below). Two compiles of the same ontology + binding + compiler version produce character-identical SQL.
Pair consistency still works: interleaved compiles with different inputs produce different compile_ids and the runtime check catches the inconsistency. Interleaved compiles with identical inputs produce identical compile_ids and the data is also identical — the worst case is wasted work, not wrong data.
Callers auditing compile output in code review can diff it against the previous compile and see only the changes caused by ontology/binding edits, not a new UUID every run.
compiled_at is deliberately not in the emitted SQL. An earlier draft included a compiled_at TIMESTAMP field in the meta row; that's been removed to preserve byte-identical output. Operators who want compile timestamp visibility can read it from INFORMATION_SCHEMA.TABLES.creation_time on the __meta table, which BigQuery maintains automatically. The tradeoff is deliberate: runtime correctness (deterministic compile output, reviewable diffs) over embedded operator metadata that BQ already provides.
SELECT * FROM {output_table}__meta LIMIT 1 → get expected_compile_id, expected fingerprints.
SELECT DISTINCT compile_id FROM {output_table} LIMIT 2 → verify exactly one compile_id is present and it equals expected_compile_id.
If compile_id mismatches or multiple distinct compile_ids are observed (which would indicate a broken compile), retry once after a short backoff (default 2 seconds) — handles the narrow interleaving window during normal refresh.
If the retry also fails, raise ConceptIndexInconsistentPair with both observed compile_ids. This is distinct from ConceptIndexMismatchError (which is a wiring/fingerprint error, not a timing one) so callers can handle them differently.
Once pair-consistency is established, fingerprint verification proceeds against the meta row.
The retry is deliberately one-shot and small: a legitimately long refresh window indicates operator misbehavior (concurrent compiles against the same table) and should fail loudly.
Compatibility flag vocabulary:
verify_concept_index="strict" (default) — fingerprint mismatch, missing meta, or persistent pair inconsistency all raise.
verify_concept_index="missing_ok" — fingerprint mismatch and pair inconsistency raise, missing meta warns and proceeds.
verify_concept_index="off" — no verification, purely caller-managed. Intended for explicit "I know what I'm doing" paths.
Rejected alternatives for pair consistency:
Transactional multi-statement BEGIN TRANSACTION; ... COMMIT;. BigQuery's transaction support doesn't cover CREATE OR REPLACE TABLE — DDL is generally non-transactional. Not a viable primitive.
Shadow-version both tables + atomic pointer-table swap. Three tables (main, meta, pointer) adds significant operational complexity for a narrow window. The compile_id approach gets the same correctness property with one extra column.
BQ table OPTIONS(description=...) tagging. Requires INFORMATION_SCHEMA lookups, has its own freshness semantics, more tooling surface. compile_id in a data column is simpler to query from plain SQL.
Rejected alternatives for the provenance storage shape:
Embed provenance as repeated columns on every index row. Wastes storage proportional to the number of rows; makes diff-based review noisier; still requires runtime verification logic. Sibling table is strictly better. (The compile_id column is a deliberate exception — it's a single short fixed-length tag needed for pair consistency, not full provenance.)
Encode provenance in BQ table OPTIONS(description=...). BQ-native and elegant, but INFORMATION_SCHEMA queries have their own cost and tooling constraints, and the sibling table approach is easier to inspect from plain SQL (SELECT * FROM concept_index__meta).
Caller-managed in v1 with no verification. Considered and rejected — shipping a primitive that silently produces wrong results under a plausible operator mistake is a feature bug the SDK shouldn't ship with. v1 ships strict verification on by default, with documented escape hatches.
How the rows reach BQ (atomic-swap semantics for runtime readers): the compiler emits a single CREATE OR REPLACE TABLE ... AS SELECT ... statement. In BigQuery this is atomic — concurrent readers see either the previous table's rows or the new table's rows, never an empty intermediate state. This is the critical difference from a DELETE + INSERT pair, which would expose a window where the index is queryable but empty. For a runtime lookup primitive, that window is a correctness hazard, not just a performance one.
-- gated on --emit-concept-index, name passed via --concept-index-table-- write order: main table first, __meta second. compile_id ties the pair.
CREATE OR REPLACE TABLE `{output_table}`ASSELECT*FROM UNNEST([
STRUCT('DMA'AS entity_name, 'DMA'AS label, 'name'AS label_kind,
'807'AS notation, 'NielsenDMA'AS scheme,
CAST(NULLAS STRING) AS language,
FALSE AS is_abstract,
'a1b2c3d4e5f6'AS compile_id),
STRUCT('DMA', 'Designated Market Area', 'synonym',
'807', 'NielsenDMA', 'en', FALSE, 'a1b2c3d4e5f6'),
STRUCT('DMA', 'Marché de diffusion désigné', 'pref',
'807', 'NielsenDMA', 'fr', FALSE, 'a1b2c3d4e5f6'),
-- first-class notation row: label holds the notation value, label_kind='notation'
STRUCT('DMA', '807', 'notation',
'807', 'NielsenDMA', CAST(NULLAS STRING), FALSE, 'a1b2c3d4e5f6'),
-- multi-scheme example: same entity appearing in two schemes
STRUCT('BayAreaMetro', 'Bay Area Metro', 'pref',
CAST(NULLAS STRING), 'NielsenDMA', 'en', FALSE, 'a1b2c3d4e5f6'),
STRUCT('BayAreaMetro', 'Bay Area Metro', 'pref',
CAST(NULLAS STRING), 'CensusMSA', 'en', FALSE, 'a1b2c3d4e5f6'),
-- abstract SKOS concept: informational, no scheme membership
STRUCT('skos_Banking', 'Banking', 'pref',
CAST(NULLAS STRING), CAST(NULLAS STRING), 'en', TRUE, 'a1b2c3d4e5f6'),
...
]);
The separate CREATE TABLE scaffold is not emitted — CREATE OR REPLACE TABLE creates the table on first run and atomically replaces it on every subsequent run. This collapses "ensure table exists" and "populate rows" into one statement, eliminating the empty-table-existing intermediate state entirely.
For ontologies with tens of thousands of concepts (Yahoo's YAMO example: 274 SKOS concepts × multiple labels ≈ ~1K-10K rows), inline UNNEST(ARRAY<STRUCT<...>>) is well within BigQuery's query-text limits. For ontologies above ~50K rows, the compiler emits a shadow-table swap pattern for both tables in the pair — the pair-consistency contract applies to the shadow path too:
-- both tables get a shadow; suffix is "_shadow" on each production name
CREATE OR REPLACE TABLE `{output_table}_shadow` (...);
INSERT INTO`{output_table}_shadow`VALUES (...); -- batched, includes compile_id column
CREATE OR REPLACE TABLE `{output_table}__meta_shadow`ASSELECT*FROM UNNEST([STRUCT(... 'a1b2c3d4e5f6'AS compile_id)]);
-- swap order: data first, then meta (matches the inline-path write order)DROPTABLE IF EXISTS `{output_table}`;
ALTERTABLE`{output_table}_shadow` RENAME TO <short name from output_table>;
DROPTABLE IF EXISTS `{output_table}__meta`;
ALTERTABLE`{output_table}__meta_shadow` RENAME TO <short name from output_table>__meta;
Two distinct non-atomicity windows exist on this path:
Table-existence window: between DROP and RENAME on each table, that table name does not resolve. Readers get BigQuery's "table not found" error, which they must tolerate as transient.
Pair-inconsistency window: between "main renamed" and "meta renamed", the main table carries the new compile_id while the meta row still carries the old one. Readers in this window see compile_id disagreement → ConceptIndexInconsistentPair on the pair-consistency check.
The pair-inconsistency window on the shadow path can exceed the inline path's one-shot 2-second retry budget, because large-ontology rename operations take longer than small-ontology CREATE OR REPLACE TABLE statements. This means strict verification will raise during a legitimate shadow-path refresh if it happens to sample during the swap. That's by-design, not a defect: strict verification correctly rejects an inconsistent pair even when the inconsistency is transient. The alternative (silently serving old data against a new meta, or vice versa) is the failure mode strict verification exists to prevent.
Operational contract for the shadow path:
Treat shadow-path refreshes as offline/admin operations. Pause reader traffic (or accept ConceptIndexInconsistentPair exceptions) during gm compile runs that hit the shadow path.
If traffic cannot be paused, the caller has two options, neither of which involves missing_ok (which per the verification-mode contract above still raises on pair inconsistency — transient or not):
Increase verify_ttl_seconds so the pair re-check samples less frequently. Reduces the probability of landing inside a swap window at the cost of a longer staleness tolerance.
Catch ConceptIndexInconsistentPair at the application layer and retry the call after a short delay. Cleanest at the service-mesh level where transient 5xx handling already exists.
For services where neither is acceptable: bind the main + meta pair under a higher-level indirection (a separate {output_table}__current pointer table that callers resolve through). Not shipped in v1 — out of scope as a third level of indirection, tracked as follow-up work if real users hit this constraint.
This limitation is specific to the shadow path. The inline-UNNEST path (the default for ontologies under 50K rows, covering the motivating use cases including Yahoo's YAMO) remains fully atomic per-statement and doesn't exhibit either window.
When refresh happens: on every gm compile run with --emit-concept-index. No incremental build. If the user edits ontology YAML, they re-run compile, same as any other DDL change. This matches the compile model users already have for schema changes and avoids adding a second refresh command.
Alternatives considered and rejected:
Separate gm build-concept-index step. Adds a command users have to remember and introduces drift between "DDL is up to date" and "index is up to date." Two invocations for one conceptual change.
Runtime lazy-build. Rebuild the index in memory on OntologyRuntime.load() and optionally push to BQ. Surfaces inconsistent state when multiple agent instances load simultaneously and makes the "query the index in SQL" path unreliable until someone has pushed.
Streaming incremental updates. Possible future work for ontologies with externally-sourced concept rolls. Out of scope for the initial primitive.
Failure modes:
Inline CREATE OR REPLACE TABLE path: if the statement fails (quota, permissions, query text too large), gm compile errors with a message naming the concept index. Because CREATE OR REPLACE is atomic, there is no half-written state — the previous table (if any) remains queryable, or no table exists at all. The user can re-run compile without cleanup.
Shadow-table swap path: failure mid-swap can leave the pair in an inconsistent state (main renamed but meta not, or either table dropped but not renamed). gm compile retry detects the orphaned _shadow table(s) and resumes from the swap step. Runtime readers during the orphaned window get either "table not found" or ConceptIndexInconsistentPair — both expected transient conditions on the shadow path, tolerated via the operational contract above (pause traffic during refresh, or accept transient failures).
3. EntityResolver Protocol + two reference implementations
Interface-only in core:
fromtypingimportProtocolfromdataclassesimportdataclass@dataclassclassCandidate:
entity_name: str# unique per Candidate in ResolveResult.candidateslabel: str# the winning label that produced this matchlabel_kind: str# 'name' | 'pref' | 'alt' | 'hidden' | 'synonym' | 'notation'scheme: Optional[str] # scheme the winning match came through (None = entity-scoped or no scheme)confidence: floatreason: str# 'exact' | 'notation' | 'synonym' | 'fuzzy' | 'none'@dataclassclassResolveResult:
match: Optional[str] # resolved entity_name (None = no match)confidence: float# 0.0 - 1.0; 1.0 = exactcandidates: list[Candidate] # top-k "did you mean" suggestions, one per entityreason: str# why `match` resolved (or 'none')classEntityResolver(Protocol):
defresolve(
self,
value: str,
*,
scheme: str|None=None,
entity: str|None=None,
limit: int=5,
) ->ResolveResult: ...
scheme and entity are mutually exclusive — see "Scope semantics" in the Library API impact section below. Exactly one must be provided.
Candidate dedup contract (important once the index is denormalized per (entity_name, label, label_kind, language, scheme)):
ResolveResult.candidates contains at most one entry per entity_name. The denormalized index naturally produces multiple matching rows for the same entity (same entity, different label or different scheme). For an agent-facing "did you mean" list, duplicates are noise — the agent wants a list of distinct concepts, each annotated with the best evidence for why it matched.
limit=N means N distinct entities, not N raw rows. Resolvers do the dedup before truncating.
Winning-label rule when the same entity matches through multiple rows: pick the row with the highest confidence under the resolver's matching rule. Ties broken by label_kind priority, in this order: name > pref > alt > hidden > synonym > notation. Further ties broken by lexicographic label order for determinism.
Candidate.label / Candidate.label_kind / Candidate.scheme / Candidate.reason reflect the winning row. The other rows that also matched are discarded — callers wanting the full provenance use the concept index directly via SQL.
reason values are resolver-defined but drawn from a shared vocabulary so callers can branch on them without ambiguity: exact (name match), notation (notation match), synonym (any label other than name), fuzzy (non-exact match produced by a fuzzy resolver), none (no match found — only valid on the ResolveResult.reason, not on Candidate.reason).
SDK ships two references in core:
ExactMatchResolver — O(1) lookup against name + skos:notation. Confidence is 1.0 or 0.0. Good for notation-heavy inputs (Nielsen DMA codes, Google Ads Criteria IDs).
SynonymResolver — extends ExactMatchResolver by also matching against prefLabel / altLabel / hiddenLabel / synonyms. Still exact on each label; still confidence 1.0 or 0.0.
Everything above exact-match — token-set equality, Jaccard, Levenshtein, phonetic, weighted ensembles — lives in user code or contrib/ packages. Verticals pick (or write) a resolver tuned for their domain.
4. validate_against_ontology — small validation helper
Not resolution — just pass/fail against the declared ontology. Return shape is bounded by design so it stays useful on large concept schemes (IAB Taxonomy, Nielsen DMAs, SNOMED excerpts) where the candidate universe is hundreds to tens of thousands of entries:
rt.validate(
{"format_ids": ["display_static", "display_banner"]},
scheme="AdFormat", # see "Scope semantics" belowsample_limit=10, # default — cap on known_values_sample
)
# → ValidationResult(# valid=["display_banner"],# invalid=["display_static"],# known_value_count=47,# known_values_sample=["display_banner", "display_native", ...], # up to sample_limit# candidates=None, # populated only when composed with a resolver# )
Agents combine validate_against_ontology with a resolver to produce "did you mean." The SDK doesn't match; it only knows what exists.
Design notes on the return shape:
known_value_count is always the full count. Tells the caller whether the sample is representative.
known_values_sample is capped at sample_limit (default 10). Enough for a "did you mean" hint without bloating every validation miss on a 10K-concept scheme. Callers who genuinely need the full set use rt.in_scheme(...) or rt.entities() — that's what those accessors are for.
candidates stays None unless the caller composes validation with a resolver. Keeps validate pure set-membership; keeps ranking logic in resolver-land. No double-duty.
Sample order is not specified by the contract — callers should not rely on alphabetical or any other ordering. If deterministic ordering matters for a specific use, pass a sorted known_values_sample through a resolver that ranks.
Library API impact
This section pins down the parts of the proposal that touch existing public APIs, so they're clear before implementation starts.
Compiler output contract
The existing bigquery_ontology.compile_graph(ontology, binding) -> str is documented to be deterministic — "same inputs → byte-identical text." That contract is preserved. Concept-index emission does not modify compile_graph().
Both return deterministic strings. compile_concept_index() extends the compile_graph() byte-identical contract in the same spirit: same inputs → byte-identical DML text, including row order.
"Same inputs" for compile_concept_index() = (ontology, binding, output_table, compiler_version). Everything in the emitted SQL is derived from those four values:
compile_id is sha256(ontology_fingerprint || binding_fingerprint || compiler_version)[:12] — deterministic.
No per-run timestamps, UUIDs, or process identifiers appear in the emitted SQL. Compile timestamps are recoverable from INFORMATION_SCHEMA.TABLES.creation_time on the emitted tables.
Row order is determined by the sort key below, applied before SQL generation.
Rows are sorted by a stable key before SQL generation:
with NULLs ordered last consistently per column. is_abstract is last because it's determined by entity_name — included only for defensive stability if the invariant ever loosens. This sort order guarantees that two invocations of compile_concept_index() on the same ontology + binding emit character-identical SQL — critical for diffing compile output in code review, caching compiled artifacts, and verifying that ontology edits produced only the expected row changes.
Library callers who want only DDL keep calling compile_graph() as today; callers who want the concept index call compile_concept_index() for a separate DML script. The CLI layer composes the two:
# CLI behavior for `gm compile --emit-concept-index`sql_parts= [compile_graph(ont, binding)]
ifargs.emit_concept_index:
sql_parts.append(
compile_concept_index(ont, binding, output_table=args.concept_index_table)
)
print("\n\n-- concept index --\n\n".join(sql_parts))
Why a sibling and not a composed option:
Preserves the byte-identical contract on compile_graph().
No breaking change to existing callers.
Each function has one job; easier to test, version, and reason about.
CLI callers with shell-orchestrated pipelines can write DDL and DML to separate files if they want — composition stays a caller concern.
Rejected alternatives:
compile_graph(..., emit_concept_index=True) returning concatenated DDL+DML — breaks the byte-identical contract for one config mode and creates a function whose return value depends on a flag.
Return an object (CompileResult(ddl=..., dml=...)) — breaks every existing caller of compile_graph() that treats the return as a string.
CLI-only (no library-layer API for the index DML) — forces library users to reimplement concept-index generation themselves, defeats the point of the primitive.
OntologyRuntime construction: paths and models
The example in section 1 shows OntologyRuntime.load(ontology_path=..., binding_path=...). But existing SDK code already carries validated Ontology and Binding models around in memory (e.g., src/bigquery_agent_analytics/runtime_spec.py:199 passes models directly). Forcing callers to reparse YAML or round-trip through disk would be a step backward.
load() is the convenience path for one-off scripts and the CLI. from_models() is the integration path for the SDK's existing flows — runtime_spec, ontology_orchestrator, adapters downstream of load_ontology() can all wrap without touching disk again.
Internal implementation: load() calls load_ontology() + load_binding() then delegates to from_models(). Zero code duplication.
Scope semantics: scheme vs entity
Resolvers and validate() need an explicit target set. Two mutually-exclusive named parameters, no polymorphism:
# Scheme-scoped: resolve/validate against all members of a concept scheme.# Most common case — this is what the motivating examples (AdFormat, DMA, IAB) want.rt.validate({"dma": ["Nielsen 807"]}, scheme="NielsenDMA")
resolver.resolve("San Francisco-Oakland", scheme="NielsenDMA")
# Entity-scoped: resolve/validate against a single named entity. Identity check only.# Rare — used when you want "is this exactly this one entity?" rather than# "is this a member of a taxonomy?"rt.validate({"customer_id": ["C-42"]}, entity="Customer")
Rules:
Exactly one of scheme or entity must be provided. Passing both or neither is an error with a clear message.
scheme=<name> resolves against the set {e : e.in_scheme(name) or name == e.name and e.is_abstract_scheme_root}. This is the motivating case. Works for both explicit SKOS concept schemes and abstract entities that act as taxonomy roots.
entity=<name> resolves against the singleton set {e : e.name == name}. Identity check. Returns match iff the input exactly matches the entity's name, notation, or a declared label/synonym.
Narrower-closure scoping (e.g., "all narrower-than some abstract node") is explicitly deferred. When the need surfaces, it'll come back as scope=Scope.narrower_closure(name) or similar, without changing the meaning of scheme and entity.
Why not polymorphic ("entity= means scheme-scoped if it's a scheme, entity-scoped otherwise"):
Two implementers following the spec would return different answers for the same call, depending on their interpretation of the ontology's structure.
Ontology authors who later change an entity from concrete to abstract-scheme-root would silently change the semantics of every entity= call targeting it.
Callers would need ontology knowledge to predict what a given entity= call does — defeats the point of a stable API.
Explicit parameters keep the contract boring and predictable.
Non-goals
Ship a general string-matching library. BQ already has EDIT_DISTANCE, SOUNDEX, JACCARD UDFs. If the concept index is materialized, users get these for free. Don't wrap.
Ship the 5-layer resolver in core. Token-set equality thresholds, Jaccard coefficients, Levenshtein cutoffs — all domain-tuned. Advertising's tuning for DMAs is not the right tuning for SNOMED or legal-entity names. The feedback author's resolver is valuable as a reference for their domain and belongs in contrib/ or a separate package.
Promise a <50ms SLA. Latency is a function of index size and resolver choice, both of which vary by user. The SDK can guarantee the primitive shapes; it can't guarantee the performance of every application that uses them.
Provide a concept-scheme browser UI. Out of scope — this is an analytics SDK, not an ontology editor.
Take a position on "did you mean" phrasing. The SDK returns structured candidates; the agent composes user-facing copy.
How this lands on top of existing code
Using the current SDK's module boundaries:
Piece
Belongs in
Notes
OntologyRuntime class with load() + from_models() classmethods
Wraps load_ontology + load_binding from bigquery_ontology. Pure Python, no BQ calls. Both construction paths share one implementation.
compile_graph() (existing)
bigquery_ontology/graph_ddl_compiler.py
Unchanged. Preserves byte-identical contract.
compile_concept_index() (new sibling)
bigquery_ontology/graph_ddl_compiler.py (new function)
Separate deterministic DML emitter. CLI composes with compile_graph() when --emit-concept-index is set.
EntityResolver Protocol + references
bigquery_agent_analytics/entity_resolver.py (new)
Core SDK layer. Protocol + two implementations. Both accept scheme= or entity= (mutually exclusive).
validate_against_ontology
Method on OntologyRuntime
Same scheme= / entity= scope parameters.
Domain packs and layered resolvers
bigquery_ontology/contrib/ or external packages
Advertising, healthcare, finance. Never in core.
Changes to existing modules are limited but not zero. Most of the proposal is additive (new files, new functions, new classmethods). Two concrete edits to existing code are needed:
src/bigquery_ontology/cli.py:299 — the existing compile command gains --emit-concept-index and --concept-index-table <name> flags. When --emit-concept-index is set, the command composes the existing compile_graph() output with compile_concept_index(..., output_table=...). Without the flag, the command's behavior is byte-identical to today.
src/bigquery_ontology/graph_ddl_compiler.py — adds the new compile_concept_index() function in the same module. compile_graph() itself is not modified.
The runtime accessor (OntologyRuntime) reads the same Ontology/Binding models already loaded today — that path is purely additive in the SDK package.
This proposal depends on issue #57 landing first, because the concept index's value comes almost entirely from SKOS annotations (skos:notation, skos:prefLabel, skos:altLabel, skos:broader) being preserved through import. Without #57, the concept index is a thin wrapper over entity names and existing synonyms — useful but not transformative.
Specifically:
skos:notation in annotations → notation column in concept index → L1 code match becomes trivial
skos:prefLabel / altLabel / hiddenLabel → rows in concept index with label_kind discriminator → L2 lexical becomes trivial
skos_broader abstract relationships → rt.broader() traversal → taxonomy-aware "did you mean a parent or sibling"
Abstract entities with skos_ prefix → rt.in_scheme() enumerates all concepts in a taxonomy → agent can present the scheme to the LLM as context
Open questions — feedback wanted
Is OntologyRuntime the right wrapper, or should the accessors live as methods on Ontology / Binding directly? Pro-wrapper: keeps bigquery_ontology pure-data and the runtime layer in bigquery_agent_analytics. Pro-direct: fewer classes to learn. Proposal leans wrapper — the accessor layer is SDK-runtime concern, not ontology-package concern.
Should the concept index be opt-in or opt-out? Pro-opt-in: users who don't need it don't pay storage. Pro-opt-out: users discover the primitive because it just exists. Proposal leans opt-in: no silent BQ table creation.
Should OntologyRuntime cache the concept index in memory for pure-Python access, or always go to BQ? Pro-memory: fast, no BQ cost, works offline. Pro-BQ-only: always consistent with DDL, scales to ontologies with 100K+ concepts. Proposal: pure-Python by default for ontologies under some size threshold; explicit BQ-backed resolver for large ones.
Does EntityResolver need an async variant? Resolution against a BQ-backed index is I/O. Proposal: ship sync; add async later if users ask.
Should the SDK ship a richer FuzzyResolver reference (just exact + prefix, not full 5-layer) so users have a middle option?Proposal: no — either exact or bring-your-own. Avoids the "SDK partially solves fuzzy matching" trap where the reference becomes everyone's default despite being domain-unaware.
Should the Protocol be typing.Protocol or an ABC? Protocol allows duck typing; ABC forces inheritance. Proposal: Protocol — matches modern typing conventions and doesn't force users to inherit.
Should rt.validate() also return a nearest field when values are invalid? Would require calling a resolver inside validate, coupling the two. Proposal: no — keep validate pure set-membership, let callers compose it with a resolver.
Concept index: do we need a per-row score or priority for when multiple labels map to the same entity? Some verticals (IAB) prefer one label over another as the "canonical" display form. Proposal: defer — label_kind (name vs pref vs alt) already lets callers prioritize. Add score if needed.
Is contrib/ the right home for domain resolvers, or should they be separate packages? Pro-contrib: easy discovery, versioned together. Pro-separate: community can ship without depending on SDK releases. Proposal: contrib for reference implementations (advertising, healthcare); external packages for user-owned domains.
Should narrower-closure scoping ship in v1? The current proposal settled on two explicit parameters — scheme= for concept-scheme membership and entity= for single-entity identity. A third mode (narrower-closure: "resolve against all entities narrower-than some abstract node") is deferred. Advertising taxonomies nest (IAB Tier 1 → Tier 2), and a caller may want to resolve against the subtree under a specific abstract node rather than a flat scheme. Proposal: ship scheme= and entity= only in v1; add scope=Scope.narrower_closure(name) in v2 if real callers need it. For most cases, scheme membership plus rt.narrower(entity) traversal covers the need without a new API.
Please comment if you have opinions, real-world resolver implementations you'd like to see supported, or disagreements about where the SDK/agent boundary should sit.
_fingerprint.py(internal — underscore prefix). Single source of truth for model fingerprinting and the compile_id pair-consistency tag. Two functions: fingerprint_model(model) -> "sha256:<64 hex>" and compile_id(ont_fp, bnd_fp, compiler_version) -> "<12 hex>". Contract pinned in docstring (W1): model_dump(mode="json", by_alias=False, exclude_none=False) → json.dumps(sort_keys=True, separators=(",",":"), ensure_ascii=False) → SHA-256. Not re-exported; both packages import via from bigquery_ontology._fingerprint import .... Landed in PR feat(ontology): A1 — internal _fingerprint module for concept-index provenance #71.
concept_index.py(module importable but not re-exported in v1). Row builder. Function: build_rows(ontology, binding) -> list[ConceptIndexRow]. Applies the "abstract always included, concrete iff bound" rule. Emits one row per (entity_name, label, label_kind, language, scheme) membership tuple, plus one notation row per skos:notation. Sorts deterministically by (scheme, entity_name, label_kind, language, label, notation, is_abstract) with NULLs last. Package-level re-export may be added later; kept out of the root for v1 to avoid growing semver surface ahead of need.
Modified files:
graph_ddl_compiler.py — gains a new public function compile_concept_index(ontology, binding, *, output_table) -> str alongside the existing compile_graph(). compile_graph() contract is preserved byte-identically; the existing function body is not touched. compile_concept_index() emits two statements by default: CREATE OR REPLACE TABLE {output_table} AS SELECT * FROM UNNEST([STRUCT(...), ...]) for the main index and a matching CREATE OR REPLACE TABLE {output_table}__meta AS SELECT * FROM UNNEST([STRUCT(...)]) for the meta sibling. Shadow-swap fallback activates at > 50K rows. Every row in both tables carries the same compile_id; the meta row additionally carries full ontology_fingerprint and binding_fingerprint.
cli.py:299 (the compile command) — gains two new flags: --emit-concept-index (boolean) and --concept-index-table <fqn> (required when --emit-concept-index is set — no silent global default). When both flags are absent, command output is byte-identical to today. No other CLI flags change.
__init__.py — adds from .graph_ddl_compiler import compile_concept_index so the new public function is importable as from bigquery_ontology import compile_concept_index, matching the existing compile_graph re-export. No other exports change. _fingerprint stays unexported.
binding_models.py — no changes in v1. A binding-side index: opt-in block was considered and deferred to v2; precedence rule documented in the plan when/if it lands.
Produces: the existing CREATE PROPERTY GRAPH DDL + two concept-index tables (ontology_concept_index and ontology_concept_index__meta). Re-running the same command produces byte-identical SQL — the compile_id is deterministic from inputs (no timestamps, no UUIDs).
Version bump:Minor — new public function (compile_concept_index) and new CLI flags. Existing API byte-identical.
SDK package (bigquery_agent_analytics) — changes
New files (all under src/bigquery_agent_analytics/):
ontology_runtime.py. Hosts OntologyRuntime (the read accessor wrapper), the verification machinery (first-call + TTL re-check), and all four exception classes (ConceptIndexMismatchError, ConceptIndexProvenanceMissing, ConceptIndexInconsistentPair, ConceptIndexRefreshed). OntologyRuntime exposes two constructors — .load(ontology_path, binding_path, ...) and .from_models(ontology, binding, ...) — both routing through one shared implementation.
entity_resolver.py. Hosts the EntityResolverProtocol (not ABC — duck-typed for modern typing), the Candidate and ResolveResult dataclasses, and two reference implementations: ExactMatchResolver (name + notation) and SynonymResolver (extends exact with label-based match). Candidate dedup: one candidate per entity, winning-label priority (name > pref > alt > hidden > synonym > notation, lexicographic tiebreaker), limit=N returns N distinct entities.
Modified files:
__init__.py — adds to the existing try/except re-export block (same pattern as Client, CodeEvaluator, etc.):
OntologyRuntime — from .ontology_runtime
EntityResolver, ExactMatchResolver, SynonymResolver, Candidate, ResolveResult — from .entity_resolver
ConceptIndexMismatchError, ConceptIndexProvenanceMissing, ConceptIndexInconsistentPair, ConceptIndexRefreshed — from .ontology_runtime
Unchanged:
All other SDK modules. The runtime accessor layer is strictly additive.
Read accessors on OntologyRuntime (pure-Python, no BQ round-trip):
Method
Returns
Notes
entities()
list[str]
Names of concrete + abstract entities
entity(name)
Entity
With annotations, synonyms, abstract flag
synonyms(name)
list[str]
Pref + alt + hidden labels
annotation(name, key)
str | None
E.g. skos:notation, skos:definition
in_scheme(scheme_name)
list[Entity]
Concepts in a skos:ConceptScheme
broader(name)
list[Entity]
skos:broader traversal
narrower(name)
list[Entity]
Inverse
related(name)
list[Entity]
skos:related traversal
Identity rules: entities are name-addressed (singular lookup); relationships are traversal-first, not name-addressed — a single skos_broader can repeat across endpoint pairs after #62's relaxed uniqueness, so a hypothetical rt.relationship(name) would have no single answer.
scheme= and entity= are mutually exclusive; neither or both = ValueError. Bounded output via known_value_count + known_values_sample. candidates is None unless a resolver is explicitly composed by the caller.
Verification configuration (on construction):
Parameter
Default
Notes
verify_concept_index
"strict"
"strict" (raises on any provenance issue), "missing_ok" (tolerates missing meta), "off" (disables verification entirely — for read-only dashboards)
Construction — OntologyRuntime.load(...) / .from_models(...) computes local ontology_fingerprint and binding_fingerprint (both full SHA-256). No BQ round-trip.
First concept-index access (lazy — not on construction) — reads the __meta sibling, compares fingerprints. Mismatch → ConceptIndexMismatchError. Missing meta → ConceptIndexProvenanceMissing.
TTL re-check (each resolve / validate call past the TTL window) — runs two queries:
SELECT DISTINCT compile_id FROM {output_table} LIMIT 2 — asserts exactly one value (pair consistency).
Fingerprints drift from cache = ConceptIndexRefreshed.
The TTL re-check reading both tables with full fingerprints is a W2 watchpoint in the plan — a single-table sentinel or short-compile-id-only comparison reintroduces either the meta/main race or the 48-bit collision hole.
Both reference resolvers query the concept index via BigQuery; ExactMatchResolver uses WHERE label = @input and SynonymResolver composes with label_kind preference ordering.
Version bump:Minor — new public API surface (OntologyRuntime, four resolver-related classes, four exception types). No existing behavior changes.
Existing user code: No deprecation. Users with their own resolution layers continue unaffected until they opt into the SDK primitive.
Goal
Today the SDK's ontology pipeline stops at DDL compilation (
gm compileemitsCREATE PROPERTY GRAPH+ table scaffolding). Runtime — the point where an agent receives a user/client input likeformat_ids: ["display_static"]orgeo: ["San Francisco-Stockton-Modesto"]and needs to resolve it against a declared ontology — is left entirely to the application layer.Feedback from a production user building agentic media buying on top of this SDK quantified the gap: ~85% of brief-validation value for their use case sits at runtime, not schema time. They implemented a 5-layer resolver (notation match → lexical → token-set equality → Jaccard → Levenshtein) on top of ~10K lines of TTL (274 SKOS concepts, 942 synonyms, 210 GAM DMA display names). It works — but every vertical building on the SDK will rewrite some version of this, and today there is no supported runtime surface for them to build against.
This issue proposes a small, opinion-light set of runtime primitives that make resolution implementable in application code without pushing domain-specific matching logic into the SDK core.
Guiding principle: SDK provides, agent decides
The SDK and the agent layer make different kinds of claims:
company_nameis acceptable or dangerous.Consequences for the runtime:
EDIT_DISTANCE/SOUNDEX/ UDFs.EntityResolverprotocol and ships two trivial references (ExactMatchResolver,SynonymResolver). Anything beyond exact-match lives outside core.contrib/or user code, never in the runtime's required surface.The SDK stays general. Verticals get a contract to build against instead of reaching into YAML or reconstructing structure from BQ tables.
Current gaps
No runtime accessor over loaded ontologies.
load_ontology()returns Pydantic models, but there's no shape-agnostic API likert.synonyms("DMA")orrt.annotation("DMA", "skos:notation"). Agents parse the model directly, which couples them to schema details the SDK otherwise hides.Annotations are not queryable at runtime. Issue Feat: SKOS import support alongside OWL (design proposal — feedback wanted) #57 proposes persisting SKOS annotations (
skos:definition,skos:notation,skos:prefLabel, etc.) through import. Nothing today reads those annotations at runtime. They live in the YAML and die there.No concept index. Synonyms and notations are scattered across per-entity YAML nodes. Agents that want to do SQL-level matching have to flatten this themselves at query time on every request.
No resolver interface. Every SDK user writes their own resolution entry point, with their own return type, with their own "did you mean" shape. No convention, no reuse.
Proposed primitives
1.
OntologyRuntime— read accessor over loaded ontology + bindingSmall, stateless, zero external dependencies at read time. Built on top of existing
load_ontology()+load_binding().Design notes:
rt.annotation(name, key)treatsskos:definition,owl:equivalentClass, or a user's custom annotation identically.Identity rules (important after #57 lands):
rt.entity(name),rt.synonyms(name),rt.annotation(name, key)are singular lookups — entity names remain globally unique.(name, from, to)for abstract relationships, so a singleskos_broadercan repeat across endpoint pairs. A hypotheticalrt.relationship(name)would be unsafe because it has no single answer to return.rt.broader(entity),rt.narrower(entity),rt.related(entity)return the set of entities reachable from the given starting point via the named predicate. That's a well-defined question regardless of how manyskos_broaderedges exist in the ontology.rt.relationship(name, from, to) -> Relationship | None) or list-returning (rt.relationships(name) -> list[Relationship]). Never singular-by-name.2. Concept index materialization (opt-in)
At
gm compiletime, optionally emit a BigQuery sidecar table:notationis a first-class row kind. For every entity that has askos:notation, the compiler emits a row withlabel_kind='notation'andlabel=<notation value>— so resolvers searching bylabelnaturally catch notation matches without a separateOR notation = @inputpredicate. Thenotationcolumn is kept as per-entity metadata that repeats across all rows of the same entity, for display convenience (a caller with a winning match can read the entity's notation directly from the candidate row without a separate lookup).Row multiplicity contract:
(entity_name, label, label_kind, language, scheme)membership tuple. A SKOS concept can legally belong to multipleskos:inSchemeschemes (a DMA concept may be in bothNielsenDMAandCensusMSA, a banking concept may be in bothBankingTaxonomyandFinancialProductsTaxonomy). This is denormalized — a concept in 3 schemes × 5 labels produces 15 rows. Intentional; see below.scheme IS NULL. They're still in the index;entity=resolution finds them,scheme=resolution skips them.notationis per-entity (not per-scheme), so it repeats across membership rows for the same entity. Callers selecting a single notation per entity useDISTINCT notationor aggregate.Why denormalized rather than
ARRAY<STRING> schemeor a separate membership table:WHERE scheme = @xstays a trivial clustered lookup — critical for the commonscheme=<name>resolver path.(scheme, entity_name)stays usable.ARRAY<STRING>forcesWHERE @x IN UNNEST(scheme)on every scheme-scoped query, which is less indexable and harder for less-experienced SQL callers to write correctly.Agents do fuzzy match in SQL:
The
DISTINCT/GROUP BYonentity_nameis how callers collapse the denormalized rows back to one result per matched concept.Matches the SDK's agent-native ethos: any action an agent takes in SQL is something a user or another tool can also take. No new Python-only runtime, no new service, no new matcher implementation to maintain.
Opt-in. v1 ships with a CLI flag only:
gm compile --emit-concept-index. Default off for users who don't need it. A binding-side toggle (index: concept_indexblock onBinding) was considered but deferred — it requires schema and loader changes inbigquery_ontology.binding_models+binding_loader.pythat are worth scoping as their own change once the CLI behavior is settled. If v2 adds it, the explicit precedence rule will be: CLI flag overrides binding setting; binding setting serves as the project default when the CLI flag is absent.Index population contract
The existing DDL compiler (
src/bigquery_ontology/graph_ddl_compiler.py) only emits schema SQL —CREATE TABLE/CREATE PROPERTY GRAPH. A concept index needs rows, which is a new kind of output. This subsection names who writes those rows and when.Who writes the rows: the ontology compiler itself, in the same
gm compileinvocation that emits the DDL. The index is a deterministic function of both the ontology YAML and the binding — see "What's in the index" below. Treating it as a separate build step creates two sources of truth and a refresh-skew class of bugs that the SDK shouldn't inherit.What's in the index (scope relative to binding):
compile_concept_index(ontology, binding)takes both inputs because the index respects the binding's subset semantics. Since a binding may legally realize only a subset of the declared ontology (binding_models.py:147), the compiler needs a rule for which entities participate in the index. The rule is:In short: abstract: always. Concrete: iff bound. This matches the SDK-level invariant from the adapter design ("every element in GraphSpec is bindable and has data") while preserving the taxonomy-browse value that abstract SKOS entities add at runtime.
Consequence: two different bindings over the same ontology produce different indexes. A narrow deployment binding only
AccountandCustomeremits a smaller index than a wide deployment binding all 40 concrete entities, but both share the same abstractskos_Banking/skos_FinancialProduct/ etc. nodes. Abstract relationships between abstract entities are always in scope; abstract relationships touching an unbound concrete entity are included (they're informational metadata, not runtime operations).The
is_abstractcolumn in the index row lets resolvers filter at query time: a resolver that wants only runtime-materializable matches doesWHERE NOT is_abstract; a resolver producing taxonomy-aware "did you mean" suggestions keeps both.Table naming contract: because two bindings against the same ontology produce legitimately different indexes, a single global table name is unsafe — the second compile would silently overwrite the first. The output table name is therefore a required parameter, not a fixed convention:
CLI:
gm compile --emit-concept-index \ --concept-index-table my-project.my_dataset.ontology_concept_index__retailBoth library and CLI error cleanly if the name is missing when
--emit-concept-indexis set. No silent global default. Users with a single binding per dataset pick any unique name they like (ontology_concept_indexis fine); users with multiple bindings per dataset pick distinct names per binding (ontology_concept_index__retail,ontology_concept_index__investment_bank, etc.).Why required rather than auto-derived:
binding: stridentifier (binding_models.py:159), but it isn't a safe or stable source for a BQ table name: it's an identity tag for the binding document, not a deployment-unique BQ-legal identifier. Using it would couple operational naming to a field authors rename for non-operational reasons, and would collide across environments (dev/stage/prod) that share the same binding identity.ontology_concept_index__{sha1(binding)[:8]}are collision-free but unreadable and change on every trivial binding edit — bad ergonomics for a table name that appears in user-written resolver SQL.OntologyRuntimereads the index via the same name the caller passed at compile time — runtime construction takes a matchingconcept_index_table: strparameter (or reads it from configuration) so lookups target the right table. The name is not stored on the ontology or binding model; it's a runtime/deployment concern.Provenance and compatibility contract: because the table name is caller-supplied and binding-scoped, nothing in the data columns alone would catch a mismatched wiring like
OntologyRuntime.from_models(ontology_A, binding_B, concept_index_table=table_C)wheretable_Cwas actually compiled from a different(ontology, binding)pair. Plausible-but-wrong matches are worse than no matches — the agent gets confident answers against stale or unrelated data.The compiler therefore emits a sibling metadata table named
{output_table}__meta, written in the samegm compileinvocation. One row per compile:Sibling rather than embedded columns so the bulk of the index (the label/notation rows) stays lean.
Fingerprint algorithm: fingerprints are SHA-256 hashes over a canonical serialization of the validated Ontology / Binding Pydantic models — not over raw YAML text. Concretely:
load_ontology()/load_binding()path). Validation normalizes optional fields, default values, and type coercion.None/ booleans / numbers, lists preserved in declaration order (list order is semantically meaningful in the ontology model — e.g., key columns).sha256:.The same approach is used for both ontology and binding fingerprints, with one runtime difference: ontology fingerprinting covers every field of the Ontology model. Binding fingerprinting covers every field of the Binding model except ephemeral annotations (if any are introduced later) — the binding's identity for the purpose of "does this index correspond to this binding" is its declared structure, not its documentation metadata.
Why model-based and not YAML-text-based:
src/bigquery_agent_analytics/runtime_spec.py:199and adjacent). Hashing at that layer matches the layer where the rest of the SDK's determinism lives.compile_graph(ontology, binding) -> strtakes models, not YAML strings). Keeping fingerprint input at the same layer maintains consistency across compile output and runtime verification.Two bindings produced from the same source YAML by different emitters (e.g., one with trailing newlines, one without) fingerprint identically. Two bindings that disagree on any declared field — entity names, target dataset, property types — fingerprint differently and correctly fail strict verification.
Canonicalization rules in brief (formal spec in the implementation):
Pydantic.model_dump(mode="json", by_alias=False, exclude_none=False)so defaults materialize consistently.None/ missing-but-defaulted fields serialized as explicitnullto distinguish "absent" from "defaulted."separators=(",", ":")(no extra whitespace).OntologyRuntimeruntime verification:OntologyRuntime.load(...)/.from_models(...)computes the same fingerprints on the loaded Ontology and Binding models.__metasibling and compares fingerprints.ConceptIndexMismatchErrorwith a clear message naming the expected vs actual fingerprints and the table name involved. The runtime refuses to return matches from an index that doesn't correspond to the loaded models.__metasibling (e.g., a manually-created index or one compiled with an older toolchain) raises a distinctConceptIndexProvenanceMissing— caller can explicitly opt out withOntologyRuntime(..., verify_concept_index="off")for read-only dashboards or interactive exploration.Long-lived runtime verification (strict is strict for the whole lifetime, not just the first call). A naive "verify once then cache forever" contract would let a long-lived service sail past an index refresh that swapped in a different
(ontology, binding)pair — returning matches against the new index while still believing it was verified. That defeats the "plausible-but-wrong matches are worse than no matches" argument behind the strict default.The contract:
OntologyRuntimecaches the expectedcompile_id,ontology_fingerprint, andbinding_fingerprinton the instance.verify_ttl_seconds, default 60). If the cache is fresh, the call proceeds without a BQ round-trip.SELECT DISTINCT compile_id FROM {output_table} LIMIT 2returns exactly one value.SELECT compile_id, ontology_fingerprint, binding_fingerprint FROM {output_table}__meta LIMIT 1— read compile_id and the full fingerprints.main.compile_id == meta.compile_id(pair consistency).meta.compile_id == cached.compile_idANDmeta.ontology_fingerprint == cached.ontology_fingerprintANDmeta.binding_fingerprint == cached.binding_fingerprint(full-fingerprint freshness).ConceptIndexRefreshed. Service operator recreatesOntologyRuntimewith updated models; new instance's full fingerprint verification catches whether the new index matches or not.ConceptIndexInconsistentPair. Same contract as first-load.Why the sentinel must read both tables, not just meta. An earlier draft checked only
meta.compile_id. That has a correctness hole: the inline refresh order is "main first, meta second," so during the swap window a reader could see the old meta compile_id (matches cache, accepted), then query the new main table, and serve data from the refreshed index under stale verification. Reading both tables on TTL re-check closes that window — main's compile_id is authoritative for "which compile does the data belong to," and the meta comparison catches inconsistent pairs.Why the freshness check compares full fingerprints, not just
compile_id. Thecompile_idcolumn is a 12-hex-char truncation ofsha256(ontology_fingerprint || binding_fingerprint || compiler_version)— 48 bits of entropy, chosen to keep the per-rowcompile_idcolumn short (storage efficiency on a column that repeats across every data row). That's enough for first-pass pair consistency: two tables with different compile_ids definitely belong to different compiles, and the birthday bound on distinct compiles for a single(output_table)over realistic deployment lifetimes is comfortably below collision probability.But "comfortably below" is not "zero," and a strict verification contract shouldn't rely on it. The meta row carries the full
ontology_fingerprintandbinding_fingerprint(SHA-256, 256 bits each) — storing those in a single-row meta table costs nothing. The TTL re-check therefore compares all three (compile_id+ both full fingerprints) against the cache. A hypothetical 48-bit collision where a legitimately-different(ontology, binding)pair happens to share a 12-char prefix is caught because the full fingerprints won't match.Pair consistency between the two tables still runs on the short
compile_id— it only needs to detect "are these from the same compile or different compiles," and 48 bits is overkill for that single-dataset comparison. The strict freshness check runs on the full 256-bit fingerprints where the safety story demands it.The three reads are still cheap. Main's
SELECT DISTINCT compile_id FROM {output_table} LIMIT 2reads at most two rows from a clustered column; meta reads exactly one row (and always has, just with more columns than before). Per-TTL-window cost remains negligible even at default 60s.Configuration surface on
OntologyRuntimeconstruction:verify_ttl_seconds: int = 60— default 60. Balance between correctness-staleness window and re-verification cost.verify_ttl_seconds=0— check on every call. Useful for low-QPS services where correctness matters more than cost.verify_ttl_seconds=None— snapshot-bound: verify once on first use, never again. Explicit opt-in for services that coordinate refresh out-of-band (e.g., rolling-restart on recompile). Matches the old "verify once" behavior for callers who want it.Why TTL rather than check-every-call by default: the pair re-check is cheap but not free, and for high-QPS resolver workloads it adds up. A 60s staleness window matches typical service-refresh cadences while keeping per-call cost bounded at
O(1)with no BQ hit in the common case.Pair-consistency contract (the two tables must agree on the same compile). Because
{output_table}and{output_table}__metaare written as two separateCREATE OR REPLACE TABLEstatements, a reader interleaved with a refresh could otherwise observe:To make the pair coherent without requiring DDL-level transactions (which BigQuery doesn't offer for
CREATE OR REPLACE TABLE), both tables carry acompile_idtag that is derived deterministically from compile inputs — not a per-run UUID or a timestamped value:(First 12 hex chars is enough to make accidental collisions vanishingly unlikely while keeping the column short.)
compile_id STRING NOT NULLcolumn on the main table — every row of{output_table}shares the same value.compile_idfield on the single__metarow — same value.Why deterministic rather than per-run:
compile_concept_index()(see Compiler output contract below). Two compiles of the same ontology + binding + compiler version produce character-identical SQL.compiled_atis deliberately not in the emitted SQL. An earlier draft included acompiled_at TIMESTAMPfield in the meta row; that's been removed to preserve byte-identical output. Operators who want compile timestamp visibility can read it fromINFORMATION_SCHEMA.TABLES.creation_timeon the__metatable, which BigQuery maintains automatically. The tradeoff is deliberate: runtime correctness (deterministic compile output, reviewable diffs) over embedded operator metadata that BQ already provides.Runtime pair-consistency check (first concept-index access):
SELECT * FROM {output_table}__meta LIMIT 1→ getexpected_compile_id, expected fingerprints.SELECT DISTINCT compile_id FROM {output_table} LIMIT 2→ verify exactly one compile_id is present and it equalsexpected_compile_id.compile_idmismatches or multiple distinct compile_ids are observed (which would indicate a broken compile), retry once after a short backoff (default 2 seconds) — handles the narrow interleaving window during normal refresh.ConceptIndexInconsistentPairwith both observed compile_ids. This is distinct fromConceptIndexMismatchError(which is a wiring/fingerprint error, not a timing one) so callers can handle them differently.The retry is deliberately one-shot and small: a legitimately long refresh window indicates operator misbehavior (concurrent compiles against the same table) and should fail loudly.
Compatibility flag vocabulary:
verify_concept_index="strict"(default) — fingerprint mismatch, missing meta, or persistent pair inconsistency all raise.verify_concept_index="missing_ok"— fingerprint mismatch and pair inconsistency raise, missing meta warns and proceeds.verify_concept_index="off"— no verification, purely caller-managed. Intended for explicit "I know what I'm doing" paths.Rejected alternatives for pair consistency:
BEGIN TRANSACTION; ... COMMIT;. BigQuery's transaction support doesn't coverCREATE OR REPLACE TABLE— DDL is generally non-transactional. Not a viable primitive.OPTIONS(description=...)tagging. Requires INFORMATION_SCHEMA lookups, has its own freshness semantics, more tooling surface. compile_id in a data column is simpler to query from plain SQL.Rejected alternatives for the provenance storage shape:
compile_idcolumn is a deliberate exception — it's a single short fixed-length tag needed for pair consistency, not full provenance.)OPTIONS(description=...). BQ-native and elegant, but INFORMATION_SCHEMA queries have their own cost and tooling constraints, and the sibling table approach is easier to inspect from plain SQL (SELECT * FROM concept_index__meta).How the rows reach BQ (atomic-swap semantics for runtime readers): the compiler emits a single
CREATE OR REPLACE TABLE ... AS SELECT ...statement. In BigQuery this is atomic — concurrent readers see either the previous table's rows or the new table's rows, never an empty intermediate state. This is the critical difference from aDELETE + INSERTpair, which would expose a window where the index is queryable but empty. For a runtime lookup primitive, that window is a correctness hazard, not just a performance one.The separate
CREATE TABLEscaffold is not emitted —CREATE OR REPLACE TABLEcreates the table on first run and atomically replaces it on every subsequent run. This collapses "ensure table exists" and "populate rows" into one statement, eliminating the empty-table-existing intermediate state entirely.For ontologies with tens of thousands of concepts (Yahoo's YAMO example: 274 SKOS concepts × multiple labels ≈ ~1K-10K rows), inline
UNNEST(ARRAY<STRUCT<...>>)is well within BigQuery's query-text limits. For ontologies above ~50K rows, the compiler emits a shadow-table swap pattern for both tables in the pair — the pair-consistency contract applies to the shadow path too:Two distinct non-atomicity windows exist on this path:
DROPandRENAMEon each table, that table name does not resolve. Readers get BigQuery's "table not found" error, which they must tolerate as transient.compile_idwhile the meta row still carries the old one. Readers in this window seecompile_iddisagreement →ConceptIndexInconsistentPairon the pair-consistency check.The pair-inconsistency window on the shadow path can exceed the inline path's one-shot 2-second retry budget, because large-ontology rename operations take longer than small-ontology
CREATE OR REPLACE TABLEstatements. This means strict verification will raise during a legitimate shadow-path refresh if it happens to sample during the swap. That's by-design, not a defect: strict verification correctly rejects an inconsistent pair even when the inconsistency is transient. The alternative (silently serving old data against a new meta, or vice versa) is the failure mode strict verification exists to prevent.Operational contract for the shadow path:
ConceptIndexInconsistentPairexceptions) duringgm compileruns that hit the shadow path.missing_ok(which per the verification-mode contract above still raises on pair inconsistency — transient or not):verify_ttl_secondsso the pair re-check samples less frequently. Reduces the probability of landing inside a swap window at the cost of a longer staleness tolerance.ConceptIndexInconsistentPairat the application layer and retry the call after a short delay. Cleanest at the service-mesh level where transient 5xx handling already exists.{output_table}__currentpointer table that callers resolve through). Not shipped in v1 — out of scope as a third level of indirection, tracked as follow-up work if real users hit this constraint.This limitation is specific to the shadow path. The inline-UNNEST path (the default for ontologies under 50K rows, covering the motivating use cases including Yahoo's YAMO) remains fully atomic per-statement and doesn't exhibit either window.
When refresh happens: on every
gm compilerun with--emit-concept-index. No incremental build. If the user edits ontology YAML, they re-run compile, same as any other DDL change. This matches the compile model users already have for schema changes and avoids adding a second refresh command.Alternatives considered and rejected:
gm build-concept-indexstep. Adds a command users have to remember and introduces drift between "DDL is up to date" and "index is up to date." Two invocations for one conceptual change.OntologyRuntime.load()and optionally push to BQ. Surfaces inconsistent state when multiple agent instances load simultaneously and makes the "query the index in SQL" path unreliable until someone has pushed.Failure modes:
CREATE OR REPLACE TABLEpath: if the statement fails (quota, permissions, query text too large),gm compileerrors with a message naming the concept index. BecauseCREATE OR REPLACEis atomic, there is no half-written state — the previous table (if any) remains queryable, or no table exists at all. The user can re-run compile without cleanup.gm compileretry detects the orphaned_shadowtable(s) and resumes from the swap step. Runtime readers during the orphaned window get either "table not found" orConceptIndexInconsistentPair— both expected transient conditions on the shadow path, tolerated via the operational contract above (pause traffic during refresh, or accept transient failures).3.
EntityResolverProtocol + two reference implementationsInterface-only in core:
schemeandentityare mutually exclusive — see "Scope semantics" in the Library API impact section below. Exactly one must be provided.Candidate dedup contract (important once the index is denormalized per
(entity_name, label, label_kind, language, scheme)):ResolveResult.candidatescontains at most one entry perentity_name. The denormalized index naturally produces multiple matching rows for the same entity (same entity, different label or different scheme). For an agent-facing "did you mean" list, duplicates are noise — the agent wants a list of distinct concepts, each annotated with the best evidence for why it matched.limit=Nmeans N distinct entities, not N raw rows. Resolvers do the dedup before truncating.label_kindpriority, in this order:name>pref>alt>hidden>synonym>notation. Further ties broken by lexicographiclabelorder for determinism.Candidate.label/Candidate.label_kind/Candidate.scheme/Candidate.reasonreflect the winning row. The other rows that also matched are discarded — callers wanting the full provenance use the concept index directly via SQL.reasonvalues are resolver-defined but drawn from a shared vocabulary so callers can branch on them without ambiguity:exact(name match),notation(notation match),synonym(any label other than name),fuzzy(non-exact match produced by a fuzzy resolver),none(no match found — only valid on theResolveResult.reason, not onCandidate.reason).SDK ships two references in core:
ExactMatchResolver— O(1) lookup against name +skos:notation. Confidence is 1.0 or 0.0. Good for notation-heavy inputs (Nielsen DMA codes, Google Ads Criteria IDs).SynonymResolver— extendsExactMatchResolverby also matching againstprefLabel/altLabel/hiddenLabel/synonyms. Still exact on each label; still confidence 1.0 or 0.0.Everything above exact-match — token-set equality, Jaccard, Levenshtein, phonetic, weighted ensembles — lives in user code or
contrib/packages. Verticals pick (or write) a resolver tuned for their domain.4.
validate_against_ontology— small validation helperNot resolution — just pass/fail against the declared ontology. Return shape is bounded by design so it stays useful on large concept schemes (IAB Taxonomy, Nielsen DMAs, SNOMED excerpts) where the candidate universe is hundreds to tens of thousands of entries:
Agents combine
validate_against_ontologywith a resolver to produce "did you mean." The SDK doesn't match; it only knows what exists.Design notes on the return shape:
known_value_countis always the full count. Tells the caller whether the sample is representative.known_values_sampleis capped atsample_limit(default 10). Enough for a "did you mean" hint without bloating every validation miss on a 10K-concept scheme. Callers who genuinely need the full set usert.in_scheme(...)orrt.entities()— that's what those accessors are for.candidatesstaysNoneunless the caller composes validation with a resolver. Keepsvalidatepure set-membership; keeps ranking logic in resolver-land. No double-duty.known_values_samplethrough a resolver that ranks.Library API impact
This section pins down the parts of the proposal that touch existing public APIs, so they're clear before implementation starts.
Compiler output contract
The existing
bigquery_ontology.compile_graph(ontology, binding) -> stris documented to be deterministic — "same inputs → byte-identical text." That contract is preserved. Concept-index emission does not modifycompile_graph().Instead, a new sibling function ships alongside:
Both return deterministic strings.
compile_concept_index()extends thecompile_graph()byte-identical contract in the same spirit: same inputs → byte-identical DML text, including row order."Same inputs" for
compile_concept_index()=(ontology, binding, output_table, compiler_version). Everything in the emitted SQL is derived from those four values:compile_idissha256(ontology_fingerprint || binding_fingerprint || compiler_version)[:12]— deterministic.INFORMATION_SCHEMA.TABLES.creation_timeon the emitted tables.Rows are sorted by a stable key before SQL generation:
with NULLs ordered last consistently per column.
is_abstractis last because it's determined byentity_name— included only for defensive stability if the invariant ever loosens. This sort order guarantees that two invocations ofcompile_concept_index()on the same ontology + binding emit character-identical SQL — critical for diffing compile output in code review, caching compiled artifacts, and verifying that ontology edits produced only the expected row changes.Library callers who want only DDL keep calling
compile_graph()as today; callers who want the concept index callcompile_concept_index()for a separate DML script. The CLI layer composes the two:Why a sibling and not a composed option:
compile_graph().Rejected alternatives:
compile_graph(..., emit_concept_index=True)returning concatenated DDL+DML — breaks the byte-identical contract for one config mode and creates a function whose return value depends on a flag.CompileResult(ddl=..., dml=...)) — breaks every existing caller ofcompile_graph()that treats the return as a string.OntologyRuntimeconstruction: paths and modelsThe example in section 1 shows
OntologyRuntime.load(ontology_path=..., binding_path=...). But existing SDK code already carries validatedOntologyandBindingmodels around in memory (e.g.,src/bigquery_agent_analytics/runtime_spec.py:199passes models directly). Forcing callers to reparse YAML or round-trip through disk would be a step backward.Two classmethods cover both cases:
load()is the convenience path for one-off scripts and the CLI.from_models()is the integration path for the SDK's existing flows —runtime_spec,ontology_orchestrator, adapters downstream ofload_ontology()can all wrap without touching disk again.Internal implementation:
load()callsload_ontology()+load_binding()then delegates tofrom_models(). Zero code duplication.Scope semantics:
schemevsentityResolvers and
validate()need an explicit target set. Two mutually-exclusive named parameters, no polymorphism:Rules:
schemeorentitymust be provided. Passing both or neither is an error with a clear message.scheme=<name>resolves against the set{e : e.in_scheme(name) or name == e.name and e.is_abstract_scheme_root}. This is the motivating case. Works for both explicit SKOS concept schemes and abstract entities that act as taxonomy roots.entity=<name>resolves against the singleton set{e : e.name == name}. Identity check. Returns match iff the input exactly matches the entity's name, notation, or a declared label/synonym.scope=Scope.narrower_closure(name)or similar, without changing the meaning ofschemeandentity.Why not polymorphic ("
entity=means scheme-scoped if it's a scheme, entity-scoped otherwise"):entity=call targeting it.entity=call does — defeats the point of a stable API.Explicit parameters keep the contract boring and predictable.
Non-goals
EDIT_DISTANCE,SOUNDEX,JACCARDUDFs. If the concept index is materialized, users get these for free. Don't wrap.contrib/or a separate package.<50msSLA. Latency is a function of index size and resolver choice, both of which vary by user. The SDK can guarantee the primitive shapes; it can't guarantee the performance of every application that uses them.How this lands on top of existing code
Using the current SDK's module boundaries:
OntologyRuntimeclass withload()+from_models()classmethodsbigquery_agent_analytics/ontology_runtime.py(new)load_ontology+load_bindingfrombigquery_ontology. Pure Python, no BQ calls. Both construction paths share one implementation.compile_graph()(existing)bigquery_ontology/graph_ddl_compiler.pycompile_concept_index()(new sibling)bigquery_ontology/graph_ddl_compiler.py(new function)compile_graph()when--emit-concept-indexis set.EntityResolverProtocol + referencesbigquery_agent_analytics/entity_resolver.py(new)scheme=orentity=(mutually exclusive).validate_against_ontologyOntologyRuntimescheme=/entity=scope parameters.bigquery_ontology/contrib/or external packagesChanges to existing modules are limited but not zero. Most of the proposal is additive (new files, new functions, new classmethods). Two concrete edits to existing code are needed:
src/bigquery_ontology/cli.py:299— the existingcompilecommand gains--emit-concept-indexand--concept-index-table <name>flags. When--emit-concept-indexis set, the command composes the existingcompile_graph()output withcompile_concept_index(..., output_table=...). Without the flag, the command's behavior is byte-identical to today.src/bigquery_ontology/graph_ddl_compiler.py— adds the newcompile_concept_index()function in the same module.compile_graph()itself is not modified.The runtime accessor (
OntologyRuntime) reads the same Ontology/Binding models already loaded today — that path is purely additive in the SDK package.Ties to issue #57 (SKOS import)
This proposal depends on issue #57 landing first, because the concept index's value comes almost entirely from SKOS annotations (
skos:notation,skos:prefLabel,skos:altLabel,skos:broader) being preserved through import. Without #57, the concept index is a thin wrapper over entity names and existing synonyms — useful but not transformative.Specifically:
skos:notationin annotations →notationcolumn in concept index → L1 code match becomes trivialskos:prefLabel/altLabel/hiddenLabel→ rows in concept index withlabel_kinddiscriminator → L2 lexical becomes trivialskos_broaderabstract relationships →rt.broader()traversal → taxonomy-aware "did you mean a parent or sibling"skos_prefix →rt.in_scheme()enumerates all concepts in a taxonomy → agent can present the scheme to the LLM as contextOpen questions — feedback wanted
Is
OntologyRuntimethe right wrapper, or should the accessors live as methods onOntology/Bindingdirectly? Pro-wrapper: keepsbigquery_ontologypure-data and the runtime layer inbigquery_agent_analytics. Pro-direct: fewer classes to learn. Proposal leans wrapper — the accessor layer is SDK-runtime concern, not ontology-package concern.Should the concept index be opt-in or opt-out? Pro-opt-in: users who don't need it don't pay storage. Pro-opt-out: users discover the primitive because it just exists. Proposal leans opt-in: no silent BQ table creation.
Should
OntologyRuntimecache the concept index in memory for pure-Python access, or always go to BQ? Pro-memory: fast, no BQ cost, works offline. Pro-BQ-only: always consistent with DDL, scales to ontologies with 100K+ concepts. Proposal: pure-Python by default for ontologies under some size threshold; explicit BQ-backed resolver for large ones.Does
EntityResolverneed anasyncvariant? Resolution against a BQ-backed index is I/O. Proposal: ship sync; add async later if users ask.Should the SDK ship a richer
FuzzyResolverreference (just exact + prefix, not full 5-layer) so users have a middle option? Proposal: no — either exact or bring-your-own. Avoids the "SDK partially solves fuzzy matching" trap where the reference becomes everyone's default despite being domain-unaware.Should the
Protocolbetyping.Protocolor anABC? Protocol allows duck typing; ABC forces inheritance. Proposal: Protocol — matches modern typing conventions and doesn't force users to inherit.Should
rt.validate()also return anearestfield when values are invalid? Would require calling a resolver inside validate, coupling the two. Proposal: no — keepvalidatepure set-membership, let callers compose it with a resolver.Concept index: do we need a per-row
scoreorpriorityfor when multiple labels map to the same entity? Some verticals (IAB) prefer one label over another as the "canonical" display form. Proposal: defer —label_kind(namevsprefvsalt) already lets callers prioritize. Add score if needed.Is
contrib/the right home for domain resolvers, or should they be separate packages? Pro-contrib: easy discovery, versioned together. Pro-separate: community can ship without depending on SDK releases. Proposal: contrib for reference implementations (advertising, healthcare); external packages for user-owned domains.Should narrower-closure scoping ship in v1? The current proposal settled on two explicit parameters —
scheme=for concept-scheme membership andentity=for single-entity identity. A third mode (narrower-closure: "resolve against all entities narrower-than some abstract node") is deferred. Advertising taxonomies nest (IAB Tier 1 → Tier 2), and a caller may want to resolve against the subtree under a specific abstract node rather than a flat scheme. Proposal: shipscheme=andentity=only in v1; addscope=Scope.narrower_closure(name)in v2 if real callers need it. For most cases, scheme membership plusrt.narrower(entity)traversal covers the need without a new API.Related:
Please comment if you have opinions, real-world resolver implementations you'd like to see supported, or disagreements about where the SDK/agent boundary should sit.
Final design decisions — detailed
After twelve rounds of review the design is frozen. In-repo implementation plan at
docs/implementation_plan_concept_index_runtime.md. This section is the design-level recap, split by package.Ontology package (
bigquery_ontology) — changesNew files (all under
src/bigquery_ontology/):_fingerprint.py(internal — underscore prefix). Single source of truth for model fingerprinting and thecompile_idpair-consistency tag. Two functions:fingerprint_model(model) -> "sha256:<64 hex>"andcompile_id(ont_fp, bnd_fp, compiler_version) -> "<12 hex>". Contract pinned in docstring (W1):model_dump(mode="json", by_alias=False, exclude_none=False)→json.dumps(sort_keys=True, separators=(",",":"), ensure_ascii=False)→ SHA-256. Not re-exported; both packages import viafrom bigquery_ontology._fingerprint import .... Landed in PR feat(ontology): A1 — internal _fingerprint module for concept-index provenance #71.concept_index.py(module importable but not re-exported in v1). Row builder. Function:build_rows(ontology, binding) -> list[ConceptIndexRow]. Applies the "abstract always included, concrete iff bound" rule. Emits one row per(entity_name, label, label_kind, language, scheme)membership tuple, plus one notation row perskos:notation. Sorts deterministically by(scheme, entity_name, label_kind, language, label, notation, is_abstract)with NULLs last. Package-level re-export may be added later; kept out of the root for v1 to avoid growing semver surface ahead of need.Modified files:
graph_ddl_compiler.py— gains a new public functioncompile_concept_index(ontology, binding, *, output_table) -> stralongside the existingcompile_graph().compile_graph()contract is preserved byte-identically; the existing function body is not touched.compile_concept_index()emits two statements by default:CREATE OR REPLACE TABLE {output_table} AS SELECT * FROM UNNEST([STRUCT(...), ...])for the main index and a matchingCREATE OR REPLACE TABLE {output_table}__meta AS SELECT * FROM UNNEST([STRUCT(...)])for the meta sibling. Shadow-swap fallback activates at > 50K rows. Every row in both tables carries the samecompile_id; the meta row additionally carries fullontology_fingerprintandbinding_fingerprint.cli.py:299(thecompilecommand) — gains two new flags:--emit-concept-index(boolean) and--concept-index-table <fqn>(required when--emit-concept-indexis set — no silent global default). When both flags are absent, command output is byte-identical to today. No other CLI flags change.__init__.py— addsfrom .graph_ddl_compiler import compile_concept_indexso the new public function is importable asfrom bigquery_ontology import compile_concept_index, matching the existingcompile_graphre-export. No other exports change._fingerprintstays unexported.Unchanged:
ontology_models.py— model changes forabstract: bool = Falselanded in feat(owl-import): SKOS support alongside OWL #62 (issue Feat: SKOS import support alongside OWL (design proposal — feedback wanted) #57). No further model changes for concept-index work.binding_models.py— no changes in v1. A binding-sideindex:opt-in block was considered and deferred to v2; precedence rule documented in the plan when/if it lands.bigquery_ontology/*.py— untouched.New CLI surface summary:
Produces: the existing
CREATE PROPERTY GRAPHDDL + two concept-index tables (ontology_concept_indexandontology_concept_index__meta). Re-running the same command produces byte-identical SQL — thecompile_idis deterministic from inputs (no timestamps, no UUIDs).Version bump: Minor — new public function (
compile_concept_index) and new CLI flags. Existing API byte-identical.SDK package (
bigquery_agent_analytics) — changesNew files (all under
src/bigquery_agent_analytics/):ontology_runtime.py. HostsOntologyRuntime(the read accessor wrapper), the verification machinery (first-call + TTL re-check), and all four exception classes (ConceptIndexMismatchError,ConceptIndexProvenanceMissing,ConceptIndexInconsistentPair,ConceptIndexRefreshed).OntologyRuntimeexposes two constructors —.load(ontology_path, binding_path, ...)and.from_models(ontology, binding, ...)— both routing through one shared implementation.entity_resolver.py. Hosts theEntityResolverProtocol(notABC— duck-typed for modern typing), theCandidateandResolveResultdataclasses, and two reference implementations:ExactMatchResolver(name + notation) andSynonymResolver(extends exact with label-based match). Candidate dedup: one candidate per entity, winning-label priority (name > pref > alt > hidden > synonym > notation, lexicographic tiebreaker),limit=NreturnsNdistinct entities.Modified files:
__init__.py— adds to the existing try/except re-export block (same pattern asClient,CodeEvaluator, etc.):OntologyRuntime— from.ontology_runtimeEntityResolver,ExactMatchResolver,SynonymResolver,Candidate,ResolveResult— from.entity_resolverConceptIndexMismatchError,ConceptIndexProvenanceMissing,ConceptIndexInconsistentPair,ConceptIndexRefreshed— from.ontology_runtimeUnchanged:
Read accessors on
OntologyRuntime(pure-Python, no BQ round-trip):entities()list[str]entity(name)Entitysynonyms(name)list[str]annotation(name, key)str | Noneskos:notation,skos:definitionin_scheme(scheme_name)list[Entity]skos:ConceptSchemebroader(name)list[Entity]skos:broadertraversalnarrower(name)list[Entity]related(name)list[Entity]skos:relatedtraversalIdentity rules: entities are name-addressed (singular lookup); relationships are traversal-first, not name-addressed — a single
skos_broadercan repeat across endpoint pairs after #62's relaxed uniqueness, so a hypotheticalrt.relationship(name)would have no single answer.Validation accessor:
validate_against_ontology(values, *, scheme=None, entity=None, sample_limit=20)ValidationResultscheme=andentity=are mutually exclusive; neither or both =ValueError. Bounded output viaknown_value_count+known_values_sample.candidatesisNoneunless a resolver is explicitly composed by the caller.Verification configuration (on construction):
verify_concept_index"strict""strict"(raises on any provenance issue),"missing_ok"(tolerates missing meta),"off"(disables verification entirely — for read-only dashboards)verify_ttl_seconds600= every-call check;None= snapshot-bound (verify once, never re-check)Verification lifecycle:
OntologyRuntime.load(...)/.from_models(...)computes localontology_fingerprintandbinding_fingerprint(both full SHA-256). No BQ round-trip.__metasibling, compares fingerprints. Mismatch →ConceptIndexMismatchError. Missing meta →ConceptIndexProvenanceMissing.SELECT DISTINCT compile_id FROM {output_table} LIMIT 2— asserts exactly one value (pair consistency).SELECT compile_id, ontology_fingerprint, binding_fingerprint FROM {output_table}__meta LIMIT 1— full-fingerprint freshness.ConceptIndexInconsistentPair.ConceptIndexRefreshed.The TTL re-check reading both tables with full fingerprints is a W2 watchpoint in the plan — a single-table sentinel or short-compile-id-only comparison reintroduces either the meta/main race or the 48-bit collision hole.
Resolver surface:
Both reference resolvers query the concept index via BigQuery;
ExactMatchResolverusesWHERE label = @inputandSynonymResolvercomposes withlabel_kindpreference ordering.Version bump: Minor — new public API surface (
OntologyRuntime, four resolver-related classes, four exception types). No existing behavior changes.Existing user code: No deprecation. Users with their own resolution layers continue unaffected until they opt into the SDK primitive.
Sequencing (from the plan)
PR stack, in merge order:
_fingerprint.py— feat(ontology): A1 — internal _fingerprint module for concept-index provenance #71, open.concept_index.pyrow builder.compile_concept_index+ inline-UNNEST SQL emission.docs/ontology/concept-index.md.examples/concept_index_quickstart.py, full docs.contrib/scaffolding (Yahoo advertising resolver when contributed).Each PR leaves
mainshippable.