Docs: Medium blog post #3 plan — LLM-as-Judge in BigQuery (some regressions don't fit a budget)

## Context

This is the plan for **post #3** in the Medium blog series tracked in #51. Post #1 is live on [Google Cloud Community](https://medium.com/google-cloud/your-bigquery-agent-analytics-table-is-a-graph-heres-how-to-see-it-via-sdk-920b4ea14731). Post #2 is in DevRel review (blog repo PR #17, SDK release `0.2.2`); its closing line forward-references this post:

> Post #3 in this series picks up the semantic side — LLM-as-Judge for the things that don't fit a budget.

Post #3 delivers on that promise. Topic: **scoring production traffic for correctness, hallucination, and sentiment using `client.evaluate(evaluator=LLMAsJudge.*())`. The execution cascade is `AI.GENERATE` → legacy `ML.GENERATE_TEXT` → direct Gemini API via `google-genai`; the AI.GENERATE path keeps evaluation inside BigQuery, the API fallback trades that for portability when no `CONNECTION` is wired up.**

### Slot ordering — resolved per review

#51's original ranking placed analyst-friendly views at slot 3 and LLM-as-Judge at slot 4. Reviewer confirmed the swap: **post #3 = LLM-as-Judge** (this issue), **post #4 = analyst views** (slot 4 going forward). Honors post #2's published forward-reference and continues the CI-gate arc — post #2 wired up *deterministic* gates; post #3 wires up *semantic* gates against the same SDK and the same `agent_events` corpus.

## Title candidates

1. **"Some regressions don't fit a budget. Score them in BigQuery instead."** ← recommended
2. "Score every session for correctness, hallucination, and sentiment without the data leaving BigQuery."
3. "Your agent_events table can also score itself. Here's LLM-as-Judge in BigQuery."

Recommendation: **#1**. Direct callback to post #2's "things that don't fit a budget" pull-quote, three-section editorial cadence (problem → SDK → demo) lands clean, and the "in BigQuery instead" closer is honest about where the AI.GENERATE path runs (and is hedged in the body for the API-fallback path).

## Target audience

Eval-curious engineers / quality leads / agent owners who:

- Already have post #2's deterministic CI gate live (or read post #2 and intend to).
- Have hit the limit of "is the latency / token / error rate okay?" — those gates pass, but the agent is still subtly wrong: hallucinating, refusing, going off-tone, missing the user's intent.
- Are wary of LLM-as-Judge because they've seen it produce "vibe scores" — they want strictness, deterministic prompts, and a path to gate CI on the result, not a dashboard of vibes.
- Have BigQuery and (sometimes) a `CONNECTION` for `AI.GENERATE`. The post explicitly handles both cases: connection wired up → `AI.GENERATE` SQL path; no connection → Gemini API fallback via `google-genai`. Reader picks based on what's available.

Post #2's audience (platform engineers wiring up CI) is the **same** here — this post extends what they already shipped.

## Structure (Medium best practices)

Target length: **1,400–1,800 words** (6–8 min read, same as posts #1 and #2). Cover image: real `bq-agent-sdk evaluate --evaluator=llm-judge --exit-code` failure output with the LLM justification visible (proof that the score is *explained*, not vibes). Inline shots: side-by-side cost table, terminal failure output with strict-mode behavior on AI.GENERATE NULL rows, INFORMATION_SCHEMA pivot showing `eval-llm-judge` as a separate `sdk_feature` row from post #2's `eval-code`.

```
H1:  Some regressions don't fit a budget. Score them in BigQuery instead.
Sub: Use LLM-as-Judge over BigQuery's AI.GENERATE — with a transparent
     fallback chain when AI.GENERATE isn't available — to gate CI on
     correctness, hallucination, and sentiment at fleet scale.

1. Hook (80 words)
   - Real example: a session where every deterministic gate from post
     #2 passes (latency fine, tokens fine, no tool errors) but the
     agent confidently states a fact that's wrong. Render the trace
     tree with post #1's `client.get_session_trace(...).render()`,
     point at the LLM_RESPONSE that fabricated.
   - "Latency you can measure. Hallucination you have to *score*.
     The SDK does both, on the same data, in the same warehouse —
     when the warehouse can call the model."

2. The problem in one paragraph (120 words)
   - Post #2's deterministic gates catch things you can put a number on.
     Some regressions hide under those numbers — confident wrong answers,
     refusals, off-tone responses, schema-shaped hallucinations.
   - You can't scale "did a human read it" to fleet traffic. You also
     can't trust a vibe score from a hand-rolled judge prompt.
   - The SDK ships three pre-built judges (correctness, hallucination,
     sentiment), each backed by a frozen prompt template, defaulting
     to `gemini-2.5-flash`, runnable over thousands of sessions in a
     single SQL statement when AI.GENERATE is available.

3. The SDK is already AI-in-warehouse (200 words)
   - `bq-agent-sdk evaluate --evaluator=llm-judge --criterion=correctness
     --threshold=0.7 --strict --exit-code` — same exit-code shape as
     post #2's deterministic gates, slots into the same workflow YAML.
   - Execution cascade — pinned to what the SDK actually does today:
     1. **AI.GENERATE path** (when endpoint is not a legacy BQML ref).
        Single BigQuery SQL job. No data leaves the warehouse.
        Requires a `CONNECTION` with `aiplatform.user`. Reads as one
        job in `INFORMATION_SCHEMA`, labeled `sdk_feature=eval-llm-judge`.
     2. **Legacy `ML.GENERATE_TEXT` fallback** (when AI.GENERATE
        raises and the endpoint is a BQML model ref). Same in-warehouse
        guarantee; older SQL surface.
     3. **Gemini API fallback** (when neither BQ-native path works).
        Requires `bigquery-agent-analytics[improvement]` for
        `google-genai`. Reads traces from BQ into Python, calls
        Gemini directly, scores per session.
   - `AI.GENERATE` keeps evaluation in BigQuery; the API fallback
     trades that for portability. Either way, scores are produced
     against the same prompt templates and the same threshold.
   - **(Caveat — see SDK polish below.)** Today the AI.GENERATE
     path uses a slightly truncated form of the Python prompt
     template, and the SDK does not currently surface which path
     fired in `report.details`. Both are required polish items
     before this post publishes.
   - `--strict` flag — see section 4 for the exact semantics.

4. The demo (450 words, the core)
   - Real scenario: "The hallucinated booking confirmation"
   - A regressed PR adjusts the Calendar-Assistant prompt to be more
     "decisive." The agent now confirms bookings even when
     book_meeting hasn't run successfully — the deterministic gate
     passes (latency normal, no tool errors, token usage within
     budget), but the user got told their meeting is booked when
     it isn't.
   - Add one step to post #2's workflow YAML:
     `bq-agent-sdk evaluate --evaluator=llm-judge --criterion=correctness
      --threshold=0.7 --strict --exit-code --last=24h --agent-id=…`
   - PR goes red on "Correctness". The CI log shows session id,
     score, threshold, *and the judge's justification* (post-polish —
     today the FAIL line stops at `score=` and `threshold=`; the
     justification fix is a hard publish blocker).
   - **Strict-mode demo — pivoted to parse-error visibility, not
     pass/fail flipping.** Empty/NULL-scores AI.GENERATE rows
     already fail without `--strict` (locked in by
     `TestFalsePassFix.test_empty_score_fails`); both BQ-native
     judge methods compute `passed = bool(scores) and all(...)`,
     which fails empty-scores rows trivially. The
     `--exit-code` gate sees the same outcome with or without
     `--strict` for those rows. What `--strict` does: walks the
     report and stamps `details["parse_error"] = True` per
     affected session, plus a report-level `parse_errors` /
     `parse_error_rate` counter — investigation visibility that
     lets a dashboard distinguish "low score" from "no parseable
     score" failures. The API-fallback path coerces malformed
     output to `score=0.0`, so its parse failures present as
     low-score failures and don't surface through `--strict`
     today. Demo (revised): capture two views of the same red
     CI run — first the FAIL output without `--strict` (every
     failing session looks like a low-score failure), then the
     same run with `--strict` adding the `parse_errors` counter
     so the reader sees "of those N failures, K were judge
     parse errors, not actual low scores." Recommend `--strict`
     for dashboards and post-incident investigation; explicitly
     note it's a no-op for pass/fail-only `--exit-code` gates.
   - Sidebar: setting thresholds for judges. Unlike a latency budget,
     a correctness threshold of 0.7 is opinionated. Recipe: run
     without `--exit-code` over the last 30 days, look at the score
     distribution, set the threshold a couple of points below the
     5th percentile of normal traffic. Revisit weekly.

5. Going deeper (200 words)
   - Stack judges: `correctness` AND `hallucination` AND `sentiment`
     as three separate workflow steps. Each gets its own threshold,
     each shows up as its own row in the CI log. Same pattern as
     post #2's four deterministic gates.
   - Cost-vs-confidence sidebar: side-by-side table at 1k / 10k /
     100k sessions for AI.GENERATE vs. API fallback, with the
     reminder that AI.GENERATE bills inference on the AI Platform
     side separately from the BigQuery slot ms shown in
     INFORMATION_SCHEMA.
   - **Compare/contrast box vs categorical-eval** (~80 words):
     `LLMAsJudge` produces *continuous* scores (0.0–1.0); good for
     thresholded gates and for tracking distributions over time.
     `categorical-eval` (post #2) produces *discrete* one-of-N
     classifications; good for pass-rate gates and dashboards
     where the categories are the story. Use both in the same
     workflow when both shapes apply.
   - Forward-reference to post #4 (analyst views) — for teams that
     want the score *distribution* over time as a Looker Studio
     chart instead of a CI gate, the next post in the series.

6. What the plugin labels show over time (180 words)
   - Same INFORMATION_SCHEMA pivot from posts #1 and #2, this time
     with three rows: `eval-code`, `eval-llm-judge`, `trace-read`.
     The AI.GENERATE judge runs report under `sdk_feature=eval-llm-judge`
     with an `sdk_ai_function=ai-generate` (or `ml-generate-text`)
     sub-label.
   - One gotcha: `AI.GENERATE` jobs *do* show up in INFORMATION_SCHEMA,
     but Vertex AI inference is billed separately on the AI Platform
     side. The pivot tells you what the BigQuery side costs; check
     the AI Platform billing report for the inference side.

7. Try it (100 words)
   - Two-action CTA:
     (1) Add one step to your post-#2 workflow with
         `--evaluator=llm-judge --criterion=correctness --strict
         --exit-code`. Pick a threshold from the sidebar.
     (2) If you don't have a `CONNECTION`, install
         `bigquery-agent-analytics[improvement]` so the API
         fallback is available.
   - "If you only run one judge, run correctness. The other two
     are tuning."
   - Forward-reference to post #4: "Want the same scores as a
     trend chart instead of a CI gate? Point dbt + Looker Studio
     at the views the plugin already created. That's next."
```

## Demo requirements

1. **Reusable Calendar-Assistant agent.** Same baseline `demo_calendar_assistant.py` from posts #1 and #2 — the regressed-branch demo for post #3 is a *third* prompt variant that produces *plausible-looking but wrong* outputs (e.g., confirming a booking the tools didn't actually complete). New file: `demo_calendar_assistant_hallucinated.py`. Lives in the blog repo alongside the other two demos.
2. **Real CI screenshots** — actual GitHub Actions failure showing the LLM-judge step red, with the judge's justification visible. Same sandbox repo as post #2 (`caohy1988/bqaa-ci-sandbox`); add a third PR variant that triggers correctness failures. **Cover screenshot can only be captured after the SDK polish lands** — see the F2/F5 polish items below.
3. **Real `AI.GENERATE` connection** — set up `aiplatform.user`-scoped connection in `test-project-0728-467323` so the AI.GENERATE path is exercised live for the cover screenshot.
4. **Side-by-side cost table at 1k/10k/100k scale** — extrapolate from a real 100-session run on the sandbox (don't fake the 100k row).
5. **Reusable gist** — `gists/08_llm_judge_correctness_gate.sh` plus the three-judge stack as `gists/09_three_judge_workflow.yml`.

## SDK improvements to ship alongside the post

Reviewer audit flagged five spots where the post's draft narrative ran ahead of what the SDK actually ships today. Three of them are hard publish blockers; one is a clarity blocker; one re-shapes the strict-mode prose.

### Required before publish

1. **AI.GENERATE prompt-template parity with the Python path** *(publish blocker, F1).*

   `_ai_generate_judge` passes only `criterion.prompt_template.split("{trace_text}")[0]` as `judge_prompt` — i.e., everything *after* the `{trace_text}` placeholder in the Python template is silently dropped on the SQL path. (See [`client.py:1057`](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/blob/main/src/bigquery_agent_analytics/client.py#L1057) and the prompt template at [`evaluators.py:865`](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/blob/main/src/bigquery_agent_analytics/evaluators.py#L865).) The Python API-fallback path uses the *whole* template via `prompt_template.format(trace_text=…, final_response=…)` ([`evaluators.py:664`](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/blob/main/src/bigquery_agent_analytics/evaluators.py#L664)). Net result: the two paths can produce different scores for the same session because they're seeing different prompts.

   Fix: rebuild the AI.GENERATE prompt to include both the prefix and the suffix of the Python template, with `{final_response}` appended after the SQL-side trace text (or move to a structured output schema where the prompt is the same on both paths). Track in a small SDK PR.

   Without this, the post can't honestly say "same scores, different mechanics."

2. **Surface execution_mode + fallback_reason in `EvaluationReport.details` for LLM-judge** *(publish blocker, F2).*

   Categorical eval already does this ([`client.py:1363`](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/blob/main/src/bigquery_agent_analytics/client.py#L1363)) — `details["execution_mode"]` is one of `ai-classify` / `ai-generate` / `api-fallback`, and `details["fallback_reason"]` carries the exception message when a fallback fired. LLM-judge does not — both `_evaluate_llm_judge` ([`client.py:966`](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/blob/main/src/bigquery_agent_analytics/client.py#L966)) and `_api_judge` ([`client.py:1183`](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/blob/main/src/bigquery_agent_analytics/client.py#L1183)) build their reports without these fields.

   Fix: parity with categorical. Add `details["execution_mode"]` ∈ `{ai-generate, ml-generate-text, api-fallback}` plus `details["fallback_reason"]` when a path fired after another raised. Without this, the post's "auditable AI.GENERATE vs. fallback" claim has nothing to point at.

3. **`evaluate --exit-code` FAIL output for LLM-judge: include criterion, threshold, AND justification snippet** *(publish blocker, F5).*

   Today's fallback FAIL line for an LLM-judge session prints `score=0.4 threshold=0.7` ([`cli.py:419`](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/blob/main/src/bigquery_agent_analytics/cli.py#L419)) but does not include the criterion name (the metric_name is generic) or `SessionScore.llm_feedback` (where the judge's justification lives). The post's whole differentiator vs. a hand-rolled judge is "the score is explained" — without a bounded justification snippet on the FAIL line, the reader has nothing to take a screenshot of.

   Fix: extend `_emit_evaluate_failures` to detect LLM-judge SessionScores (presence of `llm_feedback`) and append a bounded-length snippet (~120 chars, single-line, ellipsis on overflow) after the score/threshold pair. Same one-sitting reviewable change.

### Required for clarity

4. **`--strict` doc + help-text rewrite to match shipped behavior** *(F4 — shipped in #44, with one correction below).*

   The first attempt at this rewrite (in #44 head before the latest commit) said `--strict` flips empty-scores AI.GENERATE rows from "silently passing" to "explicitly failed and counted." That's wrong — both BQ-native judge methods compute `passed = bool(scores) and all(...)`, so empty-scores rows already have `passed=False` regardless of `--strict`. `TestFalsePassFix.test_empty_score_fails` ([`tests/test_sdk_client.py:860`](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/blob/main/tests/test_sdk_client.py#L860)) locks that in.

   Corrected understanding (now landed in #44 commit `254eb4c`): `--strict` is a **visibility** knob, not a pass/fail-affecting flag. It walks the report and stamps `SessionScore.details["parse_error"] = True` per empty-scores session, plus adds report-level `parse_errors` / `parse_error_rate` counters under `report.details`. For pass/fail-only consumers (CI gates with `--exit-code`), `--strict` is a no-op. Reach for it when a dashboard or post-incident investigation needs to distinguish "low score" from "no parseable score" failures.

   Three doc surfaces shipped in #44: CLI `--help` text, `SDK.md §4 Strict Mode`, and `CHANGELOG.md` `[Unreleased]` entry. No code change.

### Strongly recommended

5. **Make the three-tier fallback chain visible in output** *(F3, partial overlap with F2).*

   The current cascade is `AI.GENERATE → ML.GENERATE_TEXT → Gemini API` ([`client.py:974`](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/blob/main/src/bigquery_agent_analytics/client.py#L974), per the docstring). The post should not pretend the legacy middle tier doesn't exist. F2's `execution_mode` field covers this if the value space includes `ml-generate-text`. Confirm during F2 implementation; no extra work if the implementation handles all three.

### Deferred

6. **`evaluate --baseline-report` for LLM-judge.** Post #2 deferred this for the deterministic path. Same shape would help LLM-judge thresholds even more (a 0.7 correctness threshold needs distribution data to defend). Track as a follow-up; not blocking post #3.

## Dependency

- **Post #2 should be live before post #3 publishes.** PR #17 in the blog repo is in DevRel review; SDK 0.2.2 (with the raw-budget `--threshold` semantics + `evaluate --exit-code` failure output that post #3 also relies on) is live on PyPI as of 2026-04-24.
- **Reusable Calendar-Assistant demo agent + sandbox CI repo** — already exist from post #2. Post #3 adds the third prompt variant + one PR.
- **`AI.GENERATE` connection on `test-project-0728-467323`** — needs to be wired before the demo cover screenshot can be captured. ~10 minutes of `gcloud bq mk --connection` + IAM grant.
- **`bigquery-agent-analytics[improvement]` extra installed in the sandbox CI** — for the API fallback path the post demos.
- **Blog PR opens *after* the SDK polish lands.** Per reviewer guidance — the cover screenshot depends on the F1/F2/F3/F5 polish items above.

## Medium-specific tactics

- **Publication**: Google Cloud Community (same as posts #1 and #2). Preserves series continuity and search ranking.
- **Tags**: `BigQuery`, `AI Agents`, `LLM`, `Google Cloud`, `Observability`. Swap `CI/CD` (post #2) for `LLM` — this post is about quality scoring, not workflow. Ordered for reader-intent per Medium's tag guidance.
- **Opening image / cover**: real GHA failure on the LLM-judge step with the judge's justification visible. Higher value than the deterministic-gate cover from post #2 because the *explanation* is the differentiator. Cover capture blocks on F1/F2/F5 polish.
- **Code blocks**: one-line CLI invocations are the hero. The workflow YAML excerpt is a small adjunct. Embed the gists from `gists/08` and `gists/09`.
- **Callouts**: candidate pull-quotes:
  - *"Latency you can measure. Hallucination you have to score."*
  - *"`--strict` is the difference between a silent skip and a counted failure."*
- **Series navigation**: link back to post #2 in the hook ("If you've already wired up post #2's deterministic gate, this is the second step."). Link forward to post #4 (analyst views) at close.
- **Canonical URL**: set to the Google Cloud dev blog version if co-published, consistent with posts #1 and #2.

## Timeline

- **Week 1**: SDK polish PR(s) — F1 (prompt parity), F2 (execution_mode/fallback_reason), F3 (visible in F2 output), F4 (strict help/doc), F5 (justification snippet on FAIL lines). Could be one PR or two; small, reviewable in one sitting each.
- **Week 1**: Provision the `AI.GENERATE` connection + IAM on `test-project-0728-467323`.
- **Week 1–2**: Build `demo_calendar_assistant_hallucinated.py` + a third sandbox PR that trips correctness. Capture live GHA failure (red on Correctness step, justification visible — post-F5 polish).
- **Week 2**: Run a 100-session sandbox fleet, capture INFORMATION_SCHEMA pivot showing `eval-llm-judge` row. Compute side-by-side cost table at 1k/10k/100k by extrapolation. Write the prose against the now-accurate SDK surface.
- **Week 3**: Internal review (Google Cloud DevRel) — same reviewer path as posts #1 and #2.
- **Week 3 / 4**: Publish.

Compressed timeline relative to post #2 because the demo agent + sandbox CI scaffolding all carries over; only the prompt variant, AI.GENERATE connection, and one new judge step are new.

## Open questions — resolved per review

1. ~~Slot swap with #51~~ — **resolved**: post #3 = LLM-as-Judge.
2. ~~Anchor judge for the demo~~ — **resolved**: `correctness` (maps cleanly to the hallucinated-booking scenario).
3. ~~AI.GENERATE-only or AI.GENERATE + API fallback in cover demo~~ — **resolved**: AI.GENERATE in the cover, but only after F1/F2/F5 polish lands so the path the cover shows is the path the reader will reproduce.
4. ~~Categorical-eval cross-link depth~~ — **resolved**: short compare/contrast box (~80 words) in section 5; main post stays focused on continuous judge scores.
5. ~~Blog PR timing relative to SDK polish~~ — **resolved**: blog PR waits on SDK polish landing. Cover screenshot depends on justification-rich FAIL output.

## Related

- Series plan and ranking: #51
- Post #1 plan: #53
- Post #2 plan: #77
- Post #1 published: https://medium.com/google-cloud/your-bigquery-agent-analytics-table-is-a-graph-heres-how-to-see-it-via-sdk-920b4ea14731
- Post #2 draft: https://github.com/haiyuan-eng-google/bigquery-agent-analytics-blogpost/pull/17
- SDK PR #36 (raw-budget `--threshold` + `--exit-code` output) — merged, shipped in 0.2.2
- SDK PR #37 (`categorical-eval --exit-code`) — merged, shipped in 0.2.2
- SDK PR #38 (release 0.2.2) — merged, live on PyPI
- SDK PR #39 (`examples/ci/evaluate_thresholds.yml`) — companion reference workflow



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docs: Medium blog post #3 plan — LLM-as-Judge in BigQuery (some regressions don't fit a budget) #82

Context

Slot ordering — resolved per review

Title candidates

Target audience

Structure (Medium best practices)

Demo requirements

SDK improvements to ship alongside the post

Required before publish

Required for clarity

Strongly recommended

Deferred

Dependency

Medium-specific tactics

Timeline

Open questions — resolved per review

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Docs: Medium blog post #3 plan — LLM-as-Judge in BigQuery (some regressions don't fit a budget) #82

Description

Context

Slot ordering — resolved per review

Title candidates

Target audience

Structure (Medium best practices)

Demo requirements

SDK improvements to ship alongside the post

Required before publish

Required for clarity

Strongly recommended

Deferred

Dependency

Medium-specific tactics

Timeline

Open questions — resolved per review

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions