You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is the plan for post #3 in the Medium blog series tracked in #51. Post #1 is live on Google Cloud Community. Post #2 is in DevRel review (blog repo PR #17, SDK release 0.2.2); its closing line forward-references this post:
Post #3 in this series picks up the semantic side — LLM-as-Judge for the things that don't fit a budget.
Post #3 delivers on that promise. Topic: scoring production traffic for correctness, hallucination, and sentiment using client.evaluate(evaluator=LLMAsJudge.*()). The execution cascade is AI.GENERATE → legacy ML.GENERATE_TEXT → direct Gemini API via google-genai; the AI.GENERATE path keeps evaluation inside BigQuery, the API fallback trades that for portability when no CONNECTION is wired up.
Slot ordering — resolved per review
#51's original ranking placed analyst-friendly views at slot 3 and LLM-as-Judge at slot 4. Reviewer confirmed the swap: post #3 = LLM-as-Judge (this issue), post #4 = analyst views (slot 4 going forward). Honors post #2's published forward-reference and continues the CI-gate arc — post #2 wired up deterministic gates; post #3 wires up semantic gates against the same SDK and the same agent_events corpus.
Title candidates
"Some regressions don't fit a budget. Score them in BigQuery instead." ← recommended
"Score every session for correctness, hallucination, and sentiment without the data leaving BigQuery."
"Your agent_events table can also score itself. Here's LLM-as-Judge in BigQuery."
Recommendation: #1. Direct callback to post #2's "things that don't fit a budget" pull-quote, three-section editorial cadence (problem → SDK → demo) lands clean, and the "in BigQuery instead" closer is honest about where the AI.GENERATE path runs (and is hedged in the body for the API-fallback path).
Have hit the limit of "is the latency / token / error rate okay?" — those gates pass, but the agent is still subtly wrong: hallucinating, refusing, going off-tone, missing the user's intent.
Are wary of LLM-as-Judge because they've seen it produce "vibe scores" — they want strictness, deterministic prompts, and a path to gate CI on the result, not a dashboard of vibes.
Have BigQuery and (sometimes) a CONNECTION for AI.GENERATE. The post explicitly handles both cases: connection wired up → AI.GENERATE SQL path; no connection → Gemini API fallback via google-genai. Reader picks based on what's available.
Post #2's audience (platform engineers wiring up CI) is the same here — this post extends what they already shipped.
Structure (Medium best practices)
Target length: 1,400–1,800 words (6–8 min read, same as posts #1 and #2). Cover image: real bq-agent-sdk evaluate --evaluator=llm-judge --exit-code failure output with the LLM justification visible (proof that the score is explained, not vibes). Inline shots: side-by-side cost table, terminal failure output with strict-mode behavior on AI.GENERATE NULL rows, INFORMATION_SCHEMA pivot showing eval-llm-judge as a separate sdk_feature row from post #2's eval-code.
H1: Some regressions don't fit a budget. Score them in BigQuery instead.
Sub: Use LLM-as-Judge over BigQuery's AI.GENERATE — with a transparent
fallback chain when AI.GENERATE isn't available — to gate CI on
correctness, hallucination, and sentiment at fleet scale.
1. Hook (80 words)
- Real example: a session where every deterministic gate from post
#2 passes (latency fine, tokens fine, no tool errors) but the
agent confidently states a fact that's wrong. Render the trace
tree with post #1's `client.get_session_trace(...).render()`,
point at the LLM_RESPONSE that fabricated.
- "Latency you can measure. Hallucination you have to *score*.
The SDK does both, on the same data, in the same warehouse —
when the warehouse can call the model."
2. The problem in one paragraph (120 words)
- Post #2's deterministic gates catch things you can put a number on.
Some regressions hide under those numbers — confident wrong answers,
refusals, off-tone responses, schema-shaped hallucinations.
- You can't scale "did a human read it" to fleet traffic. You also
can't trust a vibe score from a hand-rolled judge prompt.
- The SDK ships three pre-built judges (correctness, hallucination,
sentiment), each backed by a frozen prompt template, defaulting
to `gemini-2.5-flash`, runnable over thousands of sessions in a
single SQL statement when AI.GENERATE is available.
3. The SDK is already AI-in-warehouse (200 words)
- `bq-agent-sdk evaluate --evaluator=llm-judge --criterion=correctness
--threshold=0.7 --strict --exit-code` — same exit-code shape as
post #2's deterministic gates, slots into the same workflow YAML.
- Execution cascade — pinned to what the SDK actually does today:
1. **AI.GENERATE path** (when endpoint is not a legacy BQML ref).
Single BigQuery SQL job. No data leaves the warehouse.
Requires a `CONNECTION` with `aiplatform.user`. Reads as one
job in `INFORMATION_SCHEMA`, labeled `sdk_feature=eval-llm-judge`.
2. **Legacy `ML.GENERATE_TEXT` fallback** (when AI.GENERATE
raises and the endpoint is a BQML model ref). Same in-warehouse
guarantee; older SQL surface.
3. **Gemini API fallback** (when neither BQ-native path works).
Requires `bigquery-agent-analytics[improvement]` for
`google-genai`. Reads traces from BQ into Python, calls
Gemini directly, scores per session.
- `AI.GENERATE` keeps evaluation in BigQuery; the API fallback
trades that for portability. Either way, scores are produced
against the same prompt templates and the same threshold.
- **(Caveat — see SDK polish below.)** Today the AI.GENERATE
path uses a slightly truncated form of the Python prompt
template, and the SDK does not currently surface which path
fired in `report.details`. Both are required polish items
before this post publishes.
- `--strict` flag — see section 4 for the exact semantics.
4. The demo (450 words, the core)
- Real scenario: "The hallucinated booking confirmation"
- A regressed PR adjusts the Calendar-Assistant prompt to be more
"decisive." The agent now confirms bookings even when
book_meeting hasn't run successfully — the deterministic gate
passes (latency normal, no tool errors, token usage within
budget), but the user got told their meeting is booked when
it isn't.
- Add one step to post #2's workflow YAML:
`bq-agent-sdk evaluate --evaluator=llm-judge --criterion=correctness
--threshold=0.7 --strict --exit-code --last=24h --agent-id=…`
- PR goes red on "Correctness". The CI log shows session id,
score, threshold, *and the judge's justification* (post-polish —
today the FAIL line stops at `score=` and `threshold=`; the
justification fix is a hard publish blocker).
- **Strict-mode demo — pivoted to parse-error visibility, not
pass/fail flipping.** Empty/NULL-scores AI.GENERATE rows
already fail without `--strict` (locked in by
`TestFalsePassFix.test_empty_score_fails`); both BQ-native
judge methods compute `passed = bool(scores) and all(...)`,
which fails empty-scores rows trivially. The
`--exit-code` gate sees the same outcome with or without
`--strict` for those rows. What `--strict` does: walks the
report and stamps `details["parse_error"] = True` per
affected session, plus a report-level `parse_errors` /
`parse_error_rate` counter — investigation visibility that
lets a dashboard distinguish "low score" from "no parseable
score" failures. The API-fallback path coerces malformed
output to `score=0.0`, so its parse failures present as
low-score failures and don't surface through `--strict`
today. Demo (revised): capture two views of the same red
CI run — first the FAIL output without `--strict` (every
failing session looks like a low-score failure), then the
same run with `--strict` adding the `parse_errors` counter
so the reader sees "of those N failures, K were judge
parse errors, not actual low scores." Recommend `--strict`
for dashboards and post-incident investigation; explicitly
note it's a no-op for pass/fail-only `--exit-code` gates.
- Sidebar: setting thresholds for judges. Unlike a latency budget,
a correctness threshold of 0.7 is opinionated. Recipe: run
without `--exit-code` over the last 30 days, look at the score
distribution, set the threshold a couple of points below the
5th percentile of normal traffic. Revisit weekly.
5. Going deeper (200 words)
- Stack judges: `correctness` AND `hallucination` AND `sentiment`
as three separate workflow steps. Each gets its own threshold,
each shows up as its own row in the CI log. Same pattern as
post #2's four deterministic gates.
- Cost-vs-confidence sidebar: side-by-side table at 1k / 10k /
100k sessions for AI.GENERATE vs. API fallback, with the
reminder that AI.GENERATE bills inference on the AI Platform
side separately from the BigQuery slot ms shown in
INFORMATION_SCHEMA.
- **Compare/contrast box vs categorical-eval** (~80 words):
`LLMAsJudge` produces *continuous* scores (0.0–1.0); good for
thresholded gates and for tracking distributions over time.
`categorical-eval` (post #2) produces *discrete* one-of-N
classifications; good for pass-rate gates and dashboards
where the categories are the story. Use both in the same
workflow when both shapes apply.
- Forward-reference to post #4 (analyst views) — for teams that
want the score *distribution* over time as a Looker Studio
chart instead of a CI gate, the next post in the series.
6. What the plugin labels show over time (180 words)
- Same INFORMATION_SCHEMA pivot from posts #1 and #2, this time
with three rows: `eval-code`, `eval-llm-judge`, `trace-read`.
The AI.GENERATE judge runs report under `sdk_feature=eval-llm-judge`
with an `sdk_ai_function=ai-generate` (or `ml-generate-text`)
sub-label.
- One gotcha: `AI.GENERATE` jobs *do* show up in INFORMATION_SCHEMA,
but Vertex AI inference is billed separately on the AI Platform
side. The pivot tells you what the BigQuery side costs; check
the AI Platform billing report for the inference side.
7. Try it (100 words)
- Two-action CTA:
(1) Add one step to your post-#2 workflow with
`--evaluator=llm-judge --criterion=correctness --strict
--exit-code`. Pick a threshold from the sidebar.
(2) If you don't have a `CONNECTION`, install
`bigquery-agent-analytics[improvement]` so the API
fallback is available.
- "If you only run one judge, run correctness. The other two
are tuning."
- Forward-reference to post #4: "Want the same scores as a
trend chart instead of a CI gate? Point dbt + Looker Studio
at the views the plugin already created. That's next."
Real CI screenshots — actual GitHub Actions failure showing the LLM-judge step red, with the judge's justification visible. Same sandbox repo as post Revamp README, enhance documentation navigation, and fix CI #2 (caohy1988/bqaa-ci-sandbox); add a third PR variant that triggers correctness failures. Cover screenshot can only be captured after the SDK polish lands — see the F2/F5 polish items below.
Real AI.GENERATE connection — set up aiplatform.user-scoped connection in test-project-0728-467323 so the AI.GENERATE path is exercised live for the cover screenshot.
Side-by-side cost table at 1k/10k/100k scale — extrapolate from a real 100-session run on the sandbox (don't fake the 100k row).
Reusable gist — gists/08_llm_judge_correctness_gate.sh plus the three-judge stack as gists/09_three_judge_workflow.yml.
SDK improvements to ship alongside the post
Reviewer audit flagged five spots where the post's draft narrative ran ahead of what the SDK actually ships today. Three of them are hard publish blockers; one is a clarity blocker; one re-shapes the strict-mode prose.
Required before publish
AI.GENERATE prompt-template parity with the Python path(publish blocker, F1).
_ai_generate_judge passes only criterion.prompt_template.split("{trace_text}")[0] as judge_prompt — i.e., everything after the {trace_text} placeholder in the Python template is silently dropped on the SQL path. (See client.py:1057 and the prompt template at evaluators.py:865.) The Python API-fallback path uses the whole template via prompt_template.format(trace_text=…, final_response=…) (evaluators.py:664). Net result: the two paths can produce different scores for the same session because they're seeing different prompts.
Fix: rebuild the AI.GENERATE prompt to include both the prefix and the suffix of the Python template, with {final_response} appended after the SQL-side trace text (or move to a structured output schema where the prompt is the same on both paths). Track in a small SDK PR.
Without this, the post can't honestly say "same scores, different mechanics."
Surface execution_mode + fallback_reason in EvaluationReport.details for LLM-judge(publish blocker, F2).
Categorical eval already does this (client.py:1363) — details["execution_mode"] is one of ai-classify / ai-generate / api-fallback, and details["fallback_reason"] carries the exception message when a fallback fired. LLM-judge does not — both _evaluate_llm_judge (client.py:966) and _api_judge (client.py:1183) build their reports without these fields.
Fix: parity with categorical. Add details["execution_mode"] ∈ {ai-generate, ml-generate-text, api-fallback} plus details["fallback_reason"] when a path fired after another raised. Without this, the post's "auditable AI.GENERATE vs. fallback" claim has nothing to point at.
evaluate --exit-code FAIL output for LLM-judge: include criterion, threshold, AND justification snippet(publish blocker, F5).
Today's fallback FAIL line for an LLM-judge session prints score=0.4 threshold=0.7 (cli.py:419) but does not include the criterion name (the metric_name is generic) or SessionScore.llm_feedback (where the judge's justification lives). The post's whole differentiator vs. a hand-rolled judge is "the score is explained" — without a bounded justification snippet on the FAIL line, the reader has nothing to take a screenshot of.
Fix: extend _emit_evaluate_failures to detect LLM-judge SessionScores (presence of llm_feedback) and append a bounded-length snippet (~120 chars, single-line, ellipsis on overflow) after the score/threshold pair. Same one-sitting reviewable change.
Corrected understanding (now landed in AI.GENERATE returns NULL for individual rows; Gemini API fallback truncated by max_output_tokens=1024 #44 commit 254eb4c): --strict is a visibility knob, not a pass/fail-affecting flag. It walks the report and stamps SessionScore.details["parse_error"] = True per empty-scores session, plus adds report-level parse_errors / parse_error_rate counters under report.details. For pass/fail-only consumers (CI gates with --exit-code), --strict is a no-op. Reach for it when a dashboard or post-incident investigation needs to distinguish "low score" from "no parseable score" failures.
Make the three-tier fallback chain visible in output(F3, partial overlap with F2).
The current cascade is AI.GENERATE → ML.GENERATE_TEXT → Gemini API (client.py:974, per the docstring). The post should not pretend the legacy middle tier doesn't exist. F2's execution_mode field covers this if the value space includes ml-generate-text. Confirm during F2 implementation; no extra work if the implementation handles all three.
AI.GENERATE connection on test-project-0728-467323 — needs to be wired before the demo cover screenshot can be captured. ~10 minutes of gcloud bq mk --connection + IAM grant.
bigquery-agent-analytics[improvement] extra installed in the sandbox CI — for the API fallback path the post demos.
Blog PR opens after the SDK polish lands. Per reviewer guidance — the cover screenshot depends on the F1/F2/F3/F5 polish items above.
Tags: BigQuery, AI Agents, LLM, Google Cloud, Observability. Swap CI/CD (post Revamp README, enhance documentation navigation, and fix CI #2) for LLM — this post is about quality scoring, not workflow. Ordered for reader-intent per Medium's tag guidance.
Opening image / cover: real GHA failure on the LLM-judge step with the judge's justification visible. Higher value than the deterministic-gate cover from post Revamp README, enhance documentation navigation, and fix CI #2 because the explanation is the differentiator. Cover capture blocks on F1/F2/F5 polish.
Code blocks: one-line CLI invocations are the hero. The workflow YAML excerpt is a small adjunct. Embed the gists from gists/08 and gists/09.
Callouts: candidate pull-quotes:
"Latency you can measure. Hallucination you have to score."
"--strict is the difference between a silent skip and a counted failure."
Week 1: SDK polish PR(s) — F1 (prompt parity), F2 (execution_mode/fallback_reason), F3 (visible in F2 output), F4 (strict help/doc), F5 (justification snippet on FAIL lines). Could be one PR or two; small, reviewable in one sitting each.
Week 1: Provision the AI.GENERATE connection + IAM on test-project-0728-467323.
Week 1–2: Build demo_calendar_assistant_hallucinated.py + a third sandbox PR that trips correctness. Capture live GHA failure (red on Correctness step, justification visible — post-F5 polish).
Week 2: Run a 100-session sandbox fleet, capture INFORMATION_SCHEMA pivot showing eval-llm-judge row. Compute side-by-side cost table at 1k/10k/100k by extrapolation. Write the prose against the now-accurate SDK surface.
Compressed timeline relative to post #2 because the demo agent + sandbox CI scaffolding all carries over; only the prompt variant, AI.GENERATE connection, and one new judge step are new.
Anchor judge for the demo — resolved: correctness (maps cleanly to the hallucinated-booking scenario).
AI.GENERATE-only or AI.GENERATE + API fallback in cover demo — resolved: AI.GENERATE in the cover, but only after F1/F2/F5 polish lands so the path the cover shows is the path the reader will reproduce.
Categorical-eval cross-link depth — resolved: short compare/contrast box (~80 words) in section 5; main post stays focused on continuous judge scores.
Blog PR timing relative to SDK polish — resolved: blog PR waits on SDK polish landing. Cover screenshot depends on justification-rich FAIL output.
Context
This is the plan for post #3 in the Medium blog series tracked in #51. Post #1 is live on Google Cloud Community. Post #2 is in DevRel review (blog repo PR #17, SDK release
0.2.2); its closing line forward-references this post:Post #3 delivers on that promise. Topic: scoring production traffic for correctness, hallucination, and sentiment using
client.evaluate(evaluator=LLMAsJudge.*()). The execution cascade isAI.GENERATE→ legacyML.GENERATE_TEXT→ direct Gemini API viagoogle-genai; the AI.GENERATE path keeps evaluation inside BigQuery, the API fallback trades that for portability when noCONNECTIONis wired up.Slot ordering — resolved per review
#51's original ranking placed analyst-friendly views at slot 3 and LLM-as-Judge at slot 4. Reviewer confirmed the swap: post #3 = LLM-as-Judge (this issue), post #4 = analyst views (slot 4 going forward). Honors post #2's published forward-reference and continues the CI-gate arc — post #2 wired up deterministic gates; post #3 wires up semantic gates against the same SDK and the same
agent_eventscorpus.Title candidates
Recommendation: #1. Direct callback to post #2's "things that don't fit a budget" pull-quote, three-section editorial cadence (problem → SDK → demo) lands clean, and the "in BigQuery instead" closer is honest about where the AI.GENERATE path runs (and is hedged in the body for the API-fallback path).
Target audience
Eval-curious engineers / quality leads / agent owners who:
CONNECTIONforAI.GENERATE. The post explicitly handles both cases: connection wired up →AI.GENERATESQL path; no connection → Gemini API fallback viagoogle-genai. Reader picks based on what's available.Post #2's audience (platform engineers wiring up CI) is the same here — this post extends what they already shipped.
Structure (Medium best practices)
Target length: 1,400–1,800 words (6–8 min read, same as posts #1 and #2). Cover image: real
bq-agent-sdk evaluate --evaluator=llm-judge --exit-codefailure output with the LLM justification visible (proof that the score is explained, not vibes). Inline shots: side-by-side cost table, terminal failure output with strict-mode behavior on AI.GENERATE NULL rows, INFORMATION_SCHEMA pivot showingeval-llm-judgeas a separatesdk_featurerow from post #2'seval-code.Demo requirements
demo_calendar_assistant.pyfrom posts Add streaming evaluation, Dashboard V2, design docs, and CI workflow #1 and Revamp README, enhance documentation navigation, and fix CI #2 — the regressed-branch demo for post Revise README for clarity and updated link #3 is a third prompt variant that produces plausible-looking but wrong outputs (e.g., confirming a booking the tools didn't actually complete). New file:demo_calendar_assistant_hallucinated.py. Lives in the blog repo alongside the other two demos.caohy1988/bqaa-ci-sandbox); add a third PR variant that triggers correctness failures. Cover screenshot can only be captured after the SDK polish lands — see the F2/F5 polish items below.AI.GENERATEconnection — set upaiplatform.user-scoped connection intest-project-0728-467323so the AI.GENERATE path is exercised live for the cover screenshot.gists/08_llm_judge_correctness_gate.shplus the three-judge stack asgists/09_three_judge_workflow.yml.SDK improvements to ship alongside the post
Reviewer audit flagged five spots where the post's draft narrative ran ahead of what the SDK actually ships today. Three of them are hard publish blockers; one is a clarity blocker; one re-shapes the strict-mode prose.
Required before publish
AI.GENERATE prompt-template parity with the Python path (publish blocker, F1).
_ai_generate_judgepasses onlycriterion.prompt_template.split("{trace_text}")[0]asjudge_prompt— i.e., everything after the{trace_text}placeholder in the Python template is silently dropped on the SQL path. (Seeclient.py:1057and the prompt template atevaluators.py:865.) The Python API-fallback path uses the whole template viaprompt_template.format(trace_text=…, final_response=…)(evaluators.py:664). Net result: the two paths can produce different scores for the same session because they're seeing different prompts.Fix: rebuild the AI.GENERATE prompt to include both the prefix and the suffix of the Python template, with
{final_response}appended after the SQL-side trace text (or move to a structured output schema where the prompt is the same on both paths). Track in a small SDK PR.Without this, the post can't honestly say "same scores, different mechanics."
Surface execution_mode + fallback_reason in
EvaluationReport.detailsfor LLM-judge (publish blocker, F2).Categorical eval already does this (
client.py:1363) —details["execution_mode"]is one ofai-classify/ai-generate/api-fallback, anddetails["fallback_reason"]carries the exception message when a fallback fired. LLM-judge does not — both_evaluate_llm_judge(client.py:966) and_api_judge(client.py:1183) build their reports without these fields.Fix: parity with categorical. Add
details["execution_mode"]∈{ai-generate, ml-generate-text, api-fallback}plusdetails["fallback_reason"]when a path fired after another raised. Without this, the post's "auditable AI.GENERATE vs. fallback" claim has nothing to point at.evaluate --exit-codeFAIL output for LLM-judge: include criterion, threshold, AND justification snippet (publish blocker, F5).Today's fallback FAIL line for an LLM-judge session prints
score=0.4 threshold=0.7(cli.py:419) but does not include the criterion name (the metric_name is generic) orSessionScore.llm_feedback(where the judge's justification lives). The post's whole differentiator vs. a hand-rolled judge is "the score is explained" — without a bounded justification snippet on the FAIL line, the reader has nothing to take a screenshot of.Fix: extend
_emit_evaluate_failuresto detect LLM-judge SessionScores (presence ofllm_feedback) and append a bounded-length snippet (~120 chars, single-line, ellipsis on overflow) after the score/threshold pair. Same one-sitting reviewable change.Required for clarity
--strictdoc + help-text rewrite to match shipped behavior (F4 — shipped in AI.GENERATE returns NULL for individual rows; Gemini API fallback truncated by max_output_tokens=1024 #44, with one correction below).The first attempt at this rewrite (in AI.GENERATE returns NULL for individual rows; Gemini API fallback truncated by max_output_tokens=1024 #44 head before the latest commit) said
--strictflips empty-scores AI.GENERATE rows from "silently passing" to "explicitly failed and counted." That's wrong — both BQ-native judge methods computepassed = bool(scores) and all(...), so empty-scores rows already havepassed=Falseregardless of--strict.TestFalsePassFix.test_empty_score_fails(tests/test_sdk_client.py:860) locks that in.Corrected understanding (now landed in AI.GENERATE returns NULL for individual rows; Gemini API fallback truncated by max_output_tokens=1024 #44 commit
254eb4c):--strictis a visibility knob, not a pass/fail-affecting flag. It walks the report and stampsSessionScore.details["parse_error"] = Trueper empty-scores session, plus adds report-levelparse_errors/parse_error_ratecounters underreport.details. For pass/fail-only consumers (CI gates with--exit-code),--strictis a no-op. Reach for it when a dashboard or post-incident investigation needs to distinguish "low score" from "no parseable score" failures.Three doc surfaces shipped in AI.GENERATE returns NULL for individual rows; Gemini API fallback truncated by max_output_tokens=1024 #44: CLI
--helptext,SDK.md §4 Strict Mode, andCHANGELOG.md[Unreleased]entry. No code change.Strongly recommended
Make the three-tier fallback chain visible in output (F3, partial overlap with F2).
The current cascade is
AI.GENERATE → ML.GENERATE_TEXT → Gemini API(client.py:974, per the docstring). The post should not pretend the legacy middle tier doesn't exist. F2'sexecution_modefield covers this if the value space includesml-generate-text. Confirm during F2 implementation; no extra work if the implementation handles all three.Deferred
evaluate --baseline-reportfor LLM-judge. Post Revamp README, enhance documentation navigation, and fix CI #2 deferred this for the deterministic path. Same shape would help LLM-judge thresholds even more (a 0.7 correctness threshold needs distribution data to defend). Track as a follow-up; not blocking post Revise README for clarity and updated link #3.Dependency
--thresholdsemantics +evaluate --exit-codefailure output that post Revise README for clarity and updated link #3 also relies on) is live on PyPI as of 2026-04-24.AI.GENERATEconnection ontest-project-0728-467323— needs to be wired before the demo cover screenshot can be captured. ~10 minutes ofgcloud bq mk --connection+ IAM grant.bigquery-agent-analytics[improvement]extra installed in the sandbox CI — for the API fallback path the post demos.Medium-specific tactics
BigQuery,AI Agents,LLM,Google Cloud,Observability. SwapCI/CD(post Revamp README, enhance documentation navigation, and fix CI #2) forLLM— this post is about quality scoring, not workflow. Ordered for reader-intent per Medium's tag guidance.gists/08andgists/09.--strictis the difference between a silent skip and a counted failure."Timeline
AI.GENERATEconnection + IAM ontest-project-0728-467323.demo_calendar_assistant_hallucinated.py+ a third sandbox PR that trips correctness. Capture live GHA failure (red on Correctness step, justification visible — post-F5 polish).eval-llm-judgerow. Compute side-by-side cost table at 1k/10k/100k by extrapolation. Write the prose against the now-accurate SDK surface.Compressed timeline relative to post #2 because the demo agent + sandbox CI scaffolding all carries over; only the prompt variant, AI.GENERATE connection, and one new judge step are new.
Open questions — resolved per review
Slot swap with Docs: Medium blog post series plan — downstream value for BQ Agent Analytics users #51— resolved: post Revise README for clarity and updated link #3 = LLM-as-Judge.Anchor judge for the demo— resolved:correctness(maps cleanly to the hallucinated-booking scenario).AI.GENERATE-only or AI.GENERATE + API fallback in cover demo— resolved: AI.GENERATE in the cover, but only after F1/F2/F5 polish lands so the path the cover shows is the path the reader will reproduce.Categorical-eval cross-link depth— resolved: short compare/contrast box (~80 words) in section 5; main post stays focused on continuous judge scores.Blog PR timing relative to SDK polish— resolved: blog PR waits on SDK polish landing. Cover screenshot depends on justification-rich FAIL output.Related
--threshold+--exit-codeoutput) — merged, shipped in 0.2.2categorical-eval --exit-code) — merged, shipped in 0.2.2examples/ci/evaluate_thresholds.yml) — companion reference workflow