You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is the plan for post #2 in the Medium blog series tracked in #51. Post #1 is live on Google Cloud - Community ("Your BigQuery Agent Analytics table is a graph. Here's how to see it via SDK"). That post closes with an explicit teaser:
The natural follow-up — turning this filter into an automated eval check that runs on every deploy — is the next post in this series. Spoiler: client.evaluate_categorical(...) plus three lines of CategoricalMetricDefinition gets you a CI gate. Your agent_events table is also a test suite.
Post #2 delivers on that promise. Topic: code-based evals in CI, using bq-agent-sdk evaluate --exit-code to fail a PR when the SDK's deterministic metrics (latency, error rate, token cost) regress against the last 24 hours of production traffic.
Per the series ranking in #51, this is slot 2 (top-of-funnel, universal-audience). It converts the "I can see my traces now" readers from post #1 into "I can guard against regressions with them" readers.
Title candidates
"Your agent_events table is also a test suite. Here's how to wire it into CI." ← recommended
"Stop shipping agent regressions: a 20-line GitHub Action that fails on prod-traffic quality drops."
"The minimum agent quality gate is 20 lines of YAML."
Recommendation: #1. It directly cashes in post #1's closing pull-quote ("your agent_events table is also a test suite"), which is already a rhetorical hook in the public record and gives the series a callback. #2 is clickier but fights the continuity. #3 is tight but loses the narrative arc.
Target audience
Platform / infra / devops engineers responsible for agent reliability in production. Specifically:
Already run CI on their agent code — unit tests, lint, maybe a smoke-test suite against a golden set.
Have been bitten at least once by an agent regression that passed golden-set tests but blew up in production (latency cliff, tool-error spike, token-cost runaway).
Would fork a working GitHub Actions YAML if someone handed it to them.
Post #1's audience (ADK developers new to the plugin) is not the primary here. Post #2 assumes the reader has traces — the question is what to do with them structurally, not how to read them.
Structure (Medium best practices)
Target length: 1,400–1,800 words (6–8 min read). One lede image, 3–4 inline images (workflow YAML on GitHub, red/green CI status, SDK output with threshold-miss formatting, production-vs-golden-set delta chart), one closing image.
H1: Your agent_events table is also a test suite. Here's how to wire it into CI.
Sub: Twenty lines of GitHub Actions YAML, a threshold, and the last 24 hours
of production traffic — the minimum agent quality gate, running
entirely against the SDK's deterministic code evaluators.
1. Hook (80 words)
- Real screenshot: a Slack-style incident thread — "p95 latency spiked
after merge #842. Rollback in progress."
- "This is avoidable. Your agent_events table already has the data
that would have caught it. The gate is 20 lines of CI."
2. The problem in one paragraph (120 words)
- Golden-set tests catch known shapes. Production traffic is bigger,
weirder, and moves faster than any golden set.
- The SDK's code evaluator already knows how to score production
traffic on latency, turn count, tool error rate, token efficiency,
TTFT, and cost — deterministically, no LLM call, no golden trace.
- The only missing piece is "run this in CI and block the merge when
a threshold busts." That piece already exists too: --exit-code.
3. The SDK is already CI-friendly (150 words)
- One metric per `bq-agent-sdk evaluate` invocation today — each
takes `--evaluator=<name>` + `--threshold=<value>` + `--exit-code`.
A multi-metric gate is *multiple sequential commands*, not a
single command with multiple flags. Pattern is exactly what
`examples/ci_eval_pipeline.sh` already does in-repo.
- Exit code: 0 = every evaluated session passed the threshold;
1 = at least one session fell below; 2 = config/auth error.
(See "what `--exit-code` actually measures" below — it is
per-session, not a percentile gate.)
- Zero LLM tokens on the deterministic path — cheap enough to run
on every PR, not just every deploy.
- Screenshot of the exit-code path in a terminal, red/green.
4. The demo (600 words, the core)
- Real scenario: "The per-session token budget regression"
- A feature PR changes the agent's system prompt to add more few-shot
examples. Looks fine locally. Pushes a subset of sessions over
a token-per-session budget the team set two months ago.
- Workflow YAML (20 lines, embedded as Gist) runs four `evaluate`
commands in sequence — one per metric — mirroring the existing
`examples/ci_eval_pipeline.sh` pattern: latency, error_rate,
token_efficiency, turn_count. Each has its own threshold, each
runs with `--exit-code`, CI fails on the first exit 1.
- PR goes red: the CI log shows which command failed, which
metric, and which session(s) fell below the threshold. (This is
exactly what the per-threshold exit-code messaging polish in the
SDK-improvements section below buys us; today's output is less
readable than the demo wants.)
- Fix — scope the prompt change to the 30% of sessions that need
it, not all of them. PR flips green.
- **Explicit note for the reader**: today's gate is per-session
("any session fell below threshold → fail"), not p95. Post
explicitly calls this out — the reader gets a working CI gate
today, with a pointer to the percentile-gating polish proposed
below for the teams that want statistical thresholds instead.
- Sidebar: how thresholds get set. For per-session gates, p95 or
p99 of last 30 days is a defensible starting point (set the
threshold below that line so outliers trip it); revisit after
week 1.
5. Going deeper (250 words)
- Cross-link back to post #1's fleet filter: the ambiguity pattern
from post #1 is itself a CI-worthy check — *"did this PR push
the multi-match rate above 20%?"* Worked example in prose, no
code, because there isn't a deterministic CLI path for this yet.
- One-paragraph teaser for categorical-eval gating: `categorical-eval`
exists today and produces pass rates, but does **not** support
`--exit-code` or a `--pass-rate-threshold` flag yet, so it isn't
a shippable CI gate. Tracked as an SDK polish item (see below)
and deferred in depth to post #4 in the series (LLM-as-Judge).
6. What the plugin labels show over time (200 words)
- The SDK labels every query with sdk_feature, so you can point
INFORMATION_SCHEMA at your CI runs and see exactly what the
gate is costing. Sample INFORMATION_SCHEMA query + real result
(matches section 6 format from post #1).
- CI should be a budget line, not a surprise bill.
7. Try it (100 words)
- Fork the workflow file — direct link to a Gist or example repo.
- Three-action CTA:
(1) Add the workflow to a test agent repo.
(2) Pick four thresholds based on last 30 days of prod.
(3) Watch your next PR run the gate.
- "The minimum agent quality gate is 20 lines of YAML and the
table you already have."
Real threshold numbers — pull the p95 / error-rate / token baselines from the same sandbox project (test-project-0728-467323 / agent_analytics_demo) that post Add streaming evaluation, Dashboard V2, design docs, and CI workflow #1 used. Keeps the demo reproducible against the same corpus.
Reusable workflow YAML — committed as examples/ci/evaluate_thresholds.yml in the SDK repo or blog repo. Must be fork-and-ship working on day one; not pseudocode.
SDK improvements to ship alongside the post
Review of #77 confirmed four places where today's CLI is behind where the original draft of this post assumed it was. The first two are hard publish blockers (the demo cannot honestly hit its narrative otherwise); the third reshapes the story for teams that want statistical gates; the fourth is a post #4 concern.
Required before publish
--threshold means a raw metric budget, not a normalization denominator (demo-blocking — publish blocker).
Today --threshold=5000 on --evaluator=latency does not mean "fail when avg_latency_ms > 5000." The threshold is passed into a normalized-score function — score = 1.0 - (avg_latency_ms / threshold_ms), clamped to [0, 1] (udf_kernels.py:147) — and pass/fail is checked against a hardcoded 0.5 score cutoff (evaluators.py:249, inside CodeEvaluator.latency()). Net result: --threshold=5000 actually fails around avg_latency_ms > 2500. Same shape for the other prebuilt evaluators.
A reader who forks a YAML with --threshold=5000 expecting a 5-second budget and gets a 2.5-second gate will lose trust in the post and the SDK at the same time. This cannot ship as-is. Two options to fix — pick one in the SDK polish PR before publish:
Option 1 — change prebuilt evaluator gates to compare raw values (preferred).CodeEvaluator.latency(threshold_ms=5000) passes avg_latency_ms <= 5000 directly, no normalization, no 0.5 cutoff. Apply to all prebuilt evaluators (latency, error_rate, turn_count, token_efficiency, ttft, cost). Minor behavior change; the current normalized semantics are effectively a latent bug — nobody in CI would have deliberately chosen "fail at half the threshold I typed." Mitigate with a CHANGELOG entry + a deprecation note if anyone's relying on the old curve.
Option 2 — add explicit raw-budget flags (additive). Keep the normalized scorer as-is but add --latency-budget-ms, --error-rate-budget, --token-budget-per-session (etc.) that bypass the normalization. Stricter semantics for CI users; the normalized score stays for reporting/dashboard consumers who want a 0–1 signal.
Preference: Option 1. Simpler mental model for the reader, smaller API surface. But either closes the blocker.
Until this lands, the post cannot honestly CTA "fork this YAML."
Per-session exit-code messaging (demo-blocking). When --exit-code triggers exit 1 in the current bq-agent-sdk evaluate output (cli.py:267), the CI log doesn't reliably name which session(s) fell below threshold, for which metric, by how much. Readable CI logs are table stakes for a post whose CTA is "fork this YAML and ship a gate." Audit and tighten: metric name, threshold value, observed value, failing session ids (capped at top N), one line per failure. Same framing as post Add streaming evaluation, Dashboard V2, design docs, and CI workflow #1's Span.tool_name / trace.render(color=) polish — small, reviewable in one sitting.
Strongly recommended (reshapes the narrative)
Percentile / baseline gating on evaluate (narrative-blocking for the p95 story). Current CodeEvaluator semantics are per-session: any single session below the evaluator's threshold trips exit 1 (cli.py:360 — if exit_code and report.pass_rate < 1.0). That's a legitimate and useful gate, but it is not the "p95 regressed vs last week" story the first draft told. Options:
Option A (minimal): keep the post narrative strictly per-session, remove any p95 language from the demo. Workflow ships today.
Option B (polish adds narrative back): add --aggregate=p95|p99|mean + --threshold-aggregate flags so bq-agent-sdk evaluate computes a statistic over the window and compares the aggregate to the threshold. Pairs with the --baseline-report helper below.
--baseline-report helper on evaluate (optional, reduces prose). A convenience that reads the last N days of production traffic and emits a pasteable suggested-thresholds block — e.g., bq-agent-sdk evaluate --baseline-report --last=30d --buffer=0.10 --format=text prints per-evaluator p50/p95/p99 + a suggested threshold set. Not ranking as mandatory, but it halves the "what do I set my thresholds to?" friction the post otherwise has to answer in a sidebar. Naming note per review: --baseline-report over --suggest-thresholds — more neutral, lets the output carry a suggested_thresholds: block without sounding authoritative.
Deferred (references to post #4, not blocking post #2)
Week 1: SDK polish PR — per-session exit-code messaging (demo-blocking, required) + the optional --baseline-report helper if scope allows. Small, reviewable in one sitting.
Week 1–2: Build the regressed-branch variant of the Calendar-Assistant demo; cut a CI workflow YAML (multiple sequential evaluate commands, not a single multi-threshold call); capture red/green screenshots from a real run.
Week 2: Draft the post in Google Docs, iterate on hook and screenshots.
One week accelerated vs post #1's timeline because no new agent build, no new dataset setup, no new Vertex/API key dance. The reusable demo agent is the force multiplier.
Resolved (from code-verified review)
The review of this plan against the current SDK + latest ADK plugin resolved the following items; folding them into the plan rather than leaving them as open questions.
Categorical-eval depth. One-paragraph teaser only (see the restructured "Going deeper" section above). Real categorical CI gating defers to post Overhaul README, add documentation indexes, and fix CI issues #4 unless the categorical-eval --exit-code --pass-rate-threshold polish lands first.
Workflow YAML home. SDK repo at examples/ci/evaluate_thresholds.yml. Blog repo links to the pinned SDK file. Discoverability wins; one source of truth.
Helper naming.--baseline-report over --suggest-thresholds. More neutral; output can carry a suggested_thresholds: block without sounding authoritative.
Screenshot durability. Use the public blog demo repo if it already owns Calendar-Assistant assets, but do not rely on GitHub Actions logs as permanent evidence. Check in the captured red/green screenshots to the blog repo alongside post Add streaming evaluation, Dashboard V2, design docs, and CI workflow #1's screenshots/ directory, and keep a re-runnable workflow / branch so readers can reproduce.
Option A vs Option B on percentile gating.Resolved: Option A. Ship strictly per-session in v1 with the raw-budget fix above. Percentile / aggregate gating (--aggregate=p95 --threshold-aggregate=...) stays on the strongly-recommended list as a post-publish polish that triggers an editor's-note update to the post. No hold on the post for it.
(None gating publication. Both previously-open questions are resolved above. The one remaining publish blocker — --threshold raw-budget semantics — is in the SDK improvements section as a required polish with a preferred option named.)
Context
This is the plan for post #2 in the Medium blog series tracked in #51. Post #1 is live on Google Cloud - Community ("Your BigQuery Agent Analytics table is a graph. Here's how to see it via SDK"). That post closes with an explicit teaser:
Post #2 delivers on that promise. Topic: code-based evals in CI, using
bq-agent-sdk evaluate --exit-codeto fail a PR when the SDK's deterministic metrics (latency, error rate, token cost) regress against the last 24 hours of production traffic.Per the series ranking in #51, this is slot 2 (top-of-funnel, universal-audience). It converts the "I can see my traces now" readers from post #1 into "I can guard against regressions with them" readers.
Title candidates
agent_eventstable is also a test suite. Here's how to wire it into CI." ← recommendedRecommendation: #1. It directly cashes in post #1's closing pull-quote ("your
agent_eventstable is also a test suite"), which is already a rhetorical hook in the public record and gives the series a callback. #2 is clickier but fights the continuity. #3 is tight but loses the narrative arc.Target audience
Platform / infra / devops engineers responsible for agent reliability in production. Specifically:
Post #1's audience (ADK developers new to the plugin) is not the primary here. Post #2 assumes the reader has traces — the question is what to do with them structurally, not how to read them.
Structure (Medium best practices)
Target length: 1,400–1,800 words (6–8 min read). One lede image, 3–4 inline images (workflow YAML on GitHub, red/green CI status, SDK output with threshold-miss formatting, production-vs-golden-set delta chart), one closing image.
Demo requirements
test-project-0728-467323/agent_analytics_demo) that post Add streaming evaluation, Dashboard V2, design docs, and CI workflow #1 used. Keeps the demo reproducible against the same corpus.examples/ci/evaluate_thresholds.ymlin the SDK repo or blog repo. Must be fork-and-ship working on day one; not pseudocode.SDK improvements to ship alongside the post
Review of #77 confirmed four places where today's CLI is behind where the original draft of this post assumed it was. The first two are hard publish blockers (the demo cannot honestly hit its narrative otherwise); the third reshapes the story for teams that want statistical gates; the fourth is a post #4 concern.
Required before publish
--thresholdmeans a raw metric budget, not a normalization denominator (demo-blocking — publish blocker).Today
--threshold=5000on--evaluator=latencydoes not mean "fail whenavg_latency_ms > 5000." The threshold is passed into a normalized-score function —score = 1.0 - (avg_latency_ms / threshold_ms), clamped to[0, 1](udf_kernels.py:147) — and pass/fail is checked against a hardcoded 0.5 score cutoff (evaluators.py:249, insideCodeEvaluator.latency()). Net result:--threshold=5000actually fails aroundavg_latency_ms > 2500. Same shape for the other prebuilt evaluators.A reader who forks a YAML with
--threshold=5000expecting a 5-second budget and gets a 2.5-second gate will lose trust in the post and the SDK at the same time. This cannot ship as-is. Two options to fix — pick one in the SDK polish PR before publish:CodeEvaluator.latency(threshold_ms=5000)passesavg_latency_ms <= 5000directly, no normalization, no 0.5 cutoff. Apply to all prebuilt evaluators (latency,error_rate,turn_count,token_efficiency,ttft,cost). Minor behavior change; the current normalized semantics are effectively a latent bug — nobody in CI would have deliberately chosen "fail at half the threshold I typed." Mitigate with a CHANGELOG entry + a deprecation note if anyone's relying on the old curve.--latency-budget-ms,--error-rate-budget,--token-budget-per-session(etc.) that bypass the normalization. Stricter semantics for CI users; the normalized score stays for reporting/dashboard consumers who want a 0–1 signal.Preference: Option 1. Simpler mental model for the reader, smaller API surface. But either closes the blocker.
Until this lands, the post cannot honestly CTA "fork this YAML."
Per-session exit-code messaging (demo-blocking). When
--exit-codetriggers exit 1 in the currentbq-agent-sdk evaluateoutput (cli.py:267), the CI log doesn't reliably name which session(s) fell below threshold, for which metric, by how much. Readable CI logs are table stakes for a post whose CTA is "fork this YAML and ship a gate." Audit and tighten: metric name, threshold value, observed value, failing session ids (capped at top N), one line per failure. Same framing as post Add streaming evaluation, Dashboard V2, design docs, and CI workflow #1'sSpan.tool_name/trace.render(color=)polish — small, reviewable in one sitting.Strongly recommended (reshapes the narrative)
Percentile / baseline gating on
evaluate(narrative-blocking for the p95 story). CurrentCodeEvaluatorsemantics are per-session: any single session below the evaluator's threshold trips exit 1 (cli.py:360—if exit_code and report.pass_rate < 1.0). That's a legitimate and useful gate, but it is not the "p95 regressed vs last week" story the first draft told. Options:--aggregate=p95|p99|mean+--threshold-aggregateflags sobq-agent-sdk evaluatecomputes a statistic over the window and compares the aggregate to the threshold. Pairs with the--baseline-reporthelper below.Preference: ship Option A in the first version of the post; track Option B as a follow-up SDK PR whose landing triggers a post-2.5 or an editor's-note update to post Revamp README, enhance documentation navigation, and fix CI #2.
--baseline-reporthelper onevaluate(optional, reduces prose). A convenience that reads the last N days of production traffic and emits a pasteable suggested-thresholds block — e.g.,bq-agent-sdk evaluate --baseline-report --last=30d --buffer=0.10 --format=textprints per-evaluator p50/p95/p99 + a suggested threshold set. Not ranking as mandatory, but it halves the "what do I set my thresholds to?" friction the post otherwise has to answer in a sidebar. Naming note per review:--baseline-reportover--suggest-thresholds— more neutral, lets the output carry asuggested_thresholds:block without sounding authoritative.Deferred (references to post #4, not blocking post #2)
categorical-eval --exit-code --pass-rate-threshold.categorical-eval(cli.py:589) produces pass rates today but doesn't expose--exit-codeor a threshold flag — so the post should not present categorical eval as a ready-to-ship CI gate. Either defer entirely to post Overhaul README, add documentation indexes, and fix CI issues #4 (LLM-as-Judge), or ship this flag pair as a small SDK polish before post Overhaul README, add documentation indexes, and fix CI issues #4 publishes. Post Revamp README, enhance documentation navigation, and fix CI #2 mentions it in one paragraph as a pointer forward; does not demo it.Dependency
demo_calendar_assistant.pyin the blog repo, so no new agent build required. Regressed-branch variant is the only new scaffolding.CodeEvaluatormetrics + optional categorical evals, both of which are stable today.Medium-specific tactics
bigquery,ai-agents,google-cloud,python,ci-cd(swapobservabilityforci-cdvs post Add streaming evaluation, Dashboard V2, design docs, and CI workflow #1).Timeline
--baseline-reporthelper if scope allows. Small, reviewable in one sitting.evaluatecommands, not a single multi-threshold call); capture red/green screenshots from a real run.One week accelerated vs post #1's timeline because no new agent build, no new dataset setup, no new Vertex/API key dance. The reusable demo agent is the force multiplier.
Resolved (from code-verified review)
The review of this plan against the current SDK + latest ADK plugin resolved the following items; folding them into the plan rather than leaving them as open questions.
test-project-0728-467323) for continuity with post Add streaming evaluation, Dashboard V2, design docs, and CI workflow #1, but under a separate table suffix or dataset (e.g.agent_events_ci_demo/..._regression) so the CI-regression staging doesn't pollute post Add streaming evaluation, Dashboard V2, design docs, and CI workflow #1's baseline.categorical-eval --exit-code --pass-rate-thresholdpolish lands first.examples/ci/evaluate_thresholds.yml. Blog repo links to the pinned SDK file. Discoverability wins; one source of truth.--baseline-reportover--suggest-thresholds. More neutral; output can carry asuggested_thresholds:block without sounding authoritative.screenshots/directory, and keep a re-runnable workflow / branch so readers can reproduce.--aggregate=p95 --threshold-aggregate=...) stays on the strongly-recommended list as a post-publish polish that triggers an editor's-note update to the post. No hold on the post for it.categorical-eval --exit-codepolish? Resolved: publish first. Categorical eval stays as a one-paragraph forward-reference to post Overhaul README, add documentation indexes, and fix CI issues #4 in this version. Categorical-eval gating is a post Overhaul README, add documentation indexes, and fix CI issues #4 concern, not a post Revamp README, enhance documentation navigation, and fix CI #2 blocker.Remaining open questions
(None gating publication. Both previously-open questions are resolved above. The one remaining publish blocker —
--thresholdraw-budget semantics — is in the SDK improvements section as a required polish with a preferred option named.)Related
demo_calendar_assistant.pyin the blog repo