Skip to content

Docs: Medium blog post #2 plan — 'Your agent_events table is also a test suite' #77

@caohy1988

Description

@caohy1988

Context

This is the plan for post #2 in the Medium blog series tracked in #51. Post #1 is live on Google Cloud - Community ("Your BigQuery Agent Analytics table is a graph. Here's how to see it via SDK"). That post closes with an explicit teaser:

The natural follow-up — turning this filter into an automated eval check that runs on every deploy — is the next post in this series. Spoiler: client.evaluate_categorical(...) plus three lines of CategoricalMetricDefinition gets you a CI gate. Your agent_events table is also a test suite.

Post #2 delivers on that promise. Topic: code-based evals in CI, using bq-agent-sdk evaluate --exit-code to fail a PR when the SDK's deterministic metrics (latency, error rate, token cost) regress against the last 24 hours of production traffic.

Per the series ranking in #51, this is slot 2 (top-of-funnel, universal-audience). It converts the "I can see my traces now" readers from post #1 into "I can guard against regressions with them" readers.

Title candidates

  1. "Your agent_events table is also a test suite. Here's how to wire it into CI." ← recommended
  2. "Stop shipping agent regressions: a 20-line GitHub Action that fails on prod-traffic quality drops."
  3. "The minimum agent quality gate is 20 lines of YAML."

Recommendation: #1. It directly cashes in post #1's closing pull-quote ("your agent_events table is also a test suite"), which is already a rhetorical hook in the public record and gives the series a callback. #2 is clickier but fights the continuity. #3 is tight but loses the narrative arc.

Target audience

Platform / infra / devops engineers responsible for agent reliability in production. Specifically:

  • Have installed the BQ Agent Analytics plugin (or read post Add streaming evaluation, Dashboard V2, design docs, and CI workflow #1 and plan to).
  • Already run CI on their agent code — unit tests, lint, maybe a smoke-test suite against a golden set.
  • Have been bitten at least once by an agent regression that passed golden-set tests but blew up in production (latency cliff, tool-error spike, token-cost runaway).
  • Would fork a working GitHub Actions YAML if someone handed it to them.

Post #1's audience (ADK developers new to the plugin) is not the primary here. Post #2 assumes the reader has traces — the question is what to do with them structurally, not how to read them.

Structure (Medium best practices)

Target length: 1,400–1,800 words (6–8 min read). One lede image, 3–4 inline images (workflow YAML on GitHub, red/green CI status, SDK output with threshold-miss formatting, production-vs-golden-set delta chart), one closing image.

H1:  Your agent_events table is also a test suite. Here's how to wire it into CI.
Sub: Twenty lines of GitHub Actions YAML, a threshold, and the last 24 hours
     of production traffic — the minimum agent quality gate, running
     entirely against the SDK's deterministic code evaluators.

1. Hook (80 words)
   - Real screenshot: a Slack-style incident thread — "p95 latency spiked
     after merge #842. Rollback in progress."
   - "This is avoidable. Your agent_events table already has the data
     that would have caught it. The gate is 20 lines of CI."

2. The problem in one paragraph (120 words)
   - Golden-set tests catch known shapes. Production traffic is bigger,
     weirder, and moves faster than any golden set.
   - The SDK's code evaluator already knows how to score production
     traffic on latency, turn count, tool error rate, token efficiency,
     TTFT, and cost — deterministically, no LLM call, no golden trace.
   - The only missing piece is "run this in CI and block the merge when
     a threshold busts." That piece already exists too: --exit-code.

3. The SDK is already CI-friendly (150 words)
   - One metric per `bq-agent-sdk evaluate` invocation today — each
     takes `--evaluator=<name>` + `--threshold=<value>` + `--exit-code`.
     A multi-metric gate is *multiple sequential commands*, not a
     single command with multiple flags. Pattern is exactly what
     `examples/ci_eval_pipeline.sh` already does in-repo.
   - Exit code: 0 = every evaluated session passed the threshold;
     1 = at least one session fell below; 2 = config/auth error.
     (See "what `--exit-code` actually measures" below — it is
     per-session, not a percentile gate.)
   - Zero LLM tokens on the deterministic path — cheap enough to run
     on every PR, not just every deploy.
   - Screenshot of the exit-code path in a terminal, red/green.

4. The demo (600 words, the core)
   - Real scenario: "The per-session token budget regression"
   - A feature PR changes the agent's system prompt to add more few-shot
     examples. Looks fine locally. Pushes a subset of sessions over
     a token-per-session budget the team set two months ago.
   - Workflow YAML (20 lines, embedded as Gist) runs four `evaluate`
     commands in sequence — one per metric — mirroring the existing
     `examples/ci_eval_pipeline.sh` pattern: latency, error_rate,
     token_efficiency, turn_count. Each has its own threshold, each
     runs with `--exit-code`, CI fails on the first exit 1.
   - PR goes red: the CI log shows which command failed, which
     metric, and which session(s) fell below the threshold. (This is
     exactly what the per-threshold exit-code messaging polish in the
     SDK-improvements section below buys us; today's output is less
     readable than the demo wants.)
   - Fix — scope the prompt change to the 30% of sessions that need
     it, not all of them. PR flips green.
   - **Explicit note for the reader**: today's gate is per-session
     ("any session fell below threshold → fail"), not p95. Post
     explicitly calls this out — the reader gets a working CI gate
     today, with a pointer to the percentile-gating polish proposed
     below for the teams that want statistical thresholds instead.
   - Sidebar: how thresholds get set. For per-session gates, p95 or
     p99 of last 30 days is a defensible starting point (set the
     threshold below that line so outliers trip it); revisit after
     week 1.

5. Going deeper (250 words)
   - Cross-link back to post #1's fleet filter: the ambiguity pattern
     from post #1 is itself a CI-worthy check — *"did this PR push
     the multi-match rate above 20%?"* Worked example in prose, no
     code, because there isn't a deterministic CLI path for this yet.
   - One-paragraph teaser for categorical-eval gating: `categorical-eval`
     exists today and produces pass rates, but does **not** support
     `--exit-code` or a `--pass-rate-threshold` flag yet, so it isn't
     a shippable CI gate. Tracked as an SDK polish item (see below)
     and deferred in depth to post #4 in the series (LLM-as-Judge).

6. What the plugin labels show over time (200 words)
   - The SDK labels every query with sdk_feature, so you can point
     INFORMATION_SCHEMA at your CI runs and see exactly what the
     gate is costing. Sample INFORMATION_SCHEMA query + real result
     (matches section 6 format from post #1).
   - CI should be a budget line, not a surprise bill.

7. Try it (100 words)
   - Fork the workflow file — direct link to a Gist or example repo.
   - Three-action CTA:
     (1) Add the workflow to a test agent repo.
     (2) Pick four thresholds based on last 30 days of prod.
     (3) Watch your next PR run the gate.
   - "The minimum agent quality gate is 20 lines of YAML and the
     table you already have."

Demo requirements

  1. Reusable Calendar-Assistant agent — reuse the demo from post Add streaming evaluation, Dashboard V2, design docs, and CI workflow #1 (same repo, same seeded contacts, same tools). Add a deliberately-regressed variant in a feature branch so the demo PR actually trips the gate.
  2. Real CI screenshots — take actual GitHub Actions run screenshots (red + green) against a fork that runs the workflow. Same authenticity standard as post Add streaming evaluation, Dashboard V2, design docs, and CI workflow #1's render captures.
  3. Real threshold numbers — pull the p95 / error-rate / token baselines from the same sandbox project (test-project-0728-467323 / agent_analytics_demo) that post Add streaming evaluation, Dashboard V2, design docs, and CI workflow #1 used. Keeps the demo reproducible against the same corpus.
  4. Reusable workflow YAML — committed as examples/ci/evaluate_thresholds.yml in the SDK repo or blog repo. Must be fork-and-ship working on day one; not pseudocode.

SDK improvements to ship alongside the post

Review of #77 confirmed four places where today's CLI is behind where the original draft of this post assumed it was. The first two are hard publish blockers (the demo cannot honestly hit its narrative otherwise); the third reshapes the story for teams that want statistical gates; the fourth is a post #4 concern.

Required before publish

  1. --threshold means a raw metric budget, not a normalization denominator (demo-blocking — publish blocker).

    Today --threshold=5000 on --evaluator=latency does not mean "fail when avg_latency_ms > 5000." The threshold is passed into a normalized-score function — score = 1.0 - (avg_latency_ms / threshold_ms), clamped to [0, 1] (udf_kernels.py:147) — and pass/fail is checked against a hardcoded 0.5 score cutoff (evaluators.py:249, inside CodeEvaluator.latency()). Net result: --threshold=5000 actually fails around avg_latency_ms > 2500. Same shape for the other prebuilt evaluators.

    A reader who forks a YAML with --threshold=5000 expecting a 5-second budget and gets a 2.5-second gate will lose trust in the post and the SDK at the same time. This cannot ship as-is. Two options to fix — pick one in the SDK polish PR before publish:

    • Option 1 — change prebuilt evaluator gates to compare raw values (preferred). CodeEvaluator.latency(threshold_ms=5000) passes avg_latency_ms <= 5000 directly, no normalization, no 0.5 cutoff. Apply to all prebuilt evaluators (latency, error_rate, turn_count, token_efficiency, ttft, cost). Minor behavior change; the current normalized semantics are effectively a latent bug — nobody in CI would have deliberately chosen "fail at half the threshold I typed." Mitigate with a CHANGELOG entry + a deprecation note if anyone's relying on the old curve.
    • Option 2 — add explicit raw-budget flags (additive). Keep the normalized scorer as-is but add --latency-budget-ms, --error-rate-budget, --token-budget-per-session (etc.) that bypass the normalization. Stricter semantics for CI users; the normalized score stays for reporting/dashboard consumers who want a 0–1 signal.

    Preference: Option 1. Simpler mental model for the reader, smaller API surface. But either closes the blocker.

    Until this lands, the post cannot honestly CTA "fork this YAML."

  2. Per-session exit-code messaging (demo-blocking). When --exit-code triggers exit 1 in the current bq-agent-sdk evaluate output (cli.py:267), the CI log doesn't reliably name which session(s) fell below threshold, for which metric, by how much. Readable CI logs are table stakes for a post whose CTA is "fork this YAML and ship a gate." Audit and tighten: metric name, threshold value, observed value, failing session ids (capped at top N), one line per failure. Same framing as post Add streaming evaluation, Dashboard V2, design docs, and CI workflow #1's Span.tool_name / trace.render(color=) polish — small, reviewable in one sitting.

Strongly recommended (reshapes the narrative)

  1. Percentile / baseline gating on evaluate (narrative-blocking for the p95 story). Current CodeEvaluator semantics are per-session: any single session below the evaluator's threshold trips exit 1 (cli.py:360if exit_code and report.pass_rate < 1.0). That's a legitimate and useful gate, but it is not the "p95 regressed vs last week" story the first draft told. Options:

    • Option A (minimal): keep the post narrative strictly per-session, remove any p95 language from the demo. Workflow ships today.
    • Option B (polish adds narrative back): add --aggregate=p95|p99|mean + --threshold-aggregate flags so bq-agent-sdk evaluate computes a statistic over the window and compares the aggregate to the threshold. Pairs with the --baseline-report helper below.

    Preference: ship Option A in the first version of the post; track Option B as a follow-up SDK PR whose landing triggers a post-2.5 or an editor's-note update to post Revamp README, enhance documentation navigation, and fix CI #2.

  2. --baseline-report helper on evaluate (optional, reduces prose). A convenience that reads the last N days of production traffic and emits a pasteable suggested-thresholds block — e.g., bq-agent-sdk evaluate --baseline-report --last=30d --buffer=0.10 --format=text prints per-evaluator p50/p95/p99 + a suggested threshold set. Not ranking as mandatory, but it halves the "what do I set my thresholds to?" friction the post otherwise has to answer in a sidebar. Naming note per review: --baseline-report over --suggest-thresholds — more neutral, lets the output carry a suggested_thresholds: block without sounding authoritative.

Deferred (references to post #4, not blocking post #2)

  1. categorical-eval --exit-code --pass-rate-threshold. categorical-eval (cli.py:589) produces pass rates today but doesn't expose --exit-code or a threshold flag — so the post should not present categorical eval as a ready-to-ship CI gate. Either defer entirely to post Overhaul README, add documentation indexes, and fix CI issues #4 (LLM-as-Judge), or ship this flag pair as a small SDK polish before post Overhaul README, add documentation indexes, and fix CI issues #4 publishes. Post Revamp README, enhance documentation navigation, and fix CI #2 mentions it in one paragraph as a pointer forward; does not demo it.

Dependency

Medium-specific tactics

Timeline

  • Week 1: SDK polish PR — per-session exit-code messaging (demo-blocking, required) + the optional --baseline-report helper if scope allows. Small, reviewable in one sitting.
  • Week 1–2: Build the regressed-branch variant of the Calendar-Assistant demo; cut a CI workflow YAML (multiple sequential evaluate commands, not a single multi-threshold call); capture red/green screenshots from a real run.
  • Week 2: Draft the post in Google Docs, iterate on hook and screenshots.
  • Week 3: Internal review (Google Cloud DevRel) — same reviewer path as post Add streaming evaluation, Dashboard V2, design docs, and CI workflow #1.
  • Week 3 / 4: Publish.

One week accelerated vs post #1's timeline because no new agent build, no new dataset setup, no new Vertex/API key dance. The reusable demo agent is the force multiplier.

Resolved (from code-verified review)

The review of this plan against the current SDK + latest ADK plugin resolved the following items; folding them into the plan rather than leaving them as open questions.

  1. Reuse or fresh dataset? Reuse the same sandbox project (test-project-0728-467323) for continuity with post Add streaming evaluation, Dashboard V2, design docs, and CI workflow #1, but under a separate table suffix or dataset (e.g. agent_events_ci_demo / ..._regression) so the CI-regression staging doesn't pollute post Add streaming evaluation, Dashboard V2, design docs, and CI workflow #1's baseline.
  2. Categorical-eval depth. One-paragraph teaser only (see the restructured "Going deeper" section above). Real categorical CI gating defers to post Overhaul README, add documentation indexes, and fix CI issues #4 unless the categorical-eval --exit-code --pass-rate-threshold polish lands first.
  3. Workflow YAML home. SDK repo at examples/ci/evaluate_thresholds.yml. Blog repo links to the pinned SDK file. Discoverability wins; one source of truth.
  4. Helper naming. --baseline-report over --suggest-thresholds. More neutral; output can carry a suggested_thresholds: block without sounding authoritative.
  5. Screenshot durability. Use the public blog demo repo if it already owns Calendar-Assistant assets, but do not rely on GitHub Actions logs as permanent evidence. Check in the captured red/green screenshots to the blog repo alongside post Add streaming evaluation, Dashboard V2, design docs, and CI workflow #1's screenshots/ directory, and keep a re-runnable workflow / branch so readers can reproduce.
  6. Option A vs Option B on percentile gating. Resolved: Option A. Ship strictly per-session in v1 with the raw-budget fix above. Percentile / aggregate gating (--aggregate=p95 --threshold-aggregate=...) stays on the strongly-recommended list as a post-publish polish that triggers an editor's-note update to the post. No hold on the post for it.
  7. Publish before or after the categorical-eval --exit-code polish? Resolved: publish first. Categorical eval stays as a one-paragraph forward-reference to post Overhaul README, add documentation indexes, and fix CI issues #4 in this version. Categorical-eval gating is a post Overhaul README, add documentation indexes, and fix CI issues #4 concern, not a post Revamp README, enhance documentation navigation, and fix CI #2 blocker.

Remaining open questions

(None gating publication. Both previously-open questions are resolved above. The one remaining publish blocker — --threshold raw-budget semantics — is in the SDK improvements section as a required polish with a preferred option named.)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions