Docs: Medium blog post #2 plan — 'Your agent_events table is also a test suite'

## Context

This is the plan for **post #2** in the Medium blog series tracked in #51. Post #1 is live on [Google Cloud - Community](https://medium.com/google-cloud/your-bigquery-agent-analytics-table-is-a-graph-heres-how-to-see-it-via-sdk-920b4ea14731) ("Your BigQuery Agent Analytics table is a graph. Here's how to see it via SDK"). That post closes with an explicit teaser:

> The natural follow-up — turning this filter into an automated eval check that runs on every deploy — is the next post in this series. Spoiler: `client.evaluate_categorical(...)` plus three lines of `CategoricalMetricDefinition` gets you a CI gate. **Your `agent_events` table is also a test suite.**

Post #2 delivers on that promise. Topic: **code-based evals in CI, using `bq-agent-sdk evaluate --exit-code` to fail a PR when the SDK's deterministic metrics (latency, error rate, token cost) regress against the last 24 hours of production traffic**.

Per the series ranking in #51, this is slot 2 (top-of-funnel, universal-audience). It converts the "I can see my traces now" readers from post #1 into "I can guard against regressions with them" readers.

## Title candidates

1. **"Your `agent_events` table is also a test suite. Here's how to wire it into CI."** ← recommended
2. "Stop shipping agent regressions: a 20-line GitHub Action that fails on prod-traffic quality drops."
3. "The minimum agent quality gate is 20 lines of YAML."

Recommendation: #1. It directly cashes in post #1's closing pull-quote ("your `agent_events` table is also a test suite"), which is already a rhetorical hook in the public record and gives the series a callback. #2 is clickier but fights the continuity. #3 is tight but loses the narrative arc.

## Target audience

Platform / infra / devops engineers responsible for agent reliability in production. Specifically:

- Have installed the BQ Agent Analytics plugin (or read post #1 and plan to).
- Already run CI on their agent code — unit tests, lint, maybe a smoke-test suite against a golden set.
- Have been bitten at least once by an agent regression that passed golden-set tests but blew up in production (latency cliff, tool-error spike, token-cost runaway).
- Would fork a working GitHub Actions YAML if someone handed it to them.

Post #1's audience (ADK developers new to the plugin) is **not** the primary here. Post #2 assumes the reader has traces — the question is what to do with them structurally, not how to read them.

## Structure (Medium best practices)

Target length: **1,400–1,800 words** (6–8 min read). One lede image, 3–4 inline images (workflow YAML on GitHub, red/green CI status, SDK output with threshold-miss formatting, production-vs-golden-set delta chart), one closing image.

```
H1:  Your agent_events table is also a test suite. Here's how to wire it into CI.
Sub: Twenty lines of GitHub Actions YAML, a threshold, and the last 24 hours
     of production traffic — the minimum agent quality gate, running
     entirely against the SDK's deterministic code evaluators.

1. Hook (80 words)
   - Real screenshot: a Slack-style incident thread — "p95 latency spiked
     after merge #842. Rollback in progress."
   - "This is avoidable. Your agent_events table already has the data
     that would have caught it. The gate is 20 lines of CI."

2. The problem in one paragraph (120 words)
   - Golden-set tests catch known shapes. Production traffic is bigger,
     weirder, and moves faster than any golden set.
   - The SDK's code evaluator already knows how to score production
     traffic on latency, turn count, tool error rate, token efficiency,
     TTFT, and cost — deterministically, no LLM call, no golden trace.
   - The only missing piece is "run this in CI and block the merge when
     a threshold busts." That piece already exists too: --exit-code.

3. The SDK is already CI-friendly (150 words)
   - One metric per `bq-agent-sdk evaluate` invocation today — each
     takes `--evaluator=<name>` + `--threshold=<value>` + `--exit-code`.
     A multi-metric gate is *multiple sequential commands*, not a
     single command with multiple flags. Pattern is exactly what
     `examples/ci_eval_pipeline.sh` already does in-repo.
   - Exit code: 0 = every evaluated session passed the threshold;
     1 = at least one session fell below; 2 = config/auth error.
     (See "what `--exit-code` actually measures" below — it is
     per-session, not a percentile gate.)
   - Zero LLM tokens on the deterministic path — cheap enough to run
     on every PR, not just every deploy.
   - Screenshot of the exit-code path in a terminal, red/green.

4. The demo (600 words, the core)
   - Real scenario: "The per-session token budget regression"
   - A feature PR changes the agent's system prompt to add more few-shot
     examples. Looks fine locally. Pushes a subset of sessions over
     a token-per-session budget the team set two months ago.
   - Workflow YAML (20 lines, embedded as Gist) runs four `evaluate`
     commands in sequence — one per metric — mirroring the existing
     `examples/ci_eval_pipeline.sh` pattern: latency, error_rate,
     token_efficiency, turn_count. Each has its own threshold, each
     runs with `--exit-code`, CI fails on the first exit 1.
   - PR goes red: the CI log shows which command failed, which
     metric, and which session(s) fell below the threshold. (This is
     exactly what the per-threshold exit-code messaging polish in the
     SDK-improvements section below buys us; today's output is less
     readable than the demo wants.)
   - Fix — scope the prompt change to the 30% of sessions that need
     it, not all of them. PR flips green.
   - **Explicit note for the reader**: today's gate is per-session
     ("any session fell below threshold → fail"), not p95. Post
     explicitly calls this out — the reader gets a working CI gate
     today, with a pointer to the percentile-gating polish proposed
     below for the teams that want statistical thresholds instead.
   - Sidebar: how thresholds get set. For per-session gates, p95 or
     p99 of last 30 days is a defensible starting point (set the
     threshold below that line so outliers trip it); revisit after
     week 1.

5. Going deeper (250 words)
   - Cross-link back to post #1's fleet filter: the ambiguity pattern
     from post #1 is itself a CI-worthy check — *"did this PR push
     the multi-match rate above 20%?"* Worked example in prose, no
     code, because there isn't a deterministic CLI path for this yet.
   - One-paragraph teaser for categorical-eval gating: `categorical-eval`
     exists today and produces pass rates, but does **not** support
     `--exit-code` or a `--pass-rate-threshold` flag yet, so it isn't
     a shippable CI gate. Tracked as an SDK polish item (see below)
     and deferred in depth to post #4 in the series (LLM-as-Judge).

6. What the plugin labels show over time (200 words)
   - The SDK labels every query with sdk_feature, so you can point
     INFORMATION_SCHEMA at your CI runs and see exactly what the
     gate is costing. Sample INFORMATION_SCHEMA query + real result
     (matches section 6 format from post #1).
   - CI should be a budget line, not a surprise bill.

7. Try it (100 words)
   - Fork the workflow file — direct link to a Gist or example repo.
   - Three-action CTA:
     (1) Add the workflow to a test agent repo.
     (2) Pick four thresholds based on last 30 days of prod.
     (3) Watch your next PR run the gate.
   - "The minimum agent quality gate is 20 lines of YAML and the
     table you already have."
```

## Demo requirements

1. **Reusable Calendar-Assistant agent** — reuse the demo from post #1 (same repo, same seeded contacts, same tools). Add a deliberately-regressed variant in a feature branch so the demo PR actually trips the gate.
2. **Real CI screenshots** — take actual GitHub Actions run screenshots (red + green) against a fork that runs the workflow. Same authenticity standard as post #1's render captures.
3. **Real threshold numbers** — pull the p95 / error-rate / token baselines from the same sandbox project (`test-project-0728-467323` / `agent_analytics_demo`) that post #1 used. Keeps the demo reproducible against the same corpus.
4. **Reusable workflow YAML** — committed as `examples/ci/evaluate_thresholds.yml` in the SDK repo or blog repo. Must be fork-and-ship working on day one; not pseudocode.

## SDK improvements to ship alongside the post

Review of #77 confirmed four places where today's CLI is behind where the original draft of this post assumed it was. The first two are hard publish blockers (the demo cannot honestly hit its narrative otherwise); the third reshapes the story for teams that want statistical gates; the fourth is a post #4 concern.

### Required before publish

1. **`--threshold` means a raw metric budget, not a normalization denominator** (demo-blocking — publish blocker).

   Today `--threshold=5000` on `--evaluator=latency` does **not** mean "fail when `avg_latency_ms > 5000`." The threshold is passed into a normalized-score function — `score = 1.0 - (avg_latency_ms / threshold_ms)`, clamped to `[0, 1]` ([`udf_kernels.py:147`](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/blob/main/src/bigquery_agent_analytics/udf_kernels.py#L147)) — and pass/fail is checked against a hardcoded 0.5 score cutoff ([`evaluators.py:249`](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/blob/main/src/bigquery_agent_analytics/evaluators.py#L249), inside `CodeEvaluator.latency()`). Net result: `--threshold=5000` actually fails around `avg_latency_ms > 2500`. Same shape for the other prebuilt evaluators.

   A reader who forks a YAML with `--threshold=5000` expecting a 5-second budget and gets a 2.5-second gate will lose trust in the post and the SDK at the same time. This cannot ship as-is. Two options to fix — pick one in the SDK polish PR before publish:

   - **Option 1 — change prebuilt evaluator gates to compare raw values (preferred).** `CodeEvaluator.latency(threshold_ms=5000)` passes `avg_latency_ms <= 5000` directly, no normalization, no 0.5 cutoff. Apply to all prebuilt evaluators (`latency`, `error_rate`, `turn_count`, `token_efficiency`, `ttft`, `cost`). Minor behavior change; the current normalized semantics are effectively a latent bug — nobody in CI would have deliberately chosen "fail at half the threshold I typed." Mitigate with a CHANGELOG entry + a deprecation note if anyone's relying on the old curve.
   - **Option 2 — add explicit raw-budget flags (additive).** Keep the normalized scorer as-is but add `--latency-budget-ms`, `--error-rate-budget`, `--token-budget-per-session` (etc.) that bypass the normalization. Stricter semantics for CI users; the normalized score stays for reporting/dashboard consumers who want a 0–1 signal.

   Preference: Option 1. Simpler mental model for the reader, smaller API surface. But either closes the blocker.

   Until this lands, the post cannot honestly CTA "fork this YAML."

2. **Per-session exit-code messaging** (demo-blocking). When `--exit-code` triggers exit 1 in the current `bq-agent-sdk evaluate` output ([`cli.py:267`](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/blob/main/src/bigquery_agent_analytics/cli.py#L267)), the CI log doesn't reliably name *which* session(s) fell below threshold, for *which* metric, by how much. Readable CI logs are table stakes for a post whose CTA is "fork this YAML and ship a gate." Audit and tighten: metric name, threshold value, observed value, failing session ids (capped at top N), one line per failure. Same framing as post #1's `Span.tool_name` / `trace.render(color=)` polish — small, reviewable in one sitting.

### Strongly recommended (reshapes the narrative)

2. **Percentile / baseline gating on `evaluate`** (narrative-blocking for the p95 story). Current `CodeEvaluator` semantics are **per-session**: any single session below the evaluator's threshold trips exit 1 (`cli.py:360` — `if exit_code and report.pass_rate < 1.0`). That's a legitimate and useful gate, but it is *not* the "p95 regressed vs last week" story the first draft told. Options:
   - **Option A (minimal)**: keep the post narrative strictly per-session, remove any p95 language from the demo. Workflow ships today.
   - **Option B (polish adds narrative back)**: add `--aggregate=p95|p99|mean` + `--threshold-aggregate` flags so `bq-agent-sdk evaluate` computes a statistic over the window and compares the aggregate to the threshold. Pairs with the `--baseline-report` helper below.

   Preference: ship **Option A** in the first version of the post; track **Option B** as a follow-up SDK PR whose landing triggers a post-2.5 or an editor's-note update to post #2.

3. **`--baseline-report` helper on `evaluate`** (optional, reduces prose). A convenience that reads the last N days of production traffic and emits a pasteable suggested-thresholds block — e.g., `bq-agent-sdk evaluate --baseline-report --last=30d --buffer=0.10 --format=text` prints per-evaluator p50/p95/p99 + a suggested threshold set. Not ranking as mandatory, but it halves the "what do I set my thresholds to?" friction the post otherwise has to answer in a sidebar. Naming note per review: `--baseline-report` over `--suggest-thresholds` — more neutral, lets the output carry a `suggested_thresholds:` block without sounding authoritative.

### Deferred (references to post #4, not blocking post #2)

4. **`categorical-eval --exit-code --pass-rate-threshold`.** `categorical-eval` (`cli.py:589`) produces pass rates today but doesn't expose `--exit-code` or a threshold flag — so the post should not present categorical eval as a ready-to-ship CI gate. Either defer *entirely* to post #4 (LLM-as-Judge), or ship this flag pair as a small SDK polish before post #4 publishes. Post #2 mentions it in one paragraph as a pointer forward; does not demo it.

## Dependency

- **Post #1 should be live before post #2 publishes.** (It is — [published here](https://medium.com/google-cloud/your-bigquery-agent-analytics-table-is-a-graph-heres-how-to-see-it-via-sdk-920b4ea14731).) Post #2's narrative arc leans on post #1's "you have traces now" framing and on the INFORMATION_SCHEMA cost callback in section 6.
- **Reusable Calendar-Assistant demo agent** from post #1 — already exists at `demo_calendar_assistant.py` in the blog repo, so no new agent build required. Regressed-branch variant is the only new scaffolding.
- **No gating on #75 / #76.** The compiled-extractor and validator work is orthogonal to this post. Post #2 uses only deterministic `CodeEvaluator` metrics + optional categorical evals, both of which are stable today.

## Medium-specific tactics

- **Publication**: Google Cloud Community (same as post #1). Preserves series continuity and search ranking. Tags should carry forward: `bigquery`, `ai-agents`, `google-cloud`, `python`, `ci-cd` (swap `observability` for `ci-cd` vs post #1).
- **Opening image**: an exaggerated red/green PR-status UI mashup. Less abstract than post #1's "rows → tree" hero; the value here is visceral (your CI went red on a metric you can actually explain).
- **Code blocks**: YAML is the hero code block, not Python. Embed the full 20-line workflow file as a Gist so readers copy-paste from one place.
- **Callouts**: Two candidate pull-quotes to carry the skim:
  - *"Golden-set tests catch what you thought to test. Production traffic catches the rest."*
  - *"Your CI shouldn't be a test of what you wrote last week. It should be a test of what your users did yesterday."*
- **Series navigation**: link back to post #1 at the end of the hook section (*"If you haven't seen your traces as a tree yet, [start here](link-to-post-1)."*). Link forward to post #3 in the closing, following the #51 cadence.
- **Canonical URL**: set to the Google Cloud dev blog version if co-published, consistent with post #1.

## Timeline

- **Week 1**: SDK polish PR — per-session exit-code messaging (demo-blocking, required) + the optional `--baseline-report` helper if scope allows. Small, reviewable in one sitting.
- **Week 1–2**: Build the regressed-branch variant of the Calendar-Assistant demo; cut a CI workflow YAML (multiple sequential `evaluate` commands, not a single multi-threshold call); capture red/green screenshots from a real run.
- **Week 2**: Draft the post in Google Docs, iterate on hook and screenshots.
- **Week 3**: Internal review (Google Cloud DevRel) — same reviewer path as post #1.
- **Week 3 / 4**: Publish.

One week accelerated vs post #1's timeline because no new agent build, no new dataset setup, no new Vertex/API key dance. The reusable demo agent is the force multiplier.

## Resolved (from code-verified review)

The review of this plan against the current SDK + latest ADK plugin resolved the following items; folding them into the plan rather than leaving them as open questions.

1. **Reuse or fresh dataset?** Reuse the same sandbox project (`test-project-0728-467323`) for continuity with post #1, but under a **separate table suffix or dataset** (e.g. `agent_events_ci_demo` / `..._regression`) so the CI-regression staging doesn't pollute post #1's baseline.
2. **Categorical-eval depth.** One-paragraph teaser only (see the restructured "Going deeper" section above). Real categorical CI gating defers to post #4 unless the `categorical-eval --exit-code --pass-rate-threshold` polish lands first.
3. **Workflow YAML home.** SDK repo at `examples/ci/evaluate_thresholds.yml`. Blog repo links to the pinned SDK file. Discoverability wins; one source of truth.
4. **Helper naming.** `--baseline-report` over `--suggest-thresholds`. More neutral; output can carry a `suggested_thresholds:` block without sounding authoritative.
5. **Screenshot durability.** Use the public blog demo repo if it already owns Calendar-Assistant assets, **but do not rely on GitHub Actions logs as permanent evidence.** Check in the captured red/green screenshots to the blog repo alongside post #1's `screenshots/` directory, and keep a re-runnable workflow / branch so readers can reproduce.
6. **Option A vs Option B on percentile gating.** **Resolved: Option A.** Ship strictly per-session in v1 with the raw-budget fix above. Percentile / aggregate gating (`--aggregate=p95 --threshold-aggregate=...`) stays on the strongly-recommended list as a post-publish polish that triggers an editor's-note update to the post. No hold on the post for it.
7. **Publish before or after the `categorical-eval --exit-code` polish?** **Resolved: publish first.** Categorical eval stays as a one-paragraph forward-reference to post #4 in this version. Categorical-eval gating is a post #4 concern, not a post #2 blocker.

## Remaining open questions

(None gating publication. Both previously-open questions are resolved above. The one remaining publish blocker — `--threshold` raw-budget semantics — is in the SDK improvements section as a required polish with a preferred option named.)

## Related

- Series plan and ranking: #51
- Post #1 plan: #53
- Post #1 published: https://medium.com/google-cloud/your-bigquery-agent-analytics-table-is-a-graph-heres-how-to-see-it-via-sdk-920b4ea14731
- Calendar-Assistant demo agent: `demo_calendar_assistant.py` in the [blog repo](https://github.com/haiyuan-eng-google/bigquery-agent-analytics-blogpost)




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docs: Medium blog post #2 plan — 'Your agent_events table is also a test suite' #77

Context

Title candidates

Target audience

Structure (Medium best practices)

Demo requirements

SDK improvements to ship alongside the post

Required before publish

Strongly recommended (reshapes the narrative)

Deferred (references to post #4, not blocking post #2)

Dependency

Medium-specific tactics

Timeline

Resolved (from code-verified review)

Remaining open questions

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Docs: Medium blog post #2 plan — 'Your agent_events table is also a test suite' #77

Description

Context

Title candidates

Target audience

Structure (Medium best practices)

Demo requirements

SDK improvements to ship alongside the post

Required before publish

Strongly recommended (reshapes the narrative)

Deferred (references to post #4, not blocking post #2)

Dependency

Medium-specific tactics

Timeline

Resolved (from code-verified review)

Remaining open questions

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions