Skip to content

Proposal: Automated Agent Quality Scorecard #63

@haiyuan-eng-google

Description

@haiyuan-eng-google

Author: Gayathri Radhakrishnan

Date: April 20, 2026

1. Executive Summary

To evolve the BigQuery Agent Analytics (BQ AA) platform from a reactive diagnostic tool into a proactive fleet-management system. This project implements a systematic "AI Judge" that automatically evaluates and grades every agent interaction, turning raw telemetry into actionable performance KPIs.

2. The "Evaluation Gap"

The current Closed-Loop RCA is an industry-leading tool for deep-diving into why a specific session failed. However, as agent deployments scale, manual root-cause analysis becomes a bottleneck. Organizations need a way to:

  • Identify high-performing vs. low-performing agent versions at a glance.
  • Monitor global quality trends without manual intervention.
  • Flag policy or safety violations in real-time across thousands of logs.

3. Proposed Solution: The Quality Scorecard

I propose building a modular evaluation pipeline that sits on top of the BigQuery event logs. This system will utilize BigQuery’s native AI capabilities (AI.GENERATE) to "grade" sessions across three key pillars:

  • Helpfulness Score (1–5): Did the agent resolve the user’s intent effectively?
  • Accuracy & Grounding (1–5): Did the agent use the available tools correctly and avoid hallucinations?
  • Policy Compliance (Pass/Fail): Did the response adhere to GRC standards (e.g., no PII leakage, authorized tool usage)?

4. Key Features & Flexibility

  • Data-Agnostic Design: The evaluation logic is decoupled from specific table names. It can be routed to point at any existing logs table or a fresh "v4" schema, requiring only standard session_id and content fields to function.
  • Fleet-Level Benchmarking: Aggregates scores into a "Leaderboard" view, allowing the team to compare performance across different regions, model versions, or system prompts.
  • Automated Triage: Automatically flags sessions with a score below a certain threshold for immediate human-in-the-loop (HITL) review.

5. Technical Impact for the Team

  • Showcases Platform Capability: Demonstrates the power of using BigQuery as a Governance and Evaluation engine, not just a storage layer.
  • Zero Infrastructure Friction: Operates entirely within the BigQuery ecosystem—no external APIs, new IAM permissions, or complex deployments required.
  • Modular Architecture: The "Judge" logic can be reused as a template for enterprise customers looking to build their own internal audit trails.

6. Implementation Roadmap

  • Phase 1: Develop the SQL-based "AI Judge" prompt and test on a sample dataset.
  • Phase 2: Create the aggregated agent_quality_metrics table for reporting.
  • Phase 3: Integrate a "Global Agent Health" visualization into the existing dashboard.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions