Skip to content

evaluate --exit-code: escape quotes (or emit JSON Lines) in LLM-judge FAIL feedback field #84

@caohy1988

Description

@caohy1988

Context

PR #45 (merged) verified the LLM-judge AI.GENERATE path end-to-end against live BigQuery. During that smoke, one FAIL line surfaced an unescaped embedded double-quote inside the feedback="..." value when the judge's justification itself contained quoted text:

FAIL session=4ca31e85 metric=faithfulness score=0.6 feedback="The agent added "(Design)" to Jordan Lee's name, which was not present in the user's request or any provided context."

That parses fine for a human eyeball but breaks awk -F'"', cut -d'"', or any shell parser that splits on " to extract the feedback field. Apostrophes inside the feedback (Jordan Lee's) also wouldn't survive a switch to single-quote wrapping.

Why this is not a blocker

Tracking as a follow-up so it doesn't get lost.

Two options

  1. Escape " and \ inside the feedback value. Single-line, minimal change to cli._format_feedback_snippet (or to the FAIL-line composition in _emit_evaluate_failures):

    escaped = feedback.replace("\\", "\\\\").replace('"', '\\"')
    parts.append(f'feedback="{escaped}"')

    Pros: smallest delta; preserves the current key=value shape readers may already grep.
    Cons: still a hand-rolled DSL — anyone parsing it has to know about the SDK's escape convention.

  2. Add an opt-in JSON Lines mode for CI parsing. New flag like --exit-code --emit=jsonl that emits one JSON object per failing (session, metric) pair to stderr instead of the current key=value line. Default stays human-readable.

    Pros: CI integrations get a proper machine-parseable contract; no escape ambiguity ever; future fields (e.g. multi-metric judge results, criterion arrays) can grow without breaking parsers.
    Cons: bigger surface; needs a flag-shape decision; existing readers stay on the human format anyway.

Recommendation

Ship Option 1 first (~10-line change, drops in next polish window). Track Option 2 as a separate larger ask if a CI consumer actually needs the JSONL contract — don't pre-add the surface for a hypothetical use.

Surface to touch

  • src/bigquery_agent_analytics/cli.py_format_feedback_snippet (or the line composition in _emit_evaluate_failures).
  • tests/test_cli.py::TestFormatFeedbackSnippet — add an escape-roundtrip case.
  • tests/test_cli.py::test_evaluate_exit_code_llm_judge_emits_feedback_snippet — extend the assertion to cover an embedded-quote justification (use the real (Design) example from the PR chore: Configure Renovate #45 smoke as the regression seed).

Ref

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions