You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PR #45 (merged) verified the LLM-judge AI.GENERATE path end-to-end against live BigQuery. During that smoke, one FAIL line surfaced an unescaped embedded double-quote inside the feedback="..." value when the judge's justification itself contained quoted text:
FAIL session=4ca31e85 metric=faithfulness score=0.6 feedback="The agent added "(Design)" to Jordan Lee's name, which was not present in the user's request or any provided context."
That parses fine for a human eyeball but breaks awk -F'"', cut -d'"', or any shell parser that splits on " to extract the feedback field. Apostrophes inside the feedback (Jordan Lee's) also wouldn't survive a switch to single-quote wrapping.
Why this is not a blocker
Doesn't affect SDK correctness, AI.GENERATE behavior, or judge scoring.
Escape " and \ inside the feedback value. Single-line, minimal change to cli._format_feedback_snippet (or to the FAIL-line composition in _emit_evaluate_failures):
Pros: smallest delta; preserves the current key=value shape readers may already grep.
Cons: still a hand-rolled DSL — anyone parsing it has to know about the SDK's escape convention.
Add an opt-in JSON Lines mode for CI parsing. New flag like --exit-code --emit=jsonl that emits one JSON object per failing (session, metric) pair to stderr instead of the current key=value line. Default stays human-readable.
Pros: CI integrations get a proper machine-parseable contract; no escape ambiguity ever; future fields (e.g. multi-metric judge results, criterion arrays) can grow without breaking parsers.
Cons: bigger surface; needs a flag-shape decision; existing readers stay on the human format anyway.
Recommendation
Ship Option 1 first (~10-line change, drops in next polish window). Track Option 2 as a separate larger ask if a CI consumer actually needs the JSONL contract — don't pre-add the surface for a hypothetical use.
Surface to touch
src/bigquery_agent_analytics/cli.py — _format_feedback_snippet (or the line composition in _emit_evaluate_failures).
tests/test_cli.py::TestFormatFeedbackSnippet — add an escape-roundtrip case.
tests/test_cli.py::test_evaluate_exit_code_llm_judge_emits_feedback_snippet — extend the assertion to cover an embedded-quote justification (use the real (Design) example from the PR chore: Configure Renovate #45 smoke as the regression seed).
Context
PR #45 (merged) verified the LLM-judge AI.GENERATE path end-to-end against live BigQuery. During that smoke, one FAIL line surfaced an unescaped embedded double-quote inside the
feedback="..."value when the judge's justification itself contained quoted text:That parses fine for a human eyeball but breaks
awk -F'"',cut -d'"', or any shell parser that splits on"to extract the feedback field. Apostrophes inside the feedback (Jordan Lee's) also wouldn't survive a switch to single-quote wrapping.Why this is not a blocker
Tracking as a follow-up so it doesn't get lost.
Two options
Escape
"and\inside the feedback value. Single-line, minimal change tocli._format_feedback_snippet(or to the FAIL-line composition in_emit_evaluate_failures):Pros: smallest delta; preserves the current
key=valueshape readers may already grep.Cons: still a hand-rolled DSL — anyone parsing it has to know about the SDK's escape convention.
Add an opt-in JSON Lines mode for CI parsing. New flag like
--exit-code --emit=jsonlthat emits one JSON object per failing (session, metric) pair to stderr instead of the currentkey=valueline. Default stays human-readable.Pros: CI integrations get a proper machine-parseable contract; no escape ambiguity ever; future fields (e.g. multi-metric judge results, criterion arrays) can grow without breaking parsers.
Cons: bigger surface; needs a flag-shape decision; existing readers stay on the human format anyway.
Recommendation
Ship Option 1 first (~10-line change, drops in next polish window). Track Option 2 as a separate larger ask if a CI consumer actually needs the JSONL contract — don't pre-add the surface for a hypothetical use.
Surface to touch
src/bigquery_agent_analytics/cli.py—_format_feedback_snippet(or the line composition in_emit_evaluate_failures).tests/test_cli.py::TestFormatFeedbackSnippet— add an escape-roundtrip case.tests/test_cli.py::test_evaluate_exit_code_llm_judge_emits_feedback_snippet— extend the assertion to cover an embedded-quote justification (use the real(Design)example from the PR chore: Configure Renovate #45 smoke as the regression seed).Ref