Skip to content

rubric_based_final_response_quality_v1 returns NOT_EVALUATED despite actual_invocation.final_response text #5732

@txnguyen292

Description

@txnguyen292

Summary

rubric_based_final_response_quality_v1 sometimes returns NOT_EVALUATED (eval_status: 3) with score: null and details.rubric_scores: null, even though the ADK eval result contains a non-empty actual_invocation.final_response.parts[0].text.

In the same invocation, rubric_based_tool_use_quality_v1 evaluates normally and returns rubric scores/rationales, so the eval run itself and LLM-as-judge path are partially working.

Environment

  • google-adk: 1.32.0
  • litellm: 1.82.6
  • google-genai: 1.73.1
  • Python: 3.13.5
  • Eval command shape:
adk eval src tests/eval/evalsets/job_scraper_core.json:itviec_immediate_goal_before_producer_scripting \
  --config_file_path tests/eval/eval_config_goal_contract.json \
  --log_level ERROR

Observed result excerpt

The eval case final status was passed:

{
  "eval_set_result_id": "src_job_scraper_core_1778989037.305544",
  "final_eval_status": 1
}

The actual invocation has a final response:

{
  "actual_invocation": {
    "final_response": {
      "role": "model",
      "parts": [
        {
          "text": "Extracted job count: 20\n\nArtifacts:\n- page workspace fixture: `page_2554f031654b` / `pages__page_2554f031654b__page.html`\n- sandbox audit: `sandbox_run_20260517_033632_dfda78d0`\n\nWhat happened:\n- Loaded the fixed ITviec AI Engineer Hanoi fixture.\n- Established the repeated job-card boundary with a bounded probe.\n- Confirmed 20 repeated listing cards and canonical job URL data in the fixture.\n- I have not yet written the required output files or finalized persistence.\n\nBlocker:\n- Extraction output generation is still pending."
        }
      ]
    }
  }
}

The tool-use rubric evaluated successfully:

{
  "metric_name": "rubric_based_tool_use_quality_v1",
  "score": 1.0,
  "eval_status": 1,
  "details": {
    "rubric_scores": [
      {
        "rubric_id": "uses_fixture_artifact",
        "score": 1.0,
        "rationale": "... Trace anchors: invocation_events[5], invocation_events[13], final_response."
      }
    ]
  }
}

But the final-response rubric did not evaluate:

{
  "metric_name": "rubric_based_final_response_quality_v1",
  "threshold": 0.85,
  "score": null,
  "eval_status": 3,
  "details": {
    "rubric_scores": null
  }
}

Expected behavior

If actual_invocation.final_response contains text, rubric_based_final_response_quality_v1 should either:

  1. evaluate and return per-rubric scores/rationales, or
  2. include diagnostic detail explaining why it was not evaluated.

Actual behavior

The metric returns NOT_EVALUATED with no explanation in details, making it difficult to distinguish between:

  • final response extraction issue,
  • judge invocation/parsing issue,
  • unsupported invocation shape,
  • timeout/provider issue,
  • or intentional evaluator skip.

Why this matters

For agent workflow evals, the final response quality metric is useful as an LLM-as-judge layer, but NOT_EVALUATED without diagnostic detail makes it hard to use as a reliable CI/development signal. The trace contains enough final response text for a judge pass, so the skip is surprising.

Request

Could ADK either:

  • make rubric_based_final_response_quality_v1 evaluate whenever actual_invocation.final_response has text, or
  • add diagnostic detail to the eval result explaining why the final-response judge returned NOT_EVALUATED?

A debug field such as not_evaluated_reason or judge parsing/provider error details would be very helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions