Skip to content

feat(telemetry): Fix error message telemetry for tool calls#5797

Open
Achuth17 wants to merge 2 commits into
google:mainfrom
Achuth17:feat-telemetry-tool-error-messages
Open

feat(telemetry): Fix error message telemetry for tool calls#5797
Achuth17 wants to merge 2 commits into
google:mainfrom
Achuth17:feat-telemetry-tool-error-messages

Conversation

@Achuth17
Copy link
Copy Markdown
Contributor

The tool error details are not populated correctly in traces for some some tool call categories like REST Tool or MCP Tool.

Please ensure you have read the contribution guide before creating a pull request.

Link to Issue or Description of Change

1. Link to an existing issue (if applicable):

  • Closes: #issue_number
  • Related: #issue_number

2. Or, if no issue exists, describe the change:

If applicable, please follow the issue templates to provide as much detail as
possible.

Problem:
A clear and concise description of what the problem is.

Tool spans were missing the error.type attribute when a tool returned an error as a dict (e.g. {'error': ...} from a FunctionTool, {'isError': True} from an MCP tool, {'status': 'error'} from a REST/discovery tool) rather than raising an exception. Since most ADK tool categories signal errors via return value rather than exceptions, this meant tool error telemetry was effectively absent for the majority of real-world tool failures, making it hard to detect and diagnose tool errors from traces.

Solution:
A clear and concise description of what you want to happen and why you choose
this solution.

What: Detect dict-shaped errors in tool responses and populate error.type on the tool span with a category-specific string (e.g. TOOL_ERROR, MCP_TOOL_ERROR, HTTP_ERROR). Implemented via an opt-in private hook _detect_error_in_response(self, response) on individual tool classes (FunctionTool, MCPTool, RestApiTool, BashTool, GoogleTool, DiscoveryEngineSearchTool, environment tools, skill tools); the function-call dispatcher calls it via getattr after a successful tool invocation and threads the detected string through tel_ctx.error_type into trace_tool_call. Exception-raised errors continue to take precedence.

Why this solution:

  • Opt-in per tool — each tool decides what counts as an error (shape and category string), avoiding a one-size-fits-all rule that wouldn't fit the heterogeneous response schemas.
  • Private hook (no BaseTool API change) — call sites use getattr(tool, '_detect_error_in_response', None), so we don't commit to a public API while the design settles; a public redesign can come later without breaking tools.
  • Defensive integration — detection is wrapped in try/except (errors swallowed + logged) and skipped for auth/confirmation requests, so a buggy detector can't break tool execution.
  • Minimal blast radius — exception path is untouched; error_type is only consulted when no exception was raised, preserving existing behavior for tools that already throw.

Testing Plan

Please describe the tests that you ran to verify your changes. This is required
for all PRs that are not small documentation or typo fixes.

Unit Tests:

Tests run:

  • tests/unittests/telemetry/test_spans.py + tests/unittests/flows/llm_flows/test_functions_simple.py (targeted) — 99/99 passed. Covers new tests for each tool's _detect_error_in_response hook, the error_type kwarg on trace_tool_call (including exception-takes-precedence), guard that BaseTool doesn't expose the hook, and e2e tests verifying: dict errors recorded, success dicts ignored, exception precedence, detection skipped on auth/confirmation requests, and detector exceptions don't break tool calls.

  • tests/unittests/telemetry/ + tests/unittests/flows/llm_flows/ (broader regression) — 517/517 passed.

  • tests/unittests/tools/ (regression on the 8 modified tool files) — 1394/1394 passed.

  • I have added or updated unit tests for my change.

  • All unit tests pass locally.

Please include a summary of passed pytest results.

Manual End-to-End (E2E) Tests:

Please provide instructions on how to manually test your changes, including any
necessary setup or configuration. Please provide logs or screenshots to help
reviewers better understand the fix.

Checklist

  • I have read the CONTRIBUTING.md document.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have added tests that prove my fix is effective or that my feature works.
  • New and existing unit tests pass locally with my changes.
  • I have manually tested my changes end-to-end.
  • Any dependent changes have been merged and published in downstream modules.

Additional context

Add any other context or screenshots about the feature request here.

The tool error details are not populated correctly in traces for some some tool call categories like REST Tool or MCP Tool.
@adk-bot adk-bot added the tracing [Component] This issue is related to OpenTelemetry tracing label May 21, 2026
@adk-bot
Copy link
Copy Markdown
Collaborator

adk-bot commented May 21, 2026

Response from ADK Triaging Agent

Hello @Achuth17, thank you for creating this PR!

This PR is a feature/bug fix related to telemetry. To help our reviewers better understand and verify the changes, could you please provide:

  • Manual End-to-End (E2E) Tests / Verification Evidence: Please provide instructions on how to manually test your changes, along with any relevant console logs or screenshots showing the updated telemetry traces after the fix is applied.

This information will help the reviewers review your PR more efficiently. Thanks!

@Achuth17 Achuth17 marked this pull request as ready for review May 21, 2026 23:26
@xuanyang15 xuanyang15 self-assigned this May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tracing [Component] This issue is related to OpenTelemetry tracing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants