Skip to content

feat(llm): introduce lightweight circuit breaker to prevent rate-limit bans and resource exhaustion#2095

Merged
WillemJiang merged 1 commit intobytedance:mainfrom
ag9920:feat_circuit_breaker
Apr 12, 2026
Merged

feat(llm): introduce lightweight circuit breaker to prevent rate-limit bans and resource exhaustion#2095
WillemJiang merged 1 commit intobytedance:mainfrom
ag9920:feat_circuit_breaker

Conversation

@ag9920
Copy link
Copy Markdown
Contributor

@ag9920 ag9920 commented Apr 10, 2026

🤔 What's the problem this PR solves?

Currently, LLMErrorHandlingMiddleware implements a robust retry mechanism with exponential backoff. However, when an LLM provider experiences a hard outage (e.g., persistent 502/503 errors) or when the user's IP/account is heavily rate-limited, the system will still blindly attempt to send requests for every new interaction.

For self-hosted or single-tenant deployments, this causes two major issues:

  1. API Key Bans / Throttling Penalties: Continuously bombarding a struggling API provider (like OpenAI or Anthropic) with failing requests can trigger their anti-abuse systems, leading to longer ban periods or even account suspension.
  2. Resource Exhaustion (Hangs): Each blind request attempts the full retry cycle, keeping the thread/async-task hanging for the maximum backoff duration, which degrades the overall responsiveness of the local application.

🛠️ What's the proposed solution?

This PR introduces a minimal, dependency-free Circuit Breaker pattern directly into the middleware.

  • Fast Fail (OPEN): If the LLM provider fails consecutively for N complete model calls (default 5, evaluated after internal retries are exhausted), the circuit trips to OPEN. Subsequent requests within the recovery window (default 60s) are immediately rejected with a graceful error message, bypassing the network and retry loop entirely.
  • Self-Healing (HALF-OPEN) & Concurrency Control: After the timeout, the circuit enters HALF-OPEN state and allows exactly one probe request using an explicit in-flight flag. Other concurrent requests will fast-fail, preventing a "thundering herd" effect on the struggling provider. If the probe succeeds, the circuit closes; if it fails, the circuit re-opens and extends the window.
  • Lightweight & Quiet: No external dependencies (like pybreaker or wrappers) were added. It uses a thread-safe state machine, and error logs are optimized to only print during state transitions, avoiding log spam during sustained outages.

📊 Why is this necessary for self-hosted users?

Even for personal deployments, users pay per request or have strict rate limits. When the upstream API is down, failing fast saves the user from waiting 10+ seconds per interaction just to see a timeout error, and protects their API keys from being penalized for aggressive polling.

@WillemJiang
Copy link
Copy Markdown
Collaborator

@ag9920 thanks for the contribution. Please add a unit test and configure support for this new feature.

@WillemJiang WillemJiang added the question Further information is requested label Apr 10, 2026
@ag9920 ag9920 force-pushed the feat_circuit_breaker branch from fe14c65 to 91f74cb Compare April 10, 2026 13:12
@ag9920
Copy link
Copy Markdown
Contributor Author

ag9920 commented Apr 10, 2026

@WillemJiang hi, I just update the unit test, please take a look

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a lightweight circuit breaker inside LLMErrorHandlingMiddleware to fast-fail repeated transient LLM provider failures, reducing retry-loop hangs and avoiding repeated calls during outages/rate-limits.

Changes:

  • Add circuit breaker state/config to LLMErrorHandlingMiddleware and fast-fail when OPEN.
  • Record successes/failures to reset or trip the circuit and add a user-facing circuit-breaker message.
  • Add sync + async unit tests covering circuit breaker trip/open/half-open/recovery behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
backend/packages/harness/deerflow/agents/middlewares/llm_error_handling_middleware.py Adds circuit breaker state, open-window logic, and fast-fail path around model calls.
backend/tests/test_llm_error_handling_middleware.py Adds circuit breaker tests for sync/async execution and non-retriable error handling.

Comment thread backend/tests/test_llm_error_handling_middleware.py Outdated
@WillemJiang
Copy link
Copy Markdown
Collaborator

@ag9920, thanks for your contribution. Please fix the lint error and address the review comments from Copilot.

@ag9920 ag9920 force-pushed the feat_circuit_breaker branch 2 times, most recently from 464d2be to 7bba514 Compare April 10, 2026 15:53
@WillemJiang
Copy link
Copy Markdown
Collaborator

@ag9920 two comments need to be addressed.

@ag9920
Copy link
Copy Markdown
Contributor Author

ag9920 commented Apr 11, 2026

@WillemJiang I have reviewed all the comments generated by Copilot. Actually, the latest commit 7bba514 has already addressed all of these concerns:

  1. Half-Open concurrency issue : The _circuit_probe_in_flight flag and explicit state tracking have been fully implemented in _check_circuit() exactly as suggested. It properly prevents the thundering herd problem during the half_open state.
  2. Failure counting semantics : This was a false positive from Copilot. Inside the wrap_model_call loop, we use continue during retries. _record_failure() is only called after all retries are completely exhausted and the function is about to return. It accurately counts per overall call, not per retry attempt.
  3. Log noise : The _record_failure() method has been updated to only log errors when the state actually transitions to open .
  4. Linting (Ruff) : The missing blank lines and comment spaces in the test file have been fixed and all CI checks are passing.
    The current code behaves correctly and matches the PR description. I will go ahead and mark those unresolved conversations as resolved.

@ag9920 ag9920 requested a review from WillemJiang April 11, 2026 02:53
@WillemJiang
Copy link
Copy Markdown
Collaborator

@ag9920 Here are some comments for the code

  1. GraphBubbleUp leaves the circuit stuck forever

When the circuit is in half_open and the probe request raises GraphBubbleUp, neither _record_success() nor _record_failure() is called. This leaves _circuit_probe_in_flight = True permanently.

Code path (llm_error_handling_middleware.py:210-215 and :252-257):

if self._check_circuit():          # probe_in_flight set to True
    return AIMessage(...)

# ...
try:
    response = handler(request)
    self._record_success()         # NOT reached on GraphBubbleUp
    return response
except GraphBubbleUp:
    raise                          # probe_in_flight stays True!

Result: On every subsequent call, _check_circuit() sees state == half_open and probe_in_flight == True → returns True (fast-fail). The circuit is deadlocked in half_open — it will never recover because no probe
is ever sent, and no success/failure is ever recorded.

Fix: Reset probe_in_flight in the GraphBubbleUp handler, or refactor so the probe flag is managed outside the try/except:

try:
    response = handler(request)
    self._record_success()
    return response
except GraphBubbleUp:
    # Probe was in-flight but never completed; release it
    # so the circuit can retry on the next request.
    with self._circuit_lock:
        if self._circuit_state == "half_open":
            self._circuit_probe_in_flight = False
    raise
  1. No external configuration.
    The thresholds are class attributes (circuit_failure_threshold: int = 5, circuit_recovery_timeout_sec: int = 60) set programmatically but not wired to the app config file.
    Self-hosted users (the stated audience) may want to tune these without code changes.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 11, 2026

CLA assistant check
All committers have signed the CLA.

@ag9920 ag9920 force-pushed the feat_circuit_breaker branch 3 times, most recently from f7ea2e8 to 5d828af Compare April 11, 2026 11:12
@WillemJiang
Copy link
Copy Markdown
Collaborator

@ag9920 Please fix the lint error.

@ag9920 ag9920 force-pushed the feat_circuit_breaker branch from 5d828af to c6d83b8 Compare April 11, 2026 15:25
@ag9920
Copy link
Copy Markdown
Contributor Author

ag9920 commented Apr 11, 2026

@WillemJiang Thanks for the review! I've addressed all the concerns:

  1. GraphBubbleUp deadlock : Added _circuit_probe_in_flight reset in both sync ( wrap_model_call ) and async ( awrap_model_call ) handlers for GraphBubbleUp . The lock is properly held before checking and modifying the state.
  2. External configuration : Added CircuitBreakerConfig to AppConfig with failure_threshold (default 5) and recovery_timeout_sec (default 60). The middleware now dynamically loads these from app_config.yaml if available, and gracefully falls back to defaults in test environments.
  3. Added unit tests : Added both sync and async tests to verify that GraphBubbleUp in half_open state properly resets the probe flag.
  4. Code formatting : Ran ruff format to ensure all files meet lint requirements.
    All 10 unit tests are passing locally, and ruff check is clean.

@WillemJiang WillemJiang merged commit 4d4ddb3 into bytedance:main Apr 12, 2026
4 checks passed
tmartin2113 added a commit to tmartin2113/paperclip that referenced this pull request Apr 12, 2026
Port of bytedance/deer-flow#2095. Thread-safe circuit breaker
(closed/open/half_open) on LLMErrorHandlingMiddleware. After 5
consecutive failures, fast-fails for 60s before probing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MarkHoch pushed a commit to MarkHoch/deer-flow that referenced this pull request Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

question Further information is requested

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants