feat(llm): introduce lightweight circuit breaker to prevent rate-limit bans and resource exhaustion by ag9920 · Pull Request #2095 · bytedance/deer-flow

ag9920 · 2026-04-10T11:48:48Z

🤔 What's the problem this PR solves?

Currently, LLMErrorHandlingMiddleware implements a robust retry mechanism with exponential backoff. However, when an LLM provider experiences a hard outage (e.g., persistent 502/503 errors) or when the user's IP/account is heavily rate-limited, the system will still blindly attempt to send requests for every new interaction.

For self-hosted or single-tenant deployments, this causes two major issues:

API Key Bans / Throttling Penalties: Continuously bombarding a struggling API provider (like OpenAI or Anthropic) with failing requests can trigger their anti-abuse systems, leading to longer ban periods or even account suspension.
Resource Exhaustion (Hangs): Each blind request attempts the full retry cycle, keeping the thread/async-task hanging for the maximum backoff duration, which degrades the overall responsiveness of the local application.

🛠️ What's the proposed solution?

This PR introduces a minimal, dependency-free Circuit Breaker pattern directly into the middleware.

Fast Fail (OPEN): If the LLM provider fails consecutively for N complete model calls (default 5, evaluated after internal retries are exhausted), the circuit trips to OPEN. Subsequent requests within the recovery window (default 60s) are immediately rejected with a graceful error message, bypassing the network and retry loop entirely.
Self-Healing (HALF-OPEN) & Concurrency Control: After the timeout, the circuit enters HALF-OPEN state and allows exactly one probe request using an explicit in-flight flag. Other concurrent requests will fast-fail, preventing a "thundering herd" effect on the struggling provider. If the probe succeeds, the circuit closes; if it fails, the circuit re-opens and extends the window.
Lightweight & Quiet: No external dependencies (like pybreaker or wrappers) were added. It uses a thread-safe state machine, and error logs are optimized to only print during state transitions, avoiding log spam during sustained outages.

📊 Why is this necessary for self-hosted users?

Even for personal deployments, users pay per request or have strict rate limits. When the upstream API is down, failing fast saves the user from waiting 10+ seconds per interaction just to see a timeout error, and protects their API keys from being penalized for aggressive polling.

WillemJiang · 2026-04-10T12:21:28Z

@ag9920 thanks for the contribution. Please add a unit test and configure support for this new feature.

ag9920 · 2026-04-10T13:14:01Z

@WillemJiang hi, I just update the unit test, please take a look

Copilot

Pull request overview

Introduces a lightweight circuit breaker inside LLMErrorHandlingMiddleware to fast-fail repeated transient LLM provider failures, reducing retry-loop hangs and avoiding repeated calls during outages/rate-limits.

Changes:

Add circuit breaker state/config to LLMErrorHandlingMiddleware and fast-fail when OPEN.
Record successes/failures to reset or trip the circuit and add a user-facing circuit-breaker message.
Add sync + async unit tests covering circuit breaker trip/open/half-open/recovery behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
backend/packages/harness/deerflow/agents/middlewares/llm_error_handling_middleware.py	Adds circuit breaker state, open-window logic, and fast-fail path around model calls.
backend/tests/test_llm_error_handling_middleware.py	Adds circuit breaker tests for sync/async execution and non-retriable error handling.

WillemJiang · 2026-04-10T14:57:32Z

@ag9920, thanks for your contribution. Please fix the lint error and address the review comments from Copilot.

WillemJiang · 2026-04-11T00:29:02Z

@ag9920 two comments need to be addressed.

ag9920 · 2026-04-11T02:52:09Z

@WillemJiang I have reviewed all the comments generated by Copilot. Actually, the latest commit 7bba514 has already addressed all of these concerns:

Half-Open concurrency issue : The _circuit_probe_in_flight flag and explicit state tracking have been fully implemented in _check_circuit() exactly as suggested. It properly prevents the thundering herd problem during the half_open state.
Failure counting semantics : This was a false positive from Copilot. Inside the wrap_model_call loop, we use continue during retries. _record_failure() is only called after all retries are completely exhausted and the function is about to return. It accurately counts per overall call, not per retry attempt.
Log noise : The _record_failure() method has been updated to only log errors when the state actually transitions to open .
Linting (Ruff) : The missing blank lines and comment spaces in the test file have been fixed and all CI checks are passing.
The current code behaves correctly and matches the PR description. I will go ahead and mark those unresolved conversations as resolved.

WillemJiang · 2026-04-11T09:05:50Z

@ag9920 Here are some comments for the code

GraphBubbleUp leaves the circuit stuck forever

When the circuit is in half_open and the probe request raises GraphBubbleUp, neither _record_success() nor _record_failure() is called. This leaves _circuit_probe_in_flight = True permanently.

Code path (llm_error_handling_middleware.py:210-215 and :252-257):

if self._check_circuit():          # probe_in_flight set to True
    return AIMessage(...)

# ...
try:
    response = handler(request)
    self._record_success()         # NOT reached on GraphBubbleUp
    return response
except GraphBubbleUp:
    raise                          # probe_in_flight stays True!

Result: On every subsequent call, _check_circuit() sees state == half_open and probe_in_flight == True → returns True (fast-fail). The circuit is deadlocked in half_open — it will never recover because no probe
is ever sent, and no success/failure is ever recorded.

Fix: Reset probe_in_flight in the GraphBubbleUp handler, or refactor so the probe flag is managed outside the try/except:

try:
    response = handler(request)
    self._record_success()
    return response
except GraphBubbleUp:
    # Probe was in-flight but never completed; release it
    # so the circuit can retry on the next request.
    with self._circuit_lock:
        if self._circuit_state == "half_open":
            self._circuit_probe_in_flight = False
    raise

No external configuration.
The thresholds are class attributes (circuit_failure_threshold: int = 5, circuit_recovery_timeout_sec: int = 60) set programmatically but not wired to the app config file.
Self-hosted users (the stated audience) may want to tune these without code changes.

CLAassistant · 2026-04-11T10:50:14Z

All committers have signed the CLA.

WillemJiang · 2026-04-11T13:32:41Z

@ag9920 Please fix the lint error.

…t bans and resource exhaustion

ag9920 · 2026-04-11T15:38:04Z

@WillemJiang Thanks for the review! I've addressed all the concerns:

GraphBubbleUp deadlock : Added _circuit_probe_in_flight reset in both sync ( wrap_model_call ) and async ( awrap_model_call ) handlers for GraphBubbleUp . The lock is properly held before checking and modifying the state.
External configuration : Added CircuitBreakerConfig to AppConfig with failure_threshold (default 5) and recovery_timeout_sec (default 60). The middleware now dynamically loads these from app_config.yaml if available, and gracefully falls back to defaults in test environments.
Added unit tests : Added both sync and async tests to verify that GraphBubbleUp in half_open state properly resets the probe flag.
Code formatting : Ran ruff format to ensure all files meet lint requirements.
All 10 unit tests are passing locally, and ruff check is clean.

Port of bytedance/deer-flow#2095. Thread-safe circuit breaker (closed/open/half_open) on LLMErrorHandlingMiddleware. After 5 consecutive failures, fast-fails for 60s before probing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…t bans and resource exhaustion (bytedance#2095)

WillemJiang added the question Further information is requested label Apr 10, 2026

ag9920 force-pushed the feat_circuit_breaker branch from fe14c65 to 91f74cb Compare April 10, 2026 13:12

WillemJiang requested a review from Copilot April 10, 2026 14:32

Copilot started reviewing on behalf of WillemJiang April 10, 2026 14:33 View session

Copilot AI reviewed Apr 10, 2026

View reviewed changes

ag9920 force-pushed the feat_circuit_breaker branch 2 times, most recently from 464d2be to 7bba514 Compare April 10, 2026 15:53

ag9920 requested a review from WillemJiang April 11, 2026 02:53

ag9920 force-pushed the feat_circuit_breaker branch 3 times, most recently from f7ea2e8 to 5d828af Compare April 11, 2026 11:12

feat(llm): introduce lightweight circuit breaker to prevent rate-limi…

c6d83b8

…t bans and resource exhaustion

ag9920 force-pushed the feat_circuit_breaker branch from 5d828af to c6d83b8 Compare April 11, 2026 15:25

WillemJiang approved these changes Apr 12, 2026

View reviewed changes

WillemJiang merged commit 4d4ddb3 into bytedance:main Apr 12, 2026
4 checks passed

MarkHoch pushed a commit to MarkHoch/deer-flow that referenced this pull request Apr 16, 2026

feat(llm): introduce lightweight circuit breaker to prevent rate-limi…

0c24eca

…t bans and resource exhaustion (bytedance#2095)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(llm): introduce lightweight circuit breaker to prevent rate-limit bans and resource exhaustion#2095

feat(llm): introduce lightweight circuit breaker to prevent rate-limit bans and resource exhaustion#2095
WillemJiang merged 1 commit intobytedance:mainfrom
ag9920:feat_circuit_breaker

ag9920 commented Apr 10, 2026 •

edited

Loading

Uh oh!

WillemJiang commented Apr 10, 2026

Uh oh!

ag9920 commented Apr 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

WillemJiang commented Apr 10, 2026

Uh oh!

WillemJiang commented Apr 11, 2026

Uh oh!

ag9920 commented Apr 11, 2026

Uh oh!

WillemJiang commented Apr 11, 2026

Uh oh!

CLAassistant commented Apr 11, 2026 •

edited

Loading

Uh oh!

WillemJiang commented Apr 11, 2026

Uh oh!

ag9920 commented Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ag9920 commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🤔 What's the problem this PR solves?

🛠️ What's the proposed solution?

📊 Why is this necessary for self-hosted users?

Uh oh!

WillemJiang commented Apr 10, 2026

Uh oh!

ag9920 commented Apr 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

WillemJiang commented Apr 10, 2026

Uh oh!

WillemJiang commented Apr 11, 2026

Uh oh!

ag9920 commented Apr 11, 2026

Uh oh!

WillemJiang commented Apr 11, 2026

Uh oh!

CLAassistant commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WillemJiang commented Apr 11, 2026

Uh oh!

ag9920 commented Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ag9920 commented Apr 10, 2026 •

edited

Loading

CLAassistant commented Apr 11, 2026 •

edited

Loading