fix(sandbox): stale sandbox URL recovery before tool execution by ZhouJ-sh · Pull Request #2081 · bytedance/deer-flow

ZhouJ-sh · 2026-04-10T05:55:30Z

Summary
Automatically recover from stale sandbox URLs before tool execution by probing cached AIO sandboxes in AioSandboxProvider.get() and rediscovering or recreating the live endpoint when the old URL is no longer reachable.

Problem
A thread could keep a cached sandbox_id after its original sandbox URL had already become invalid. In that state, later tool calls such as bash failed with Error: [Errno 111] Connection refused instead of self-healing and continuing.

Root Cause
The sandbox recovery path stopped too early:

ensure_sandbox_initialized() trusted provider.get(sandbox_id) as long as it returned an object
AioSandboxProvider.get() only returned the in-memory sandbox instance and refreshed activity timestamps
no health check was performed on the cached sandbox URL before reuse
when the underlying sandbox had already been reclaimed, the stale cached object was still returned and the failure only surfaced later during HTTP command execution

That left the system unable to recover from the common case where the sandbox URL had already gone stale before the next tool call started.

Changes

add a lightweight health probe in AioSandboxProvider.get() for cached AioSandbox instances
if the cached sandbox URL is unreachable, attempt rediscovery via the backend before returning it
if rediscovery succeeds, replace the stale cached sandbox with a fresh in-memory instance bound to the live URL
if rediscovery fails, clear the stale cached state and return None so the existing upper-layer initialization path can reacquire a sandbox
add regression tests covering both:
- stale cached sandbox successfully rediscovered
- stale cached sandbox not rediscoverable and therefore dropped

Testing

deer-flow/backend/.venv/bin/pytest backend/tests/test_aio_sandbox_provider.py -q
deer-flow/backend/.venv/bin/pytest backend/tests/test_aio_sandbox.py backend/tests/test_sandbox_tools_security.py -q

These runs confirm the new stale-URL recovery behavior and verify that adjacent sandbox and tool behavior still passes unchanged.

Copilot

Pull request overview

This PR improves DeerFlow’s AIO sandbox resilience by detecting and recovering from stale cached sandbox URLs before tool execution, so tool calls don’t fail later with connection-refused errors.

Changes:

Added an HTTP health probe in AioSandboxProvider.get() to detect unreachable cached sandbox endpoints.
Implemented a recovery path that rediscoveries (or drops) stale sandboxes and refreshes the in-memory cache accordingly.
Added regression tests for both “rediscovered” and “not rediscoverable” stale-sandbox scenarios.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
backend/packages/harness/deerflow/community/aio_sandbox/aio_sandbox_provider.py	Adds URL health probing in `get()` and a recovery helper to rediscover/refresh stale sandboxes.
backend/tests/test_aio_sandbox_provider.py	Adds tests asserting `get()` rediscovery behavior and cache eviction when rediscovery fails.

Copilot · 2026-04-10T10:17:10Z

+        with self._lock:
+            sandbox = self._sandboxes.get(sandbox_id)
+        if sandbox is None:
+            return None
+
+        # Cached AIO sandboxes can outlive the underlying container/URL.
+        # Probe before returning so callers do not execute against a dead endpoint.
+        if isinstance(sandbox, AioSandbox) and not self._sandbox_url_is_healthy(sandbox.base_url):
+            logger.warning(f"Sandbox {sandbox_id} at {sandbox.base_url} is unreachable, attempting rediscovery")
+            return self._recover_stale_sandbox(sandbox_id)


get() now performs a blocking HTTP probe (requests.get(..., timeout=2)) on every cached AioSandbox access. Since ensure_sandbox_initialized() calls provider.get() on essentially every tool invocation, this adds a network round-trip (and up to 2s latency on timeouts) to the hot path and can materially slow tool execution or amplify transient network issues. Consider throttling the probe (e.g., only if time.time() - _last_activity[sandbox_id] exceeds a small threshold, or caching the last health-check time/result per sandbox) so health checks don’t run on every get() call while still recovering from genuinely stale URLs.

Health checks are now throttled with a short cache window, so get() no longer probes on every hot-path access while still recovering stale sandbox URLs.

Copilot · 2026-04-10T10:17:10Z

+        with self._lock:
+            self._sandboxes.pop(sandbox_id, None)
+            self._sandbox_infos.pop(sandbox_id, None)
+            self._last_activity.pop(sandbox_id, None)
+
+            if discovered is None:
+                logger.warning(f"Sandbox {sandbox_id} could not be rediscovered after health-check failure")
+                return None


When _recover_stale_sandbox() drops cached state after a failed health check, it clears _sandboxes, _sandbox_infos, and _last_activity but leaves any entries in _thread_sandboxes that still point at this sandbox_id. That keeps the provider’s internal mapping inconsistent until the next acquire() happens to clean it up. Consider removing thread mappings for this sandbox ID here as well (mirroring release()/destroy()) so the provider state is immediately consistent after recovery failure.

Recovery failure now removes any _thread_sandboxes entries pointing at the stale sandbox ID so provider state is immediately consistent.

WillemJiang · 2026-04-11T09:08:19Z

@ZhouJ-sh thanks for your contribution. Please take a look at the review comments.

…houJ-sh/deer-flow into fix/sandbox-stale-url-recovery

WillemJiang · 2026-04-12T07:46:58Z

@ZhouJ-sh Thread-to-sandbox mapping not restored after recovery (_recover_stale_sandbox)

When recovery fails (discover returns None), the method clears _thread_sandboxes[tid] for all threads that pointed to the stale sandbox. This is correct — it ensures the next acquire(thread_id) creates a fresh sandbox. But when recovery succeeds, the thread-to-sandbox mapping is NOT restored. Looking at the cleanup:

thread_ids_to_remove = [tid for tid, sid in self._thread_sandboxes.items() if sid == sandbox_id]
for tid in thread_ids_to_remove:
    del self._thread_sandboxes[tid]

Then the fresh sandbox is registered in _sandboxes but _thread_sandboxes for those threads is gone. This means if those threads later call ensure_sandbox_initialized(), they won't find their sandbox_id in _thread_sandboxes and may acquire a second sandbox unnecessarily.

However, looking at ensure_sandbox_initialized(), the sandbox_id is stored in runtime.state["sandbox"]["sandbox_id"], which persists across tool calls. The _thread_sandboxes mapping is only used by acquire() for warm-pool reclaim. So the consequence is: the next acquire() for that thread won't find the existing sandbox and may create a duplicate. This is wasteful but not incorrect — the old sandbox eventually times out via the idle checker.

Consider restoring the thread mappings after successful recovery:

for tid in thread_ids_to_remove:
     self._thread_sandboxes[tid] = discovered.sandbox_id

ZhouJ-sh · 2026-04-13T08:12:55Z

Fixed.

On successful rediscovery, _recover_stale_sandbox() now restores the thread_id -> sandbox_id mappings it temporarily clears while dropping the stale cache entry. We still remove those mappings when rediscovery fails, so the next acquire(thread_id) can create a fresh sandbox.

I also added a regression assertion in test_get_rediscovers_stale_cached_sandbox to verify the thread mapping is preserved after recovery.

Verified with:
cd backend && uv run pytest -q tests/test_aio_sandbox_provider.py

Fix stale sandbox URL recovery before tool execution

db4dc69

ZhouJ-sh changed the title ~~fix(sanbox): stale sandbox URL recovery before tool execution~~ fix(sandbox): stale sandbox URL recovery before tool execution Apr 10, 2026

WillemJiang requested a review from Copilot April 10, 2026 10:13

Copilot started reviewing on behalf of WillemJiang April 10, 2026 10:13 View session

Copilot AI reviewed Apr 10, 2026

View reviewed changes

WillemJiang added question Further information is requested labels Apr 10, 2026

ZhouJ-sh added 3 commits April 12, 2026 14:30

Throttle sandbox health probes and clean stale thread mappings

d467502

Merge branch 'main' into fix/sandbox-stale-url-recovery

359de66

Merge branch 'fix/sandbox-stale-url-recovery' of https://github.com/Z…

a108359

…houJ-sh/deer-flow into fix/sandbox-stale-url-recovery

fix: restore thread mapping after sandbox recovery

88dd74a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sandbox): stale sandbox URL recovery before tool execution#2081

fix(sandbox): stale sandbox URL recovery before tool execution#2081
ZhouJ-sh wants to merge 5 commits intobytedance:mainfrom
ZhouJ-sh:fix/sandbox-stale-url-recovery

ZhouJ-sh commented Apr 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

ZhouJ-sh Apr 12, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

ZhouJ-sh Apr 12, 2026

Uh oh!

WillemJiang commented Apr 11, 2026

Uh oh!

WillemJiang commented Apr 12, 2026

Uh oh!

ZhouJ-sh commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ZhouJ-sh commented Apr 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

ZhouJ-sh Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

ZhouJ-sh Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

WillemJiang commented Apr 11, 2026

Uh oh!

WillemJiang commented Apr 12, 2026

Uh oh!

ZhouJ-sh commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants