Skip to content

fix(sandbox): stale sandbox URL recovery before tool execution#2081

Open
ZhouJ-sh wants to merge 5 commits intobytedance:mainfrom
ZhouJ-sh:fix/sandbox-stale-url-recovery
Open

fix(sandbox): stale sandbox URL recovery before tool execution#2081
ZhouJ-sh wants to merge 5 commits intobytedance:mainfrom
ZhouJ-sh:fix/sandbox-stale-url-recovery

Conversation

@ZhouJ-sh
Copy link
Copy Markdown
Contributor

Summary
Automatically recover from stale sandbox URLs before tool execution by probing cached AIO sandboxes in AioSandboxProvider.get() and rediscovering or recreating the live endpoint when the old URL is no longer reachable.

Problem
A thread could keep a cached sandbox_id after its original sandbox URL had already become invalid. In that state, later tool calls such as bash failed with Error: [Errno 111] Connection refused instead of self-healing and continuing.

Root Cause
The sandbox recovery path stopped too early:

  • ensure_sandbox_initialized() trusted provider.get(sandbox_id) as long as it returned an object
  • AioSandboxProvider.get() only returned the in-memory sandbox instance and refreshed activity timestamps
  • no health check was performed on the cached sandbox URL before reuse
  • when the underlying sandbox had already been reclaimed, the stale cached object was still returned and the failure only surfaced later during HTTP command execution

That left the system unable to recover from the common case where the sandbox URL had already gone stale before the next tool call started.

Changes

  • add a lightweight health probe in AioSandboxProvider.get() for cached AioSandbox instances
  • if the cached sandbox URL is unreachable, attempt rediscovery via the backend before returning it
  • if rediscovery succeeds, replace the stale cached sandbox with a fresh in-memory instance bound to the live URL
  • if rediscovery fails, clear the stale cached state and return None so the existing upper-layer initialization path can reacquire a sandbox
  • add regression tests covering both:
    • stale cached sandbox successfully rediscovered
    • stale cached sandbox not rediscoverable and therefore dropped

Testing

  • deer-flow/backend/.venv/bin/pytest backend/tests/test_aio_sandbox_provider.py -q
  • deer-flow/backend/.venv/bin/pytest backend/tests/test_aio_sandbox.py backend/tests/test_sandbox_tools_security.py -q

These runs confirm the new stale-URL recovery behavior and verify that adjacent sandbox and tool behavior still passes unchanged.

@ZhouJ-sh ZhouJ-sh changed the title fix(sanbox): stale sandbox URL recovery before tool execution fix(sandbox): stale sandbox URL recovery before tool execution Apr 10, 2026
@WillemJiang WillemJiang requested a review from Copilot April 10, 2026 10:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves DeerFlow’s AIO sandbox resilience by detecting and recovering from stale cached sandbox URLs before tool execution, so tool calls don’t fail later with connection-refused errors.

Changes:

  • Added an HTTP health probe in AioSandboxProvider.get() to detect unreachable cached sandbox endpoints.
  • Implemented a recovery path that rediscoveries (or drops) stale sandboxes and refreshes the in-memory cache accordingly.
  • Added regression tests for both “rediscovered” and “not rediscoverable” stale-sandbox scenarios.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
backend/packages/harness/deerflow/community/aio_sandbox/aio_sandbox_provider.py Adds URL health probing in get() and a recovery helper to rediscover/refresh stale sandboxes.
backend/tests/test_aio_sandbox_provider.py Adds tests asserting get() rediscovery behavior and cache eviction when rediscovery fails.

Comment on lines +606 to +615
with self._lock:
sandbox = self._sandboxes.get(sandbox_id)
if sandbox is None:
return None

# Cached AIO sandboxes can outlive the underlying container/URL.
# Probe before returning so callers do not execute against a dead endpoint.
if isinstance(sandbox, AioSandbox) and not self._sandbox_url_is_healthy(sandbox.base_url):
logger.warning(f"Sandbox {sandbox_id} at {sandbox.base_url} is unreachable, attempting rediscovery")
return self._recover_stale_sandbox(sandbox_id)
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get() now performs a blocking HTTP probe (requests.get(..., timeout=2)) on every cached AioSandbox access. Since ensure_sandbox_initialized() calls provider.get() on essentially every tool invocation, this adds a network round-trip (and up to 2s latency on timeouts) to the hot path and can materially slow tool execution or amplify transient network issues. Consider throttling the probe (e.g., only if time.time() - _last_activity[sandbox_id] exceeds a small threshold, or caching the last health-check time/result per sandbox) so health checks don’t run on every get() call while still recovering from genuinely stale URLs.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Health checks are now throttled with a short cache window, so get() no longer probes on every hot-path access while still recovering stale sandbox URLs.

Comment on lines +636 to +643
with self._lock:
self._sandboxes.pop(sandbox_id, None)
self._sandbox_infos.pop(sandbox_id, None)
self._last_activity.pop(sandbox_id, None)

if discovered is None:
logger.warning(f"Sandbox {sandbox_id} could not be rediscovered after health-check failure")
return None
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When _recover_stale_sandbox() drops cached state after a failed health check, it clears _sandboxes, _sandbox_infos, and _last_activity but leaves any entries in _thread_sandboxes that still point at this sandbox_id. That keeps the provider’s internal mapping inconsistent until the next acquire() happens to clean it up. Consider removing thread mappings for this sandbox ID here as well (mirroring release()/destroy()) so the provider state is immediately consistent after recovery failure.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recovery failure now removes any _thread_sandboxes entries pointing at the stale sandbox ID so provider state is immediately consistent.

@WillemJiang WillemJiang added question Further information is requested labels Apr 10, 2026
@WillemJiang
Copy link
Copy Markdown
Collaborator

@ZhouJ-sh thanks for your contribution. Please take a look at the review comments.

@WillemJiang
Copy link
Copy Markdown
Collaborator

@ZhouJ-sh Thread-to-sandbox mapping not restored after recovery (_recover_stale_sandbox)

When recovery fails (discover returns None), the method clears _thread_sandboxes[tid] for all threads that pointed to the stale sandbox. This is correct — it ensures the next acquire(thread_id) creates a fresh sandbox. But when recovery succeeds, the thread-to-sandbox mapping is NOT restored. Looking at the cleanup:

thread_ids_to_remove = [tid for tid, sid in self._thread_sandboxes.items() if sid == sandbox_id]
for tid in thread_ids_to_remove:
    del self._thread_sandboxes[tid]

Then the fresh sandbox is registered in _sandboxes but _thread_sandboxes for those threads is gone. This means if those threads later call ensure_sandbox_initialized(), they won't find their sandbox_id in _thread_sandboxes and may acquire a second sandbox unnecessarily.

However, looking at ensure_sandbox_initialized(), the sandbox_id is stored in runtime.state["sandbox"]["sandbox_id"], which persists across tool calls. The _thread_sandboxes mapping is only used by acquire() for warm-pool reclaim. So the consequence is: the next acquire() for that thread won't find the existing sandbox and may create a duplicate. This is wasteful but not incorrect — the old sandbox eventually times out via the idle checker.

Consider restoring the thread mappings after successful recovery:

for tid in thread_ids_to_remove:
     self._thread_sandboxes[tid] = discovered.sandbox_id

@ZhouJ-sh
Copy link
Copy Markdown
Contributor Author

Fixed.

On successful rediscovery, _recover_stale_sandbox() now restores the thread_id -> sandbox_id mappings it temporarily clears while dropping the stale cache entry. We still remove those mappings when rediscovery fails, so the next acquire(thread_id) can create a fresh sandbox.

I also added a regression assertion in test_get_rediscovers_stale_cached_sandbox to verify the thread mapping is preserved after recovery.

Verified with:
cd backend && uv run pytest -q tests/test_aio_sandbox_provider.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

question Further information is requested

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants