fix(sandbox): stale sandbox URL recovery before tool execution#2081
fix(sandbox): stale sandbox URL recovery before tool execution#2081ZhouJ-sh wants to merge 5 commits intobytedance:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves DeerFlow’s AIO sandbox resilience by detecting and recovering from stale cached sandbox URLs before tool execution, so tool calls don’t fail later with connection-refused errors.
Changes:
- Added an HTTP health probe in
AioSandboxProvider.get()to detect unreachable cached sandbox endpoints. - Implemented a recovery path that rediscoveries (or drops) stale sandboxes and refreshes the in-memory cache accordingly.
- Added regression tests for both “rediscovered” and “not rediscoverable” stale-sandbox scenarios.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| backend/packages/harness/deerflow/community/aio_sandbox/aio_sandbox_provider.py | Adds URL health probing in get() and a recovery helper to rediscover/refresh stale sandboxes. |
| backend/tests/test_aio_sandbox_provider.py | Adds tests asserting get() rediscovery behavior and cache eviction when rediscovery fails. |
| with self._lock: | ||
| sandbox = self._sandboxes.get(sandbox_id) | ||
| if sandbox is None: | ||
| return None | ||
|
|
||
| # Cached AIO sandboxes can outlive the underlying container/URL. | ||
| # Probe before returning so callers do not execute against a dead endpoint. | ||
| if isinstance(sandbox, AioSandbox) and not self._sandbox_url_is_healthy(sandbox.base_url): | ||
| logger.warning(f"Sandbox {sandbox_id} at {sandbox.base_url} is unreachable, attempting rediscovery") | ||
| return self._recover_stale_sandbox(sandbox_id) |
There was a problem hiding this comment.
get() now performs a blocking HTTP probe (requests.get(..., timeout=2)) on every cached AioSandbox access. Since ensure_sandbox_initialized() calls provider.get() on essentially every tool invocation, this adds a network round-trip (and up to 2s latency on timeouts) to the hot path and can materially slow tool execution or amplify transient network issues. Consider throttling the probe (e.g., only if time.time() - _last_activity[sandbox_id] exceeds a small threshold, or caching the last health-check time/result per sandbox) so health checks don’t run on every get() call while still recovering from genuinely stale URLs.
There was a problem hiding this comment.
Health checks are now throttled with a short cache window, so get() no longer probes on every hot-path access while still recovering stale sandbox URLs.
| with self._lock: | ||
| self._sandboxes.pop(sandbox_id, None) | ||
| self._sandbox_infos.pop(sandbox_id, None) | ||
| self._last_activity.pop(sandbox_id, None) | ||
|
|
||
| if discovered is None: | ||
| logger.warning(f"Sandbox {sandbox_id} could not be rediscovered after health-check failure") | ||
| return None |
There was a problem hiding this comment.
When _recover_stale_sandbox() drops cached state after a failed health check, it clears _sandboxes, _sandbox_infos, and _last_activity but leaves any entries in _thread_sandboxes that still point at this sandbox_id. That keeps the provider’s internal mapping inconsistent until the next acquire() happens to clean it up. Consider removing thread mappings for this sandbox ID here as well (mirroring release()/destroy()) so the provider state is immediately consistent after recovery failure.
There was a problem hiding this comment.
Recovery failure now removes any _thread_sandboxes entries pointing at the stale sandbox ID so provider state is immediately consistent.
|
@ZhouJ-sh thanks for your contribution. Please take a look at the review comments. |
|
@ZhouJ-sh Thread-to-sandbox mapping not restored after recovery (_recover_stale_sandbox) When recovery fails (discover returns None), the method clears _thread_sandboxes[tid] for all threads that pointed to the stale sandbox. This is correct — it ensures the next acquire(thread_id) creates a fresh sandbox. But when recovery succeeds, the thread-to-sandbox mapping is NOT restored. Looking at the cleanup: Then the fresh sandbox is registered in _sandboxes but _thread_sandboxes for those threads is gone. This means if those threads later call ensure_sandbox_initialized(), they won't find their sandbox_id in _thread_sandboxes and may acquire a second sandbox unnecessarily. However, looking at ensure_sandbox_initialized(), the sandbox_id is stored in runtime.state["sandbox"]["sandbox_id"], which persists across tool calls. The _thread_sandboxes mapping is only used by acquire() for warm-pool reclaim. So the consequence is: the next acquire() for that thread won't find the existing sandbox and may create a duplicate. This is wasteful but not incorrect — the old sandbox eventually times out via the idle checker. Consider restoring the thread mappings after successful recovery: |
|
Fixed. On successful rediscovery, I also added a regression assertion in Verified with: |
Summary
Automatically recover from stale sandbox URLs before tool execution by probing cached AIO sandboxes in
AioSandboxProvider.get()and rediscovering or recreating the live endpoint when the old URL is no longer reachable.Problem
A thread could keep a cached
sandbox_idafter its original sandbox URL had already become invalid. In that state, later tool calls such asbashfailed withError: [Errno 111] Connection refusedinstead of self-healing and continuing.Root Cause
The sandbox recovery path stopped too early:
ensure_sandbox_initialized()trustedprovider.get(sandbox_id)as long as it returned an objectAioSandboxProvider.get()only returned the in-memory sandbox instance and refreshed activity timestampsThat left the system unable to recover from the common case where the sandbox URL had already gone stale before the next tool call started.
Changes
AioSandboxProvider.get()for cachedAioSandboxinstancesNoneso the existing upper-layer initialization path can reacquire a sandboxTesting
deer-flow/backend/.venv/bin/pytest backend/tests/test_aio_sandbox_provider.py -qdeer-flow/backend/.venv/bin/pytest backend/tests/test_aio_sandbox.py backend/tests/test_sandbox_tools_security.py -qThese runs confirm the new stale-URL recovery behavior and verify that adjacent sandbox and tool behavior still passes unchanged.