Skip to content

Fix unbounded attachable growth when monitoring actors exit#13

Open
jachris wants to merge 5 commits intomainfrom
fix/monitor-attachable-growth
Open

Fix unbounded attachable growth when monitoring actors exit#13
jachris wants to merge 5 commits intomainfrom
fix/monitor-attachable-growth

Conversation

@jachris
Copy link
Copy Markdown

@jachris jachris commented Apr 7, 2026

Problem

Two separate code paths in CAF's monitor implementation had the same fundamental flaw: when a monitoring actor exits before the monitored actor, stale entries accumulated in the monitored actor's attachable list and kept the monitoring actor's actor_control_block alive indefinitely.

New callback API (monitor(whom, fn)scheduled_actor::do_monitor): used a functor_attachable with token::anonymous, which detach() can never match. So even calling dispose() on the returned handle couldn't remove the entry. The actor_addr capture kept the control block alive.

Legacy API (monitor(whom)local_actor::do_monitor): used default_attachable, which is detachable via do_demonitor() — but nothing ever called it on actor exit. Same unbounded growth and control block leak resulted.

Concrete manifestation: tenzir::async_mail creates a temporary actor_companion per request and calls companion->monitor(receiver) (legacy API) on the long-lived receiver. Each companion exits after receiving its response but left a permanent stale default_attachable in the receiver's list. Under sustained async_mail traffic against a single actor this caused unbounded memory growth.

Solution

New callback API (scheduled_actor):

  • Add token::monitor = 3 to attachable::token.
  • Introduce detail::monitor_attachable: a detachable attachable keyed by abstract_monitor_action*, replacing the anonymous functor_attachable.
  • Introduce monitor_disposable_impl: dispose() cancels the action and calls detach() on the monitored actor via a weak_actor_ptr.
  • scheduled_actor tracks active monitors in active_monitors_; on_cleanup() disposes all of them automatically.

Legacy API (local_actor):

  • Track each monitored actor as a weak_actor_ptr in local_actor::monitored_actors_.
  • In on_cleanup(), lock each entry and call do_demonitor() to detach the stale default_attachable. Dead entries (lock returns null) are skipped safely.

Review

For the new API: start in detail/monitor_attachable.hpp, then scheduled_actor.cpp for monitor_disposable_impl and the rewritten do_monitor.

For the legacy API: local_actor.cppdo_monitor and on_cleanup are the only changes, plus monitored_actors_ in the header.

Race safety: actor_exited is called after the monitored actor releases its mutex (see abstract_actor::cleanup), so calling detach() from on_cleanup cannot deadlock. Concurrent exits are safe: detach() on an already-swept list is a benign no-op; set_reason() on an already-disposed action returns false, preventing double-delivery.

jachris and others added 4 commits January 26, 2026 09:41
When a detached actor reaches its max throughput limit in resume(),
it was unconditionally calling sched->delay() to reschedule itself.
This causes detached actors to run on the shared scheduler thread pool
instead of their dedicated private thread, leading to thread starvation
and potential deadlocks when many detached actors compete for the
limited worker threads.

Fix this by checking for a private_thread_ first and resuming on it
when available, falling back to sched->delay() only for non-detached
actors.
When actor A calls monitor(B, fn), do_monitor previously attached a
functor_attachable (token::anonymous) to B's attachable list capturing
A's address. Two problems followed:

1. B's list grew without bound: each monitoring actor that died before B
   left a stale entry that could only be cleared when B itself exited.
2. A's actor_control_block was kept alive by the actor_addr capture in
   the functor_attachable even after A's strong count hit zero.

Root cause: functor_attachable uses token::anonymous, so detach() cannot
find or remove it, and dispose() on the returned disposable only marked
the action as cancelled without touching B's list.

Fix:
- Add token::monitor (= 3) to attachable::token.
- Introduce detail::monitor_attachable, a detachable attachable with a
  token keyed on the abstract_monitor_action pointer. It replaces the
  anonymous functor_attachable in do_monitor.
- Introduce monitor_disposable_impl: calling dispose() now both cancels
  the monitor_action and calls detach() on the monitored actor via a
  weak_actor_ptr, removing the entry from B's list immediately.
- scheduled_actor tracks active monitors in active_monitors_. On
  on_cleanup() all are disposed automatically, so the monitoring actor's
  exit eagerly unregisters it from every actor it was watching.
  update_watched_disposables() prunes fired entries after normal delivery.
jachris added a commit to tenzir/tenzir that referenced this pull request Apr 7, 2026
Points to tenzir/actor-framework#13, which fixes two problems in
scheduled_actor::do_monitor:

- B's attachable list grew without bound when many short-lived actors
  monitored B and exited before it, because functor_attachable
  (token::anonymous) cannot be removed via detach().
- A's actor_control_block was kept alive by the actor_addr capture in
  the stale attachable.

The fix introduces a detachable monitor_attachable and a
monitor_disposable_impl whose dispose() removes the entry from the
monitored actor immediately. scheduled_actor now tracks active monitors
in active_monitors_ and disposes them all in on_cleanup().
local_actor::do_monitor (the non-callback, deprecated overload) attached
a default_attachable to the monitored actor but never tracked the
relationship on the monitoring side. When the monitoring actor exited,
its entries were never removed from the monitored actor's attachable
list, causing:

1. Unbounded list growth — one stale entry per dead monitoring actor.
2. Control block leak — default_attachable::observer_ held a weak ref
   to the monitoring actor's actor_control_block until the monitored
   actor itself exited.

The default_attachable already carries a matchable observe_token, so
do_demonitor() can remove it correctly. The only missing piece was
tracking and cleanup.

Fix: record each monitored actor as a weak_actor_ptr in
local_actor::monitored_actors_. In on_cleanup(), lock each entry and
call do_demonitor() to detach the stale default_attachable before the
actor is fully torn down. Entries that are already dead (lock returns
null) are skipped safely.
jachris added a commit to tenzir/tenzir that referenced this pull request Apr 7, 2026
Picks up tenzir/actor-framework#13, which fixes unbounded attachable
list growth and actor_control_block leaks in both monitor() code paths:

- Legacy monitor(whom): local_actor now tracks monitored actors and
  calls do_demonitor() for each in on_cleanup(), removing the stale
  default_attachable immediately on actor exit.
- Callback monitor(whom, fn): new monitor_attachable with a detachable
  token replaces the anonymous functor_attachable; dispose() now removes
  the entry from the monitored actor's list.

Both paths were triggered by tenzir::async_mail, which creates a
temporary actor_companion per request that monitors the long-lived
receiver via the legacy API.
if (ptr != nullptr) {
ptr->attach(
default_attachable::make_monitor(ptr->address(), address(), priority));
monitored_actors_.emplace_back(ptr->ctrl(), add_ref);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weak_actor_ptrs keep the control blocks that are pointed to alive, so we need to remove entries from this map non-lazily. I think that needs to be done in do_demonitor(), but also when a down_msg (+ node_down_msg?) is processed in scheduled actors. For completeness we might also want to do the same in blocking actors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants