Fix unbounded attachable growth when monitoring actors exit by jachris · Pull Request #13 · tenzir/actor-framework

jachris · 2026-04-07T13:30:29Z

Problem

Two separate code paths in CAF's monitor implementation had the same fundamental flaw: when a monitoring actor exits before the monitored actor, stale entries accumulated in the monitored actor's attachable list and kept the monitoring actor's actor_control_block alive indefinitely.

New callback API (monitor(whom, fn) → scheduled_actor::do_monitor): used a functor_attachable with token::anonymous, which detach() can never match. So even calling dispose() on the returned handle couldn't remove the entry. The actor_addr capture kept the control block alive.

Legacy API (monitor(whom) → local_actor::do_monitor): used default_attachable, which is detachable via do_demonitor() — but nothing ever called it on actor exit. Same unbounded growth and control block leak resulted.

Concrete manifestation: tenzir::async_mail creates a temporary actor_companion per request and calls companion->monitor(receiver) (legacy API) on the long-lived receiver. Each companion exits after receiving its response but left a permanent stale default_attachable in the receiver's list. Under sustained async_mail traffic against a single actor this caused unbounded memory growth.

Solution

New callback API (scheduled_actor):

Add token::monitor = 3 to attachable::token.
Introduce detail::monitor_attachable: a detachable attachable keyed by abstract_monitor_action*, replacing the anonymous functor_attachable.
Introduce monitor_disposable_impl: dispose() cancels the action and calls detach() on the monitored actor via a weak_actor_ptr.
scheduled_actor tracks active monitors in active_monitors_; on_cleanup() disposes all of them automatically.

Legacy API (local_actor):

Track each monitored actor as a weak_actor_ptr in local_actor::monitored_actors_.
In on_cleanup(), lock each entry and call do_demonitor() to detach the stale default_attachable. Dead entries (lock returns null) are skipped safely.

Review

For the new API: start in detail/monitor_attachable.hpp, then scheduled_actor.cpp for monitor_disposable_impl and the rewritten do_monitor.

For the legacy API: local_actor.cpp — do_monitor and on_cleanup are the only changes, plus monitored_actors_ in the header.

Race safety: actor_exited is called after the monitored actor releases its mutex (see abstract_actor::cleanup), so calling detach() from on_cleanup cannot deadlock. Concurrent exits are safe: detach() on an already-swept list is a benign no-op; set_reason() on an already-disposed action returns false, preventing double-delivery.

This reverts commit 78bb2bd.

When a detached actor reaches its max throughput limit in resume(), it was unconditionally calling sched->delay() to reschedule itself. This causes detached actors to run on the shared scheduler thread pool instead of their dedicated private thread, leading to thread starvation and potential deadlocks when many detached actors compete for the limited worker threads. Fix this by checking for a private_thread_ first and resuming on it when available, falling back to sched->delay() only for non-detached actors.

When actor A calls monitor(B, fn), do_monitor previously attached a functor_attachable (token::anonymous) to B's attachable list capturing A's address. Two problems followed: 1. B's list grew without bound: each monitoring actor that died before B left a stale entry that could only be cleared when B itself exited. 2. A's actor_control_block was kept alive by the actor_addr capture in the functor_attachable even after A's strong count hit zero. Root cause: functor_attachable uses token::anonymous, so detach() cannot find or remove it, and dispose() on the returned disposable only marked the action as cancelled without touching B's list. Fix: - Add token::monitor (= 3) to attachable::token. - Introduce detail::monitor_attachable, a detachable attachable with a token keyed on the abstract_monitor_action pointer. It replaces the anonymous functor_attachable in do_monitor. - Introduce monitor_disposable_impl: calling dispose() now both cancels the monitor_action and calls detach() on the monitored actor via a weak_actor_ptr, removing the entry from B's list immediately. - scheduled_actor tracks active monitors in active_monitors_. On on_cleanup() all are disposed automatically, so the monitoring actor's exit eagerly unregisters it from every actor it was watching. update_watched_disposables() prunes fired entries after normal delivery.

Points to tenzir/actor-framework#13, which fixes two problems in scheduled_actor::do_monitor: - B's attachable list grew without bound when many short-lived actors monitored B and exited before it, because functor_attachable (token::anonymous) cannot be removed via detach(). - A's actor_control_block was kept alive by the actor_addr capture in the stale attachable. The fix introduces a detachable monitor_attachable and a monitor_disposable_impl whose dispose() removes the entry from the monitored actor immediately. scheduled_actor now tracks active monitors in active_monitors_ and disposes them all in on_cleanup().

local_actor::do_monitor (the non-callback, deprecated overload) attached a default_attachable to the monitored actor but never tracked the relationship on the monitoring side. When the monitoring actor exited, its entries were never removed from the monitored actor's attachable list, causing: 1. Unbounded list growth — one stale entry per dead monitoring actor. 2. Control block leak — default_attachable::observer_ held a weak ref to the monitoring actor's actor_control_block until the monitored actor itself exited. The default_attachable already carries a matchable observe_token, so do_demonitor() can remove it correctly. The only missing piece was tracking and cleanup. Fix: record each monitored actor as a weak_actor_ptr in local_actor::monitored_actors_. In on_cleanup(), lock each entry and call do_demonitor() to detach the stale default_attachable before the actor is fully torn down. Entries that are already dead (lock returns null) are skipped safely.

Picks up tenzir/actor-framework#13, which fixes unbounded attachable list growth and actor_control_block leaks in both monitor() code paths: - Legacy monitor(whom): local_actor now tracks monitored actors and calls do_demonitor() for each in on_cleanup(), removing the stale default_attachable immediately on actor exit. - Callback monitor(whom, fn): new monitor_attachable with a detachable token replaces the anonymous functor_attachable; dispose() now removes the entry from the monitored actor's list. Both paths were triggered by tenzir::async_mail, which creates a temporary actor_companion per request that monitors the long-lived receiver via the legacy API.

tobim · 2026-04-08T10:26:27Z

+  if (ptr != nullptr) {
    ptr->attach(
      default_attachable::make_monitor(ptr->address(), address(), priority));
+    monitored_actors_.emplace_back(ptr->ctrl(), add_ref);


weak_actor_ptrs keep the control blocks that are pointed to alive, so we need to remove entries from this map non-lazily. I think that needs to be done in do_demonitor(), but also when a down_msg (+ node_down_msg?) is processed in scheduled actors. For completeness we might also want to do the same in blocking actors.

jachris and others added 4 commits January 26, 2026 09:41

Add actor_clock setter to actor_system_config

308452d

Revert "Avoid using init_fun for spawning function objects"

e4080f1

This reverts commit 78bb2bd.

jachris mentioned this pull request Apr 7, 2026

Bump CAF submodule to fix unbounded monitor attachable growth tenzir/tenzir#5990

Open

tobim requested changes Apr 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix unbounded attachable growth when monitoring actors exit#13

Fix unbounded attachable growth when monitoring actors exit#13
jachris wants to merge 5 commits intomainfrom
fix/monitor-attachable-growth

jachris commented Apr 7, 2026 •

edited

Loading

Uh oh!

tobim Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jachris commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Review

Uh oh!

tobim Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jachris commented Apr 7, 2026 •

edited

Loading