You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(redis): eliminate connection churn and harden worker resilience (#1022)
Three Sentry clusters with 3304 aggregated events since 2026-03-24 across
30+ production servers — every 10-30 minutes all three parallel
WorkerApiCommands instances would lose their Redis connection together,
dropping queued API requests and silently failing worker state sync:
#26987 RedisException: read error on connection (blpop, 1200)
#27017 RedisException: Connection lost (cache->set, 313)
#26989 Phalcon\Storage\Exception: Connection refused (getAdapter, 1791)
Root cause: RedisClientProvider / ManagedCacheProvider were registered
non-shared, and WorkerApiCommands::start() resolved the service on every
BLPOP iteration — roughly 25000 fresh TCP sockets per hour per server.
Stale sockets accumulated until phpredis hit its client limit, and
without a client-side OPT_READ_TIMEOUT / OPT_TCP_KEEPALIVE half-dead
sockets sat on the kernel TCP retransmit window (~15 minutes), which
matches the "every 10-30 minutes" burst cadence in the report. Verified
on prod stand serber@boffart.miko.ru (2026.1.233-dev): connection rate
dropped from ~417/min to ~125/min after deploying Fix 1-3+6, zero new
events from this host in Sentry for the full observation window, and
all three WorkerApiCommands stayed up uninterrupted.
Defense in depth across six layers:
- RedisClientProvider / ManagedCacheProvider: $di->setShared() so each
worker / php-fpm process reuses one \Redis instance for its whole
lifetime. New primeRedisAdapter() helper issues a ping() to force the
lazy Phalcon socket open, then sets OPT_TCP_KEEPALIVE and
OPT_READ_TIMEOUT = 10 on the underlying phpredis object. 'persistent'
is explicitly false to prevent pconnect foot-guns.
- WorkerBase: new withRedisRetry() helper wraps hot Redis ops in a
short exponential backoff (100 ms, 200 ms, 400 ms; throws after max
attempts). Narrow by design — only catches RedisException and
Phalcon\Storage\Exception so other errors still bubble.
- WorkerApiCommands: main loop wraps BLPOP in withRedisRetry and routes
RedisException / Phalcon\Storage\Exception through an extended-outage
catch arm with 1/2/4/8/16/30 s backoff and a single syslog marker
'reason=redis_unreachable_extended' at the 5th consecutive failure.
Non-Redis exceptions keep the historical sleep(1) path. This
eliminates the reconnect-storm that turned a brief Redis glitch into
thousands of Sentry events.
- WorkerBase: new closeRedis() hoisted here (not WorkerRedisBase) so
every worker type — including Beanstalk workers that touch the cache
wrapper — can release its phpredis socket on shutdown. Wrapper-aware:
handles both raw \Redis and Phalcon\Cache\Adapter\Redis. Called from
the SIGTERM / SIGINT branch of signalHandler (replacing the naive raw
close() that would have bombed on a cache wrapper), from
cleanupRedisKeys(), and from __destruct as a safety net. Kills the
"200+ stale Redis connections from terminated PHP workers" pattern
that commit e2e191a was papering over on the server side.
- WorkerModelsEvents::saveStateToRedis: wrapped in withRedisRetry so
the single shutdown-time write no longer silently loses queued
reload actions when Redis drops the connection mid-flush.
- $redis property declared on WorkerBase (protected mixed $redis = null)
so Phalcon\Di\Injectable::__get() cannot silently resurrect a socket
during destruction. Also fixes the PHP 8.2+ dynamic-property
deprecation on WorkerRedisBase.
- RedisConf.php: generated redis.conf gains maxclients 300,
maxmemory 64mb, and maxmemory-policy allkeys-lru — a hard cap so
leaked or stale sockets cannot pile up to the default limit, an OOM
killer safety net, and the right eviction policy for a cache-only
role. Monit block extended with 'if failed port $port ... send PING
expect PONG for 3 cycles then restart' so frozen-but-alive Redis
(blocked event loop, stuck I/O) is now detected within ~15 s
instead of never, with 'if 5 restarts within 10 cycles then timeout'
as a flapping-restart safety valve.
Tests:
tests/Unit/Common/Providers/RedisClientProviderTest.php — shared
singleton + OPT_TCP_KEEPALIVE + OPT_READ_TIMEOUT on both provider
and ManagedCache's underlying \Redis (skipped if Redis unreachable).
tests/Unit/Core/Workers/WithRedisRetryTest.php — backoff timing,
retry ceiling, RedisException + Phalcon\Storage\Exception coverage,
non-Redis exceptions bubble without retry. Uses
newInstanceWithoutConstructor() fixture pattern from the #999 tests.
Smoke-tested in the mikopbx container and on the prod stand:
9/9 provider smoke + 11/11 retry helper smoke passing both places,
3 Redis-related Sentry clusters stopped receiving events from the
deployed host, syslog clean, WorkerApiCommands pool stable.
0 commit comments