Skip to content

Commit bc4875c

Browse files
Alexey PortnovAlexey Portnov
authored andcommitted
fix(redis): eliminate connection churn and harden worker resilience (#1022)
Three Sentry clusters with 3304 aggregated events since 2026-03-24 across 30+ production servers — every 10-30 minutes all three parallel WorkerApiCommands instances would lose their Redis connection together, dropping queued API requests and silently failing worker state sync: #26987 RedisException: read error on connection (blpop, 1200) #27017 RedisException: Connection lost (cache->set, 313) #26989 Phalcon\Storage\Exception: Connection refused (getAdapter, 1791) Root cause: RedisClientProvider / ManagedCacheProvider were registered non-shared, and WorkerApiCommands::start() resolved the service on every BLPOP iteration — roughly 25000 fresh TCP sockets per hour per server. Stale sockets accumulated until phpredis hit its client limit, and without a client-side OPT_READ_TIMEOUT / OPT_TCP_KEEPALIVE half-dead sockets sat on the kernel TCP retransmit window (~15 minutes), which matches the "every 10-30 minutes" burst cadence in the report. Verified on prod stand serber@boffart.miko.ru (2026.1.233-dev): connection rate dropped from ~417/min to ~125/min after deploying Fix 1-3+6, zero new events from this host in Sentry for the full observation window, and all three WorkerApiCommands stayed up uninterrupted. Defense in depth across six layers: - RedisClientProvider / ManagedCacheProvider: $di->setShared() so each worker / php-fpm process reuses one \Redis instance for its whole lifetime. New primeRedisAdapter() helper issues a ping() to force the lazy Phalcon socket open, then sets OPT_TCP_KEEPALIVE and OPT_READ_TIMEOUT = 10 on the underlying phpredis object. 'persistent' is explicitly false to prevent pconnect foot-guns. - WorkerBase: new withRedisRetry() helper wraps hot Redis ops in a short exponential backoff (100 ms, 200 ms, 400 ms; throws after max attempts). Narrow by design — only catches RedisException and Phalcon\Storage\Exception so other errors still bubble. - WorkerApiCommands: main loop wraps BLPOP in withRedisRetry and routes RedisException / Phalcon\Storage\Exception through an extended-outage catch arm with 1/2/4/8/16/30 s backoff and a single syslog marker 'reason=redis_unreachable_extended' at the 5th consecutive failure. Non-Redis exceptions keep the historical sleep(1) path. This eliminates the reconnect-storm that turned a brief Redis glitch into thousands of Sentry events. - WorkerBase: new closeRedis() hoisted here (not WorkerRedisBase) so every worker type — including Beanstalk workers that touch the cache wrapper — can release its phpredis socket on shutdown. Wrapper-aware: handles both raw \Redis and Phalcon\Cache\Adapter\Redis. Called from the SIGTERM / SIGINT branch of signalHandler (replacing the naive raw close() that would have bombed on a cache wrapper), from cleanupRedisKeys(), and from __destruct as a safety net. Kills the "200+ stale Redis connections from terminated PHP workers" pattern that commit e2e191a was papering over on the server side. - WorkerModelsEvents::saveStateToRedis: wrapped in withRedisRetry so the single shutdown-time write no longer silently loses queued reload actions when Redis drops the connection mid-flush. - $redis property declared on WorkerBase (protected mixed $redis = null) so Phalcon\Di\Injectable::__get() cannot silently resurrect a socket during destruction. Also fixes the PHP 8.2+ dynamic-property deprecation on WorkerRedisBase. - RedisConf.php: generated redis.conf gains maxclients 300, maxmemory 64mb, and maxmemory-policy allkeys-lru — a hard cap so leaked or stale sockets cannot pile up to the default limit, an OOM killer safety net, and the right eviction policy for a cache-only role. Monit block extended with 'if failed port $port ... send PING expect PONG for 3 cycles then restart' so frozen-but-alive Redis (blocked event loop, stuck I/O) is now detected within ~15 s instead of never, with 'if 5 restarts within 10 cycles then timeout' as a flapping-restart safety valve. Tests: tests/Unit/Common/Providers/RedisClientProviderTest.php — shared singleton + OPT_TCP_KEEPALIVE + OPT_READ_TIMEOUT on both provider and ManagedCache's underlying \Redis (skipped if Redis unreachable). tests/Unit/Core/Workers/WithRedisRetryTest.php — backoff timing, retry ceiling, RedisException + Phalcon\Storage\Exception coverage, non-Redis exceptions bubble without retry. Uses newInstanceWithoutConstructor() fixture pattern from the #999 tests. Smoke-tested in the mikopbx container and on the prod stand: 9/9 provider smoke + 11/11 retry helper smoke passing both places, 3 Redis-related Sentry clusters stopped receiving events from the deployed host, syslog clean, WorkerApiCommands pool stable.
1 parent 74c1a9f commit bc4875c

10 files changed

Lines changed: 582 additions & 38 deletions

File tree

src/Common/Providers/CLAUDE.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ Providers/
5858
| LoggerAuthProvider | `loggerAuth` | Yes | Syslog AUTH | Auth audit trail |
5959
| LoggerProvider | `logger` | Yes | Syslog | UDP 127.0.0.1:514 |
6060
| MainDatabaseProvider | `db` | Yes | SQLite | mikopbx.db |
61-
| ManagedCacheProvider | `managedCache` | No | Redis DB4 | 1h TTL |
61+
| ManagedCacheProvider | `managedCache` | Yes | Redis DB4 | 1h TTL, shared singleton per process (issue #1022) |
6262
| MarketPlaceProvider | `license` | Yes | License class | Module marketplace |
6363
| MessagesProvider | `messages` | Yes | Files/Cache | 29 languages |
6464
| ModelsAnnotationsProvider | `annotations` | Yes | Memory | Model annotations |
@@ -69,7 +69,7 @@ Providers/
6969
| PBXConfModulesProvider | `pbxConfModules` | Yes | DB | Module hooks + priority |
7070
| PBXCoreRESTClientProvider | `restAPIClient` | No | HTTP | GuzzleHttp, 30s timeout |
7171
| RecordingStorageDatabaseProvider | `dbRecordingStorage` | Yes | SQLite | recording_storage.db |
72-
| RedisClientProvider | `redis` | No | Redis DB1 | Worker IPC |
72+
| RedisClientProvider | `redis` | Yes | Redis DB1 | Worker IPC, shared singleton per process (issue #1022) |
7373
| RegistryProvider | `registry` | Yes | Memory | Global state |
7474
| RouterProvider | `router` | No | Config | Module route integration |
7575
| SentryErrorHandlerProvider | `sentryErrorHandler` | No | Sentry API | Production error tracking |

src/Common/Providers/ManagedCacheProvider.php

Lines changed: 16 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -22,9 +22,9 @@
2222

2323
namespace MikoPBX\Common\Providers;
2424

25+
use Phalcon\Cache\Adapter\Redis as CacheAdapterRedis;
2526
use Phalcon\Di\DiInterface;
2627
use Phalcon\Di\ServiceProviderInterface;
27-
use Phalcon\Cache\Adapter\Redis as CacheAdapterRedis;
2828
use Phalcon\Storage\SerializerFactory;
2929

3030
/**
@@ -45,20 +45,30 @@ class ManagedCacheProvider implements ServiceProviderInterface
4545
public function register(DiInterface $di): void
4646
{
4747
$config = $di->getShared(ConfigProvider::SERVICE_NAME);
48-
$di->set(
48+
// Shared (singleton) per process — see issue #1022. Before this change
49+
// the provider was registered with $di->set(), so every di->get() (and
50+
// some workers call it every 5s inside BLPOP loops) created a fresh
51+
// TCP socket. Those sockets accumulated until phpredis/kernel ran out
52+
// of descriptors and the whole worker pool lost Redis simultaneously.
53+
$di->setShared(
4954
self::SERVICE_NAME,
5055
function () use ($config) {
5156
$serializerFactory = new SerializerFactory();
52-
5357
$options = [
5458
'lifetime' => 3600,
5559
'host' => $config->path('redis.host'),
5660
'port' => $config->path('redis.port'),
5761
'index' => self::DATABASE_INDEX,
58-
'prefix' => self::CACHE_PREFIX
62+
'prefix' => self::CACHE_PREFIX,
63+
'persistent' => false,
5964
];
60-
61-
return new CacheAdapterRedis($serializerFactory, $options);
65+
$cacheAdapter = new CacheAdapterRedis($serializerFactory, $options);
66+
// Prime the underlying phpredis socket so OPT_TCP_KEEPALIVE /
67+
// OPT_READ_TIMEOUT apply on a live connection. The wrapper
68+
// keeps its reference to the same \Redis, so later cache
69+
// operations benefit from these options transparently.
70+
RedisClientProvider::primeRedisAdapter($cacheAdapter->getAdapter());
71+
return $cacheAdapter;
6272
}
6373
);
6474
}

src/Common/Providers/RedisClientProvider.php

Lines changed: 68 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@
2525
use Phalcon\Di\ServiceProviderInterface;
2626
use Phalcon\Storage\Adapter\Redis as AdapterRedis;
2727
use Phalcon\Storage\SerializerFactory;
28+
use Throwable;
2829

2930
/**
3031
* The RedisClientProvider class is responsible for registering the Redis client service.
@@ -38,31 +39,90 @@ class RedisClientProvider implements ServiceProviderInterface
3839
public const string CACHE_PREFIX = '_PH_REDIS_CLIENT:';
3940
public const int DATABASE_INDEX = 1;
4041

42+
/**
43+
* phpredis OPT_READ_TIMEOUT (seconds). Controls how long a blocking read
44+
* (e.g. BLPOP) waits on a silent socket before raising RedisException.
45+
* This is THE primary defence against the "every 10-30 minutes" cadence
46+
* in issue #1022: without it phpredis inherits the kernel TCP timeout
47+
* (~15 min), so workers sit on half-dead sockets for the whole TCP
48+
* retransmission window instead of surfacing the failure.
49+
*/
50+
public const int READ_TIMEOUT = 10;
51+
52+
/**
53+
* phpredis OPT_TCP_KEEPALIVE — in this build it is a 0/1 enable flag
54+
* (phpredis 6.x normalises any positive value to 1). We enable SO_KEEPALIVE
55+
* on the socket; the actual probe timing is taken from kernel sysctls
56+
* (net.ipv4.tcp_keepalive_time / _intvl / _probes). On MikoPBX the kernel
57+
* defaults are 7200 s / 75 s / 9 probes which is slower than we would
58+
* like, but combined with OPT_READ_TIMEOUT above the client still breaks
59+
* out of blocked reads within 10 s. The keepalive flag is kept as a
60+
* belt-and-braces cleanup for truly-idle connections.
61+
*/
62+
public const int TCP_KEEPALIVE_ENABLED = 1;
63+
64+
/**
65+
* Prime a freshly created phpredis object so issue #1022 protections take
66+
* effect. Phalcon's Storage\Adapter\Redis lazy-opens its socket on the
67+
* first command, and {@see \Redis::OPT_TCP_KEEPALIVE} must be set on a
68+
* live socket, so we have to force the connect with a ping() first.
69+
*
70+
* Any failure here is swallowed intentionally — the caller (worker loop,
71+
* retry helper) will see the RedisException on its next real operation
72+
* and deal with it via the standard backoff path.
73+
*/
74+
public static function primeRedisAdapter(\Redis $redis): \Redis
75+
{
76+
try {
77+
$redis->ping();
78+
} catch (Throwable) {
79+
// Socket is not open yet; options below still register with
80+
// phpredis and will be applied the next time a connect succeeds.
81+
}
82+
try {
83+
// OPT_TCP_KEEPALIVE is stored in phpredis' redis_sock struct
84+
// here but is only actually applied to the OS socket on the
85+
// NEXT connect()/reconnect() — the setsockopt() call lives in
86+
// phpredis' RedisSock_connect() path. So on the current socket
87+
// we rely on OPT_READ_TIMEOUT below; OPT_TCP_KEEPALIVE is a
88+
// belt-and-braces flag for the reconnect case.
89+
$redis->setOption(\Redis::OPT_TCP_KEEPALIVE, self::TCP_KEEPALIVE_ENABLED);
90+
$redis->setOption(\Redis::OPT_READ_TIMEOUT, self::READ_TIMEOUT);
91+
} catch (Throwable) {
92+
// Some builds of phpredis refuse setOption before the first
93+
// successful connect — not fatal for us, next reconnect retries.
94+
}
95+
return $redis;
96+
}
4197

4298
/**
4399
* Register the Redis client service provider.
44100
*
45101
* @param DiInterface $di The DI container.
46102
*/
47103
public function register(DiInterface $di): void
48-
{
104+
{
49105
$config = $di->getShared(ConfigProvider::SERVICE_NAME);
50-
$di->set(
106+
// Shared (singleton) per process: a worker or php-fpm child reuses
107+
// one \Redis socket for its entire lifetime instead of opening a new
108+
// TCP connection on every di->get() call inside the main loop.
109+
// See issue #1022 root-cause analysis.
110+
$di->setShared(
51111
self::SERVICE_NAME,
52-
function () use ($di, $config ) {
112+
function () use ($config) {
53113
$serializerFactory = new SerializerFactory();
54-
55114
$options = [
56115
'defaultSerializer' => 'Php',
57116
'lifetime' => 3600,
58117
'host' => $config->path('redis.host'),
59118
'port' => $config->path('redis.port'),
60119
'index' => self::DATABASE_INDEX,
61-
'prefix' => self::CACHE_PREFIX
120+
'prefix' => self::CACHE_PREFIX,
121+
'persistent' => false,
62122
];
63-
64-
return (new AdapterRedis($serializerFactory, $options))->getAdapter();
123+
$adapter = new AdapterRedis($serializerFactory, $options);
124+
return self::primeRedisAdapter($adapter->getAdapter());
65125
}
66126
);
67127
}
68-
}
128+
}

src/Core/System/Configs/RedisConf.php

Lines changed: 32 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -127,12 +127,25 @@ public function generateMonitConf(): bool{
127127

128128
$busyboxPath = Util::which('busybox');
129129
$confPath = $this->getMainMonitConfFile();
130-
130+
$redisPort = $this->port !== '' ? $this->port : '6379';
131+
132+
// Monit now also runs a TCP PING/PONG probe on the Redis port on top
133+
// of the pidfile check. This catches frozen-but-alive Redis processes
134+
// (blocked event loop, stuck I/O) that the PID-only check could not
135+
// detect — see issue #1022. `for 3 cycles` at the 5 s daemon tick
136+
// means the probe must fail for ~15 s before monit restarts Redis,
137+
// which avoids flapping restarts on brief hiccups. The hard limit of
138+
// 5 restarts per 10 cycles is a safety valve against restart loops.
131139
$conf = 'check process '.self::PROC_NAME.' with pidfile /var/run/redis.pid '.PHP_EOL.
132140
' start program = "'.$this->startCommand.'"'.PHP_EOL.
133141
' as uid root and gid root'.PHP_EOL.
134142
' stop program = "'.$busyboxPath.' sh -c \''.$busyboxPath.' killall '.self::PROC_NAME.'; rm -f /var/run/redis.pid\''.'"'.PHP_EOL.
135-
' as uid root and gid root';
143+
' as uid root and gid root'.PHP_EOL.
144+
' if failed port '.$redisPort.' type tcp protocol default'.PHP_EOL.
145+
' send "PING\r\n" expect "PONG" timeout 5 seconds'.PHP_EOL.
146+
' for 3 cycles'.PHP_EOL.
147+
' then restart'.PHP_EOL.
148+
' if 5 restarts within 10 cycles then timeout';
136149

137150
$this->saveFileContent($confPath, $conf);
138151
return true;
@@ -167,6 +180,23 @@ private function configure(): void
167180
# Prevents write-blocking when /var/tmp (tmpfs) overflows (issue #651).
168181
$conf .= "save \"\"" . PHP_EOL;
169182
$conf .= "stop-writes-on-bgsave-error no" . PHP_EOL;
183+
184+
# Client/memory caps (issue #1022).
185+
# - maxclients: hard ceiling so leaked or stale sockets cannot pile
186+
# up to the phpredis/Redis default limit and knock out all three
187+
# WorkerApiCommands instances simultaneously. 300 leaves ample
188+
# headroom for the usual 50-ish active clients (3 API workers +
189+
# php-fpm pool + module workers).
190+
# - maxmemory: small systems must not let Redis grow unbounded and
191+
# trigger OOM killer. 64 MB matches the current working set
192+
# (heartbeat keys + metadata cache + short-lived queues).
193+
# - maxmemory-policy allkeys-lru: evict the oldest keys instead of
194+
# rejecting writes under pressure (the default `noeviction` would
195+
# otherwise make every `set` fail when the cap is reached).
196+
$conf .= "maxclients 300" . PHP_EOL;
197+
$conf .= "maxmemory 64mb" . PHP_EOL;
198+
$conf .= "maxmemory-policy allkeys-lru" . PHP_EOL;
199+
170200
file_put_contents(self::CONF_FILE, $conf);
171201
}
172202

src/Core/Workers/WorkerBase.php

Lines changed: 100 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -137,6 +137,17 @@ abstract class WorkerBase extends Injectable implements WorkerInterface
137137
*/
138138
protected int $workerState = self::STATE_STARTING;
139139

140+
/**
141+
* Redis client held by this worker. Can be a raw phpredis \Redis (from
142+
* RedisClientProvider) or a Phalcon cache wrapper (from
143+
* ManagedCacheProvider) depending on the subclass. Declared explicitly
144+
* to avoid the PHP 8.2+ dynamic-property deprecation AND to stop
145+
* Phalcon\Di\Injectable::__get() from silently resurrecting a fresh
146+
* connection inside destructors or signal handlers — a real hazard
147+
* raised in the #1022 code review.
148+
*/
149+
protected mixed $redis = null;
150+
140151

141152

142153
/**
@@ -185,6 +196,90 @@ public static function getCheckInterval(): int
185196
return self::KEEP_ALLIVE_CHECK_INTERVAL;
186197
}
187198

199+
/**
200+
* Run a Redis operation with a short exponential backoff retry loop.
201+
*
202+
* Used to survive transient Redis glitches without dropping in-flight
203+
* jobs. The backoff is intentionally short (100 ms / 200 ms / 400 ms)
204+
* so the BLPOP main loop does not block for more than ~1 s total —
205+
* if Redis is down for longer, the outer main-loop catch handles the
206+
* extended outage with a larger backoff and a single syslog marker
207+
* (`reason=redis_unreachable_extended`) instead of spamming Sentry.
208+
*
209+
* Phalcon Storage\Adapter\Redis reconnects automatically through its
210+
* `checkConnect()` on every operation, so we only need to retry and
211+
* wait; there is no need to tear down the shared adapter here.
212+
*
213+
* Introduced for issue #1022.
214+
*
215+
* @template T
216+
* @param callable(): T $op Redis operation to execute.
217+
* @param int $maxAttempts Attempts including the first try.
218+
* @return T
219+
* @throws \RedisException|\Phalcon\Storage\Exception on terminal failure
220+
* @throws \InvalidArgumentException when $maxAttempts is < 1
221+
*/
222+
protected function withRedisRetry(callable $op, int $maxAttempts = 3): mixed
223+
{
224+
if ($maxAttempts < 1) {
225+
throw new \InvalidArgumentException(
226+
'withRedisRetry: $maxAttempts must be >= 1, got ' . $maxAttempts
227+
);
228+
}
229+
$lastException = null;
230+
for ($attempt = 1; $attempt <= $maxAttempts; $attempt++) {
231+
try {
232+
return $op();
233+
} catch (\RedisException | \Phalcon\Storage\Exception $e) {
234+
$lastException = $e;
235+
if ($attempt === $maxAttempts) {
236+
break;
237+
}
238+
// 100 ms, 200 ms, 400 ms — geometric, capped at 2 s total.
239+
// The shift is clamped to prevent integer overflow when a
240+
// subclass passes a pathologically large $maxAttempts.
241+
$shift = min($attempt - 1, 20);
242+
usleep(min((1 << $shift) * 100_000, 2_000_000));
243+
}
244+
}
245+
throw $lastException;
246+
}
247+
248+
/**
249+
* Best-effort close of the phpredis socket held by this worker.
250+
*
251+
* Hoisted here (instead of WorkerRedisBase) so that every subclass —
252+
* Beanstalk workers that touch ManagedCacheProvider, Redis-pool
253+
* workers, and shutdown signal handlers — can release their socket
254+
* without duplicating the wrapper-vs-raw detection logic.
255+
*
256+
* Safe to call multiple times and to call when Redis is unreachable:
257+
* any error is swallowed so the shutdown path never blocks.
258+
* Introduced for issue #1022 — prevents the "200+ stale Redis
259+
* connections from terminated PHP workers" pattern documented in
260+
* commit e2e191abb.
261+
*/
262+
protected function closeRedis(): void
263+
{
264+
try {
265+
if (isset($this->redis) && is_object($this->redis)) {
266+
if (method_exists($this->redis, 'close')) {
267+
// phpredis \Redis::close() — hard close the socket.
268+
@$this->redis->close();
269+
} elseif (method_exists($this->redis, 'getAdapter')) {
270+
// Phalcon cache wrapper — reach through to phpredis.
271+
$inner = $this->redis->getAdapter();
272+
if (is_object($inner) && method_exists($inner, 'close')) {
273+
@$inner->close();
274+
}
275+
}
276+
}
277+
} catch (Throwable) {
278+
// Ignore — the whole point of this method is to release the
279+
// socket on the way out, not to surface errors.
280+
}
281+
}
282+
188283
/**
189284
* Sets resource limits for the worker process
190285
*/
@@ -377,12 +472,11 @@ public function signalHandler(int $signal): void
377472
case SIGINT:
378473
$this->setWorkerState(self::STATE_STOPPING);
379474

380-
// Cleanup for Redis-based workers
381-
if ($this instanceof WorkerRedisBase) {
382-
if ($this->redis) {
383-
$this->redis->close();
384-
}
385-
}
475+
// Release the Redis socket via the wrapper-aware helper
476+
// (the raw `$this->redis->close()` path would bomb on a
477+
// Phalcon cache wrapper now that providers can return one —
478+
// see issue #1022 code review BLOCKER).
479+
$this->closeRedis();
386480
exit(0);
387481

388482
default:

src/Core/Workers/WorkerModelsEvents.php

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -127,7 +127,15 @@ public static function getCheckInterval(): int
127127
}
128128

129129
/**
130-
* Save worker state to Redis
130+
* Save worker state to Redis.
131+
*
132+
* Called from the shutdown path (handleShutdownSignal), so the single
133+
* retry matters: without it, losing the shutdown-time write means the
134+
* queued reload actions are gone on the next restart. This was Sentry
135+
* issue #27017 under #1022 — `Connection lost` in Redis::set during
136+
* shutdown. One retry with the short-backoff helper is cheap insurance
137+
* and survives both a stale-socket case (retry reopens) and a flapping
138+
* Redis case (retry waits 100 ms).
131139
*/
132140
private function saveStateToRedis(): void
133141
{
@@ -138,9 +146,11 @@ private function saveStateToRedis(): void
138146
'last_change' => $this->last_change,
139147
'timestamp' => time()
140148
];
141-
149+
142150
$workerKey = self::REDIS_PREFIX . self::class . ':' . getmypid();
143-
$this->managedCache->set($workerKey, $state, self::REDIS_TTL);
151+
$this->withRedisRetry(
152+
fn() => $this->managedCache->set($workerKey, $state, self::REDIS_TTL)
153+
);
144154
} catch (Throwable $e) {
145155
CriticalErrorsHandler::handleExceptionWithSyslog($e);
146156
}

0 commit comments

Comments
 (0)