Motivation
Apache Pinot currently has mechanisms to kill runaway queries based on CPU time and memory pressure via QueryResourceAggregator. However, these mechanisms have limitations:
Reactive, not proactive: The background thread (QueryResourceAggregator) samples resource usage periodically (default 100ms). A query scanning billions of entries can cause significant damage before being detected.
Delayed detection: A full table scan query can process 100M+ entries between background thread samples, causing CPU spikes and impacting other queries before termination.
Gap in protection: The existing killMostExpensiveQuery() (memory-based) and killCPUTimeExceedQueries() (CPU-based) work well for memory-intensive and CPU-intensive queries respectively. However, a query doing a massive filter scan may not immediately trigger high memory usage but will consume significant CPU cycles processing entries that ultimately get filtered out.
Problem : In production environments with tables containing billions of rows, a single missing filter predicate or low-selectivity filter can trigger full segment scans. By the time the background thread detects high CPU usage, the query has already consumed significant resources and impacted cluster performance and increased tail latency for most other queries.
To solve this problem, I would like to propose a Proactive Query Killing Framework that terminates queries inline based on scan thresholds, complementing the existing background-based CPU/memory killing. This framework will plug into existing query killing mechanism, just introduce a proactive way to track and terminate the queries instead of relying on overall JVM memory or CPU usage. The strategy will be made pluggable so it can be extended with more cost measurements in the future.
Core Components
QueryCostContext Interface: Exposes real-time query metrics:
getNumEntriesScannedInFilter()
getNumDocsScanned()
getElapsedTimeMs()
QueryKillingStrategy Interface: Pluggable strategy for termination decisions:
interface QueryKillingStrategy {
boolean shouldTerminate(QueryCostContext context);
String getTerminationReason();
}
Configuration
pinot.server.query.killer.enabled=true
pinot.server.query.killer.threshold.entriesScannedInFilter=100000000
pinot.server.query.killer.threshold.docsScanned=10000000
# Can also be overridden with table specific overrides
####Non-Goals for now (Can be a future prospect)
- Broker-side query killing (future work)
- Query fingerprinting / pattern-based blocking
- Adaptive thresholds based on cluster load
####Similar design in other systems
-
ClickHouse provides comprehensive query complexity settings that limit scan operations inline. eg. max_rows_to_read, max_rows_to_read_leaf, read_overflow_mode
-
Trino provides query.max-scan-physical-bytes to limit data scanned:
"The maximum number of bytes that can be scanned by a query during its execution. When this limit is reached, query processing is terminated to prevent excessive resource usage."
Previous Work
Motivation
Apache Pinot currently has mechanisms to kill runaway queries based on CPU time and memory pressure via
QueryResourceAggregator. However, these mechanisms have limitations:Reactive, not proactive: The background thread (QueryResourceAggregator) samples resource usage periodically (default 100ms). A query scanning billions of entries can cause significant damage before being detected.
Delayed detection: A full table scan query can process 100M+ entries between background thread samples, causing CPU spikes and impacting other queries before termination.
Gap in protection: The existing killMostExpensiveQuery() (memory-based) and killCPUTimeExceedQueries() (CPU-based) work well for memory-intensive and CPU-intensive queries respectively. However, a query doing a massive filter scan may not immediately trigger high memory usage but will consume significant CPU cycles processing entries that ultimately get filtered out.
Problem : In production environments with tables containing billions of rows, a single missing filter predicate or low-selectivity filter can trigger full segment scans. By the time the background thread detects high CPU usage, the query has already consumed significant resources and impacted cluster performance and increased tail latency for most other queries.
To solve this problem, I would like to propose a Proactive Query Killing Framework that terminates queries inline based on scan thresholds, complementing the existing background-based CPU/memory killing. This framework will plug into existing query killing mechanism, just introduce a proactive way to track and terminate the queries instead of relying on overall JVM memory or CPU usage. The strategy will be made pluggable so it can be extended with more cost measurements in the future.
Core Components
QueryCostContextInterface: Exposes real-time query metrics:QueryKillingStrategyInterface: Pluggable strategy for termination decisions:Configuration
####Non-Goals for now (Can be a future prospect)
####Similar design in other systems
ClickHouse provides comprehensive query complexity settings that limit scan operations inline. eg.
max_rows_to_read,max_rows_to_read_leaf,read_overflow_modeTrino provides
query.max-scan-physical-bytesto limit data scanned:Previous Work