Feature Request: SPARSE_MAP — A New Column Type for Sparse Semi-Structured Data
Problem
Many Pinot tables contain sparse, semi-structured fields — error metadata, optional tags/properties — where only <10%-15% of documents contain any given key. Users currently have two options, both with significant trade-offs:
| Approach |
Storage Cost |
Query Performance |
Schema Flexibility |
| Flattened columns |
High — null vectors + indexes for millions of absent rows |
Fast — native columnar access |
Poor — schema change per key |
| JSON columns |
Moderate — raw blob per doc |
Slow — full blob parse per doc |
Good — schemaless |
There is no option that provides storage efficiency, fast analytical queries, and schema flexibility together.
Proposal
Introduce SPARSE_MAP, a new field type that stores map columns using a per-key columnar layout within a single binary index file. Each key gets its own presence bitmap, typed forward index, and optional inverted index — storing data only for documents where the key is present.
Schema Definition
{
"fieldType": "SPARSE_MAP",
"name": "riskError",
"sparseMapKeyTypes": {
"code": "STRING",
"title": "STRING",
"key": "STRING"
},
"defaultValueType": "STRING"
}
Query Syntax
-- Bracket notation for key access
SELECT count(*) FROM events WHERE riskError['code'] = 'FRAUD_RISK'
SELECT count(*) FROM events WHERE riskError['code'] IS NOT NULL
SELECT riskError['code'] AS code, count(*) FROM events GROUP BY code LIMIT 10
Binary Format
┌─────────────────────────────────────────────┐
│ Header (64 bytes) │
├─────────────────────────────────────────────┤
│ Key Dictionary: key names + types │
├─────────────────────────────────────────────┤
│ Per-Key Data (repeated per key): │
│ ├── Presence Bitmap (RoaringBitmap) │
│ ├── Typed Forward Index (INT/LONG/...) │
│ ├── Inverted Index (value → docId bitmap) │
│ └── DictId Forward Index (bit-packed) │
├─────────────────────────────────────────────┤
│ Value Dictionary (sorted distinct values) │
└─────────────────────────────────────────────┘
Key operations — all O(1):
- Value lookup:
presenceBitmap.rank(docId) - 1 → array index
- Presence check:
presenceBitmap.contains(docId)
- EQ/IN filter: inverted index returns pre-built docId bitmap
- IS NULL/IS NOT NULL: direct bitmap operations, no value reads
Benchmark Evidence
Benchmarked on a production-like cluster with ~32M docs, 40 segments, 4 servers, comparing SPARSE_MAP against both flattened columns and JSON on the same Kafka topic.
Storage: ~40-50%% savings on sparse columns
| Metric |
Flattened |
SPARSE_MAP |
Savings |
| Sparse-equivalent columns (5 segments) |
79.41 MB |
42.96 MB |
45.9% |
Savings scale with sparsity:
| Column Group |
Presence Rate |
Flattened |
SPARSE_MAP |
Savings |
| rpcError (3 keys) |
~0.02% |
30.29 MB |
4.39 MB |
85.5% |
| upfrontCharge (1 key) |
~0.001% |
1.65 KB |
380 B |
77.6% |
| riskError (3 keys) |
~30% |
43.14 MB |
31.64 MB |
26.7% |
Storage savings are consistent at across all segments regardless of segment size, confirming this is a structural advantage.
vs Flattened Columns: Comparable performance
| Category |
Example Queries |
SPARSE_MAP Speedup |
| Equality filters (inverted index) |
WHERE riskError['code'] = 'FRAUD_RISK' |
1.3-1.5x faster |
| IS NOT NULL |
WHERE riskError['code'] IS NOT NULL |
1.2-1.5x faster |
| Point lookups / field selection |
SELECT riskError['code'], riskError['title'] |
1.3-1.6x faster |
| GROUP BY |
GROUP BY riskError['code'] |
1.4-1.9x slower |
GROUP BY is the main gap due to runtime value extraction from the binary format. We can improve this as well.
vs JSON: 3.8x average speedup
| Category |
SPARSE_MAP Speedup |
Why |
| Equality / IN filters |
2.3-4.8x |
Per-key inverted index vs flattened token index |
| GROUP BY |
2.9-4.9x |
Typed columnar access vs blob parsing |
| Cross-column + range |
4.1-4.7x |
JSON pays parsing cost per doc per column accessed |
| Typed aggregations |
2.2-6.0x |
No runtime string-to-number conversion on 31M docs |
With ~31M docs, JSON suffers due to parsing overhead. SPARSE_MAP eliminates this entirely.
Target Use Cases
SPARSE_MAP is ideal for columns with low presence rates where most rows have no value.
SPARSE_MAP is ideal for tables that are consuming from ever evolving source where new columns may get added very frequently.
Comparison Summary
|
SPARSE_MAP |
Flattened |
JSON |
| Storage |
Per-key typed arrays + presence bitmaps |
Individual columns + null vectors |
Raw JSON blob per doc |
| Key access |
column['key'] |
column_key |
JSON_EXTRACT_SCALAR(col, '$.key', ...) |
| EQ/IN filter |
Per-key inverted index |
Per-column inverted index |
JSON token inverted index |
| IS NULL check |
Bitmap operation — O(1) |
Null vector scan |
Full blob scan |
| Storage cost for sparse column (present in <20%-30% of docs) |
40%-50% less vs flattened |
Baseline |
40%-50% less vs flattened |
| Query latency |
1.3-1.5x faster than flattened on filters |
Baseline |
2-6x slower than SPARSE_MAP |
| Schema changes |
Add key to config |
Add column + reload |
No change needed |
Feature Request: SPARSE_MAP — A New Column Type for Sparse Semi-Structured Data
Problem
Many Pinot tables contain sparse, semi-structured fields — error metadata, optional tags/properties — where only <10%-15% of documents contain any given key. Users currently have two options, both with significant trade-offs:
There is no option that provides storage efficiency, fast analytical queries, and schema flexibility together.
Proposal
Introduce
SPARSE_MAP, a new field type that stores map columns using a per-key columnar layout within a single binary index file. Each key gets its own presence bitmap, typed forward index, and optional inverted index — storing data only for documents where the key is present.Schema Definition
{ "fieldType": "SPARSE_MAP", "name": "riskError", "sparseMapKeyTypes": { "code": "STRING", "title": "STRING", "key": "STRING" }, "defaultValueType": "STRING" }Query Syntax
Binary Format
Key operations — all O(1):
presenceBitmap.rank(docId) - 1→ array indexpresenceBitmap.contains(docId)Benchmark Evidence
Benchmarked on a production-like cluster with ~32M docs, 40 segments, 4 servers, comparing SPARSE_MAP against both flattened columns and JSON on the same Kafka topic.
Storage: ~40-50%% savings on sparse columns
Savings scale with sparsity:
Storage savings are consistent at across all segments regardless of segment size, confirming this is a structural advantage.
vs Flattened Columns: Comparable performance
WHERE riskError['code'] = 'FRAUD_RISK'WHERE riskError['code'] IS NOT NULLSELECT riskError['code'], riskError['title']GROUP BY riskError['code']GROUP BY is the main gap due to runtime value extraction from the binary format. We can improve this as well.
vs JSON: 3.8x average speedup
With ~31M docs, JSON suffers due to parsing overhead. SPARSE_MAP eliminates this entirely.
Target Use Cases
SPARSE_MAP is ideal for columns with low presence rates where most rows have no value.
SPARSE_MAP is ideal for tables that are consuming from ever evolving source where new columns may get added very frequently.
Comparison Summary
column['key']column_keyJSON_EXTRACT_SCALAR(col, '$.key', ...)