Feature Request: SPARSE_MAP — A New Column Type for Sparse Semi-Structured Data

# Feature Request: SPARSE_MAP — A New Column Type for Sparse Semi-Structured Data

## Problem

Many Pinot tables contain sparse, semi-structured fields — error metadata, optional tags/properties — where only <10%-15% of documents contain any given key. Users currently have two options, both with significant trade-offs:

| Approach | Storage Cost | Query Performance | Schema Flexibility |
|----------|-------------|-------------------|-------------------|
| **Flattened columns** | High — null vectors + indexes for millions of absent rows | Fast — native columnar access | Poor — schema change per key |
| **JSON columns** | Moderate — raw blob per doc | Slow — full blob parse per doc | Good — schemaless |

There is no option that provides storage efficiency, fast analytical queries, and schema flexibility together.

## Proposal

Introduce `SPARSE_MAP`, a new field type that stores map columns using a per-key columnar layout within a single binary index file. Each key gets its own presence bitmap, typed forward index, and optional inverted index — storing data only for documents where the key is present.

### Schema Definition

```json
{
  "fieldType": "SPARSE_MAP",
  "name": "riskError",
  "sparseMapKeyTypes": {
    "code": "STRING",
    "title": "STRING",
    "key": "STRING"
  },
  "defaultValueType": "STRING"
}
```

### Query Syntax

```sql
-- Bracket notation for key access
SELECT count(*) FROM events WHERE riskError['code'] = 'FRAUD_RISK'
SELECT count(*) FROM events WHERE riskError['code'] IS NOT NULL
SELECT riskError['code'] AS code, count(*) FROM events GROUP BY code LIMIT 10
```

### Binary Format

```
┌─────────────────────────────────────────────┐
│ Header (64 bytes)                           │
├─────────────────────────────────────────────┤
│ Key Dictionary: key names + types           │
├─────────────────────────────────────────────┤
│ Per-Key Data (repeated per key):            │
│   ├── Presence Bitmap (RoaringBitmap)       │
│   ├── Typed Forward Index (INT/LONG/...)    │
│   ├── Inverted Index (value → docId bitmap) │
│   └── DictId Forward Index (bit-packed)     │
├─────────────────────────────────────────────┤
│ Value Dictionary (sorted distinct values)   │
└─────────────────────────────────────────────┘
```

Key operations — all O(1):
- **Value lookup**: `presenceBitmap.rank(docId) - 1` → array index
- **Presence check**: `presenceBitmap.contains(docId)`
- **EQ/IN filter**: inverted index returns pre-built docId bitmap
- **IS NULL/IS NOT NULL**: direct bitmap operations, no value reads

## Benchmark Evidence

Benchmarked on a production-like cluster with ~32M docs, 40 segments, 4 servers, comparing SPARSE_MAP against both flattened columns and JSON on the same Kafka topic.

### Storage: ~40-50%% savings on sparse columns

| Metric | Flattened | SPARSE_MAP | Savings |
|--------|-----------|------------|---------|
| Sparse-equivalent columns (5 segments) | 79.41 MB | 42.96 MB | **45.9%** |

Savings scale with sparsity:

| Column Group | Presence Rate | Flattened | SPARSE_MAP | Savings |
|-------------|---------------|-----------|------------|---------|
| rpcError (3 keys) | ~0.02% | 30.29 MB | 4.39 MB | **85.5%** |
| upfrontCharge (1 key) | ~0.001% | 1.65 KB | 380 B | **77.6%** |
| riskError (3 keys) | ~30% | 43.14 MB | 31.64 MB | **26.7%** |

Storage savings are consistent at across all segments regardless of segment size, confirming this is a structural advantage.

### vs Flattened Columns: Comparable performance  

| Category | Example Queries | SPARSE_MAP Speedup |
|----------|----------------|-------------------|
| Equality filters (inverted index) | `WHERE riskError['code'] = 'FRAUD_RISK'` | **1.3-1.5x faster** |
| IS NOT NULL | `WHERE riskError['code'] IS NOT NULL` | **1.2-1.5x faster** |
| Point lookups / field selection | `SELECT riskError['code'], riskError['title']` | **1.3-1.6x faster** |
| GROUP BY | `GROUP BY riskError['code']` | 1.4-1.9x slower |

GROUP BY is the main gap due to runtime value extraction from the binary format. We can improve this as well.

### vs JSON: 3.8x average speedup

| Category | SPARSE_MAP Speedup | Why |
|----------|-------------------|-----|
| Equality / IN filters | **2.3-4.8x** | Per-key inverted index vs flattened token index |
| GROUP BY | **2.9-4.9x** | Typed columnar access vs blob parsing |
| Cross-column + range | **4.1-4.7x** | JSON pays parsing cost per doc per column accessed |
| Typed aggregations | **2.2-6.0x** | No runtime string-to-number conversion on 31M docs |

With ~31M docs, JSON suffers due to parsing overhead. SPARSE_MAP eliminates this entirely.

## Target Use Cases

SPARSE_MAP is ideal for columns with **low presence rates** where most rows have no value.
SPARSE_MAP is ideal for tables that are consuming from ever evolving source where new columns may get added very frequently. 

## Comparison Summary

|  | SPARSE_MAP | Flattened | JSON |
|--|-----------|-----------|------|
| **Storage** | Per-key typed arrays + presence bitmaps | Individual columns + null vectors | Raw JSON blob per doc |
| **Key access** | `column['key']` | `column_key` | `JSON_EXTRACT_SCALAR(col, '$.key', ...)` |
| **EQ/IN filter** | Per-key inverted index | Per-column inverted index | JSON token inverted index |
| **IS NULL check** | Bitmap operation — O(1) | Null vector scan | Full blob scan |
| **Storage cost for sparse column (present in <20%-30% of docs)** | **40%-50% less** vs flattened | Baseline | **40%-50% less** vs flattened |
| **Query latency** | 1.3-1.5x faster than flattened on filters | Baseline | 2-6x slower than SPARSE_MAP |
| **Schema changes** | Add key to config | Add column + reload | No change needed |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: SPARSE_MAP — A New Column Type for Sparse Semi-Structured Data #17894

Feature Request: SPARSE_MAP — A New Column Type for Sparse Semi-Structured Data

Problem

Proposal

Schema Definition

Query Syntax

Binary Format

Benchmark Evidence

Storage: ~40-50%% savings on sparse columns

vs Flattened Columns: Comparable performance

vs JSON: 3.8x average speedup

Target Use Cases

Comparison Summary

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Approach	Storage Cost	Query Performance	Schema Flexibility
Flattened columns	High — null vectors + indexes for millions of absent rows	Fast — native columnar access	Poor — schema change per key
JSON columns	Moderate — raw blob per doc	Slow — full blob parse per doc	Good — schemaless

Column Group	Presence Rate	Flattened	SPARSE_MAP	Savings
rpcError (3 keys)	~0.02%	30.29 MB	4.39 MB	85.5%
upfrontCharge (1 key)	~0.001%	1.65 KB	380 B	77.6%
riskError (3 keys)	~30%	43.14 MB	31.64 MB	26.7%

Category	Example Queries	SPARSE_MAP Speedup
Equality filters (inverted index)	`WHERE riskError['code'] = 'FRAUD_RISK'`	1.3-1.5x faster
IS NOT NULL	`WHERE riskError['code'] IS NOT NULL`	1.2-1.5x faster
Point lookups / field selection	`SELECT riskError['code'], riskError['title']`	1.3-1.6x faster
GROUP BY	`GROUP BY riskError['code']`	1.4-1.9x slower

Category	SPARSE_MAP Speedup	Why
Equality / IN filters	2.3-4.8x	Per-key inverted index vs flattened token index
GROUP BY	2.9-4.9x	Typed columnar access vs blob parsing
Cross-column + range	4.1-4.7x	JSON pays parsing cost per doc per column accessed
Typed aggregations	2.2-6.0x	No runtime string-to-number conversion on 31M docs

	SPARSE_MAP	Flattened	JSON
Storage	Per-key typed arrays + presence bitmaps	Individual columns + null vectors	Raw JSON blob per doc
Key access	`column['key']`	`column_key`	`JSON_EXTRACT_SCALAR(col, '$.key', ...)`
EQ/IN filter	Per-key inverted index	Per-column inverted index	JSON token inverted index
IS NULL check	Bitmap operation — O(1)	Null vector scan	Full blob scan
Storage cost for sparse column (present in <20%-30% of docs)	40%-50% less vs flattened	Baseline	40%-50% less vs flattened
Query latency	1.3-1.5x faster than flattened on filters	Baseline	2-6x slower than SPARSE_MAP
Schema changes	Add key to config	Add column + reload	No change needed

Feature Request: SPARSE_MAP — A New Column Type for Sparse Semi-Structured Data #17894

Description

Feature Request: SPARSE_MAP — A New Column Type for Sparse Semi-Structured Data

Problem

Proposal

Schema Definition

Query Syntax

Binary Format

Benchmark Evidence

Storage: ~40-50%% savings on sparse columns

vs Flattened Columns: Comparable performance

vs JSON: 3.8x average speedup

Target Use Cases

Comparison Summary

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions