Skip to content

Feature Request: SPARSE_MAP — A New Column Type for Sparse Semi-Structured Data #17894

@tarun11Mavani

Description

@tarun11Mavani

Feature Request: SPARSE_MAP — A New Column Type for Sparse Semi-Structured Data

Problem

Many Pinot tables contain sparse, semi-structured fields — error metadata, optional tags/properties — where only <10%-15% of documents contain any given key. Users currently have two options, both with significant trade-offs:

Approach Storage Cost Query Performance Schema Flexibility
Flattened columns High — null vectors + indexes for millions of absent rows Fast — native columnar access Poor — schema change per key
JSON columns Moderate — raw blob per doc Slow — full blob parse per doc Good — schemaless

There is no option that provides storage efficiency, fast analytical queries, and schema flexibility together.

Proposal

Introduce SPARSE_MAP, a new field type that stores map columns using a per-key columnar layout within a single binary index file. Each key gets its own presence bitmap, typed forward index, and optional inverted index — storing data only for documents where the key is present.

Schema Definition

{
  "fieldType": "SPARSE_MAP",
  "name": "riskError",
  "sparseMapKeyTypes": {
    "code": "STRING",
    "title": "STRING",
    "key": "STRING"
  },
  "defaultValueType": "STRING"
}

Query Syntax

-- Bracket notation for key access
SELECT count(*) FROM events WHERE riskError['code'] = 'FRAUD_RISK'
SELECT count(*) FROM events WHERE riskError['code'] IS NOT NULL
SELECT riskError['code'] AS code, count(*) FROM events GROUP BY code LIMIT 10

Binary Format

┌─────────────────────────────────────────────┐
│ Header (64 bytes)                           │
├─────────────────────────────────────────────┤
│ Key Dictionary: key names + types           │
├─────────────────────────────────────────────┤
│ Per-Key Data (repeated per key):            │
│   ├── Presence Bitmap (RoaringBitmap)       │
│   ├── Typed Forward Index (INT/LONG/...)    │
│   ├── Inverted Index (value → docId bitmap) │
│   └── DictId Forward Index (bit-packed)     │
├─────────────────────────────────────────────┤
│ Value Dictionary (sorted distinct values)   │
└─────────────────────────────────────────────┘

Key operations — all O(1):

  • Value lookup: presenceBitmap.rank(docId) - 1 → array index
  • Presence check: presenceBitmap.contains(docId)
  • EQ/IN filter: inverted index returns pre-built docId bitmap
  • IS NULL/IS NOT NULL: direct bitmap operations, no value reads

Benchmark Evidence

Benchmarked on a production-like cluster with ~32M docs, 40 segments, 4 servers, comparing SPARSE_MAP against both flattened columns and JSON on the same Kafka topic.

Storage: ~40-50%% savings on sparse columns

Metric Flattened SPARSE_MAP Savings
Sparse-equivalent columns (5 segments) 79.41 MB 42.96 MB 45.9%

Savings scale with sparsity:

Column Group Presence Rate Flattened SPARSE_MAP Savings
rpcError (3 keys) ~0.02% 30.29 MB 4.39 MB 85.5%
upfrontCharge (1 key) ~0.001% 1.65 KB 380 B 77.6%
riskError (3 keys) ~30% 43.14 MB 31.64 MB 26.7%

Storage savings are consistent at across all segments regardless of segment size, confirming this is a structural advantage.

vs Flattened Columns: Comparable performance

Category Example Queries SPARSE_MAP Speedup
Equality filters (inverted index) WHERE riskError['code'] = 'FRAUD_RISK' 1.3-1.5x faster
IS NOT NULL WHERE riskError['code'] IS NOT NULL 1.2-1.5x faster
Point lookups / field selection SELECT riskError['code'], riskError['title'] 1.3-1.6x faster
GROUP BY GROUP BY riskError['code'] 1.4-1.9x slower

GROUP BY is the main gap due to runtime value extraction from the binary format. We can improve this as well.

vs JSON: 3.8x average speedup

Category SPARSE_MAP Speedup Why
Equality / IN filters 2.3-4.8x Per-key inverted index vs flattened token index
GROUP BY 2.9-4.9x Typed columnar access vs blob parsing
Cross-column + range 4.1-4.7x JSON pays parsing cost per doc per column accessed
Typed aggregations 2.2-6.0x No runtime string-to-number conversion on 31M docs

With ~31M docs, JSON suffers due to parsing overhead. SPARSE_MAP eliminates this entirely.

Target Use Cases

SPARSE_MAP is ideal for columns with low presence rates where most rows have no value.
SPARSE_MAP is ideal for tables that are consuming from ever evolving source where new columns may get added very frequently.

Comparison Summary

SPARSE_MAP Flattened JSON
Storage Per-key typed arrays + presence bitmaps Individual columns + null vectors Raw JSON blob per doc
Key access column['key'] column_key JSON_EXTRACT_SCALAR(col, '$.key', ...)
EQ/IN filter Per-key inverted index Per-column inverted index JSON token inverted index
IS NULL check Bitmap operation — O(1) Null vector scan Full blob scan
Storage cost for sparse column (present in <20%-30% of docs) 40%-50% less vs flattened Baseline 40%-50% less vs flattened
Query latency 1.3-1.5x faster than flattened on filters Baseline 2-6x slower than SPARSE_MAP
Schema changes Add key to config Add column + reload No change needed

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions