A distributed key-value store in C++ that backs a full cloud platform — webmail, file storage, and user accounts. Tablets are replicated across nodes with automatic failover, checkpoint-based recovery, and memory-bounded operation under a 10MB-per-node limit.
Four layers, each with a distinct responsibility:
- Load Balancer (port 9000) — stateless round-robin dispatch. After the initial redirect, clients talk directly to their assigned frontend.
- Frontend Servers (ports 8000–8002) — stateless HTTP servers for webmail, file storage, and accounts. Parse requests, issue KV ops against the backend, return responses.
- Backend KVS Nodes (ports 10000–10008) — the store itself. Data lives in tablets (range-partitioned, e.g. A–M, M–Z), each replicated with a primary/secondary model.
- Coordinator (port 9999) — heartbeat-based health monitor, primary elector, and routing authority. Single point of coordination (not a single point of failure for data).
Each tablet is a nested map: map<row, unordered_map<column, value>>. Four operations:
| Op | Semantics |
|---|---|
PUT row,col=val |
Insert or overwrite |
GET row,col |
Read |
CPUT row,col=old,new |
Atomic compare-and-swap |
DELETE row,col |
Remove column from row |
Row-level locking means concurrent writes to different rows in the same tablet don't block each other.
All writes flow through the primary:
- If a client hits a secondary, the secondary forwards the write to the primary.
- The primary executes locally, then sends
REPLICA_WRITEto every secondary. - The primary blocks until all secondaries ACK.
- Only then does the client get
+OK.
This is synchronous replication — no stale reads, no lost writes under normal operation. The tradeoff is write latency scales with the slowest secondary.
The coordinator pings each node every second. If a heartbeat is missed for 2+ seconds:
- Node is marked DOWN.
- First available secondary is promoted to primary.
PRIMARY_CHANGEis broadcast to all replicas in the group.- Frontends are redirected away from the dead node.
When the failed node comes back, it loads its latest binary checkpoint, replays the operation log, and rejoins as a secondary.
Each node operates under a hard 10MB memory ceiling, 8MB per tablet. Three mechanisms keep it bounded:
- LRU eviction — least-recently-used tablets are checkpointed to disk and evicted when the node hits its limit.
- Tablet splitting — when a tablet exceeds 8MB, it splits at the 50% memory mark (not key count), producing two smaller tablets with updated ranges.
- Precise tracking — every PUT/DELETE accounts for the exact memory delta, including string overhead.
Two-tier durability:
- Operation log (text) — every write is appended to a per-tablet log. Cleared on checkpoint.
- Checkpoints (binary) — versioned, full tablet snapshots with size-prefixed serialization. Triggered every 5 writes.
Recovery replays the log from the last checkpoint to reconstruct pre-crash state.
Webmail — compose, reply, delete, send to local and external addresses via SMTP.
File storage — upload/download any file type with chunked transfers and nested folders.
Admin console — live server health, tablet assignments, remote shutdown/restart.
- Three-level locking without deadlocks — node mutex, tablet mutex, per-row mutex, always acquired in that order. Allows high concurrency on disjoint rows while protecting structural changes.
- Memory-aware splitting — splitting by accumulated byte size rather than key count prevents pathological cases where a few large values dominate a tablet.
- Routing during failover — the window between a node dying and the coordinator electing a new primary is the danger zone. The coordinator gates all routing decisions so frontends never get stale endpoints.
- Synchronous replication over unreliable sockets — retries, connection tracking, and per-command ACK counters ensure writes either fully propagate or cleanly fail.
C++17, POSIX sockets, pthreads, Make.





