Skip to content

vismaychuriwala/PennCloud

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

PennCloud Backend

A distributed key-value store in C++ that backs a full cloud platform — webmail, file storage, and user accounts. Tablets are replicated across nodes with automatic failover, checkpoint-based recovery, and memory-bounded operation under a 10MB-per-node limit.

Fault-recovery demo (Vimeo)

Fault & Recovery Demo

Architecture

PennCloud System Architecture

Four layers, each with a distinct responsibility:

  • Load Balancer (port 9000) — stateless round-robin dispatch. After the initial redirect, clients talk directly to their assigned frontend.
  • Frontend Servers (ports 8000–8002) — stateless HTTP servers for webmail, file storage, and accounts. Parse requests, issue KV ops against the backend, return responses.
  • Backend KVS Nodes (ports 10000–10008) — the store itself. Data lives in tablets (range-partitioned, e.g. A–M, M–Z), each replicated with a primary/secondary model.
  • Coordinator (port 9999) — heartbeat-based health monitor, primary elector, and routing authority. Single point of coordination (not a single point of failure for data).

Data model

Each tablet is a nested map: map<row, unordered_map<column, value>>. Four operations:

Op Semantics
PUT row,col=val Insert or overwrite
GET row,col Read
CPUT row,col=old,new Atomic compare-and-swap
DELETE row,col Remove column from row

Row-level locking means concurrent writes to different rows in the same tablet don't block each other.

Replication

All writes flow through the primary:

  1. If a client hits a secondary, the secondary forwards the write to the primary.
  2. The primary executes locally, then sends REPLICA_WRITE to every secondary.
  3. The primary blocks until all secondaries ACK.
  4. Only then does the client get +OK.

This is synchronous replication — no stale reads, no lost writes under normal operation. The tradeoff is write latency scales with the slowest secondary.

Fault tolerance

The coordinator pings each node every second. If a heartbeat is missed for 2+ seconds:

  1. Node is marked DOWN.
  2. First available secondary is promoted to primary.
  3. PRIMARY_CHANGE is broadcast to all replicas in the group.
  4. Frontends are redirected away from the dead node.

When the failed node comes back, it loads its latest binary checkpoint, replays the operation log, and rejoins as a secondary.

Memory management

Each node operates under a hard 10MB memory ceiling, 8MB per tablet. Three mechanisms keep it bounded:

  • LRU eviction — least-recently-used tablets are checkpointed to disk and evicted when the node hits its limit.
  • Tablet splitting — when a tablet exceeds 8MB, it splits at the 50% memory mark (not key count), producing two smaller tablets with updated ranges.
  • Precise tracking — every PUT/DELETE accounts for the exact memory delta, including string overhead.

Persistence

Two-tier durability:

  • Operation log (text) — every write is appended to a per-tablet log. Cleared on checkpoint.
  • Checkpoints (binary) — versioned, full tablet snapshots with size-prefixed serialization. Triggered every 5 writes.

Recovery replays the log from the last checkpoint to reconstruct pre-crash state.

What it powers

Webmail — compose, reply, delete, send to local and external addresses via SMTP.

Compose Email

Inbox

File storage — upload/download any file type with chunked transfers and nested folders.

Image upload

Audio upload and playback

Admin console — live server health, tablet assignments, remote shutdown/restart.

Admin Console

Hard problems

  • Three-level locking without deadlocks — node mutex, tablet mutex, per-row mutex, always acquired in that order. Allows high concurrency on disjoint rows while protecting structural changes.
  • Memory-aware splitting — splitting by accumulated byte size rather than key count prevents pathological cases where a few large values dominate a tablet.
  • Routing during failover — the window between a node dying and the coordinator electing a new primary is the danger zone. The coordinator gates all routing decisions so frontends never get stale endpoints.
  • Synchronous replication over unreliable sockets — retries, connection tracking, and per-command ACK counters ensure writes either fully propagate or cleanly fail.

Built with

C++17, POSIX sockets, pthreads, Make.

About

Distributed cloud platform in C++17 — webmail, file storage, and user accounts backed by a fault-tolerant key-value store with synchronous replication, LRU eviction, tablet splitting, and heartbeat-based failover.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors