Welcome to System Design: Building a scalable link shortening service from scratch.
Let's start with the functional requirements. We have four key capabilities. F1: Given a long URL, generate a unique short link. F2: Short URLs redirect to the original long URL. F3: Users can optionally choose a custom short code. And F4: Links expire after a configurable T-T-L. These are the core features our system needs to support.
table
Now the non-functional requirements — these define our performance and reliability targets. N-F-R1: Redirects must complete in under 10 milliseconds at the ninety-ninth percentile. N-F-R2: We need 99.99 percent uptime because reads are absolutely critical. N-F-R3: We must handle 100 million-plus stored URLs and 1000-plus requests per second. And N-F-R4: Short codes should not be sequential or predictable for security. These constraints shape every architectural decision ahead.
table
Here's the overall system architecture. At the top we have the client layer with browsers and mobile apps. Requests flow through a load balancer to multiple API servers for horizontal scaling. Behind that sits a Redis cache layer with primary and replica for hot data. At the bottom is the storage layer with a PostgreSQL write node and read replicas. The load balancer distributes reads across API servers, which check Redis first. On cache miss, they query read replicas. Writes go to the primary database.
mermaid
Let's compare different strategies for generating short codes. M-D5 Hash is deterministic with no collisions on unique input, but it produces long output that needs truncation. Auto-increment counters are simple and guaranteed unique, but they're predictable — a single point of failure. Base62 encoding is compact and U-R-L-safe, producing non-sequential codes, but it needs a unique numeric input. U-U-I-D version 4 requires no coordination, but it's too long for short U-R-Ls. Finally, Nano I-D offers customizable length and collision resistance, but requires collision checking. For our use case, Base62 is the sweet spot.
table
Here's how we implement Base62 encoding in Python. The charset includes digits zero through nine, lowercase a through z, and uppercase A through Z — that's 62 characters total. The encode function takes an integer and repeatedly divides by 62, collecting remainders as indices into our charset, then reverses the result. The comment shows that seven characters gives us 62 to the seventh power, which is 3.5 trillion unique codes — more than enough for 100 million URLs. The example converts 238 million to the Base62 string 'd-n-h-3-K-9'.
code
Looking at the create short U-R-L endpoint design. We P-O-S-T to slash-api-slash-shorten with a request body. The long U-R-L is required. Custom alias and expiration T-T-L are optional. The response includes the full short U-R-L, the short code itself, and an ISO-8601 timestamp for when it expires. This keeps the interface clean and gives clients all the metadata they need.
code
The redirect handler is where the bulk of traffic flows. On a G-E-T request to forward slash followed by the code, we first check the cache. If that misses, we query the database. If the U-R-L doesn't exist or has expired, we return a 404. Otherwise we log the click asynchronously — and this is important, we don't await it, so it never blocks the redirect. Finally we return a 301 permanent redirect to the original U-R-L. This keeps latency under ten milliseconds.
code
The database schema is straightforward. We have an id as the primary key using big serial for future sharding. The code column is unique and indexed — this is our lookup path for every redirect. We store the original U-R-L, creation and expiration timestamps, a click counter for analytics, and a user-id foreign key if the system supports user accounts. We create two indexes: one on code for fast lookups, and a partial index on expires-at to efficiently find expired URLs for cleanup.
code
Here's the sequence diagram for a typical redirect. Client requests forward slash followed by the code. The A-P-I server checks the Redis cache first. If it's a cache hit, we return immediately. If it's a cache miss, we query PostgreSQL, which returns the row data. The API server then populates the cache with a T-T-L — typically 24 hours. Finally we send the 301 redirect back to the client. This two-tier lookup is critical for meeting our sub-10-millisecond latency target.
mermaid
We use the cache-aside pattern, also called lazy loading. The A-P-I server is responsible for cache logic. First it checks Redis. On a cache hit, we return immediately to the application. On a cache miss, the A-P-I queries PostgreSQL directly. Then the A-P-I populates the cache so the next request is a hit. This pattern keeps the cache fresh because we only cache data that's been recently accessed. It's simple to implement and works well for our read-heavy workload.
mermaid
For cache invalidation, we use multiple strategies. Hot U-R-Ls — those with more than 100 clicks per hour — get longer T-T-Ls of 7 days. Expired or deleted U-R-Ls are evicted immediately via Redis pub-slash-sub. Notice that write-through caching is avoided to keep write latency low. We only populate cache on read. For standard U-R-Ls the default T-T-L is 24 hours. User-specified T-T-Ls can range from 1 hour to 30 days. This adaptive strategy balances staleness risk against cache memory.
callout table
Our rate limiting sits at the edge, before requests reach the A-P-I servers. The rate limiter checks whether a client is within their quota. If they're under the limit, the request proceeds to the A-P-I. If they're over the limit, we return a 429 too-many-requests status. This protects the backend from abuse and ensures fair resource allocation across clients.
mermaid
Here's a sliding window rate limiter in TypeScript using Redis. We define different limits per tier — free accounts get 100 requests per minute, while pro accounts get 10 thousand. We calculate a window key based on the current time bucket. We increment a counter in Redis and set an expiration if it's the first request in that window. If the count exceeds the tier limit, we throw a rate limit error that includes how many seconds until retry and how many requests remain. This approach is simple, distributed, and has zero coordination overhead.
code
Token bucket allows short bursts above the average rate, useful if we want to tolerate occasional spikes. Sliding window provides smoother limiting across all time ranges. For a U-R-L shortener, sliding window counters in Redis are the best balance of accuracy and simplicity. Each window key auto-expires, so we don't need background cleanup logic. It's predictable — clients always know their limits — and it's fair across all users.
callout
Our analytics pipeline is completely decoupled from the redirect path. When a redirect happens, we fire a click event to an in-memory buffer. The buffer batches events and publishes them to a Kafka topic called clicks. A stream processor reads from Kafka and aggregates the events — calculating things like total clicks per hour or unique visitors per code. The aggregated data lands in Click House, a columnar database optimized for analytics queries. Finally a dashboard queries Click House to display real-time metrics. This architecture ensures that analytics never impact redirect latency.
mermaid
Looking at the click event structure. We capture the code, a timestamp, the client's I-P address, user agent, and referer. We optionally resolve the country via GeoI-P. Inside the log-click function, we construct the event object and produce it to Kafka. Critically, this is fire-and-forget — we don't await the Kafka produce call. If it fails, we log the error but never block the redirect. This fire-and-forget pattern ensures that analytics infrastructure never causes us to miss our latency targets.
code
The table shows what we aggregate and where. Total clicks are computed via per-minute rollup in Click House and return in under 50 milliseconds. Unique visitors use a HyperLogLog sketch in Redis for sub-5-millisecond latency and constant memory usage. Top referrers are stored as a sorted set in Redis, also sub-5-millisecond. Geographic breakdown requires a daily batch job in Click House and takes up to 200 milliseconds. For real-time dashboards, we favor the Redis metrics.
table
When we eventually shard the database, we use consistent hashing. The comment shows a range-based example: shard zero holds codes starting with digits zero through nine and letters a through f, shard one holds g through v, and shard two holds w through z and uppercase letters. The consistent hashing alternative computes a hash of the code and takes the modulo with the number of shards. Consistent hashing is preferred because adding a new shard only remaps approximately one-over-N keys, not the entire dataset.
code
Here's the shard router class. The constructor takes shard configurations and instantiates a database pool for each one. The get-shard-for-code method uses murmurhash-three to hash the code, then takes modulo with the number of shards to pick a shard I-D. The find-by-code method demonstrates the pattern — it gets the appropriate shard and queries only that shard for the code. This distribution spreads load evenly across shards and enables independent scaling.
code
A single PostgreSQL instance handles 100 million rows comfortably with proper indexing. Start sharding when you approach one billion-plus U-R-Ls or when write throughput exceeds what a single primary can handle — roughly 10 thousand writes per second. Use consistent hashing so adding new shards only remaps a fraction of existing keys. Migration is complex, so delay it as long as your hardware can sustain the load. For most cases, vertical scaling — adding more C-P-U and memory — buys you years of headroom.
callout
Let's compare two cache invalidation strategies. {{step}}First, T-T-L-based expiry: every cache entry gets a T-T-L, say 24 hours. Stale data resolves itself automatically. You may serve outdated U-R-Ls briefly during the window, but there's no coordination overhead — simple to implement. {{step}}Second, event-driven eviction: when you delete or update a U-R-L, publish an event to Redis pub-slash-sub. All A-P-I servers listen and evict the key immediately. Sub-millisecond propagation across the cluster, but requires pub-slash-sub infrastructure and listeners on every server.
cards
Two additional patterns. {{step}}Write-around cache: on write, skip the cache entirely. The next read populates the cache on demand. This avoids double-write complexity and race conditions, but the first read after a write is slower. {{step}}Versioned keys: include a version in the cache key, like 'u-r-l-colon-v-2-colon-a-b-c-123'. On update, increment the version number. Old keys expire naturally via T-T-L. No explicit invalidation logic required — elegant for scenarios where you update U-R-Ls frequently.
cards
For a U-R-L shortener, combine T-T-L-based expiry — 24 hours default — with event-driven eviction for deletes. Most U-R-Ls never change after creation, so cache invalidation is rare. The main concern is evicting expired U-R-Ls promptly. A Redis pub-slash-sub listener handles deletion events with sub-millisecond latency. This hybrid approach gets you 95-plus percent cache hit ratio while keeping the code simple.
callout
We need comprehensive monitoring. {{step}}First, latency tracking: monitor P50, P95, and P99 redirect latency. Alert if P99 exceeds 50 milliseconds sustained for 5 minutes — that usually indicates the cache miss rate is climbing. Check Redis connectivity and read replica lag. {{step}}Second, error rate: track 4-X-X and 5-X-X codes separately. 404 spikes may indicate bot scanning for short codes. 5-X-X spikes mean infrastructure is broken. Page on-call if error rate exceeds one percent for 2 minutes.
cards
Two more critical metrics. {{step}}Cache hit ratio: healthy systems see 93 to 97 percent cache hit ratio. Below 90 percent means your working set exceeds available cache memory. Scale Redis or increase memory allocation. Monitor eviction rate alongside hit ratio. {{step}}Queue depth: monitor Kafka consumer lag for the analytics pipeline. Growing lag means the stream processor can't keep up. Scale consumers or increase partition count. Set alerts on lag exceeding 10 thousand messages.
cards
Here's a summary of alert thresholds. If redirect P99 exceeds 50 milliseconds sustained for 5 minutes, scale read replicas or check Redis. If error rate exceeds one percent for 2 minutes, page on-call and check database connectivity. If cache hit ratio drops below 90 percent for 15 minutes, increase Redis memory or add nodes. If disk usage trends above 80 percent, plan storage expansion or implement data archival. These thresholds are tuned for our scale and latency requirements.
table
Here are the key numbers that drove our design. We're storing 100 million U-R-Ls. Peak writes are 1 thousand requests per second. Reads are about 10 times higher at 10 thousand requests per second — that 10-to-1 ratio drives our caching strategy. Annual storage is approximately 500 gigabytes. These numbers are realistic for a popular U-R-L shortening service handling hundreds of millions of links.
stats
Let's break down the capacity math. Storage per U-R-L is roughly 500 bytes including code, U-R-L, and metadata. Over a year at 1 thousand new U-R-Ls per second, we create about one billion U-R-Ls, consuming roughly 500 gigabytes. At 10 thousand reads per second with 500 bytes per response, we're moving about 5 megabytes per second. For the cache, assuming 20 percent of the 100 million U-R-Ls are hot, we need about 10 gigabytes of Redis memory. These numbers determine our server counts and resource allocation.
table
Here's a quick Python script for back-of-the-envelope calculations. We set 1 thousand U-R-Ls per second as our baseline. We calculate seconds per year — 365 days times 24 hours times 3600 seconds. We assume 500 bytes per U-R-L. The total storage per year comes to 16 gigabytes. That's 31.5 billion U-R-Ls per year. For cache, 20 percent of 100 million U-R-Ls at 500 bytes each is 10 gigabytes. These quick calculations help you estimate infrastructure costs.
terminal
U-R-L shorteners are read-heavy with a 10-to-1 read-write ratio. We favor availability over strong consistency. An eventual consistency window of a few seconds on writes is acceptable. New short U-R-Ls may take one to two seconds to propagate to read replicas, but redirects must never fail. This is a classic example of choosing the right consistency model based on the workload. Reads are critical; writes can be slightly delayed.
callout
Why PostgreSQL over NoSQL? U-R-Ls have a fixed schema — you always have a code, an original U-R-L, timestamps. We need A-C-I-D guarantees for code uniqueness — a hard requirement. And the dataset fits on a single shard for a long time. NoSQL like DynamoDB or Cassandra becomes attractive only at 10 billion-plus U-R-Ls when you need automatic sharding. Start with PostgreSQL — it's simpler, more reliable, and cheaper. Migrate only when the data forces you to.
callout
Here are the key takeaways. Base62 encoding with a distributed counter gives compact, non-guessable short codes. Cache-aside with Redis achieves 95-plus percent hit ratio — most redirects never touch the database. Start with PostgreSQL — the schema is simple, uniqueness constraints are critical, and you won't need to shard for years. Finally, design for reads. That 10-to-1 read-write ratio means caching and read replicas matter more than write optimization. These principles scale this system from millions to billions of U-R-Ls.
list