End-to-end system design — encoding, caching, sharding, rate limiting, and tradeoffs.
Welcome everyone! Today we're going to design a URL shortener from the ground up. We'll build a scalable link shortening service that can handle millions of URLs and thousands of requests per second. This is one of the classic system design problems, and it touches on encoding strategies, caching, database design, and distributed systems principles. By the end of this talk, you'll understand the architecture, implementation details, and the key tradeoffs that go into building a production-ready URL shortener. Let's dive in.
Let's start by defining what we're building. Looking at the table, we have four functional requirements. F1: we need to shorten URLs by generating a unique short link for any long URL. F2: those short URLs must redirect users to the original destination. F3: users can optionally choose custom aliases instead of auto-generated codes. And F4: links can expire after a configurable time-to-live. On the non-functional side, we have some serious performance goals. NFR1 demands redirects complete in under 10 milliseconds at the 99th percentile. NFR2 requires 99.99% uptime since reads are absolutely critical. NFR3 means we need to handle over 100 million stored URLs and process more than a thousand requests per second. And NFR4 is crucial for security: short codes cannot be sequential or predictable, otherwise attackers could enumerate all links.
As you can see in the diagram, our architecture has four main layers. At the top, we have the client layer with browsers and mobile apps. These connect through a load balancer to our API layer, which consists of multiple stateless API servers for horizontal scalability. Below that is our cache layer with Redis running in a primary-replica configuration for high availability. At the bottom is our storage layer with a PostgreSQL writer instance and multiple read replicas. The flow goes from clients through the load balancer, which distributes traffic across API servers. API servers check Redis first for cached URLs. On cache misses, reads go to the database replicas while writes go to the primary, which then replicates to the read replicas. This separation of concerns gives us low-latency reads and high availability.
Now let's talk about how we actually generate those short codes. The table compares five different strategies. MD5 hashing is deterministic but produces long outputs that need truncation. Auto-increment counters are simple but predictable and create a single point of failure. Base62 encoding gives us compact, URL-safe codes but requires a unique numeric input. UUID version 4 needs no coordination but is way too long for short URLs. And Nano ID offers customizable length with good collision resistance. Looking at this code, we're implementing Base62 encoding. Notice the charset has 62 characters: digits, lowercase, and uppercase letters. The encode function converts a number to base 62 by repeatedly dividing by 62 and mapping remainders to characters. With just 7 characters, base 62 gives us 62 to the power of 7, which is 3.5 trillion unique codes. That's more than enough headroom.
Let's look at the API design. On the left, we have the shorten endpoint. It's a POST request to slash API slash shorten. The request takes a required URL field for the long URL, an optional custom alias if the user wants to choose their own code, and an optional expires-in field for the TTL in seconds. The response returns the full short URL, just the code portion, and an expiration timestamp in ISO 8601 format. On the right is the redirect handler. Notice on line 5, when we receive a GET request to slash code, we first check the cache with await cache dot get. If that returns null, we fall back to the database with await db dot find by code. Line 8 checks if the URL exists and isn't expired. If it's not found or expired, we return a 404. Otherwise, line 13 logs the click for analytics asynchronously, and line 14 sends a 301 permanent redirect to the original URL.
Looking at the schema, our URLs table has an auto-incrementing ID, a unique code column with a varchar 10 to store our short codes, the original URL as text, timestamps for creation and expiration, a click count for basic analytics, and an optional user ID foreign key. We have two indexes: one on the code column since that's our primary lookup key, and a partial index on expires at for efficiently finding expired URLs. The sequence diagram below shows the read flow. When a client requests slash abc123, the API server first checks Redis. On a cache hit, we immediately return the original URL and redirect. On a cache miss, we query PostgreSQL, get the row data, then populate the cache with a 24-hour TTL before redirecting. This cache-aside pattern keeps the hot path extremely fast.
The flow chart illustrates our cache-aside pattern in more detail. The API server checks the cache first. If it's a hit, we return immediately. If it's a miss, we query the database, get the row, and populate the cache for next time. The callout explains our cache invalidation strategy. Hot URLs that get more than 100 clicks per hour receive longer TTLs to reduce database load. When URLs expire or get deleted, we evict them from cache immediately using Redis pub-sub for instant propagation across all API servers. We avoid write-through caching because it would add latency to the write path. Since most URLs are read many times but written only once, caching on read is the optimal strategy. This keeps write latency low while still achieving a 95% or better cache hit ratio.
Rate limiting is critical to prevent abuse. The flow chart shows requests hitting the rate limiter first. If the client is under their limit, the request proceeds to the API server. If they're over the limit, we immediately return a 429 Too Many Requests response. Looking at this code, we're implementing a sliding window rate limiter using Redis. The function takes an API key and tier, either free or pro. We define limits: 100 requests per minute for free users, 10,000 for pro. The key is constructed with the current time window rounded to the nearest minute. We increment the counter and set expiration on the first request. If the count exceeds the limit, we throw a rate limit error with a retry-after header. The callout explains our algorithm choice. Token bucket allows bursts above the average rate, while sliding window provides smoother limiting. For our use case, sliding window counters in Redis are perfect because each window key auto-expires, so there's no cleanup overhead.
Let's talk about analytics. The flow diagram shows our asynchronous tracking pipeline. The redirect handler buffers click events in memory, which are then sent to a Kafka topic called clicks. A stream processor consumes from Kafka, aggregates the data, and writes it to ClickHouse for efficient analytics queries. Finally, we expose this data through an analytics dashboard. Looking at the code, notice the log click function is fire-and-forget. It creates a click event with the code, timestamp, IP address, user agent, referer, and country resolved via GeoIP lookup. The key line is at the bottom: we produce to Kafka without awaiting the result. We catch errors for logging but never block the redirect. This is crucial because analytics must never slow down the core redirect functionality. The table at the bottom shows our aggregation strategy. Total clicks use per-minute rollups in ClickHouse. Unique visitors use HyperLogLog sketches in Redis for approximate counting. Top referrers and geographic breakdowns are also pre-aggregated for fast dashboard queries.
Eventually, you'll need to shard your database. The SQL comments show two approaches: range-based sharding by code prefix, or consistent hashing. Looking at the TypeScript code, we're implementing consistent hashing. The ShardRouter class maintains an array of database pools, one per shard. The getShardForCode function hashes the code using MurmurHash3 and takes the modulo to determine which shard owns that code. The findByCode method routes to the appropriate shard and executes the query there. The callout gives us guidance on when to actually implement sharding. A single PostgreSQL instance easily handles 100 million rows with proper indexing. You only need to shard when approaching a billion URLs or when write throughput exceeds around 10,000 writes per second. Consistent hashing is the right choice because when you add new shards, it only remaps about one over N of your existing keys, minimizing disruption.
Cache invalidation is one of the hardest problems in distributed systems. We have four cards here describing different strategies. TTL-based expiry gives every cache entry a time-to-live, like 24 hours. Stale data resolves itself automatically, but you might serve outdated URLs briefly. Event-driven eviction uses Redis pub-sub: when a URL is deleted or updated, all API servers evict the key immediately. Write-around cache skips the cache entirely on writes; the next read populates it. And versioned keys include a version number in the cache key itself. On update, you increment the version and old keys expire naturally. The callout recommends our approach: combine TTL-based expiry with a 24-hour default, plus event-driven eviction for deletes. Since most URLs never change after creation, invalidation is rare. The main concern is evicting expired URLs promptly, which the Redis pub-sub listener handles with sub-millisecond latency across all servers.
Observability is crucial for production systems. We have four key metrics to monitor, shown in the cards. Latency tracking focuses on p50, p95, and p99 redirect latency. If p99 exceeds 50 milliseconds, it usually means the cache miss rate is climbing. Error rate tracking separates 4xx and 5xx errors. A spike in 404s might indicate a bot scanning codes, while 5xx errors mean something is actually broken. Cache hit ratio should be between 93 and 97 percent. Below 90 percent means your working set exceeds cache memory and it's time to scale Redis. Queue depth monitors the Kafka consumer lag for analytics. Growing lag means the stream processor can't keep up with click volume. The table defines our alert thresholds. If redirect p99 stays above 50 milliseconds for 5 minutes, we scale read replicas or check Redis. Error rate above 1 percent for 2 minutes means we page the on-call engineer. Cache hit ratio below 90 percent for 15 minutes triggers capacity planning. And disk usage above 80 percent on a trend basis means we need storage expansion or archival.
Let's run some numbers. The stats show we're targeting 100 million total URLs stored, with 1,000 writes per second at peak and 10,000 reads per second, giving us a 10-to-1 read-write ratio. Storage grows by about 500 gigabytes per year. Looking at the table, we calculate that each URL takes roughly 500 bytes for the code, original URL, and metadata. At 1,000 writes per second continuously, that's 31.5 billion new URLs per year, though realistically we'd expect around 1 billion. This translates to 500 gigabytes of storage annually. Read bandwidth is 10,000 requests per second times 500 bytes, which is about 5 megabytes per second. And if 20 percent of URLs are hot and frequently accessed, we need about 10 gigabytes of cache memory. The terminal output shows a quick Python calculation confirming these numbers. This back-of-envelope math is essential for capacity planning and helps us right-size our infrastructure before we build.
Every system design involves tradeoffs. The first callout discusses consistency versus availability. Since URL shorteners are read-heavy with that 10-to-1 ratio, we favor availability over strong consistency. An eventual consistency window of a few seconds on writes is perfectly acceptable. New short URLs might take 1 or 2 seconds to propagate to read replicas, but redirects must never fail. The second callout addresses the SQL versus NoSQL debate. PostgreSQL wins here for several reasons: URLs have a fixed schema, we absolutely need ACID guarantees for code uniqueness, and the dataset fits on a single shard for a long time. NoSQL databases like DynamoDB or Cassandra become attractive only when you hit 10 billion-plus URLs and need automatic sharding. The principle is to start simple with battle-tested relational databases and only migrate when the data volume or access patterns force you to. Premature optimization toward NoSQL adds complexity without clear benefits.
Let's wrap up with the key takeaways. First, Base62 encoding with a distributed counter gives us compact, non-guessable short codes that can scale to trillions of URLs. Second, the cache-aside pattern with Redis achieves a 95 percent or better hit ratio, meaning most redirects never touch the database at all. This is critical for meeting our sub-10-millisecond latency requirement. Third, start with PostgreSQL. The schema is simple, uniqueness constraints are absolutely critical for preventing code collisions, and you won't need to shard for years. Resist the urge to prematurely adopt NoSQL. And finally, design for reads. That 10-to-1 read-write ratio means caching and read replicas matter far more than write optimization. Invest your effort where it has the most impact. Thank you for joining me today. I hope this deep dive into URL shortener design gives you a solid foundation for tackling similar system design problems.
Hands-on implementation guides with detailed code examples, step-by-step instructions, and expanded explanations for each topic.