a URL shortener system
To design a scalable system, you should design a system which can handle up to millions of URLs generated per day. The length of the generated URLs should be as short as possible. They can be a combination of characters (a-z, A-Z) as well as numbers (0 - 9)
In my favorite TV show ever, "Silicon Valley", Gavin Belson said that "small is the new big". Although he is not my favorite character in the movie, I gotta admit that he is right. With the endless stream of information we encounter daily, it getting harder to pay any of our attention to anything. That is a URL shortener is such an essential these days as services like bit.ly or goo.gl are widely popular. These services help shorten an arbitrary URL like "https://www.google.com/search?q=fasdfsa&sxsrf=AOaemvLDZGX" into something a lot shorter like "https://www.bit.ly/dut122"
Requirements
On the surface, this seems like an easy system to design; however, designing it correctly is hard. And the "right" changes as technology evolves daily and keeps on changing. Things like traffic pattern also change, too.
To design a system correctly, you have to ask the right questions. Like:
- What is the traffic volume?
- What characters are allowed in the shortened URL?
- How long are the shortened URLs should be?
To design a scalable system, you should design a system which can handle up to millions of URLs generated per day. The length of the generated URLs should be as short as possible. They can be a combination of characters (a-z, A-Z) as well as numbers (0 - 9)
Use Case
- User input a long URL, the service returns a much shorter URL
- User clicks the shortened URL, they are redirected to the original URL
- The service should be highly available, scalable, fault tolerant.
Quick Estimations
- Write: 100 million URLs shortened per day
- Write Per Second: 100 million / 24/ 3600 = 1160
- Read: Because it is hard to estimate how many times a users will need to use the URL, we can assume the ratio to be 10:1, read to write. Therefore, it will be 1160 * 10 = 11,600
- When you build something, it should be durable. So we can expect the service will up and running for 15 years. This means we have to support 100 million * 365 * 15 = 548 billions records
- An average URL length will be 100. Then the storage requires over 15 years will be 548 billions * 100 bytes = 54.8 TB
High Level Design
The high level design of this system will focus on the API endpoints, URL redirecting.
A URL shortener will require two primary API endpoints. The first one will be a POST request to create new short URL. The second one will be a GET request for URL redirecting.
- POST api/v1/url/shorten - @request parameter: longURL, @return shortURL
- GET api/v1/shorturl - @return longURL for HTTP connection
To redirect a URL, once a user send the above GET request, the service will change the short URL to the long URL with 301 redirect
Low Level Design
- User inputs a longURL
- The service will check if the longURL already existed in the database
- If it does, fetch the shortened URL from the database and return it
- If it doesn't, create a new shortURL
- Store all the attributes in the database with ID, shortURL, and longURL
#Data model
Although there are many ways in which you can store the data for this service, personally I believe that the model in the image below is the best option. It is a simple model, but it is enough for the purpose of this post.
#How to generate the unique shortened link?
I bet that "hash function" is the first thing that came to your mind when you think about converting a long string of information into a shorter, condensed string. But in this post, I will use the "Base-62 Conversion". Even though each mechanism has its own pros and cons, using "hash function" will results in collisions between shortened URLs. Some of you may say that the collisions can be avoided by checking in advanced if the shortened URL already existed or not. If it does, we can generate a new URL. However, do you know how expensive that can be? As in the beginning, we planned out to design a scalable system, that's why we have to come up with the most sustainable solution for our system. We can't risk making the service doing extra work while we have the luxury of designing it differently. Furthermore, base conversion is also a popular algorithm used for URL shorteners.
For those that aren't familiar with base conversion, it is an algorithm that helps convert the same number between its different number representation systems. In this case, the "Base-62 conversion" is used as if there are 62 possible characters for the shortened URL value. For example, the number of 649 in the long URL will be converted to "AT" in the base-62. Therefore, our shortened URL will look like this "bit.ly/AT". Hope things are clearly for everyone.
So the workflow on the service now will be like this.
- A user enters a long URL
- The service will check if the long URL already exists in the database or not. If it does, the shortened URL will be returned
- If not, the database will generate a random ID for the URL *
- We converted the newly generated ID with the base-62 to create a new unique string. Then, we concat that tiny unique string with our domain to become a short URL.
- We store everything in the database according to the model in the image above
*This will not be a problem with you only using one server for the service, which is unlikely. In case the service uses a distributed system, we have to create a ID generator for the system because it can be the case where 2 exactly ID being generated. But in the scope of this post, I will not talk about that. It is for another post.
We already talked about the user's workflow on converting the URL. How about the URL redirecting mechanism?
There is also one small notice here as we design this system. Because there are obviously more reads than writes, to boost the performance, we will store the mapping of <shortURL, longURL> in the cache.
- The user will click the shortened URL link.
- The load balancer will send the request to the web servers.
- If the shortened URL is already in the cache, return the longURL immediately.
- Bad cases will happen if the shortened URL is not in the cache, the service will have to go find it in the database. Worst case scenarios is the user enters an invalid URL.
- The longURL will be returned to the user.
Well, this post has been long enough and I think all the necessities of designing a URL shortener has been addressed. Please let me know at hi@tamnguyen.io or visit my website at tamnguyen.io if you have any question regarding this post.
Thank you everyone and stay safe.