Recently GigaOM carried an article written by Derrick Harris entitled: “Facebook trapped in MySQL fate worse than death”.
The gist of the article is:
- Facebook is trapped using MySQL as their persistence data layer
- This is unfortunate, because MySQL is inappropriate
- It is inappropriate because MySQL is poor at large scales
The greatest counterargument, I suppose is, to quote John Siracusa (of Ars Technica and Hypercritical), that Facebook works. Can you imagine an architecture of greater scale? They have 700 million active users!
The article quotes Michael Stonebraker, who, as the creator of the Ingres database (and later the Postgres database. You’d think the greater eminence of the Postgres DB would ensure it’s presence outside these parentheses, but look how wrong you are!), certainly knows what he’s talking about when it comes to databases.
The tone of the article, however, and many like it, is this issue of painting scalability in an absolute sense – i.e. that data storage technology X either “scales” or “doesn’t scale”, from which one is led to infer that a service can or cannot be implemented upon said technology.
In truth, the scalability of data storage technologies can only be described in a relative sense – they are more scalable, or less scalable. Saying that something “doesn’t scale” is lazy at best and misleading at worst.
The scaling problem in a nutshell
The scalability of distributed systems and web applications in particular is typically talked about on quite vague terms. One can make a reasonable definition with a bit of investigation.
It seems clear that scalability contains within it this notion of what is happening to one’s infrastructure as more clients are requesting resources from it.
As applied to distributed systems, scalability is used in two senses:
- How ability to handle load increases as you add resources
- Ability to handle increasing load in a graceful manner (for a fixed set of resources)
Notice that scaling is not performance.
Performance is about the rate of computation for fixed circumstances. Scalability is what happens as demand upon a system increases. For example, the maxsort algorithm has high performance for 5 numbers but poor scalability. The quicksort algorithm by contrast has high scalability.
Let’s consider a web server, serving static content (e.g. a page containing only the text “Hello World!”) and conduct an experiment – what happens as the number of requests upon it starts to increase?
One sees that two important things happen:
- Amount of time required per-request goes up
- Failed requests start to appear
What is the explanation for this?
Your web server has a maximum number of concurrent client requests it can serve. As connections are filling up this limit – no slowdown is yet noticeable client side. After the limit is reached, the OS initiates a backlog queue, of signals from clients (SYN signals from the TCP protocol) waiting to be processed. The size of this backlog queue is OS dependent.
As this backlog queue starts getting filled, client requests take longer to be fulfilled. Until ultimately, after the queue is filled, requests start to get dropped. Now, this is totally OS and machine context dependent, but typical values for Apache on a commodity machine when this starts happening are ~10^4 connections.
Recall this is simply for requesting a static page. As more complex requests occur, such as dynamically synthesizing the response to an SQL query, you can imagine the time + memory required to fulfill a request grows, and so the number of requests the server can receive before they start being dropped becomes smaller.
In a nutshell, then, the scaling problem is thus: we wish to prevent slow responses / dropped requests by the server. The way to do this is to prevent the request queues from filling up – and to do that one spreads requests across multiple machines. If your storage solution (such as MySQL) is better than another, then it will require fewer nodes to spread requests across. If it is poorer, more nodes to spread requests across.
Now, one can make the argument that were Facebook to use a different storage technology, they might need fewer resources. But this notion that they’re on the brink of collapse is hyperbole. They simply have the requisite resources to make MySQL respond to their experienced load. Facebook pumped a sufficient number of machines into their MySQL based architecture to make their system work.
Technologies are more scalable, or less scalable. They are not scalable in an absolute sense. One can say, “well, we don’t have the hardware to make this data storage system work” or, “Hey, we switched to data storage system X and now we can serve twice as many requests per hour” but it is not meaningful simply to say “X does not scale.” It is a lazy answer.