Thoughts on Varnish
Varnish is getting a lot of attention these days around the internet, and with good reason, it’s a nicely written and speedy cache, and has a nice DSL for caching. It has great features like hot reloading of cache rules and ESI.
One thing that’s really surprised me, though, is that Varnish uses one thread per connection. Most network programs designed for high number of connection don’t use one thread per connection anymore as it has serious drawbacks.
With slow clients, many of your threads are spending a lot of time doing nothing but blocking in write(). In all internet consumer apps, I believe, slow clients make up the majority of your connections. But even though the threads are doing nothing, the OS still has memory and scheduling overhead in dealing with them. You find yourself with an artificially low ceiling on the amount of users you can service with a single machine.
What makes a client slow, though? Both speed and latency. Cell phones, 56k modems, and users on high speed links but not geographically close to your data center can all be classified as ‘slow’.
One design that is more appropriate for dealing with the slow client problem uses a pool of worker threads or processes behind the scene and epoll / kqueue / event ports handling slow clients and telling the pool of workers that a socket is ready with a change notification. Your cost is still correlated with growth but at a much lower rate and the number of users you can service will dramatically increase.
So why does Varnish use this older, more troublesome model? Probably because most services aren’t going to notice the bottleneck; They simply don’t have enough concurrent connections to worry about using a few extra machines. If you’re never saturated a load balancer or firewall, you’ve probably never had to seriously consider the C10k issues involved.
Also, unfortunately, the way most people write load tests is that they are only testing the All Fast Clients scenario and not a mix of fast clients and slow clients. I’m guilty of this, too.
My executive summary: Varnish is a nice piece of software, and I hope they spend the time to make it useful for larger sites as well as smaller ones.


![[Atom Enabled]](http://saladwithsteve.com/valid-atom.png)
7 Comments:
I think this is the first open source thing I've seen that supports Edge Side Includes, not that I've been looking.
6:50 PM
Steve are you sure about the behavior of one-thread-per-connection? Varnish architecture diagram clearly states it employs kqueue/epoll/poll/ports for accepting connections and putting them into a pool where a limited number of worker threads can pick and process. I'm no more familiar with the internals of Varnish for a long time but they started with Niel Provos' libevent for event driven async i/o handling at the beginning but completely removed it 2 years ago for another alternative. I'm still experiencing some increased resource usage problems with Varnish but not sure if it's all about threading. A recent peak of 3866 concurrent utterly slow users (10000+ session) consumed significant amount of processor time and memory (other than hot object cache)
12:57 AM
Yes, I'm pretty certain. The architecture diagram shows using non-blocking reads for accepting but there's still one thread per connection behind the scenes, as searching for pthread_create in their source base shows.
The difference between what I'm suggesting and what they are doing is that they aren't using a smaller pool of workers than connections to do the actual work.
I haven't done this but try using the ps -M (show threads) option and correlating that with ESTABLISHED connections in netstat. It should be nearly 1:1.
9:08 AM
Kind of interesting that this discussion is still going on. Maybe it is time to think about switchflow again. Varnish and Nginx are two reasons I decided to shelf that work.
I suspect a lot of folks start working on async HTTP servers and realize how hugely complicated it is. Maybe that's how Varnish ended up at thread per connection.
On Linux I still think thread per connection isn't optimal for proxies because of reliability concerns when dealing with an unknown stack size.
Running out of stack is a nasty situation in C/C++ and if you are dynamically allocating stack via theads the chance of blowing the stack increases.
In languages like erlang where the runtime stack is actually stored on the C heap they can more easily recover from this, so it makes sense to use a process per connection.
http://baus.net/
10:43 AM
Another reason they might approach a thread-per-connection architecture is that they need to do read cache items from disk, which AFAIK can not be done in a non-blocking way on Linux.
7:00 PM
We also found it wasn't stable under production load. And we weren't able to engage in a productive conversation about why.
10:43 AM
@baus: yeah, it's definitely very complicated. The follow-up letter from PHK that I posted sheds some light on to their design decisions.
I need to look again at switchflow.
@kellan was this with varnish 1? I've heard that had a lot of problems that mostly seem to be fixed with 2.
9:18 PM
Post a Comment
Links to this post:
Create a Link
<< Home