A client on a fast connection will come in and will pull the data as fast as the server can spit it out, keeping the process and the buffers occupied for the minimum amount of wall clock time and the number of times the 'poll' cycle is done is very small.
But the slowpokes, the ones on dial up and on congested lines will get you every time. They keep the processes busy far longer than you'd want and you have to hit the 'poll' cycle far more frequently, first to see if they've finally completed sending you a request, then to see if they've finally received the last little bit of data that you sent them.
The impact of this is very easy to underestimate, and if you're benchmarking web servers for real world conditions you could do a lot worse than to run a test across a line that is congested on purpose.
> the ones on dial up and on congested lines will get you every time.
Do you have numbers on the dial-up users for your server? My understanding is that there's far fewer, so this is bogus. Show evidence of high dial-up penetration first.
> They keep the processes busy far longer than you'd want and you have to hit the 'poll' cycle far more frequently
Again, you have no numbers on the active/total ratio in your server, so unless you do this statement doesn't refute what I found. I've presented evidence that just shows the math of O(N=active) / O(N=total) holds up. Simple math. The only way epoll wins for all load types is if it is as fast as poll all the time. My tests show it's not, which stands to reason since it's implemented using more syscalls than poll.
> The impact of this is very easy to underestimate, and if you're benchmarking web servers for real world conditions you could do a lot worse than to run a test across a line that is congested on purpose.
Again, you have no definition of "congestion". If you adopt a simple metric like ATR then we can talk. As it is, you (and everyone else) just throws around latency numbers like those matter when really the performance break is in the ATR. In addition, my numbers show the performance break being at about 60% ATR, so if you're saying that no server every goes above 60% activity levels then you're totally wrong. 60% is not completely unreasonable on a loaded server.
But, I think you're missing a key point: You need both in a server like Mongrel2. I never said epoll sucks and poll rocks (since you probably didn't read the article). I said something very exact and measurable:
> epoll is faster than poll when the active/total FD ratio is < 0.6, but poll is faster than epoll when the active/total ratio is > 0.6.
If you don't think that's the case in "the real world" then go measure it and report back. That's the science part. I totally don't believe it yet myself, which is why I'm measuring it and showing the methods to everyone so they can confirm it for me.
The webserver is custom job called yawwws (yet-another-www-server) that is used to serve up a variety of bits and pieces for a high traffic website, typically the requests are very short in nature (a 500 byte request followed by a < 10K answer).
After about two hours of running the active-to-total ratio varied between 10% to 40% for 5 minute intervals, with the majority of the 5 minute buckets around the 30% mark. I'm actually quite surprised at the spread.
The bigger portion of the time seems to be spent waiting for the clients to send the request, most if not all of the output data should fit in the TCP output buffers, so that actually skews the results upwards, for longer running requests sending more data to the clients the active-to-total ratios would probably be a bit lower.
So 10% to 40% of all the sockets were active at any given time, the rest was idle, waiting for data to be received or for buffer space to be freed up so data could be written.
In this situation epoll would be faster than poll because epoll only sends the user process those fds that it actually has to deal with rather than all of them, so the loop that takes the output of the system call will have less iterations.
So, as I wrote before, I think the typical web server is, when it is dealing with the client facing side more often than not waiting for the client to do something, and it seems that on my server that hasn't changed since I last looked at it.
This server runs with keepalive off. Switching it on will most likely make the active-to-total ratio dramatically lower but I don't feel like pissing off a large number of users just to see how bad it could get. There is a good chance that my socket pool will turn out to be too small to do this without damage.
Chances are that for different workloads the percentages will vary but this setup is fairly typical (single threaded server, all requests served from memory) so I wouldn't expect to see too much variation on different sites, and if there is variation I'd expect it to go down rather than up.
If I get a chance I'll re-run the test on some other websites to see if the numbers come out comparable or are wildly different.
He doesn't need to show that it's high, only that it's high enough to cause a significant contingent of ordinary webservers' requests to be lingering slow connections.
For instance, you might have a system which has a latency of 1 second, and at a given workload, you have 10,000 connections. In the Java culture, people think you're a genius if you can increase those connections to 100,000 and increase the latency to 10 seconds.
End users, on the other hand, would be happier if you cut the latency to 0.1 seconds, but there are a lot of people who'll then think you're a loser who can only manage to handle 1000 concurrent connections.
Of course, getting that latency down is a holistic process that requires you to think about the client, the server, and what exactly goes over the wire.
As far as I know the only way around this is to use multiple IPS (possibly aliases on the same interface) but that would still require a new process.
So even if your per-process limit for fds can be larger than 64K the network layer or the mapper that turns fds in to socket ids for the network stack to work with may impose a restriction. I don't know enough about the linux kernel to figure out what exactly causes this.
I use the 64K limit on some high throughput machines (mostly video and image servers), but when I go over that I need to start another process. Possibly there's a way around that but the expense of another process is fairly small so I haven't put in much time to see if I can work around that. Socket to fd mapping presumably takes in to account the address as well as the port so it shouldnt't be a problem but on the kernel of the machines where I have to resort to these tricks it appears to be a limit.
Maybe someone with more knowledge of the guts of the linux kernel can point out why this happens.
I wonder how kqueue behaves compares to poll and epoll. Kqueue has a less stupid interface because it allows you to perform batch updates with a single syscall.
http://www.xmailserver.org/linux-patches/nio-improve.html
And as jacquesm points out, in a web-facing server, that's the case you should care about. A 15-20% performance hit in a situation a web-facing server is never going to see doesn't matter when you consider that the 'faster' method is 80% slower (or worse) in lots of real world scenarios.
I'll be interested to see how the superpoll approach ends up working, but my first impression is 'more complexity, not much more benefit'.
Yes, but where's the evidence what people see for active/total ratios in the real world? I'm showing that unless it's below about 60% (probably more like 50%) then poll is the way to go.
60% active isn't entirely unrealistic at all. I can see quite a few servers hitting those thresholds, so in that cases, poll vs. epoll doesn't matter.
I think what's more important in what I'm finding is that you really need both. It's entirely possible that you have servers that are at 80-90% ATR all the time. Others that are 10% ATR. The key is either you have to measure that, which nobody does, or you have to make a server that can adapt.
Yes Zed, where the fuck is it? You're claiming SCIENCE! based on your worst-case synthetic localhost benchmarks, and then turning around and wildly guessing as to real-world performance characteristics with internet latencies.
Worse, your whole thesis hinges off of ATR but you made no effort to measure it anywhere, instead you're passive-aggressively berating us to do it.
I'd be curious if you have any evidence that this occurs in practice. Even a busy server with clients of uniform + low latency, intuitively I'd expect fairly low ATRs.
I think what's more important in what I'm finding is that you really need both.
I'm not sure you do: the performance advantage of poll seems marginal at best. When ATR is high, you're presumably doing enough real work that the slight overhead of epoll vs. poll is probably not super important.
If a site gets spiked with the typical 'read-and-leave' traffic a link from reddit or huffpo or wherever generates, how does superpoll compare to straight epoll? Based on your description so far, I can only see it hurting - you're not just wasting time on dead connections in your poll bin, you're now also incurring the overhead of managing the migration over to the epoll bin.
What exactly is the definition of an "active" file descriptor in this context?
My best guess after reading the man pages is that poll() takes an array of file descriptors to monitor and sets flags in the relevant array entries, which your code then needs to scan linearly for changes, whereas epoll_wait() gives you an array of events, thus avoiding checking file descriptors which haven't received any events. Active file descriptors would therefore be those that did indeed receive an event during the call.
EDIT: thanks for pointing out Zed's "superpoll" idea. I somehow completely missed that paragraph in the article, which makes the following paragraph redundant.
If this is correct, it sounds to me (naive as I am) as if some kind of hybrid approach would be the most efficient: stuff the idling/lagging connections into an epoll pool and add the pool's file descriptor to the array of "live" connections you use with poll(). That of course assumes you can identify a set of fds which are indeed most active.
The difference between poll and epoll is that, given an input of N file descriptors, poll returns all N file descriptors and you need to loop through each one of them to check whether the 'active' flag is set on there. epoll just returns all the active file descriptors so that you don't need to loop through the inactive ones.
A hybrid approach, as Zed has suggested, would appear to be more efficient on the surface. It remains to be seen whether it can actually be implemented efficiently because migrating fds from/to epoll is extremely expensive, requiring a single syscall per fd.
But if you ask me, the real solution is to have the kernel team fix their epoll implementation performance issues instead of forcing people to work around it with hybrid approaches. Other than the stupid single-syscall-per-fd requirement, there's nothing in epoll's interface that would force it to perform worse than poll when the active/total ratio is high.
But if you ask me, the real solution is to have the kernel team fix their epoll implementation performance issues instead of forcing people to work around it with hybrid approaches.
That does indeed sound like a better conclusion.
Other than the stupid single-syscall-per-fd requirement, there's nothing in epoll's interface that would force it to perform worse than poll when the active/total ratio is high.
I don't see a reason why the syscall-per-fd couldn't easily be replaced/augmented with a single mass add/remove syscall which takes an array. The worse performance seems similarly baffling; it almost sounds as if they had some kind of inefficient data structure holding the file descriptor pool; considering poll() uses a flat array and epoll uses set operations I assume it's pretty tricky to make it perform well, even with a hash table. Maybe set operations aren't the best way to handle this data structure; but only some profiling in the kernel code can tell us that.
Obviously it'll take until 2.6.37 at least for any changes to enter the mainstream kernel, and until then a hybrid approach sounds sensible for those unwilling to patch. But still, fixing the root problem seems like a worthwhile cause.
active_fds = poll(big_ass_array_of_fds, total_fds)
epoll is slightly different but same concept. You have a total number of FDs you're want to know about, and each call returns a number that have had activity.
And that's it. You then just do active_fds/total_fds and that gives the ATR. If this is < 0.6 after your call to poll, then that call to poll would have been better done with epoll. If the active_fd/total_fd is > 0.6 then it's better to stick with poll.
Of course, it's more complicated than that, but this gives you a simple metric of the break point where one is better than another.
To put it in another way, if you were to use blocking IO then an operation on an active descriptor would not block. Of course poll and epoll are all about asynchronous IO (so non-blocking by definition) but that's a good way to describe the difference.
Zed's 'superpoll' is precisely what you suggest.
Zed's 'superpoll' is precisely what you suggest.
Facepalm. Thanks, I mysteriously missed that part of the article.
That's just plain wrong. Premature optimisation does not refer to having to measure before you optimise, it refers to optimising things that in practice may have little or no effect on the actual performance of the program.
By doing these tests in isolation instead of while running on a profiling kernel under production load it is very well possible that the bottleneck will not be the polling code at all but something entirely different. I'd say that this is a textbook example of what premature optimisation is all about.
Assuming you have a finite budget of time to spend on a project any optimisations done that take time out of that budget that could have been spent more effective elsewhere is premature.
Now there is a chance that this would have been the bottleneck in the completed system, but before you've got a complete system you can't really tell. My guess based on real world experience with lots of system level code that used both, including web servers, video servers, streaming audio servers and so on is that the overhead of poll/epoll will be relatively minor compared to other parts of the code and the massive amount of IO that typically follows a poll or an epoll call.
If you have 10K sockets open then typically poll/epoll will return a large number of 'active' descriptors, you'll then be doing IO on all of those for that single call to poll/epoll.
Each of those IO calls is probably going to be as much or more work to process than the poll call was.
Maybe Java does some of this cool stuff already so perhaps I'm shielded from the pain of dealing with things directly.
In the past I've written Java NIO code that dealt with around 60,000 concurrent connections pretty well. The time spent doing poll seemed to be completely insignificant. CPU usage was negligible.
It'd be good to see some numbers though - for example:
For average mongrel application, 40% of CPU time is spent in poll / average of 30ms latency is due to poll etc.
But I'm skeptical those numbers are true. That was my point.
If you don't start with those numbers and measurements, optimizations like this, whilst interesting, may end up being of no real use to anyone.
You're probably right that when you actually use Mongrel2 as your app server your app-specific code higher up will be a larger bottleneck, but that's code that you have to deal with and this is code that he has to deal with so optimizing the hell out of it doesn't sound like a bad idea.
That's 0.5% of your total time being spent here. So even if it's made twice as fast, your app will only speed up by say 100ms -> 99.75ms
Find the big things that matter and optimize them. Adding extra complexity to small things that don't matter is a recipe for more bugs and more issues.
Fortunately, Zed is the right guy to find this out. I'm certainly looking forward to the results of this--which I bet we'll have an initial answer to by tomorrow.
Depending on what you are doing, you might not even need to track these booleans. For example, on the read side you can ignore read events when you are not interested in reading. When you switch back to read interest, you can read the socket to see if data arrived while you ignored events. A similar strategy can be used on the write side.
I can't even find a reference to his OS configuration and version details that he's developing on, which seems to me like a critical detail.
Today I'm crafting how I ran the tests and releasing all the code and asking everyone to test my results. I am completely assuming I am wrong so looking for other people to test it.
Incidentally, if you google for "pipetest.c" you'll it's kind of the gold standard for this comparison, so if that code is wrong, then the entire assumption that epoll is better needs to be redone.
To make your process scientific, I'd like to suggest you add the following things to the post when you find it convenient:
1. A detailed explanation of your methodology, preferably with source code. This is so we can reproduce the tests. The ability to reproduce your work is a critical part of any process calling itself science.
2. A detailed list of the hardware you used & its deployment. (For reasons listed above).
3. Your raw data should be made available upon request so other people can work it as well.
P.S., aren't you concerned about I/O overhead with your superpoll proposal? It seems like the added resource allocation and the time spent in zeromq is going to eat up the small advantages you gain?
It seems superior to both *poll minions. Would be great if you proved/falsified this thesis as well.
There are probably hordes of people who will be willing to run Mongrel2 on *BSD platforms, precisely because of the performance reasons. And Zed is a famous tinkerer rather than a religious zealot, so very probably he could be interested in checking kqueue as well.
"Why not" is also a good reason for a hacker when he's lacking other reasons.
In case of poll(), you have to transfer this array of FDs from the userland vm to the kernel vm each time you call poll(). Now compare this with epoll() (let's assume we are using EPOLLET trigger), when you only have to transfer the file descriptors once.
You might say the copying won't matter, but it will matter when you have a lot of events coming on the 20k FDs which eventually leads to calling xpoll() at a higher rate, hence more copying of data between the userland and kernel (4bytes * 20k, ~80kbytes each call).
Also, your assumption of EPOLLET is potentially wrong. I think (unproven) that the extra overhead and complexity of using edge trigger right makes EPOLLET pointless.
I think it might even be faster, kernel-side. From what I remember of the implementation, both modes have to walk the same list of ready fds, but that list is shorter in edge triggered mode, because they get removed from the list as it goes.
Edge triggered might have more overhead if many fds change between ready/not-ready quickly, but that's quite the wacky situation (and if it has an even distribution, would ensure your ATR is about 0.5, so probably still winning).
Just kidding. It's always nice to see science in action. Great work! I suspect there's an impact on ZeroMQ's own poll/epoll strategy.
Of course, that's often Kinda Hard To Do (tm). ;-)
So if you have a thousand fds, and they're all active, you have to deal with a thousand fds, which would make the difference between poll and epoll insignificant (only twice as fast, not even an order of magnitude!)?
This would make the micro-benchmark quite micro! Annoyingly enough, I think that means that the real way to find out would be an httpperf run, with each backends. A lot more work...
hint: nginx/src/event/modules/ngx_epoll_module.c
May be one should learn how to use epoll and, perhaps, how to program? ^_^