Zed Shaw: "poll, epoll, science, and superpoll" with R (opens in new tab)

jemfinch15y ago

> Do you have numbers on the dial-up users for your server? My understanding is that there's far fewer, so this is bogus. Show evidence of high dial-up penetration first.

He doesn't need to show that it's high, only that it's high enough to cause a significant contingent of ordinary webservers' requests to be lingering slow connections.

terra_t15y ago

Yeah, but there's a fetishization of "high concurrency" (being able to support a huge number of connections) rather than absolute performance.

For instance, you might have a system which has a latency of 1 second, and at a given workload, you have 10,000 connections. In the Java culture, people think you're a genius if you can increase those connections to 100,000 and increase the latency to 10 seconds.

End users, on the other hand, would be happier if you cut the latency to 0.1 seconds, but there are a lot of people who'll then think you're a loser who can only manage to handle 1000 concurrent connections.

Of course, getting that latency down is a holistic process that requires you to think about the client, the server, and what exactly goes over the wire.

jacquesm15y ago

If you could increase the number of connections to 100,000 you would indeed be a genius because when you bind to a network interface using IPV4 there is a hard limit of the short integer used to indicate the port number which automatically limits you to 65536 connections (actually a few less, usually you'll lose 3 for stdin,stdout and stderr (which you can close to reuse them) and one for the listen socket).

As far as I know the only way around this is to use multiple IPS (possibly aliases on the same interface) but that would still require a new process.

So even if your per-process limit for fds can be larger than 64K the network layer or the mapper that turns fds in to socket ids for the network stack to work with may impose a restriction. I don't know enough about the linux kernel to figure out what exactly causes this.

I use the 64K limit on some high throughput machines (mostly video and image servers), but when I go over that I need to start another process. Possibly there's a way around that but the expense of another process is fairly small so I haven't put in much time to see if I can work around that. Socket to fd mapping presumably takes in to account the address as well as the port so it shouldnt't be a problem but on the kernel of the machines where I have to resort to these tricks it appears to be a limit.

Maybe someone with more knowledge of the guts of the linux kernel can point out why this happens.

jbeda15y ago

TCP connections are identified by the (src ip, src port, dest ip, dest port) tuple. The server only needs one port. So theoretically a server can handle 64k connections per client.

http://www.xmailserver.org/linux-patches/nio-improve.html

jakevoytko15y ago

For simple testing purposes, it is easy to set up a forwarding proxy that drops n% of the packets it receives - for some high value of n. The World Wide Web is far more sadistic, but it still uncovers some performance or usability problems that are invisible over normal `localhost` traffic. I bet you can also use web servers with traffic shaping to mimic lots of slow connections at once, but I haven't tried that

dminor15y ago

Mongrel2 is supposed to handle WebSockets as well as HTTP, so I think open connections with sporadic traffic are a use case Zed has to worry about.

FooBarWidget15y ago

Zed isn't the only one who has found epoll to be slower than poll. The author of libev basically says the same thing. See http://pod.tst.eu/http://cvs.schmorp.de/libev/ev.pod and search for EVBACKEND_EPOLL.

I wonder how kqueue behaves compares to poll and epoll. Kqueue has a less stupid interface because it allows you to perform batch updates with a single syscall.

jfager15y ago

It is worth pointing out that the original epoll benchmarks were focused on how performance scaled with the number of dead connections, not performance in general:

And as jacquesm points out, in a web-facing server, that's the case you should care about. A 15-20% performance hit in a situation a web-facing server is never going to see doesn't matter when you consider that the 'faster' method is 80% slower (or worse) in lots of real world scenarios.

I'll be interested to see how the superpoll approach ends up working, but my first impression is 'more complexity, not much more benefit'.

zedshaw15y ago

> And as jacquesm points out, in a web-facing server, that's the case you should care about.

Yes, but where's the evidence what people see for active/total ratios in the real world? I'm showing that unless it's below about 60% (probably more like 50%) then poll is the way to go.

60% active isn't entirely unrealistic at all. I can see quite a few servers hitting those thresholds, so in that cases, poll vs. epoll doesn't matter.

I think what's more important in what I'm finding is that you really need both. It's entirely possible that you have servers that are at 80-90% ATR all the time. Others that are 10% ATR. The key is either you have to measure that, which nobody does, or you have to make a server that can adapt.

blasdel15y ago

> but where's the evidence what people see for active/total ratios in the real world?

Yes Zed, where the fuck is it? You're claiming SCIENCE! based on your worst-case synthetic localhost benchmarks, and then turning around and wildly guessing as to real-world performance characteristics with internet latencies.

Worse, your whole thesis hinges off of ATR but you made no effort to measure it anywhere, instead you're passive-aggressively berating us to do it.

zedshaw15y ago

Wow here we are again, you not reading my article. I ran the same test that everyone else runs for poll vs. epoll, then used R to craft graphs and tested hypothesis. It was not a localhost test.

So far all you've got is trolling HN comments. YOU WIN!

neilc15y ago

It's entirely possible that you have servers that are at 80-90% ATR all the time

I'd be curious if you have any evidence that this occurs in practice. Even a busy server with clients of uniform + low latency, intuitively I'd expect fairly low ATRs.

I think what's more important in what I'm finding is that you really need both.

I'm not sure you do: the performance advantage of poll seems marginal at best. When ATR is high, you're presumably doing enough real work that the slight overhead of epoll vs. poll is probably not super important.

zedshaw15y ago

This is the point where talking about it does nothing. Go measure it like I have. In fact, I'll give you your hypothesis to test:

"There are no servers that have an ATR of > 80%."

That's easy to test, and I'm damn positive you could find some that disprove your assertion.

More importantly though, you have this assertion:

"Using both poll and epoll has no advantage in performance."

Again, who knows, that's why I'm testing and trying out. That's the science part, since I've got no idea, but I'll give it a shot. And now that I've done an analysis that tells me what really matters, I'll be able to do very good tests for the different kinds of loads.

Incidentally, when people run performance tests against web servers to see how fast they serve files they're testing the server with an ATR at around 100%. Food for thought.

http://www.linuxinsight.com/ols2004_comparing_and_evaluating...

jfager15y ago

IE will keep a connection open for about 60 seconds, how much of that's going to be active? I don't have the exact number, and of course it will vary, but of course it's going to be far less than 60% in the vast majority of cases.

If a site gets spiked with the typical 'read-and-leave' traffic a link from reddit or huffpo or wherever generates, how does superpoll compare to straight epoll? Based on your description so far, I can only see it hurting - you're not just wasting time on dead connections in your poll bin, you're now also incurring the overhead of managing the migration over to the epoll bin.

pmjordan15y ago

Pardon my ignorance, I haven't built high performance servers at this low a level, but I'm intrigued:

What exactly is the definition of an "active" file descriptor in this context?

My best guess after reading the man pages is that poll() takes an array of file descriptors to monitor and sets flags in the relevant array entries, which your code then needs to scan linearly for changes, whereas epoll_wait() gives you an array of events, thus avoiding checking file descriptors which haven't received any events. Active file descriptors would therefore be those that did indeed receive an event during the call.

EDIT: thanks for pointing out Zed's "superpoll" idea. I somehow completely missed that paragraph in the article, which makes the following paragraph redundant.

If this is correct, it sounds to me (naive as I am) as if some kind of hybrid approach would be the most efficient: stuff the idling/lagging connections into an epoll pool and add the pool's file descriptor to the array of "live" connections you use with poll(). That of course assumes you can identify a set of fds which are indeed most active.

FooBarWidget15y ago

An active file descriptor is one that you can read from or write to without blocking or getting EAGAIN as error. The whole point of poll/epoll/kqueue/select is to figure out which file descriptors are in such a state.

The difference between poll and epoll is that, given an input of N file descriptors, poll returns all N file descriptors and you need to loop through each one of them to check whether the 'active' flag is set on there. epoll just returns all the active file descriptors so that you don't need to loop through the inactive ones.

A hybrid approach, as Zed has suggested, would appear to be more efficient on the surface. It remains to be seen whether it can actually be implemented efficiently because migrating fds from/to epoll is extremely expensive, requiring a single syscall per fd.

But if you ask me, the real solution is to have the kernel team fix their epoll implementation performance issues instead of forcing people to work around it with hybrid approaches. Other than the stupid single-syscall-per-fd requirement, there's nothing in epoll's interface that would force it to perform worse than poll when the active/total ratio is high.

zedshaw15y ago

I totally absolutely agree they should fix epoll, but the way they've designed I don't see it happening. Of course they could fix the call for doing the actual select and make it at least as fast as poll, but the fact that you have to do a syscall for every file descriptor is idiotic.

jacquesm15y ago

Some people actually did fix epoll, benchmarked the results and concluded it wasn't worth it.

pmjordan15y ago

Thanks for the detailed explanation. Sounds like I was at least on the right track.

But if you ask me, the real solution is to have the kernel team fix their epoll implementation performance issues instead of forcing people to work around it with hybrid approaches.

That does indeed sound like a better conclusion.

Other than the stupid single-syscall-per-fd requirement, there's nothing in epoll's interface that would force it to perform worse than poll when the active/total ratio is high.

I don't see a reason why the syscall-per-fd couldn't easily be replaced/augmented with a single mass add/remove syscall which takes an array. The worse performance seems similarly baffling; it almost sounds as if they had some kind of inefficient data structure holding the file descriptor pool; considering poll() uses a flat array and epoll uses set operations I assume it's pretty tricky to make it perform well, even with a hash table. Maybe set operations aren't the best way to handle this data structure; but only some profiling in the kernel code can tell us that.

Obviously it'll take until 2.6.37 at least for any changes to enter the mainstream kernel, and until then a hybrid approach sounds sensible for those unwilling to patch. But still, fixing the root problem seems like a worthwhile cause.

jacquesm15y ago

I highly doubt that in production it will make any difference at all. In the end it is not the poll/epoll overhead that determines your overall throughput. If you call poll/epoll more frequently than you should then it starts to add up but in reality one call per several thousand file system operations doing real IO is not going to make a big difference.

Of course all the little bits help and I'm happy to see someone pay attention to detail like this but normally speaking you should get to the point where you're shifting data in real life situations and you can hook up a profiler to make the decision. You have less to blog about like that but the difference between poll and epoll is not large enough that you would spend more time going from the one to the other than was spent analysing this and writing the post.

Optimisations like this are best left to when you have things working, first make it work, then make it fast.

zedshaw15y ago

It's actually a really simple concept what's "active" in poll vs. epoll. Your call to poll and epoll basically looks like this:

active_fds = poll(big_ass_array_of_fds, total_fds)

epoll is slightly different but same concept. You have a total number of FDs you're want to know about, and each call returns a number that have had activity.

And that's it. You then just do active_fds/total_fds and that gives the ATR. If this is < 0.6 after your call to poll, then that call to poll would have been better done with epoll. If the active_fd/total_fd is > 0.6 then it's better to stick with poll.

Of course, it's more complicated than that, but this gives you a simple metric of the break point where one is better than another.

jacquesm15y ago

An active file descriptor is a filedescriptor that you want to read from that has data available and one that you want to write to that has buffer space available.

To put it in another way, if you were to use blocking IO then an operation on an active descriptor would not block. Of course poll and epoll are all about asynchronous IO (so non-blocking by definition) but that's a good way to describe the difference.

Zed's 'superpoll' is precisely what you suggest.

pmjordan15y ago

Thanks for the explanation, I didn't think of the part about the socket being free for writing vs. whether there was data available.

Zed's 'superpoll' is precisely what you suggest.

Facepalm. Thanks, I mysteriously missed that part of the article.

axod15y ago

Sounds like premature optimization to me. Is this really the bottleneck? Is the extra complexity and logic really going to be a net win?

jakevoytko15y ago

A conclusion reached by measurement is not premature. This looks like an attempt to write a better server than the 80/20 rule allows. If he's wrong and only one polling method is useful in production, the live servers will pick the good one and nobody will suffer because he jumped to conclusions. Since he's written Mongrel, I trust that he has a reason to worry about polling that may not have appeared in the post

jacquesm15y ago

> A conclusion reached by measurement is not premature.

That's just plain wrong. Premature optimisation does not refer to having to measure before you optimise, it refers to optimising things that in practice may have little or no effect on the actual performance of the program.

By doing these tests in isolation instead of while running on a profiling kernel under production load it is very well possible that the bottleneck will not be the polling code at all but something entirely different. I'd say that this is a textbook example of what premature optimisation is all about.

Assuming you have a finite budget of time to spend on a project any optimisations done that take time out of that budget that could have been spent more effective elsewhere is premature.

Now there is a chance that this would have been the bottleneck in the completed system, but before you've got a complete system you can't really tell. My guess based on real world experience with lots of system level code that used both, including web servers, video servers, streaming audio servers and so on is that the overhead of poll/epoll will be relatively minor compared to other parts of the code and the massive amount of IO that typically follows a poll or an epoll call.

If you have 10K sockets open then typically poll/epoll will return a large number of 'active' descriptors, you'll then be doing IO on all of those for that single call to poll/epoll.

Each of those IO calls is probably going to be as much or more work to process than the poll call was.

axod15y ago

FWIW, I've written an async webserver which handles a few thousand concurrent users, and does thousands of HTTP reqs/second. I've never seen poll/epoll as a bottleneck, but I'm using Java NIO which really seems to work extremely well (I don't remember how/which it uses, I think epoll).

Maybe Java does some of this cool stuff already so perhaps I'm shielded from the pain of dealing with things directly.

In the past I've written Java NIO code that dealt with around 60,000 concurrent connections pretty well. The time spent doing poll seemed to be completely insignificant. CPU usage was negligible.

It'd be good to see some numbers though - for example:

For average mongrel application, 40% of CPU time is spent in poll / average of 30ms latency is due to poll etc.

But I'm skeptical those numbers are true. That was my point.

If you don't start with those numbers and measurements, optimizations like this, whilst interesting, may end up being of no real use to anyone.

zedshaw15y ago

> But I'm skeptical those numbers are true. That was my point.

Yeah, I'm skeptical those numbers are true too, but then again we're talking about totally different numbers.

statictype15y ago

Looks like he's already written a large bulk of his server's code, so maybe this optimization isn't really premature :)

You're probably right that when you actually use Mongrel2 as your app server your app-specific code higher up will be a larger bottleneck, but that's code that you have to deal with and this is code that he has to deal with so optimizing the hell out of it doesn't sound like a bad idea.

axod15y ago

Lets say 95% of time is in your code, and 5% is in Mongrel2. Lets say that within Mongrel2, 10% of time is in this poll/epoll stuff.

That's 0.5% of your total time being spent here. So even if it's made twice as fast, your app will only speed up by say 100ms -> 99.75ms

Find the big things that matter and optimize them. Adding extra complexity to small things that don't matter is a recipe for more bugs and more issues.

pmjordan15y ago

I haven't actually used Mongrel2 yet, but I get the impression you can use it as a front-end for multiple physical servers, in which case the relative load of the M2 process on its server increases. Besides, unless Zed is writing your application and this is eating away at his time for that, it's not clear what exactly your complaint is.

FooBarWidget15y ago

I don't know. Nobody has tried to find out and Zed is the first. Whatever his results are the gained knowledge will be useful to future developers.

apgwoz15y ago

> Is the extra complexity and logic really going to be a net win?

Fortunately, Zed is the right guy to find this out. I'm certainly looking forward to the results of this--which I bet we'll have an initial answer to by tomorrow.

frognibble15y ago

The blog post does not say if the epoll code uses level triggering or edge triggering. It would be interesting to see the results for both modes. The smaller number of system calls required for edge triggering might make a difference in performance.

zedshaw15y ago

That's entirely possible, but then you pay a penalty in complexity because you have to keep track of missed events yourself. I think (unproven) that it's actually a wash because of this.

frognibble15y ago

At most, you need to track a couple of booleans per socket, one for read and one for write.

Depending on what you are doing, you might not even need to track these booleans. For example, on the read side you can ignore read events when you are not interested in reading. When you switch back to read interest, you can read the socket to see if data arrived while you ignored events. A similar strategy can be used on the write side.

KirinDave15y ago

Is it just me, or did Zed not describe his testing methodology in any detail?

I can't even find a reference to his OS configuration and version details that he's developing on, which seems to me like a critical detail.

zedshaw15y ago

There's the pipetest.c file that everyone uses (since 2002) linked off that blog post, but I got tired and went to sleep.

Today I'm crafting how I ran the tests and releasing all the code and asking everyone to test my results. I am completely assuming I am wrong so looking for other people to test it.

Incidentally, if you google for "pipetest.c" you'll it's kind of the gold standard for this comparison, so if that code is wrong, then the entire assumption that epoll is better needs to be redone.

KirinDave15y ago

Okay. And I appreciate that, I'll look.

To make your process scientific, I'd like to suggest you add the following things to the post when you find it convenient:

1. A detailed explanation of your methodology, preferably with source code. This is so we can reproduce the tests. The ability to reproduce your work is a critical part of any process calling itself science.

2. A detailed list of the hardware you used & its deployment. (For reasons listed above).

3. Your raw data should be made available upon request so other people can work it as well.

P.S., aren't you concerned about I/O overhead with your superpoll proposal? It seems like the added resource allocation and the time spent in zeromq is going to eat up the small advantages you gain?

kunley15y ago

Cool experiment Mr Zed, but what about kqueue?

It seems superior to both *poll minions. Would be great if you proved/falsified this thesis as well.

silentbicycle15y ago

kqueue is on OpenBSD and FreeBSD, while epoll is from Linux. (poll and select are on both)

kunley15y ago

I'm aware of it (you forgot to mention that kqueue is on the OS X as well). So what?

There are probably hordes of people who will be willing to run Mongrel2 on *BSD platforms, precisely because of the performance reasons. And Zed is a famous tinkerer rather than a religious zealot, so very probably he could be interested in checking kqueue as well.

"Why not" is also a good reason for a hacker when he's lacking other reasons.

gthank15y ago

What about NetBSD? Zed has already said he uses NetBSD, so if kqueue is there, he might add it to the mix.

silentbicycle15y ago

Yes, it does. (I haven't used NetBSD at all, and I forgot to mention it.)

bch15y ago

NetBSD also supports kqueue

zedshaw15y ago

Well, I haven't tried kqueue, but IIRC it has its own set of problems. Mainly that you can't kqueue certain types of file descriptors like ptys. I'd have to look into that, but I'm sure I'll have some kind of thing going about it soon.

kqueue15y ago

Lets assume we have 20k opened FDs.

In case of poll(), you have to transfer this array of FDs from the userland vm to the kernel vm each time you call poll(). Now compare this with epoll() (let's assume we are using EPOLLET trigger), when you only have to transfer the file descriptors once.

You might say the copying won't matter, but it will matter when you have a lot of events coming on the 20k FDs which eventually leads to calling xpoll() at a higher rate, hence more copying of data between the userland and kernel (4bytes * 20k, ~80kbytes each call).

zedshaw15y ago

Yep, that's what I thought too, that at least epoll would be as fast. Turns out it's not though, but then I could be wrong.

Also, your assumption of EPOLLET is potentially wrong. I think (unproven) that the extra overhead and complexity of using edge trigger right makes EPOLLET pointless.

kqueue15y ago

Sorry, I meant level-triggered. :) I think edge-triggered does add an extra overhead as you stated.

pphaneuf15y ago

Why would there be extra overhead when using edge triggered? There's definitely extra complexity on the client side, but it's close to what you're trying to do with super-poll (the extra complexity is basically to find out when an fd isn't busy anymore).

I think it might even be faster, kernel-side. From what I remember of the implementation, both modes have to walk the same list of ready fds, but that list is shorter in edge triggered mode, because they get removed from the list as it goes.

Edge triggered might have more overhead if many fds change between ready/not-ready quickly, but that's quite the wacky situation (and if it has an even distribution, would ensure your ATR is about 0.5, so probably still winning).

FooBarWidget15y ago

Why would there by any copying? The kernel can directly read userspace memory.

kqueue15y ago

For the kernel to execute a system call, it has to place the arguments on its stack. a system call doesn't execute in the userland.

FooBarWidget15y ago

Yes but the argument to poll is a pointer. The pointer would be copied but the kernel can still follow the pointer to userspace, right?

phintjens15y ago

Zed, whats with all the premature optimization? Surely Mongrel2 should first be able to make coffee, build you an island and f@!in transform into a jet and fly you there, before you start to make it faster!

Just kidding. It's always nice to see science in action. Great work! I suspect there's an impact on ZeroMQ's own poll/epoll strategy.

jaekwon15y ago

0.6 is so arbitrary. it should be 1.0/golden-ratio.

zedshaw15y ago

I was hoping for e, but alas no luck.

pphaneuf15y ago

The best would be if it would be possible to code up superpoll to be adaptive, and in effect, benchmark itself to come to the same conclusion, dynamically. So if one day the kernel people fixed epoll to be better all the time, Mongrel2 would magically not use poll() much on systems using that kernel, and favour epoll.

Of course, that's often Kinda Hard To Do (tm). ;-)

aston15y ago

It's pretty darn close to 1 - 1/e.

pphaneuf15y ago

Question: as the ATR is going higher, so would the proportional time spent in poll or epoll, no?

So if you have a thousand fds, and they're all active, you have to deal with a thousand fds, which would make the difference between poll and epoll insignificant (only twice as fast, not even an order of magnitude!)?

This would make the micro-benchmark quite micro! Annoyingly enough, I think that means that the real way to find out would be an httpperf run, with each backends. A lot more work...

16s15y ago

Very nice write-up. Little details such as this should make Mongrel2 very solid. It's nice to see how he analyzed the issues around poll and epoll and then figured out how to make use of both for optimum performance no matter what happens in production. Many other programs could benefit from this sort of analysis although at different levels... e.g. Sorted vectors may be better for smaller containers but hash tables better for larger containers, etc.

lukesandberg15y ago

interesting article! Is 'super-poll' done yet? i would have liked to see a super poll line on some of those graphs to see how it compares to just vanilla poll and ePoll at different ATRs. Though i guess you would also have to test for situations where ATR varies over time (so that you could measure the impact of moving fds back and forth).

c00p3r15y ago

It is a little wonder why this kind of people think that everyone else are just stupid to realize such things. What they want is a fame and followers. (btw, don't you forget to donate!)

hint: nginx/src/event/modules/ngx_epoll_module.c

May be one should learn how to use epoll and, perhaps, how to program? ^_^

j / k navigate · click thread line to collapse

141 comments

jacquesm15y ago

In real-life web serving situations, and not in benchmarks, the majority of the fds is not active. It's the slow guys that kill you.

zedshaw15y ago

So let's take your assertions and take them apart:

> the ones on dial up and on congested lines will get you every time.

Do you have numbers on the dial-up users for your server? My understanding is that there's far fewer, so this is bogus. Show evidence of high dial-up penetration first.

> They keep the processes busy far longer than you'd want and you have to hit the 'poll' cycle far more frequently

> epoll is faster than poll when the active/total FD ratio is < 0.6, but poll is faster than epoll when the active/total ratio is > 0.6.

jacquesm15y ago

So, here are the numbers from one of the webservers that I instrumented to log the active-to-total ratio over a couple of hours.

So 10% to 40% of all the sockets were active at any given time, the rest was idle, waiting for data to be received or for buffer space to be freed up so data could be written.

If I get a chance I'll re-run the test on some other websites to see if the numbers come out comparable or are wildly different.

bdr15y ago

Read "on dial-up" as "slow". The argument depends only on there being a certain distribution of client speeds. It's not about dial-up in particular.

zedshaw15y ago

And, if there's a distribution of speeds then you can measure the distribution and see what works best. Again, my challenge still stands:

Measure it or STFU.

jemfinch15y ago

> Do you have numbers on the dial-up users for your server? My understanding is that there's far fewer, so this is bogus. Show evidence of high dial-up penetration first.

He doesn't need to show that it's high, only that it's high enough to cause a significant contingent of ordinary webservers' requests to be lingering slow connections.

terra_t15y ago

Yeah, but there's a fetishization of "high concurrency" (being able to support a huge number of connections) rather than absolute performance.

Of course, getting that latency down is a holistic process that requires you to think about the client, the server, and what exactly goes over the wire.

jacquesm15y ago

As far as I know the only way around this is to use multiple IPS (possibly aliases on the same interface) but that would still require a new process.

Maybe someone with more knowledge of the guts of the linux kernel can point out why this happens.

jbeda15y ago

TCP connections are identified by the (src ip, src port, dest ip, dest port) tuple. The server only needs one port. So theoretically a server can handle 64k connections per client.

http://www.xmailserver.org/linux-patches/nio-improve.html

jakevoytko15y ago

dminor15y ago

Mongrel2 is supposed to handle WebSockets as well as HTTP, so I think open connections with sporadic traffic are a use case Zed has to worry about.

FooBarWidget15y ago

I wonder how kqueue behaves compares to poll and epoll. Kqueue has a less stupid interface because it allows you to perform batch updates with a single syscall.

jfager15y ago

It is worth pointing out that the original epoll benchmarks were focused on how performance scaled with the number of dead connections, not performance in general:

I'll be interested to see how the superpoll approach ends up working, but my first impression is 'more complexity, not much more benefit'.

zedshaw15y ago

> And as jacquesm points out, in a web-facing server, that's the case you should care about.

Yes, but where's the evidence what people see for active/total ratios in the real world? I'm showing that unless it's below about 60% (probably more like 50%) then poll is the way to go.

60% active isn't entirely unrealistic at all. I can see quite a few servers hitting those thresholds, so in that cases, poll vs. epoll doesn't matter.

blasdel15y ago

> but where's the evidence what people see for active/total ratios in the real world?

Worse, your whole thesis hinges off of ATR but you made no effort to measure it anywhere, instead you're passive-aggressively berating us to do it.

zedshaw15y ago

Wow here we are again, you not reading my article. I ran the same test that everyone else runs for poll vs. epoll, then used R to craft graphs and tested hypothesis. It was not a localhost test.

So far all you've got is trolling HN comments. YOU WIN!

neilc15y ago

It's entirely possible that you have servers that are at 80-90% ATR all the time

I'd be curious if you have any evidence that this occurs in practice. Even a busy server with clients of uniform + low latency, intuitively I'd expect fairly low ATRs.

I think what's more important in what I'm finding is that you really need both.

zedshaw15y ago

This is the point where talking about it does nothing. Go measure it like I have. In fact, I'll give you your hypothesis to test:

"There are no servers that have an ATR of > 80%."

That's easy to test, and I'm damn positive you could find some that disprove your assertion.

More importantly though, you have this assertion:

"Using both poll and epoll has no advantage in performance."

Incidentally, when people run performance tests against web servers to see how fast they serve files they're testing the server with an ATR at around 100%. Food for thought.

http://www.linuxinsight.com/ols2004_comparing_and_evaluating...

jfager15y ago

pmjordan15y ago

Pardon my ignorance, I haven't built high performance servers at this low a level, but I'm intrigued:

What exactly is the definition of an "active" file descriptor in this context?

EDIT: thanks for pointing out Zed's "superpoll" idea. I somehow completely missed that paragraph in the article, which makes the following paragraph redundant.

FooBarWidget15y ago

zedshaw15y ago

jacquesm15y ago

Some people actually did fix epoll, benchmarked the results and concluded it wasn't worth it.

pmjordan15y ago

Thanks for the detailed explanation. Sounds like I was at least on the right track.

But if you ask me, the real solution is to have the kernel team fix their epoll implementation performance issues instead of forcing people to work around it with hybrid approaches.

That does indeed sound like a better conclusion.

Other than the stupid single-syscall-per-fd requirement, there's nothing in epoll's interface that would force it to perform worse than poll when the active/total ratio is high.

jacquesm15y ago

Optimisations like this are best left to when you have things working, first make it work, then make it fast.

zedshaw15y ago

It's actually a really simple concept what's "active" in poll vs. epoll. Your call to poll and epoll basically looks like this:

active_fds = poll(big_ass_array_of_fds, total_fds)

epoll is slightly different but same concept. You have a total number of FDs you're want to know about, and each call returns a number that have had activity.

Of course, it's more complicated than that, but this gives you a simple metric of the break point where one is better than another.

jacquesm15y ago

An active file descriptor is a filedescriptor that you want to read from that has data available and one that you want to write to that has buffer space available.

Zed's 'superpoll' is precisely what you suggest.

pmjordan15y ago

Thanks for the explanation, I didn't think of the part about the socket being free for writing vs. whether there was data available.

Zed's 'superpoll' is precisely what you suggest.

Facepalm. Thanks, I mysteriously missed that part of the article.

axod15y ago

Sounds like premature optimization to me. Is this really the bottleneck? Is the extra complexity and logic really going to be a net win?

jakevoytko15y ago

jacquesm15y ago

> A conclusion reached by measurement is not premature.

Assuming you have a finite budget of time to spend on a project any optimisations done that take time out of that budget that could have been spent more effective elsewhere is premature.

If you have 10K sockets open then typically poll/epoll will return a large number of 'active' descriptors, you'll then be doing IO on all of those for that single call to poll/epoll.

Each of those IO calls is probably going to be as much or more work to process than the poll call was.

axod15y ago

Maybe Java does some of this cool stuff already so perhaps I'm shielded from the pain of dealing with things directly.

In the past I've written Java NIO code that dealt with around 60,000 concurrent connections pretty well. The time spent doing poll seemed to be completely insignificant. CPU usage was negligible.

It'd be good to see some numbers though - for example:

For average mongrel application, 40% of CPU time is spent in poll / average of 30ms latency is due to poll etc.

But I'm skeptical those numbers are true. That was my point.

If you don't start with those numbers and measurements, optimizations like this, whilst interesting, may end up being of no real use to anyone.

zedshaw15y ago

> But I'm skeptical those numbers are true. That was my point.

Yeah, I'm skeptical those numbers are true too, but then again we're talking about totally different numbers.

statictype15y ago

Looks like he's already written a large bulk of his server's code, so maybe this optimization isn't really premature :)

axod15y ago

Lets say 95% of time is in your code, and 5% is in Mongrel2. Lets say that within Mongrel2, 10% of time is in this poll/epoll stuff.

That's 0.5% of your total time being spent here. So even if it's made twice as fast, your app will only speed up by say 100ms -> 99.75ms

Find the big things that matter and optimize them. Adding extra complexity to small things that don't matter is a recipe for more bugs and more issues.

pmjordan15y ago

FooBarWidget15y ago

I don't know. Nobody has tried to find out and Zed is the first. Whatever his results are the gained knowledge will be useful to future developers.

apgwoz15y ago

> Is the extra complexity and logic really going to be a net win?

Fortunately, Zed is the right guy to find this out. I'm certainly looking forward to the results of this--which I bet we'll have an initial answer to by tomorrow.

frognibble15y ago

zedshaw15y ago

That's entirely possible, but then you pay a penalty in complexity because you have to keep track of missed events yourself. I think (unproven) that it's actually a wash because of this.

frognibble15y ago

At most, you need to track a couple of booleans per socket, one for read and one for write.

KirinDave15y ago

Is it just me, or did Zed not describe his testing methodology in any detail?

I can't even find a reference to his OS configuration and version details that he's developing on, which seems to me like a critical detail.

zedshaw15y ago

There's the pipetest.c file that everyone uses (since 2002) linked off that blog post, but I got tired and went to sleep.

Today I'm crafting how I ran the tests and releasing all the code and asking everyone to test my results. I am completely assuming I am wrong so looking for other people to test it.

Incidentally, if you google for "pipetest.c" you'll it's kind of the gold standard for this comparison, so if that code is wrong, then the entire assumption that epoll is better needs to be redone.

KirinDave15y ago

Okay. And I appreciate that, I'll look.

To make your process scientific, I'd like to suggest you add the following things to the post when you find it convenient:

2. A detailed list of the hardware you used & its deployment. (For reasons listed above).

3. Your raw data should be made available upon request so other people can work it as well.

P.S., aren't you concerned about I/O overhead with your superpoll proposal? It seems like the added resource allocation and the time spent in zeromq is going to eat up the small advantages you gain?

kunley15y ago

Cool experiment Mr Zed, but what about kqueue?

It seems superior to both *poll minions. Would be great if you proved/falsified this thesis as well.

silentbicycle15y ago

kqueue is on OpenBSD and FreeBSD, while epoll is from Linux. (poll and select are on both)

kunley15y ago

I'm aware of it (you forgot to mention that kqueue is on the OS X as well). So what?

"Why not" is also a good reason for a hacker when he's lacking other reasons.

gthank15y ago

What about NetBSD? Zed has already said he uses NetBSD, so if kqueue is there, he might add it to the mix.

silentbicycle15y ago

Yes, it does. (I haven't used NetBSD at all, and I forgot to mention it.)

bch15y ago

NetBSD also supports kqueue

zedshaw15y ago

kqueue15y ago

Lets assume we have 20k opened FDs.

zedshaw15y ago

Yep, that's what I thought too, that at least epoll would be as fast. Turns out it's not though, but then I could be wrong.

Also, your assumption of EPOLLET is potentially wrong. I think (unproven) that the extra overhead and complexity of using edge trigger right makes EPOLLET pointless.

kqueue15y ago

Sorry, I meant level-triggered. :) I think edge-triggered does add an extra overhead as you stated.

pphaneuf15y ago

FooBarWidget15y ago

Why would there by any copying? The kernel can directly read userspace memory.

kqueue15y ago

For the kernel to execute a system call, it has to place the arguments on its stack. a system call doesn't execute in the userland.

FooBarWidget15y ago

Yes but the argument to poll is a pointer. The pointer would be copied but the kernel can still follow the pointer to userspace, right?

phintjens15y ago

Just kidding. It's always nice to see science in action. Great work! I suspect there's an impact on ZeroMQ's own poll/epoll strategy.

jaekwon15y ago

0.6 is so arbitrary. it should be 1.0/golden-ratio.

zedshaw15y ago

I was hoping for e, but alas no luck.

pphaneuf15y ago

Of course, that's often Kinda Hard To Do (tm). ;-)

aston15y ago

It's pretty darn close to 1 - 1/e.