Techempower Web Framework Benchmarks Round 10 (opens in new tab)

(techempower.com)

110 pointspneumatics11y ago62 comments

62 comments

Come for that stats, stay for the comedy

> The project returns with significant restructuring of the toolset and Travis CI integration. Fierce battles raged between the Compiled Empire and the Dynamic Rebellion and many requests died to bring us this data. Yes, there is some comic relief, but do not fear—the only jar-jars here are Java

Brilliant!

redstripe11y ago

There's something I don't understand in these comments. Why is everyone interested in language comparisons instead of the huge difference between EC2 and bare metal?

sauere11y ago

Got to agree. Bare Metal performance isn't appreciated enough. I know a few companies that run fine with a mixed Metal/AWS combo. Metal handles 80% of the workload, and if that for some reason fails, EC2 instances are fired up to take over until it is fixed. This setup doesn't work for every scenario, but it is something to take into consideration.

ckluis11y ago

I remember being in a meeting talking about the tax companies using X-Metal servers for the whole year and scaling Y-cloud instances for the 3 months they have excessive usage.

buster11y ago

Indeed.. I am wondering the same.. I fear that at some point the baseline for performance will be some virtualized server and people will forget just how fast real hardware can be.. At some point people will think it's normal to host a simple blog on 5 virtual servers ;)

Also, by its nature EC2 should be terrible for serious benchmarking, since you have no control whatsoever about the infrastructure.

wmf11y ago

Because people think they can't use bare metal and thus its advantages are irrelevant.

stefantalpalaru11y ago

From what I know about the Xen virtualization used by EC2 the instance variability is too high to get consistent results over time and the overhead for this type of benchmarks is so big it's not even funny.

That's why the bare metal results are the only relevant ones. The playing field is fair and stable. Let the battle begin ;-)

Joeri11y ago

Every time I'm struck by how slow the big PHP frameworks are compared to raw PHP. Either nobody cares about making those tests perform better, or something is very wrong in the architecture of those frameworks.

I expect the cause is that too much code is being loaded for every request. PHP tears down and rebuilds the world for every request, and the popular frameworks load a lot of code and instantiate a lot of objects for every request.

aikah11y ago

> Either nobody cares about making those tests perform better

Very few do. Just look at Symfony's complexity , the devs managed to shove proxies into the IoC container to "lazy load services" so you could inject them into the controller class constructors without actually instantiating them until you call a route handler's controller instance method... Using complex class hierarchies in PHP has a big cost, while "raw PHP" barely executes the underlying C code. To be fair people using heavy frameworks come for the battery included first, not really for the speed of the router. And since devs want more batteries, it's unlikely it will get faster.

panopticon11y ago

PHP frameworks generally rely on heavy amounts of caching at every level (database, bytecode, Varnish, etc) to make up for this.

Why this is the status quo is something I also question.

1 more reply

makeitsuckless11y ago

Because we know these raw tests are bullshit since it's almost trivial to set up the frameworks and the infrastructure to make it perform perfectly fine.

And because of the share-nothing architecture it scales like a mofo before we have to start jumping through hoops (and 99% of us never reach that scale).

The lack of raw speed is a minor inconvenience that's easily fixed in practice.

Also, I doubt big PHP frameworks are the only ones with a disadvantage that only shows up in these kind of benchmarks. It's like testing cars solely for straight line speed.

tolas11y ago

Still no Elixir/Phoenix inclusion? I'd really love to see how it stacked up.

bhauer11y ago

Coming in Round 11 [1]! We'll aim to have Round 11 out much quicker than 10.

[1] https://github.com/TechEmpower/FrameworkBenchmarks/pull/1510

perishabledave11y ago

Thanks! I'd like to see this as well.

DAddYE11y ago

Few notes:

* Impressive Dart

* JRuby > MRI (I'd like to see JRuby 9k)

* Padrino that offers basically everything that Rails does performs impressively well. [Shameless Plug]

cheald11y ago

My experience has been that JRuby 9k is slightly slower than MRI in a number of cases right now, mostly because its IR has been completely rewritten, and is still pending a performance pass.

That said, it's still a huge step up from the 1.7 series, and once the team starts knocking out performance problems it should be pretty magnificent.

aikah11y ago

I'd like to see the memory benchmarks as well.It's fine to have something fast but the more memory the more expensive the boxes are.

kainsavage11y ago

We HAVE the memory data, but we have not distilled it into a consumable form to use on our website. Here, for example, is the cpu/ram usage report for ULib's json test: https://github.com/TechEmpower/TFB-Round-10/blob/master/peak...

vdaniuk11y ago

Okay, Nim is really fast in these benchmarks on ec2 servers, impressive.

Does anyone has experience with nim web stack? Is it ready for prime time? How much effort is required to create a simple CRUD json api?

On a sidenote I am really looking forward to comparing rust and elixir results in the next round of benchmarks.

idlewan11y ago

It's getting there fast. We still need some niceties like hot code reload for the server, but experiments like [0] are on the way. I'd love to see Rust and Elixir perform well too for the next rounds!

I recommend you to read the nim code used for the benchmark [1] [2] [3] [4], and if you like what you see, to start with the Nim tutorial [5].

[0]: https://github.com/transfuturist/outlet

[1]: https://github.com/TechEmpower/FrameworkBenchmarks/blob/mast... [2]: https://github.com/TechEmpower/FrameworkBenchmarks/blob/mast... [3]: https://github.com/TechEmpower/FrameworkBenchmarks/blob/mast... [4]: https://github.com/TechEmpower/FrameworkBenchmarks/blob/mast...

[5]: http://nim-lang.org/tut1.html

kainsavage11y ago

Rust and Elixir SHOULD be included in Round 11 - we have already accepted a pull request for the Phoenix framework (Elixir) and have had a pull request for Rust when it was in alpha. Hopefully, we will see another Rust pull request soon.

vdaniuk11y ago

Pleased to hear that, thanks for doing the benchmarks!

DAddYE11y ago

See: https://github.com/dom96/jester

vdaniuk11y ago

thx for the suggestion

saryant11y ago

The chart says Play Framework didn't complete but looking at the output, the logs say it did.

https://github.com/TechEmpower/TFB-Round-10/blob/master/peak...

What am I missing?

richdougherty11y ago

An error occurs, which is logged to stderr, but the benchmark logs don't capture stderr so it's hard to know what's happening. (Or maybe Play redirects stderr to a log file?)

The test passes in the preview runs, in the TechEmpower continuous integration tests and in the EC2 tests so it's probably some transient error that only occurred in the final bare metal test. Maybe there's a race condition in the Play 2 test scripts which only shows up sometimes.

I've spent a fair bit of time maintaining the Play 2 benchmark tests so it's very frustrating to get no result on the final test. Oh well!

saryant11y ago

Out of the box, the start script from "play stage" does not redirect stderr.

Though I didn't think to check the classpath when I was poking around the TechEmpower github repo. I wonder if a logback.xml slipped in somewhere that's siphoning off stderr to some unknown destination?

wheaties11y ago

It's not that they didn't complete, they never even started. Definitely configured incorrectly for the tests.

kainsavage11y ago

Agreed. Our logging has undergone some solid improvements in the last week or two, and so round 11 will, if not resolve this issue completely, make the logged output more useful for tracking down issues like this during the preview runs.

bhauer11y ago

Sorry, the links to logs were going to logs from a preview run. I've just changed the links to direct to the final logs. In the final run, play2-scala did not respond:

https://github.com/TechEmpower/TFB-Round-10/blob/master/peak...

saryant11y ago

Generally that error happens when Play tries to bind to a port that's already in use. Looking at the start scripts for the Scala projects, the RUNNING_PID file is just being removed if it exists but your script should probably kill that PID if the file exists before deleting the file.

https://github.com/TechEmpower/FrameworkBenchmarks/blob/mast...

1 more reply

virtualwhys11y ago

Maybe they need to pass `-Dhttp.address=...` to the start script[1]; Play binds to `0.0.0.0` and port 9000 by default.

Here's the corresponding error from the log in Play[2]. Not sure what else it could be...

[1] https://github.com/TechEmpower/FrameworkBenchmarks/blob/mast...

[2] https://github.com/playframework/playframework/blob/2.2.x/fr...

sauere11y ago

Bottle handling 5x more requests than Flask. Impressive, but overall Python framework performance ist still... meh.

Cyph0n11y ago

Falcon is where the performance is at according to the benchmarks.

trentnelson11y ago

Or PyParallel, when they include Windows in the next round: https://speakerdeck.com/trent/pyparallel-pycon-2015-language...

Consistently orders of magnitude faster than everything else out there in the Python landscape.

1 more reply

WoodenChair11y ago

Dart dominated the multiple queries test type.

Cyph0n11y ago

That is quite surprising. Anyone have an idea why that is?

wheaties11y ago

That's easy to answer, instead of using a relational database or even PostgreSQL's JSON data store, it's using MongoDB. I hate MongoDB but for tests like these where there's absolutely no writes what so ever, Mongo is going to fly. Hence, I read this as comparing MongoDB read performance to PostgreSQL, MySQL or SQL Server perf (or any of the other DB that are listed.)

That's why benchmarks like these have to be scrutinized. I like looking them over but in reality, they're not Apples to Apples.

1 more reply

kainsavage11y ago

Actually, not really. We checked the code to ensure that there was no gaming the system and it definitely APPEARS to be making separate database queries as we require in our rules. In fact, we had this same question in round 9 and had a number of people audit it. We cannot explain it other than it might be pretty darn fast.

2 more replies

sker11y ago

I'd like to see some ASP.NET running on Owin. Perhaps I'll find the time to add it myself before round 11.

kainsavage11y ago

I am not familiar with Owin; how does it differ from Mono?

meragrin11y ago

Owin is an interface between .NET web servers and web applications.

Mono is an implementation of the .NET runtime and framework.

ckluis11y ago

It’d be interesting to see on Nano too.

hamiltont11y ago

I've been working with this project for a while, here's some unorganized thoughts:

   1) Statistics
   2) Running Overhead
   3) Travis-CI
   4) Memory/Bandwidth/Other info
   5) Windows
   6) IRC
   7) Ease of Contributing

1) Currently, the TFB results are not statistically sound in any sense - for each round you're looking at one data point. EC2 has higher variability in performance, so that one data point is worth less than the bare metal data point. Re-running this round, I would expect to see at least 5 to 10% difference for each framework permutation. See point (2) to understand why we're not yet running 30 iterations and averaging (or something similar)

2) Running a benchmark "round" takes >24 hours, and still (sadly) a nontrivial amount of manpower. It's currently really tough to do lots of previews before an official round, and therefore tough to let framework contributors "optimize" their frameworks iteratively. I'm working on continuous benchmarking over at https://github.com/hamiltont/webjuice - it's a bit early for PRs, but open an issue if you want to chat

3) As you can imagine, our resource usage on Travis-CI is much higher than other open source projects. They have been nothing but amazing, and even reached out to chat about mutual solutions to potentially reduce our usage. Really great team

4) We do record a lot of this using the dstat tool. dstat outputs a huge amount of data, and no one has sent in a PR to help us aggregate that data into something easy to visualize. If you want this info, it's available in the results github in raw form.

5) Sadly windows support is struggling at the minute. We need something setup like Travis-CI but for our Windows system. CUrrently windows PRs have to be manually tested, and few of the contributors have either a) time to do it manually in a responsive manner or b) windows setups (a few do, but many of us dont). Any takers to help set something up? FYI, we have put a ton of work into keeping Mono support just so we can at least test that changes to C# tests at least run and pass verification, but naturally that isn't as nice as really having native windows support

6) join us on freenode at #techempower-fwbm - it's really fun meeting the brilliant people behind the frameworks

7) If I had to pick one big thing that's happened in between R9 and R10, it would be the drastically reduced barrier to entry. Running these benchmarks requires configuring three computers, which is much harder than something like pip install. Adding vagrant support that can setup a development environment in one command, or deploy to a benchmarking-ready AWS EC2 environment, has really reduced the barrier to getting involved. Adding Travis-CI made it better - it will automatically verify that your changes check out! Adding documentation at https://frameworkbenchmarks.readthedocs.org/en/latest/Projec... made it even easier. Having a stable IRC community is even better! Tons of changes have added to mean that it's now easier than ever for someone to get involved

skrowl11y ago

Yeah, I clicked stats and tried to hide everything but IIS since that's the web server I use. No results. Closed page.

hamiltont11y ago

I'm open to recommendations for systems similar to Travis-CI but supporting Windows! Having some type of windows CI would really help bring windows uspport up to par

EDIT: Actually, let me just link everyone to github:

Here are the windows compatibility issues - https://github.com/TechEmpower/FrameworkBenchmarks/issues?q=...

Here is the specific issue asking for advice on what CI we should use to support windows: https://github.com/TechEmpower/FrameworkBenchmarks/issues/10...

MCRed11y ago

Reading thru these tests they are measuring database performance as much as framework performance.

They are also single node which is great if you're entire system is only going to ever need one machine's worth of capacity (Eg: vertical scaling)

vinceyuan11y ago

Some frameworks which I never know performed very well. But looks like they are not mature. Which framework do you recommend? I used Node.js/Express, Rails and Sinatra but am not satisfied with them. I am learning Go.

cagenut11y ago

Since the "peak" hardware is a dual E5-2660v2 thats 32 threads, so a c3.8xlarge would be a much more comparable instance.

kainsavage11y ago

We aren't trying to measure each hardware set as apples-to-apples, but rather give the reader an idea of how performance characteristics for a chosen stack are affected by hosting environment. Specifically, we wanted the middle-of-the-road EC2 instances versus the extremely high-end Peak option to illustrate that difference.

cagenut11y ago

thanks for the response

I've noticed a weird trend where amazon created various slices of instance types a long time ago, and people have mentally gotten used to using larger ones far slower than moores law adds cores. So people will refer to something with 2 cores as "middle-of-the-road" and 32 as "extremely high-end" when in my brain thats "a cell phone" and "a 2 year old server".

merb11y ago

Keep in mind most of these benchmarks won't happen in production. Especially not the netty and lwan ones.

dilatedmind11y ago

what are the benefits of using this benchmark over using ab?

hamiltont11y ago

The main benefit is this allows rough comparison to a ton of other frameworks. Just running ab against your one server setup gives you one RPS/latency result on one hardware setup - that's good to know as an absolute metric, but tells you very little about your performance relative to other frameworks.

This project gives you RPS/latency metrics for many frameworks, on a few hardware setups. This enables a rough comparison of "how does my framework perform relative to all these other well-known or established frameworks". Naturally, the comparison is not perfect - there are a ton of reasons that measuring just requests/sec and latency doesn't allow complete comparison between two frameworks. However, once you accept that it is basically impossible to fully compare any two frameworks using just quantitative methods and these numbers should inform your choice of framework (instead of totally control your choice of framework), we can talk about why it's valuable.

Want to run a low-cost server in language X that you happen to love? This project can provide guidance about which frameworks written in language X are performing the best. Want to ensure your service can support 50k requests per second without loosing latency? This project can provide latency numbers for you to examine that let you know which frameworks appear to maintain acceptable latency even under high load.

If you wanted to, you could re-create this project by running ab against 100+ frameworks - that's the cornerstone of what is happening here. Granted, we currently use https://github.com/wg/wrk instead of ab, but the principle is the same - start up framework, run load generation, capture result data. Most of the codebase is dedicated to ensuring that these 100+ frameworks don't interfere with each other, setting up pseudo-production environments with separate server/database/load generation servers, and other concerns that have to be addressed.

Over time, this project has started collect more statistics than just requests/second and latency, which makes it more valuable than just running ab. As more metrics are added and more frameworks are added, this becomes a really valuable project for understanding how frameworks perform relative to one another.

j / k navigate · click thread line to collapse

62 comments

chrisan11y ago

Come for that stats, stay for the comedy

Brilliant!

redstripe11y ago

There's something I don't understand in these comments. Why is everyone interested in language comparisons instead of the huge difference between EC2 and bare metal?

sauere11y ago

ckluis11y ago

I remember being in a meeting talking about the tax companies using X-Metal servers for the whole year and scaling Y-cloud instances for the 3 months they have excessive usage.

buster11y ago

Also, by its nature EC2 should be terrible for serious benchmarking, since you have no control whatsoever about the infrastructure.

wmf11y ago

Because people think they can't use bare metal and thus its advantages are irrelevant.

stefantalpalaru11y ago

That's why the bare metal results are the only relevant ones. The playing field is fair and stable. Let the battle begin ;-)

Joeri11y ago

aikah11y ago

> Either nobody cares about making those tests perform better

panopticon11y ago

PHP frameworks generally rely on heavy amounts of caching at every level (database, bytecode, Varnish, etc) to make up for this.

Why this is the status quo is something I also question.

1 more reply

makeitsuckless11y ago

Because we know these raw tests are bullshit since it's almost trivial to set up the frameworks and the infrastructure to make it perform perfectly fine.

And because of the share-nothing architecture it scales like a mofo before we have to start jumping through hoops (and 99% of us never reach that scale).

The lack of raw speed is a minor inconvenience that's easily fixed in practice.

Also, I doubt big PHP frameworks are the only ones with a disadvantage that only shows up in these kind of benchmarks. It's like testing cars solely for straight line speed.

tolas11y ago

Still no Elixir/Phoenix inclusion? I'd really love to see how it stacked up.

bhauer11y ago

Coming in Round 11 [1]! We'll aim to have Round 11 out much quicker than 10.

[1] https://github.com/TechEmpower/FrameworkBenchmarks/pull/1510

perishabledave11y ago

Thanks! I'd like to see this as well.

DAddYE11y ago

Few notes:

* Impressive Dart

* JRuby > MRI (I'd like to see JRuby 9k)

* Padrino that offers basically everything that Rails does performs impressively well. [Shameless Plug]

cheald11y ago

My experience has been that JRuby 9k is slightly slower than MRI in a number of cases right now, mostly because its IR has been completely rewritten, and is still pending a performance pass.

That said, it's still a huge step up from the 1.7 series, and once the team starts knocking out performance problems it should be pretty magnificent.

aikah11y ago

I'd like to see the memory benchmarks as well.It's fine to have something fast but the more memory the more expensive the boxes are.

kainsavage11y ago

vdaniuk11y ago

Okay, Nim is really fast in these benchmarks on ec2 servers, impressive.

Does anyone has experience with nim web stack? Is it ready for prime time? How much effort is required to create a simple CRUD json api?

On a sidenote I am really looking forward to comparing rust and elixir results in the next round of benchmarks.

idlewan11y ago

I recommend you to read the nim code used for the benchmark [1] [2] [3] [4], and if you like what you see, to start with the Nim tutorial [5].

[0]: https://github.com/transfuturist/outlet

[5]: http://nim-lang.org/tut1.html

kainsavage11y ago

vdaniuk11y ago

Pleased to hear that, thanks for doing the benchmarks!

DAddYE11y ago

See: https://github.com/dom96/jester

vdaniuk11y ago

thx for the suggestion

saryant11y ago

The chart says Play Framework didn't complete but looking at the output, the logs say it did.

https://github.com/TechEmpower/TFB-Round-10/blob/master/peak...

What am I missing?

richdougherty11y ago

An error occurs, which is logged to stderr, but the benchmark logs don't capture stderr so it's hard to know what's happening. (Or maybe Play redirects stderr to a log file?)

I've spent a fair bit of time maintaining the Play 2 benchmark tests so it's very frustrating to get no result on the final test. Oh well!

saryant11y ago

Out of the box, the start script from "play stage" does not redirect stderr.

wheaties11y ago

It's not that they didn't complete, they never even started. Definitely configured incorrectly for the tests.

kainsavage11y ago

bhauer11y ago

Sorry, the links to logs were going to logs from a preview run. I've just changed the links to direct to the final logs. In the final run, play2-scala did not respond:

https://github.com/TechEmpower/TFB-Round-10/blob/master/peak...

saryant11y ago

https://github.com/TechEmpower/FrameworkBenchmarks/blob/mast...

1 more reply

virtualwhys11y ago

Maybe they need to pass `-Dhttp.address=...` to the start script[1]; Play binds to `0.0.0.0` and port 9000 by default.

Here's the corresponding error from the log in Play[2]. Not sure what else it could be...

[1] https://github.com/TechEmpower/FrameworkBenchmarks/blob/mast...

[2] https://github.com/playframework/playframework/blob/2.2.x/fr...

sauere11y ago

Bottle handling 5x more requests than Flask. Impressive, but overall Python framework performance ist still... meh.

Cyph0n11y ago

Falcon is where the performance is at according to the benchmarks.

trentnelson11y ago

Or PyParallel, when they include Windows in the next round: https://speakerdeck.com/trent/pyparallel-pycon-2015-language...

Consistently orders of magnitude faster than everything else out there in the Python landscape.

1 more reply

WoodenChair11y ago

Dart dominated the multiple queries test type.

Cyph0n11y ago

That is quite surprising. Anyone have an idea why that is?

wheaties11y ago

That's why benchmarks like these have to be scrutinized. I like looking them over but in reality, they're not Apples to Apples.

1 more reply

kainsavage11y ago

2 more replies

sker11y ago

I'd like to see some ASP.NET running on Owin. Perhaps I'll find the time to add it myself before round 11.

kainsavage11y ago

I am not familiar with Owin; how does it differ from Mono?

meragrin11y ago

Owin is an interface between .NET web servers and web applications.

Mono is an implementation of the .NET runtime and framework.

ckluis11y ago

It’d be interesting to see on Nano too.

hamiltont11y ago

I've been working with this project for a while, here's some unorganized thoughts:

   1) Statistics
   2) Running Overhead
   3) Travis-CI
   4) Memory/Bandwidth/Other info
   5) Windows
   6) IRC
   7) Ease of Contributing

6) join us on freenode at #techempower-fwbm - it's really fun meeting the brilliant people behind the frameworks

skrowl11y ago

Yeah, I clicked stats and tried to hide everything but IIS since that's the web server I use. No results. Closed page.

hamiltont11y ago

I'm open to recommendations for systems similar to Travis-CI but supporting Windows! Having some type of windows CI would really help bring windows uspport up to par

EDIT: Actually, let me just link everyone to github:

Here are the windows compatibility issues - https://github.com/TechEmpower/FrameworkBenchmarks/issues?q=...

Here is the specific issue asking for advice on what CI we should use to support windows: https://github.com/TechEmpower/FrameworkBenchmarks/issues/10...

MCRed11y ago

Reading thru these tests they are measuring database performance as much as framework performance.

They are also single node which is great if you're entire system is only going to ever need one machine's worth of capacity (Eg: vertical scaling)

vinceyuan11y ago

cagenut11y ago

Since the "peak" hardware is a dual E5-2660v2 thats 32 threads, so a c3.8xlarge would be a much more comparable instance.

kainsavage11y ago

cagenut11y ago

thanks for the response

merb11y ago

Keep in mind most of these benchmarks won't happen in production. Especially not the netty and lwan ones.

dilatedmind11y ago

what are the benefits of using this benchmark over using ab?

hamiltont11y ago

j / k navigate · click thread line to collapse