Monitoring latency: Cloudflare Workers vs Fly vs Koyeb vs Railway vs Render (opens in new tab)

(openstatus.dev)

120 pointselieskilled2y ago39 comments

39 comments

I would love to see Wasmer Edge in the next comparison!

A summary for the lazy readers:

  * Cloudflare workers outperforms the rest by a big margin (~80ms avg)
  * Fly.io cold starts are not great for the Hono use case (~1.5s avg)
  * Koyeb wraps the requests behind Cloudflare to optimize latency (150ms best case)
  * Railway localized the app in one region (80ms best case, ~400ms rest)
  * Render has some challenges scaling from cold (~600ms avg)

In my opinion, this shows that all platform providers that uses Docker containers under the hood (Fly, Koyeb, Railway, Render) only achieve good cold starts by never shutting down the app. The ones that they do, can only achieve ~600ms startup times at best.

crabmusket2y ago

> only achieve good cold starts by never shutting down the app

That's... no longer a cold start, right?

HatchedLake7212y ago

Until you need to scale horizontally and have more docker containers spin up

mtlynch2y ago

I was surprised to see such miserable measured latency to Fly, but then I saw this note:

>The primary region of our server is Amsterdam, and the fly instances is getting paused after a period of inactivity.

After they configured Fly to run nonstop, it outperformed everyone by 3x. But it seems like they're running the measurement from Fly's infrastructure, which biases the results in Fly's favor.

Also weird that they report p75, p90, p95, p99, but not median.

catlifeonmars2y ago

I’m not sure what’s customary in most places, but in my experience you base most things off of avg and p99 and then you can use other percentiles to interpolate the shape of the distribution in the event you need a better model (you usually don’t). Of course I’m sure this sort of thing varies wildly by use case.

mtlynch2y ago

I admit that I don't deal a lot with latency measurements, but my memory from working at Google was that we focused on p50, p95, and p99. That was six years ago, and it wasn't my focus there, so it's possible I'm misremembering (especially based on all the responses saying I'm the weirdo for expecting p50).

Looking at Google's SRE book, they use p50, p85, p95, and p99, so it's possible I'm misremembering or that Google uses unusual metrics:

https://sre.google/sre-book/service-level-objectives/#fig_sl...

catlifeonmars2y ago

I wasn’t aware of the Google SRE book. Thanks, I’ve bookmarked it!

foofie2y ago

> Also weird that they report p75, p90, p95, p99, but not median.

I'm not aware of P50s having ever been a relevant performance metric in latency. The focus of these latency measurements were always the expected value for most customers, and that means P90-ish.

cheviethai1232y ago

Using median to report latency is very weird. I haven't seen any tech report using median though and curious to see one, can you provide?

joshstrange2y ago

Very odd that AWS Lambda/Google Cloud Functions weren't tested. Those CF numbers are impressive though, they beat Lambda cold start by a mile.

Havoc2y ago

They’re barely comparable in what they do, tradeoffs and limitations. I’ve found that a mix of Cloudflare (for the user interactive parts) and cloud functions (gcp) for more backend ish is a good combo

robertlagrant2y ago

They always will; starting an Isolate in v8 will always beat a Docker or VM start. They only work with Javascript, but if you're okay with that, then you will have tiny cold start times.

syrusakbary2y ago

Indeed. Or using WebAssembly. In fact, Wasm cold startup times beat Javascript v8 isolate startup times by a significant margin!

foofie2y ago

> They always will; starting an Isolate in v8 will always beat a Docker or VM start.

Is that the case though? AWS is upfront in how their nodejs lambdas being the preferred choice for low overhead, low latency, millisecond workloads, and as they also control the runtime I'd be surprised if they followed the naive path you're implying of just running nodejs instances in a dumb VM.

Hell, didn't AWS just updated the way they handled JVM lambdas to basically not charge for their preemptive starts and thus make them look as performant as nodejs ones?

tibozaurus2y ago

Maybe next time

anurag2y ago

(Render CEO) Our free services are meant for personal hobby projects that don't need to stay up all the time; I'd love to see tests (and uptime monitoring) for the $7/mo server on Render. Happy to give you credits if it helps.

tibozaurus2y ago

Sure I would love that!

But it's not all about latency a real world application will be different for sure !

send us an email at ping@openstatus.dev :)

richardkeller2y ago

OP's note about Johannesburg's latency is something I've noticed over the past few weeks in particular. Our servers are hosted in South Africa, yet accessing most of our sites and services from within South Africa causes traffic to be re-routed via other nodes, mostly London (LHR). This is easy to verify by appending cdn-cgi/trace onto a Cloudflare-proxied domain.

Something is definitely up with Cloudflare's Johannesburg data centre. On particularly bad days, TTFB routinely reaches 1-3 seconds. Bypassing Cloudflare immediately drops this to sub 100ms.

In the past, I would have emailed support@cloudflare.com, but it seems that this channel is no longer available for free tier users. What is the recommended approach these days for reporting issues such as this?

jshier2y ago

Only remaining support channel is their community forums. Whether you'll get an official response is another question.

I don't recall Cloudflare routing request to their free clients differently than paid, but I've read multiple reports of that happening recently. Change in policy or fallout from something else?

bishopsmother2y ago

An example community forum query, albeit from 2021, regarding Periodic Cloudflare Worker datacenter allocation issues: https://community.cloudflare.com/t/periodic-cloudflare-worke... . In summary: no official response/conclusion.

Users in the UK having their traffic sent to Australian Cloudflare DC Workers was quite the round-trip/tromboning-a-go-go...

Havoc2y ago

I’ve seen this a couple times before not just CF but other ISP too. Must be something funky with peering politics in SA perhaps

richardkeller2y ago

From the bit of digging that I've done, it's all free tier domains that are affected. Paid tiers are not affected by this. Given that I've tested from multiple different ISPs, I think this is the fault of Cloudflare intentionally de-prioritising free tier traffic, rather than an issue with local routing.

emadda2y ago

> This is easy to verify by appending cdn-cgi/trace onto a Cloudflare-proxied domain.

Nice tip thanks

mxstbr2y ago

I feel like this title is misleading compared to the original article. (cc @dang) Fly.io without cold starts (which is a one-line configuration change) is 2x faster than Cloudflare Workers.

dang2y ago

Ok, we've changed to the article's title now, in keeping with the HN guidelines: https://news.ycombinator.com/newsguidelines.html. (Submitted title was "Hono on Cloudflare Workers is 3x faster than alternatives")

Submitters: If you want to say what you think is important about an article, that's fine, but do it by adding a comment to the thread. Then your view will be on a level playing field with everyone else's: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&so...

almost_usual2y ago

2x faster than Cloudflare Worker cold starts.

tibozaurus2y ago

But monitoring fly.io from fly.io is a bit biased

1 more reply

byyll2y ago

Isn't the whole point to not have a constant function running and waiting?

reissbaker2y ago

I think it depends on your use case. If you want serverless — well, yes! You want it to scale down to zero and scale up to infinity really fast. Cloudflare Workers are really good at that.

On the other hand, if you want lowest possible latency, and the ability to run normal Linux applications (rather than being confined to the Cloudflare Worker limits, e.g. maximum worker size is 10MB), something like Fly.io is pretty nice: it's even lower latency than Cloudflare assuming you keep the machines running, and scaling up/down is relatively quick although not something you'd generally be doing every few seconds.

tmikaeld2y ago

On which cloudflare plan though? On free plan, eu visits are constantly routed through Estonia and Russia in our case, causing about 1-5 sec ttfb

willsmith722y ago

im curious what the results would be with a more production-like app

e.g. if you add prisma connecting to postgres, presumably there's extra latency to create the client. for the fly app, you have a server reusing the client while it's warm. presumably for the cloudflare worker, you're recreating the client per request, but im not 100% on that. how would the latency change then for cold vs warm, and on the other platforms?

kflansburg2y ago

I believe that what you are looking for is Hyperdrive. Global connection pooling and query caching.

https://developers.cloudflare.com/hyperdrive/configuration/h...

skybrian2y ago

I’d be curious how Deno Deploy does.

_visgean2y ago

I wonder how much the open status server allocation plays a role in this case - they tested from 6 different location but its not clear if fpr example openstatus servers are in closer datacenters.

almost_usual2y ago

> We use Fly.io in production and are satisfied with it.

elieskilledOP2y ago

Curious what people think of this. Seems like a huge difference. Much larger than I expected.

catlifeonmars2y ago

Those Fly.io p99 latencies are atrocious. 2.6s P99 compared to CloudFlare 1.0s. Neither one seems particularly great at first glance, but the CloudFlare worker latency does seem on par with Lambda from previous experience (I have not tested Lambda@Edge or CloudFront Functions).

tibozaurus2y ago

It's due to cold start on default preset when creating an app if you are running the app continuously it's good

catlifeonmars2y ago

I’m not familiar with the scaling behavior of Fly.io. Do you still hit cold starts on scale up?

j / k navigate · click thread line to collapse

39 comments

syrusakbary2y ago

I would love to see Wasmer Edge in the next comparison!

A summary for the lazy readers:

  * Cloudflare workers outperforms the rest by a big margin (~80ms avg)
  * Fly.io cold starts are not great for the Hono use case (~1.5s avg)
  * Koyeb wraps the requests behind Cloudflare to optimize latency (150ms best case)
  * Railway localized the app in one region (80ms best case, ~400ms rest)
  * Render has some challenges scaling from cold (~600ms avg)

crabmusket2y ago

> only achieve good cold starts by never shutting down the app

That's... no longer a cold start, right?

HatchedLake7212y ago

Until you need to scale horizontally and have more docker containers spin up

mtlynch2y ago

I was surprised to see such miserable measured latency to Fly, but then I saw this note:

>The primary region of our server is Amsterdam, and the fly instances is getting paused after a period of inactivity.

After they configured Fly to run nonstop, it outperformed everyone by 3x. But it seems like they're running the measurement from Fly's infrastructure, which biases the results in Fly's favor.

Also weird that they report p75, p90, p95, p99, but not median.

catlifeonmars2y ago

mtlynch2y ago

Looking at Google's SRE book, they use p50, p85, p95, and p99, so it's possible I'm misremembering or that Google uses unusual metrics:

https://sre.google/sre-book/service-level-objectives/#fig_sl...

catlifeonmars2y ago

I wasn’t aware of the Google SRE book. Thanks, I’ve bookmarked it!

foofie2y ago

> Also weird that they report p75, p90, p95, p99, but not median.

I'm not aware of P50s having ever been a relevant performance metric in latency. The focus of these latency measurements were always the expected value for most customers, and that means P90-ish.

cheviethai1232y ago

Using median to report latency is very weird. I haven't seen any tech report using median though and curious to see one, can you provide?

joshstrange2y ago

Very odd that AWS Lambda/Google Cloud Functions weren't tested. Those CF numbers are impressive though, they beat Lambda cold start by a mile.

Havoc2y ago

robertlagrant2y ago

They always will; starting an Isolate in v8 will always beat a Docker or VM start. They only work with Javascript, but if you're okay with that, then you will have tiny cold start times.

syrusakbary2y ago

Indeed. Or using WebAssembly. In fact, Wasm cold startup times beat Javascript v8 isolate startup times by a significant margin!

foofie2y ago

> They always will; starting an Isolate in v8 will always beat a Docker or VM start.

Hell, didn't AWS just updated the way they handled JVM lambdas to basically not charge for their preemptive starts and thus make them look as performant as nodejs ones?

tibozaurus2y ago

Maybe next time

anurag2y ago

tibozaurus2y ago

Sure I would love that!

But it's not all about latency a real world application will be different for sure !

send us an email at ping@openstatus.dev :)

richardkeller2y ago

Something is definitely up with Cloudflare's Johannesburg data centre. On particularly bad days, TTFB routinely reaches 1-3 seconds. Bypassing Cloudflare immediately drops this to sub 100ms.

jshier2y ago

Only remaining support channel is their community forums. Whether you'll get an official response is another question.

I don't recall Cloudflare routing request to their free clients differently than paid, but I've read multiple reports of that happening recently. Change in policy or fallout from something else?

bishopsmother2y ago

Users in the UK having their traffic sent to Australian Cloudflare DC Workers was quite the round-trip/tromboning-a-go-go...

Havoc2y ago

I’ve seen this a couple times before not just CF but other ISP too. Must be something funky with peering politics in SA perhaps

richardkeller2y ago

emadda2y ago

> This is easy to verify by appending cdn-cgi/trace onto a Cloudflare-proxied domain.

Nice tip thanks

mxstbr2y ago

I feel like this title is misleading compared to the original article. (cc @dang) Fly.io without cold starts (which is a one-line configuration change) is 2x faster than Cloudflare Workers.

dang2y ago

almost_usual2y ago

2x faster than Cloudflare Worker cold starts.

tibozaurus2y ago

But monitoring fly.io from fly.io is a bit biased

1 more reply

byyll2y ago

Isn't the whole point to not have a constant function running and waiting?

reissbaker2y ago

I think it depends on your use case. If you want serverless — well, yes! You want it to scale down to zero and scale up to infinity really fast. Cloudflare Workers are really good at that.

tmikaeld2y ago

On which cloudflare plan though? On free plan, eu visits are constantly routed through Estonia and Russia in our case, causing about 1-5 sec ttfb

willsmith722y ago

im curious what the results would be with a more production-like app

kflansburg2y ago

I believe that what you are looking for is Hyperdrive. Global connection pooling and query caching.

https://developers.cloudflare.com/hyperdrive/configuration/h...

skybrian2y ago

I’d be curious how Deno Deploy does.

_visgean2y ago

I wonder how much the open status server allocation plays a role in this case - they tested from 6 different location but its not clear if fpr example openstatus servers are in closer datacenters.

almost_usual2y ago

> We use Fly.io in production and are satisfied with it.

elieskilledOP2y ago

Curious what people think of this. Seems like a huge difference. Much larger than I expected.

catlifeonmars2y ago

tibozaurus2y ago

It's due to cold start on default preset when creating an app if you are running the app continuously it's good

catlifeonmars2y ago

I’m not familiar with the scaling behavior of Fly.io. Do you still hit cold starts on scale up?

j / k navigate · click thread line to collapse