Amazon Managed Service for Prometheus (opens in new tab)

(aws.amazon.com)

168 pointspdelgallego5y ago66 comments

66 comments

The pricing just for the ingest seems way off. $0.002 for 10,000 metrics might not seem like much by even a simple node_exporter will grab 700 metrics every 15 seconds.

Thats $24/month just to ingest the cpu/ram/diskspace data from each server. Plus storage and query costs.

At work I have a single r4.xlarge instance handling 1.3 million metrics every 15 seconds. Storage is not clustered but cost is only $500/month. It would cost me $45k/month just for the ingest with the new managed service.

throwaway3432335y ago

Pricing makes sense if you consider how Amazon operates at this point.

You put basically a MVP product out there with abnormal pricing. Your enterprise customers that are drowning in money can start using it and using that money you can grow your org by hiring more engineers. At this point you start working on adding new features and do cost optimization. Since your whole architecture was designed based on "we have to ship this ASAP", you deliver some real nice cost reduction easily. Then you reflect this to your customers and gain goodwill and good PR.

andrewstuart25y ago

And let's be honest. We all know a company or two that would throw _way_ more than 45k/year at a global metrics solution to handle that volume, and still wind up with a flaming scrap heap. And a promotion or two.

1 more reply

ec1096855y ago

I don’t think any company is drowning in money. Everyone has a budget they are working against. At the end of the day, you can bite an engineer or pay aws more. It’s all a trade off.

mchusma5y ago

Their pricing for these managed services used to be "no brainer" (something like the cost of compute only, or maybe a <30% upcharge). Managed airflow was similarly very expensive (maybe 3x the cost). Just not worth it. Bummer.

wpietri5y ago

Yeah, it turns out there's a lot of money to be made from people who don't have a good grasp of the fundamentals. We got a marketing email from Huggingface recently about their ML-models-as-a-service offering: https://huggingface.co/pricing

One of my colleagues asked if it might be better than creating our own infrastructure for that. I ran the numbers for one of our recent jobs, feeding a million tweets to two ML models to see which worked better. That would have cost about $1800 on Huggingface. Using AWS spot instances, it was maybe $25 for us to run ourselves.

Of course, we can do it at that price because we are paying for engineers and plan on classifying enormous amounts of text, so it works out for us. Plenty of other people probably should just use Huggingface. But I can't help looking at that 70x markup and think, "Fuck me? No, fuck you!"

gautamdivgi5y ago

Pricing makes sense for enterprises. Considering that you may need a team (or a part of one) to maintain a self-hosted cluster at possibly 0.995 reliability, do upgrades, manage devs, run all the mandatory security scans, justify why some enterprise scan tool throwing errors isn't an issue, etc. Oh also justify why you need the manpower to do it, at which point your VP will tell you to just use the managed service.

qz25y ago

It doesn’t though. I just did a cost projection on our estate and hiring two engineers to look after it on bare metal VMs is 30% cheaper than using the managed service. Plus it doesn’t require a lot of maintenance so we can use those guys on improving the product as well which actually gives direct customer benefits.

kobalsky5y ago

both google and amazon are insane with their observability services.

we ran away screaming from stackdriver when we saw how costs started piling up.

thank god for prometheus and grafana.

potamic5y ago

That's probably lesser than your team's payroll budget :) Their positioning is that you can reduce the staff needed to operate and maintain these instances.

pickledish5y ago

14 cents per "query processing minute" sounds like it could add up very fast. Prom queries can get somewhat complex and it's not rare at all IME to have a dashboard making several multi-second queries per load (whether that falls into "you're using Prometheus wrong" being a separate discussion of course)

Edit: The example from their pricing page:

> We will assume you have 1 end user monitoring a dashboard for an average of 2 hours per day refreshing it every 60 seconds with 20 chart widgets per dashboard (assuming 1 PromQL query per widget)... assuming 18ms per query for this example.

Comes out to over $3 per month in query costs. Replace this 1 person with a TV showing the dashboard all day, and the cost jumps to $36, for just one dashboard and (again IME) overly fast query estimates... o.O

gravypod5y ago

Does it put any limits on cardinality of metrics? Grafana cloud's offering was absolutely awful for my use cases. They charge per-series so if you have metrics with a "pod=..." label your prices go through the roof.

kasey_junk5y ago

Every managed metrics system will put a limit on cardinality because all mainstream available metrics systems cost more per cardinality to query and store. If they don’t limit that you can assume you or some other customer is going to use up the clusters resources and cause an outage.

Like most metrics systems, under the covers in Prometheus each unique combination of dimensions is the same as a new metric line.

1 more reply

heliodor5y ago

Plenty has been written about not using the server/container/pod id as a label because it leads to high cardinality which leads to poor performance (cost aside). Time series databases have been purpose-built for certain workloads and you can consider this their weakness.

gravypod5y ago

Plenty has also been written about the bugs/issues that have cropped up that are only visible when inspecting what regions/nodes/cgroups an issue is coming from [0]. My use case wasn't exactly `pod=...` but it was very similar. It was more like `device=...`. Also, for a huge application, it's not uncommon to have 100s or even 1000s of metrics that are important to application health/performance. Constantly saying "do you really need X? It will cost us Y" will lead to an extremely under-monitored application.

[0] - https://cloud.google.com/blog/products/management-tools/sre-...

1 more reply

webo5y ago

I like Weave Cloud’s Prometheus hosting model — it’s per host, which is predictable and forecastable.

edoceo5y ago

Now do six dashboards, 10 widgets each, multiple viewers, 18h/day and one slowish query on each dashboard. Seems like we get to hundred+ pretty quick

bboreham5y ago

Caching means that multiple viewers cost very little extra.

(I am a Cortex maintainer)

pram5y ago

Yeah I dunno about this, and the grafana service. They’re not exactly complicated to run on their own. At this pricing you may as well be on Datadog.

nrmitchi5y ago

I've commented fairly heavily in the related Grafana thread.

Prometheus is a bit of a different story. It does have some operational overhead when you get to a certain point, and scaling it out is not always trivial.

Assuming it works, there is value-add on this one, and the pricing is more in line with active use (ie, a cost+ model, which is more typical of AWS services)

2 more replies

stevekemp5y ago

This seems more interesting of the two, grafana is pretty simple to setup and maintain. The harder part is handling the metrics themselves, be it with influxdb, prometheus, or something else.

zander3125y ago

Scaling prometheus across multiple separate Kubernetes clusters is a fking nightmare.

codeduck5y ago

Use Victoria metrics. One lightweight agent per cluster pushing to a centralised metrics store makes it so much easier.

markcartertm5y ago

setting up one Prometheus server is easy. scaling, HA, Metrics retention for more than 3 days not so much.

1 more reply

Thaxll5y ago

Prometheus is not easy to run at scale on the storage side.

pram5y ago

This is all relative but I don't personally think so. Not on EC2+EBS, anyway. Certainly not as difficult as running/scaling an ES or Kafka cluster.

Thaxll5y ago

It's a completely different problem because by default Prometheus does not shard anything so you're bound to a single instance, where ES and Kafka are cluster based.

smcleod5y ago

Out of interest what do you find hard about running ElasticSearch clusters?

In my experience ES has been one of the easiest clustered / highly available and sharded systems I've ever run - especially for how incredibly performant and reliable it is.

I've generally found that beyond right sizing your nodes, indexes and shard configuration - it pretty much just works without ever really having issues.

shitloadofbooks5y ago

Victoria Metrics is an absolutely superb drop in replacement.

jrv5y ago

It's not a drop-in replacement (even though it tries to sell itself as such), it's incompatible in a significant number of ways and throws away part of your data.

5 more replies

0xbadcafebee5y ago

You could say the same about any SaaS based on open source, but people still find it useful

eminence325y ago

From the pricing section:

> AMP counts each metric sample ingested to the secured Prometheus-compatible endpoint. AMP also calculates the stored metric samples and metric metadata in gigabytes (GB), where 1GB is 230 bytes.

Surely that's a typo, right?

biot5y ago

Likely a casualty of copy and paste that left out the superscript formatting. 1GB is 2^30 bytes.

WoahNoun5y ago

Everyone here complaining about the pricing on the managed Grafana and Prometheus services have clearly never worked at a shop using SumoLogic. Log/metric processing/querying is expensive for a reason.

aluminussoma5y ago

I very much dislike Prometheus, but the fact that AWS is offering it as a managed service means I am in the minority. I attribute much of Prometheus' success to the influence of ex-Googlers. They joined other companies, had a lot of clout, and sought out a tool that was similar to what they once used.

I understand that the Google version of Prometheus is deprecated but there is no commercial equivalent.

dilyevsky5y ago

What is in your opinion a better open source alternative to prometheus?

Borgmon was inspiration for prometheus but was a totally different project so it is a complete rewrite

codeduck5y ago

I feel like a broken record, but we are having great success with Victoria metrics as a drop in replacement.

heliodor5y ago

Promscale looks interesting. Keep the architecture of Prometheus while storing the data in TimescaleDB and using SQL as the query language (together with the TimescaleDB-specific extensions to it). Does anyone actually like PromQL?

notesinafield5y ago

Its nearly weekly we bump up against the limits of timeseries aggregation. Id take anything else foss at this point.

alexhf5y ago

I don't see any mention of Pushgateway. They'll need to add that or I won't be able to monitor ephemeral jobs.

mchene5y ago

Hey... Marc here from AWS. I'm the PM lead for this service. Thank you for the feedback. Pushgateway is important for our customers and it is a feature we are looking to support as part of our roadmap. For the time being, you can continue to use the Pushgateway as you do today and remote write the metrics to AMP for long term storage and querying!

latchkey5y ago

I just went through the "process" of installing Grafana, Loki, Promtail and Prometheus on an ubuntu box and it is almost like the company behind all of this has gone out of the their way to make it hard. It isn't really _that_ difficult to get set up, but it also isn't 'apt install' easy (you really want me to create my own startup scripts?) and required me to build my own documentation on how I installed everything.

john_moscow5y ago

It's almost like the company behind it wants to see some profit after pouring millions of dollars into developing these tools. Except, in 2020 you cannot just have a closed-source easy-to-use documented and supported product with a license fee. Not in the server market, at least. Everything must be free and open-source, and you are expected to make money by offering a hosted service. Except, good luck competing with Big Cloud.

RocketSyntax5y ago

It's extremely worrisome. The incentive to spend your early mornings, nights, and weekends building something awesome to free yourself from corporate life is fading away. They need to institute some kind of royalty program or at least dedicate engineers to helping maintain the projects they make into services.

Almost have to change gears and get into a scientific field that isn't computer science.

rfratto5y ago

One of the Loki maintainers here (though I mostly work on other stuff now). I promise it's not difficult on purpose.

We've put a lot of effort into optimizing the Kubernetes experience that non-containerized installations haven't been getting as much attention. We'd be thrilled to have system packages for Loki that also set it up as a service, it's just not something we've been able to spend time doing ourselves yet.

jitl5y ago

Honestly I mostly throw out the Debian service definitions anyways - when clustering or interacting with Chef or Ansible or whatever, you end up building a lot of ‘smarts’ around a custom supervisor like Runit or skarnet or systemd

latchkey5y ago

It isn't just loki, but the whole stack. Grafana is the only project mentioned that has a debian installer.

The expectation that someone doing greenfield development is going to jump into k8s just to use the software is kind of weird.

qz25y ago

I’m deploying it (prom, alertmanager, pushgateway, grafana) on native hardware via ansible and it’s not difficult. Not Loki (yet). It’s all just go binaries you fire up with systemd with a single config file.

I find it harder to deploy reliably on kubernetes with persistent volumes etc.

0xbadcafebee5y ago

All of those who have spent their free time contributing to Linux distributions are why 'apt install' is easy. You can contribute too.

latchkey5y ago

As the co-founder of Apache Java and a 20+ year member of the ASF, creator and contributor to hundreds of projects over the years, I think I've contributed enough of my time to OSS. I'm more than happy to let the new kids jump in. Thanks for the 'advice'.

0xbadcafebee5y ago

What percentage of ASF projects use 'apt install' at all? Did the Apache folks themselves make the packages? Should we complain about ASF for not making an 'apt install' for each of their projects?

Of a random sampling of install instructions for different ASF projects, the instructions generally are "1. Install java" then "2. Download this binary" and "3. Run the binary with java". Not quite 'apt install', is it?

AYBABTME5y ago

I wonder how AWS is supporting the development of Prometheus. Are they financing the OSS developers who are spending countless hours dedicated to the project?

bboreham5y ago

AWS is an investor in Weaveworks where the implementation (Cortex) was first created. Weaveworks had two Prometheus maintainers on staff at the time.

In the announcement it says AWS have a commercial relationship with Grafana Labs, where several Prometheus maintainers, community managers, etc. currently work.

(I work for Weaveworks)

vishuk5y ago

Do we know which scalable prometheus backend are they running? Chronosphere? Thanos?

bmurphy19765y ago

The Grafana blog post mentions Cortex, something I'm not familiar with:

https://grafana.com/blog/2020/12/15/announcing-amazon-manage...

bboreham5y ago

It’s Cortex, though the particular configuration shares a lot of code with Thanos.

(I am a Cortex maintainer)

hagen17785y ago

If you know technical details, are there any metrics cardinality limitations?

bboreham5y ago

There are soft limits _everywhere_, to stop people shooting themselves in the foot. Those can be raised by admins after checking the user knows what they are doing.

I do not know what the practical limits are right now; especially I do not know what size hardware AWS run it on.

If you were to search the Cortex Slack you would find people talking about instances with 100 million series, also people talking about work to improve scalability.

AzzieElbab5y ago

Now that Aws ate the world, can we get some useable gui or consistent cli?

hagen17785y ago

I wonder if it will be possible to migrate your data somewhere else once it becomes too expensive.

backing5y ago

Hoe can I hide all Amazon and Google news on HN ? Do you know an alternative of HN without big tech lobby? Thanks.

joana0355y ago

I'm interested in this too, more I see aws dominating every aspect of our life, more depressed I become.

j / k navigate · click thread line to collapse

66 comments

slyall5y ago

The pricing just for the ingest seems way off. $0.002 for 10,000 metrics might not seem like much by even a simple node_exporter will grab 700 metrics every 15 seconds.

Thats $24/month just to ingest the cpu/ram/diskspace data from each server. Plus storage and query costs.

throwaway3432335y ago

Pricing makes sense if you consider how Amazon operates at this point.

andrewstuart25y ago

1 more reply

ec1096855y ago

I don’t think any company is drowning in money. Everyone has a budget they are working against. At the end of the day, you can bite an engineer or pay aws more. It’s all a trade off.

mchusma5y ago

wpietri5y ago

gautamdivgi5y ago

qz25y ago

kobalsky5y ago

both google and amazon are insane with their observability services.

we ran away screaming from stackdriver when we saw how costs started piling up.

thank god for prometheus and grafana.

potamic5y ago

That's probably lesser than your team's payroll budget :) Their positioning is that you can reduce the staff needed to operate and maintain these instances.

pickledish5y ago

Edit: The example from their pricing page:

gravypod5y ago

kasey_junk5y ago

Like most metrics systems, under the covers in Prometheus each unique combination of dimensions is the same as a new metric line.

1 more reply

heliodor5y ago

gravypod5y ago

[0] - https://cloud.google.com/blog/products/management-tools/sre-...

1 more reply

webo5y ago

I like Weave Cloud’s Prometheus hosting model — it’s per host, which is predictable and forecastable.

edoceo5y ago

Now do six dashboards, 10 widgets each, multiple viewers, 18h/day and one slowish query on each dashboard. Seems like we get to hundred+ pretty quick

bboreham5y ago

Caching means that multiple viewers cost very little extra.

(I am a Cortex maintainer)

pram5y ago

Yeah I dunno about this, and the grafana service. They’re not exactly complicated to run on their own. At this pricing you may as well be on Datadog.

nrmitchi5y ago

I've commented fairly heavily in the related Grafana thread.

Prometheus is a bit of a different story. It does have some operational overhead when you get to a certain point, and scaling it out is not always trivial.

Assuming it works, there is value-add on this one, and the pricing is more in line with active use (ie, a cost+ model, which is more typical of AWS services)

2 more replies

stevekemp5y ago

This seems more interesting of the two, grafana is pretty simple to setup and maintain. The harder part is handling the metrics themselves, be it with influxdb, prometheus, or something else.

zander3125y ago

Scaling prometheus across multiple separate Kubernetes clusters is a fking nightmare.

codeduck5y ago

Use Victoria metrics. One lightweight agent per cluster pushing to a centralised metrics store makes it so much easier.

markcartertm5y ago

setting up one Prometheus server is easy. scaling, HA, Metrics retention for more than 3 days not so much.

1 more reply

Thaxll5y ago

Prometheus is not easy to run at scale on the storage side.

pram5y ago

This is all relative but I don't personally think so. Not on EC2+EBS, anyway. Certainly not as difficult as running/scaling an ES or Kafka cluster.

Thaxll5y ago

It's a completely different problem because by default Prometheus does not shard anything so you're bound to a single instance, where ES and Kafka are cluster based.

smcleod5y ago

Out of interest what do you find hard about running ElasticSearch clusters?

In my experience ES has been one of the easiest clustered / highly available and sharded systems I've ever run - especially for how incredibly performant and reliable it is.

I've generally found that beyond right sizing your nodes, indexes and shard configuration - it pretty much just works without ever really having issues.

shitloadofbooks5y ago

Victoria Metrics is an absolutely superb drop in replacement.

jrv5y ago

It's not a drop-in replacement (even though it tries to sell itself as such), it's incompatible in a significant number of ways and throws away part of your data.

5 more replies

0xbadcafebee5y ago

You could say the same about any SaaS based on open source, but people still find it useful

eminence325y ago

From the pricing section:

> AMP counts each metric sample ingested to the secured Prometheus-compatible endpoint. AMP also calculates the stored metric samples and metric metadata in gigabytes (GB), where 1GB is 230 bytes.

Surely that's a typo, right?

biot5y ago

Likely a casualty of copy and paste that left out the superscript formatting. 1GB is 2^30 bytes.

WoahNoun5y ago

aluminussoma5y ago

I understand that the Google version of Prometheus is deprecated but there is no commercial equivalent.

dilyevsky5y ago

What is in your opinion a better open source alternative to prometheus?

Borgmon was inspiration for prometheus but was a totally different project so it is a complete rewrite

codeduck5y ago

I feel like a broken record, but we are having great success with Victoria metrics as a drop in replacement.

heliodor5y ago

notesinafield5y ago

Its nearly weekly we bump up against the limits of timeseries aggregation. Id take anything else foss at this point.

alexhf5y ago

I don't see any mention of Pushgateway. They'll need to add that or I won't be able to monitor ephemeral jobs.

mchene5y ago

latchkey5y ago

john_moscow5y ago

RocketSyntax5y ago

Almost have to change gears and get into a scientific field that isn't computer science.

rfratto5y ago

One of the Loki maintainers here (though I mostly work on other stuff now). I promise it's not difficult on purpose.

jitl5y ago

latchkey5y ago

It isn't just loki, but the whole stack. Grafana is the only project mentioned that has a debian installer.

The expectation that someone doing greenfield development is going to jump into k8s just to use the software is kind of weird.

qz25y ago

I find it harder to deploy reliably on kubernetes with persistent volumes etc.

0xbadcafebee5y ago

All of those who have spent their free time contributing to Linux distributions are why 'apt install' is easy. You can contribute too.

latchkey5y ago

0xbadcafebee5y ago

What percentage of ASF projects use 'apt install' at all? Did the Apache folks themselves make the packages? Should we complain about ASF for not making an 'apt install' for each of their projects?

AYBABTME5y ago

I wonder how AWS is supporting the development of Prometheus. Are they financing the OSS developers who are spending countless hours dedicated to the project?

bboreham5y ago

AWS is an investor in Weaveworks where the implementation (Cortex) was first created. Weaveworks had two Prometheus maintainers on staff at the time.

In the announcement it says AWS have a commercial relationship with Grafana Labs, where several Prometheus maintainers, community managers, etc. currently work.

(I work for Weaveworks)

vishuk5y ago

Do we know which scalable prometheus backend are they running? Chronosphere? Thanos?

bmurphy19765y ago

The Grafana blog post mentions Cortex, something I'm not familiar with:

https://grafana.com/blog/2020/12/15/announcing-amazon-manage...

bboreham5y ago

It’s Cortex, though the particular configuration shares a lot of code with Thanos.

(I am a Cortex maintainer)

hagen17785y ago

If you know technical details, are there any metrics cardinality limitations?

bboreham5y ago

There are soft limits _everywhere_, to stop people shooting themselves in the foot. Those can be raised by admins after checking the user knows what they are doing.

I do not know what the practical limits are right now; especially I do not know what size hardware AWS run it on.

If you were to search the Cortex Slack you would find people talking about instances with 100 million series, also people talking about work to improve scalability.

AzzieElbab5y ago

Now that Aws ate the world, can we get some useable gui or consistent cli?

hagen17785y ago

I wonder if it will be possible to migrate your data somewhere else once it becomes too expensive.

backing5y ago

Hoe can I hide all Amazon and Google news on HN ? Do you know an alternative of HN without big tech lobby? Thanks.

joana0355y ago

I'm interested in this too, more I see aws dominating every aspect of our life, more depressed I become.

j / k navigate · click thread line to collapse