CNCF's Cortex v1.0: scalable, fast Prometheus implementation (opens in new tab)

(grafana.com)

181 pointsnetingle6y ago46 comments

46 comments

awesome job by the cortex team!

there's a lot of good questions, and some confusion in this thread. here is my view. note: i'm definitely biased; am the co-founder/ceo at grafana labs.

- at grafana labs we are huge fans of prometheus. it has become the most popular metrics backend for grafana. we view cortex and prometheus as complementary. we are also very active contributors to the prometheus project itself. in fact, cortex vendors in prometheus.

- you can think of cortex as a scale-out, multi-tenant, highly available "implementation" of prometheus itself.

- the reason grafana labs put so much resources into cortex is because it powers our grafana cloud product (which offers a prometheus backend). like grafana itself, we are also actively working on an enterprise edition of cortex that is designed to meet the security and feature requirements of the largest companies in the world.

- yes, cortex was born at weaveworks in 2016. tom wilkie (vp of product at grafana labs) co-created it while he worked there. after tom joined grafana labs in 2018, we decided to pour a lot more resources into the project, and managed to convince weave.works to move it to the cncf. this was a great move for the project and the community, and cortex has come a long long way in the last 2 years.

once again, a big hat tip to everyone who made this release possible. a big day for the project, and for prometheus users in general!

[edit: typos]

Florin_Andrei6y ago

I'm worried about this statement:

> Local storage is explicitly not production ready at this time.

https://cortexmetrics.io/docs/getting-started/getting-starte...

But I want a scale-out, multitenant implementation of Prometheus with local storage that's ready for prod. What are my options then? VictoriaMetrics?

prungta6y ago

I suggest checking out M3DB[1]. My team & I use it to serve metrics for all of Uber, we have ~1500 hosts across various clusters. It's serving us quite well.

[1]: https://github.com/m3db/m3

gouthamve6y ago

The only one I know with "non-experimental" local-storage is VictoriaMetrics. But the big thing there is that data in VM is not replicated, so when you lose a disk/node, you lose that data.

Having said that, both Thanos and Cortex have experimental local-storage modes that are pretty good. You could also try them for now while they get production ready.

simonrobb6y ago

M3 provides local storage but is not experimental, on top of that with cluster replication which VictoriaMetrics does not provide, and has a kubernetes operator to help scale out a cluster.

Disclosure: I work on the TSDB underlying M3 (M3DB) at Uber. Still worth checking out though!

Florin_Andrei6y ago

> data in VM is not replicated, so when you lose a disk/node, you lose that data

The vmstorage component in VictoriaMetrics Server - is it RAID0-like (stripping) or RAID1-like (mirroring)?

https://github.com/VictoriaMetrics/VictoriaMetrics/tree/clus...

2 more replies

netingleOP6y ago

There are a bunch of different solutions out there; Thanos, Influx, federated Prometheus etc.

The local Cortex storage works pretty well but we have a very high bar for production worthiness. Right now I'd recommend using Bigtable of DynamoDB, and if you're on premise Cassandra. In the future the block storage will allow you to run minio.

ecnahc5156y ago

Thanos is probably one of the other popular choices. It's being heavily used in production by a number of companies, but I don't think they've branded it at "Prod ready" in a 1.0 release though.

sciurus6y ago

Thanos doesn't have production support for local storage either. The only stable storage providers for it are google, amazon, and azure's object stores.

https://thanos.io/storage.md/

Interestingly, it looks like Cortex's support for local storage and object stores comes from using Thanos's storage engine. So once it's production ready in Thanos it will probably be production-ready in Cortex shortly thereafter.

https://cortexmetrics.io/docs/operations/blocks-storage/

I think for Cortex your safest storage options now are Bigtable, DynamoDB, or Cassandra.

1 more reply

m0rphling6y ago

Please note the difference between complimentary and complementary. It's a common homophone confusion in English.

The former means free or charge or expressing praise or a compliment.

The latter means disparate things go well together and enhance each others' qualities.

nopzor6y ago

thanks for the complimentary tip ;) fixed.

kapilvt6y ago

also props to https://weave.works for creating cortex, open-sourcing it and moving it under cncf, something this blog post leaves out.

netingleOP6y ago

Hi! Tom, one of the Cortex authors here. Super proud of the team and this release - let me know if you have any questions!

number1010106y ago

Hey Tom!

Can you outline how Cortex differs from some of the other available Prometheus backends?

netingleOP6y ago

Sure, check out this talk from PromCon I did with Bartek, the Thanos author: https://grafana.com/blog/2019/11/21/promcon-recap-two-househ...

MetalMatze6y ago

Love that talk. :)

valyala6y ago

VictoriaMetrics FAQ contains comparisons to Cortex [1] and Thanos [2].

[1] https://github.com/VictoriaMetrics/VictoriaMetrics/wiki/FAQ#...

[2] https://github.com/VictoriaMetrics/VictoriaMetrics/wiki/FAQ#...

ctovena6y ago

Great job Cortex team, Do you think this means Cortex will move to incubation in the CNCF landscape ?

netingleOP6y ago

I hope so! Goutham is apply for incubation as we speak..

RichiH6y ago

This will also depend on SIG o11y, the creation of which is currently being voted on by CNCF TOC. TOC vote is looking good and projects which have been in sandbox for some time are obvious candidates for early review.

ones_and_zeros6y ago

Isn't prometheus an implementation and not an interface? I have "prometheus" running in my cluster, if it's not cortex, what implementation am I using?

ownagefool6y ago

It's kinda several things

- The OSS product

- The Storage Format (I guess)

- The Interface for pulling metrics (https://github.com/OpenObservability/OpenMetrics)

I haven't dug into cortex even a little, but the other comments are suggesting it's API compatible but essentially claiming they're production ready because they'll give you things the OSS project won't give you out of the box, i.e. long term storage and RBAC.

Looks like a good thing.

netingleOP6y ago

> wrapping prometheus and giving you that production readyness that they're claiming the OSS project won't give you out of the box

No! Prometheus is and has been production ready for many years. Cortex is a clustered/horizontally scalable implemention of the Prometheus APIs, and Cortex has just gone production ready. Sorry for the confusion.

ownagefool6y ago

Just want to say, I use prometheus. It's amazing.

But readiness depends somewhat on your use case. If you're on a multi-tenanted cluster and you don't want to explicit trust your users / admins, how do you stop them from messing with your metrics whilst allowing them to maintain their own?

I typically did it via github flow, some others used the operator to give us many proms, some others would just suggest it's missing features.

Indeed, I could probably word my example better though. Apologies if I were putting words in your mouth.

RichiH6y ago

And I have Prometheus data from 2015, so I would argue that's long-term.

outworlder6y ago

You are using Prometheus.

However, Prometheus can use different storage backends. The TSDB that it comes with is horrible.

I mean, it's workable. And can store an impressive amount of data points. If you don't care about historical data or scale, it may be all you need.

However, if your scale is really large, or if you care about the data, it may not be the right solution, and you'll need something like Cortex.

For instance, Prometheus' own TSSB has no 'fsck'-like tool. From time to time, it does compaction operations. If your process (or pod in K8s) dies, you may be left with duplicate time series. And now you have to delete some (or a lot!) of your data to recover.

Prometheus documentation, last I checked, even says it is not suitable for long-term storage.

ecnahc5156y ago

The TSDB it uses is actually pretty state of the art. I think your pain point is more that it's designed for being used on local disk, but that doesn't mean it isn't possible to store the TSDB remotely. In fact, this is exactly how Thanos works.

The docs say Prometheus is not intended for long term storage because without a remote_write configuration, all data is persisted locally, and thus you will eventually hit limits on the amount that can be stored and queried locally. However, that is a limitation on how Prometheus is designed, not how the TSDB is designed, and which can be overcome by using a remote_write adapter.

sagichmal6y ago

> The TSDB that it comes with is horrible.

The TSDB in Prometheus since 2.0 is excellent for its use case.

netingleOP6y ago

Yes, Prometheus is an implementation - the HN text has a limited number of words, so I thought "Prometheus implementation" conveyed the fact Cortex was trying to be a 100% API compatible implementation of Prometheus, but with scalability, replication etc

cat1996y ago

how about:

CNCF's Cortex v1.0: scalable, fast Prometheus API implementation ready for prod (grafana.com)

saves 1 char.

gouthamve6y ago

Yes, you're running the Prometheus server. But what Cortex is a Prometheus API compatible service that horizontally scales and has multi-tenancy and other things built in.

Rapzid6y ago

Dat architecture tho: https://cortexmetrics.io/docs/architecture/ . Holy bi-gebus.

netingleOP6y ago

Thats the "microservices" mode - you can run it as a single process and the architecture becomes super boring.

Its like looking at the module interdependencies of reasonably large piece of software; of course its going to look complicated.

valyala6y ago

According to Cortex docs [1], a single-process Cortex isn't production ready. It is intended for development and testing only.

[1] https://cortexmetrics.io/docs/configuration/single-process-c...

zytek6y ago

Congrats to Grafana Team!

If you're looking at scaling your Prometheus setup - check out also Victoria Metrics.

Operational simplicity and scalability/robustness are what drive me to it.

I used to to send metrics from multiple Kubernetes clusters with Prometheus - each cluster having Prom with remote_write directive to send metrics to central VictoriaMetrics service.

That way my "edge" prometheus installations are practically "stateless", easily set up using prometheus-operator. You don't even need to add persistent storage to them.

mmcclellan6y ago

New to Cortex but when looking at a comparison of Prometheus and InfluxDB (like https://prometheus.io/docs/introduction/comparison/#promethe...) it appears that Cortex offers similar horizontal scalability features to the InfluxDB Enterprise offering. The linked comparison does note the difference between event logging and metrics recording but I am curious (choosy beggar that I am) whether others consider them separate tooling or whether it is possible to remain performant using one solution.

stuff4ben6y ago

This was a Weaveworks project right?

gouthamve6y ago

Yes, it was created at Weaveworks, but it was later donated to CNCF and now the community is much bigger! Having said that Weaveworks is still a major contributor!

mattmendick6y ago

Really exciting! Well done

rfratto6y ago

Great job Cortex team!

demilich6y ago

Good job, excited!

throwaway502036y ago

Reminder: github star history is in no way a measure of quality.

j / k navigate · click thread line to collapse

46 comments

nopzor6y ago

awesome job by the cortex team!

there's a lot of good questions, and some confusion in this thread. here is my view. note: i'm definitely biased; am the co-founder/ceo at grafana labs.

- you can think of cortex as a scale-out, multi-tenant, highly available "implementation" of prometheus itself.

once again, a big hat tip to everyone who made this release possible. a big day for the project, and for prometheus users in general!

[edit: typos]

Florin_Andrei6y ago

I'm worried about this statement:

> Local storage is explicitly not production ready at this time.

https://cortexmetrics.io/docs/getting-started/getting-starte...

But I want a scale-out, multitenant implementation of Prometheus with local storage that's ready for prod. What are my options then? VictoriaMetrics?

prungta6y ago

I suggest checking out M3DB[1]. My team & I use it to serve metrics for all of Uber, we have ~1500 hosts across various clusters. It's serving us quite well.

[1]: https://github.com/m3db/m3

gouthamve6y ago

The only one I know with "non-experimental" local-storage is VictoriaMetrics. But the big thing there is that data in VM is not replicated, so when you lose a disk/node, you lose that data.

Having said that, both Thanos and Cortex have experimental local-storage modes that are pretty good. You could also try them for now while they get production ready.

simonrobb6y ago

M3 provides local storage but is not experimental, on top of that with cluster replication which VictoriaMetrics does not provide, and has a kubernetes operator to help scale out a cluster.

Disclosure: I work on the TSDB underlying M3 (M3DB) at Uber. Still worth checking out though!

Florin_Andrei6y ago

> data in VM is not replicated, so when you lose a disk/node, you lose that data

The vmstorage component in VictoriaMetrics Server - is it RAID0-like (stripping) or RAID1-like (mirroring)?

https://github.com/VictoriaMetrics/VictoriaMetrics/tree/clus...

2 more replies

netingleOP6y ago

There are a bunch of different solutions out there; Thanos, Influx, federated Prometheus etc.

ecnahc5156y ago

Thanos is probably one of the other popular choices. It's being heavily used in production by a number of companies, but I don't think they've branded it at "Prod ready" in a 1.0 release though.

sciurus6y ago

Thanos doesn't have production support for local storage either. The only stable storage providers for it are google, amazon, and azure's object stores.

https://thanos.io/storage.md/

https://cortexmetrics.io/docs/operations/blocks-storage/

I think for Cortex your safest storage options now are Bigtable, DynamoDB, or Cassandra.

1 more reply

m0rphling6y ago

Please note the difference between complimentary and complementary. It's a common homophone confusion in English.

The former means free or charge or expressing praise or a compliment.

The latter means disparate things go well together and enhance each others' qualities.

nopzor6y ago

thanks for the complimentary tip ;) fixed.

kapilvt6y ago

also props to https://weave.works for creating cortex, open-sourcing it and moving it under cncf, something this blog post leaves out.

netingleOP6y ago

Hi! Tom, one of the Cortex authors here. Super proud of the team and this release - let me know if you have any questions!

number1010106y ago

Hey Tom!

Can you outline how Cortex differs from some of the other available Prometheus backends?

netingleOP6y ago

Sure, check out this talk from PromCon I did with Bartek, the Thanos author: https://grafana.com/blog/2019/11/21/promcon-recap-two-househ...

MetalMatze6y ago

Love that talk. :)

valyala6y ago

VictoriaMetrics FAQ contains comparisons to Cortex [1] and Thanos [2].

[1] https://github.com/VictoriaMetrics/VictoriaMetrics/wiki/FAQ#...

[2] https://github.com/VictoriaMetrics/VictoriaMetrics/wiki/FAQ#...

ctovena6y ago

Great job Cortex team, Do you think this means Cortex will move to incubation in the CNCF landscape ?

netingleOP6y ago

I hope so! Goutham is apply for incubation as we speak..

RichiH6y ago

ones_and_zeros6y ago

Isn't prometheus an implementation and not an interface? I have "prometheus" running in my cluster, if it's not cortex, what implementation am I using?

ownagefool6y ago

It's kinda several things

- The OSS product

- The Storage Format (I guess)

- The Interface for pulling metrics (https://github.com/OpenObservability/OpenMetrics)

Looks like a good thing.

netingleOP6y ago

> wrapping prometheus and giving you that production readyness that they're claiming the OSS project won't give you out of the box

ownagefool6y ago

Just want to say, I use prometheus. It's amazing.

I typically did it via github flow, some others used the operator to give us many proms, some others would just suggest it's missing features.

Indeed, I could probably word my example better though. Apologies if I were putting words in your mouth.

RichiH6y ago

And I have Prometheus data from 2015, so I would argue that's long-term.

outworlder6y ago

You are using Prometheus.

However, Prometheus can use different storage backends. The TSDB that it comes with is horrible.

I mean, it's workable. And can store an impressive amount of data points. If you don't care about historical data or scale, it may be all you need.

However, if your scale is really large, or if you care about the data, it may not be the right solution, and you'll need something like Cortex.

Prometheus documentation, last I checked, even says it is not suitable for long-term storage.

ecnahc5156y ago

sagichmal6y ago

> The TSDB that it comes with is horrible.

The TSDB in Prometheus since 2.0 is excellent for its use case.

netingleOP6y ago

cat1996y ago

how about:

CNCF's Cortex v1.0: scalable, fast Prometheus API implementation ready for prod (grafana.com)

saves 1 char.

gouthamve6y ago

Yes, you're running the Prometheus server. But what Cortex is a Prometheus API compatible service that horizontally scales and has multi-tenancy and other things built in.

Rapzid6y ago

Dat architecture tho: https://cortexmetrics.io/docs/architecture/ . Holy bi-gebus.

netingleOP6y ago

Thats the "microservices" mode - you can run it as a single process and the architecture becomes super boring.

Its like looking at the module interdependencies of reasonably large piece of software; of course its going to look complicated.

valyala6y ago

According to Cortex docs [1], a single-process Cortex isn't production ready. It is intended for development and testing only.

[1] https://cortexmetrics.io/docs/configuration/single-process-c...

zytek6y ago

Congrats to Grafana Team!

If you're looking at scaling your Prometheus setup - check out also Victoria Metrics.

Operational simplicity and scalability/robustness are what drive me to it.

I used to to send metrics from multiple Kubernetes clusters with Prometheus - each cluster having Prom with remote_write directive to send metrics to central VictoriaMetrics service.

That way my "edge" prometheus installations are practically "stateless", easily set up using prometheus-operator. You don't even need to add persistent storage to them.

mmcclellan6y ago

stuff4ben6y ago

This was a Weaveworks project right?

gouthamve6y ago

Yes, it was created at Weaveworks, but it was later donated to CNCF and now the community is much bigger! Having said that Weaveworks is still a major contributor!

mattmendick6y ago

Really exciting! Well done

rfratto6y ago

Great job Cortex team!

demilich6y ago

Good job, excited!

throwaway502036y ago

Reminder: github star history is in no way a measure of quality.

j / k navigate · click thread line to collapse