there's a lot of good questions, and some confusion in this thread. here is my view. note: i'm definitely biased; am the co-founder/ceo at grafana labs.
- at grafana labs we are huge fans of prometheus. it has become the most popular metrics backend for grafana. we view cortex and prometheus as complementary. we are also very active contributors to the prometheus project itself. in fact, cortex vendors in prometheus.
- you can think of cortex as a scale-out, multi-tenant, highly available "implementation" of prometheus itself.
- the reason grafana labs put so much resources into cortex is because it powers our grafana cloud product (which offers a prometheus backend). like grafana itself, we are also actively working on an enterprise edition of cortex that is designed to meet the security and feature requirements of the largest companies in the world.
- yes, cortex was born at weaveworks in 2016. tom wilkie (vp of product at grafana labs) co-created it while he worked there. after tom joined grafana labs in 2018, we decided to pour a lot more resources into the project, and managed to convince weave.works to move it to the cncf. this was a great move for the project and the community, and cortex has come a long long way in the last 2 years.
once again, a big hat tip to everyone who made this release possible. a big day for the project, and for prometheus users in general!
[edit: typos]
> Local storage is explicitly not production ready at this time.
https://cortexmetrics.io/docs/getting-started/getting-starte...
But I want a scale-out, multitenant implementation of Prometheus with local storage that's ready for prod. What are my options then? VictoriaMetrics?
Having said that, both Thanos and Cortex have experimental local-storage modes that are pretty good. You could also try them for now while they get production ready.
The local Cortex storage works pretty well but we have a very high bar for production worthiness. Right now I'd recommend using Bigtable of DynamoDB, and if you're on premise Cassandra. In the future the block storage will allow you to run minio.
The former means free or charge or expressing praise or a compliment.
The latter means disparate things go well together and enhance each others' qualities.
Can you outline how Cortex differs from some of the other available Prometheus backends?
[1] https://github.com/VictoriaMetrics/VictoriaMetrics/wiki/FAQ#...
[2] https://github.com/VictoriaMetrics/VictoriaMetrics/wiki/FAQ#...
- The OSS product
- The Storage Format (I guess)
- The Interface for pulling metrics (https://github.com/OpenObservability/OpenMetrics)
I haven't dug into cortex even a little, but the other comments are suggesting it's API compatible but essentially claiming they're production ready because they'll give you things the OSS project won't give you out of the box, i.e. long term storage and RBAC.
Looks like a good thing.
No! Prometheus is and has been production ready for many years. Cortex is a clustered/horizontally scalable implemention of the Prometheus APIs, and Cortex has just gone production ready. Sorry for the confusion.
However, Prometheus can use different storage backends. The TSDB that it comes with is horrible.
I mean, it's workable. And can store an impressive amount of data points. If you don't care about historical data or scale, it may be all you need.
However, if your scale is really large, or if you care about the data, it may not be the right solution, and you'll need something like Cortex.
For instance, Prometheus' own TSSB has no 'fsck'-like tool. From time to time, it does compaction operations. If your process (or pod in K8s) dies, you may be left with duplicate time series. And now you have to delete some (or a lot!) of your data to recover.
Prometheus documentation, last I checked, even says it is not suitable for long-term storage.
The docs say Prometheus is not intended for long term storage because without a remote_write configuration, all data is persisted locally, and thus you will eventually hit limits on the amount that can be stored and queried locally. However, that is a limitation on how Prometheus is designed, not how the TSDB is designed, and which can be overcome by using a remote_write adapter.
The TSDB in Prometheus since 2.0 is excellent for its use case.
CNCF's Cortex v1.0: scalable, fast Prometheus API implementation ready for prod (grafana.com)
saves 1 char.
Its like looking at the module interdependencies of reasonably large piece of software; of course its going to look complicated.
[1] https://cortexmetrics.io/docs/configuration/single-process-c...
If you're looking at scaling your Prometheus setup - check out also Victoria Metrics.
Operational simplicity and scalability/robustness are what drive me to it.
I used to to send metrics from multiple Kubernetes clusters with Prometheus - each cluster having Prom with remote_write directive to send metrics to central VictoriaMetrics service.
That way my "edge" prometheus installations are practically "stateless", easily set up using prometheus-operator. You don't even need to add persistent storage to them.