"Business should focus on its core competency"
* Outsource in-house infra to cloud. This begets lock-in as every engineer is doing heaven knows what with Lambda. Still need that huge infra team to manage AWS.
* Outsource in-house metrics and visibility to SignalFx, Splunk, DataDog, NewRelic, etc. Still need a team to manage it. Costs get raised by more than double because we're beholden, so now we need to fund 20+ engineer quarters to migrate everything ASAP.
* Feature flagging system built in house works like a charm and needs one engineer for maintenance. Let's fund a team to migrate it all to LaunchDarkly. Year+ later and we still don't have proper support or rollout and their stuff doesn't work as expected.
Madness.
Expensive madness.
SaaS won't magically reduce your staffing needs. Open source solutions won't reduce your staffing needs either, but they'll make costs predictable. As these tools become more prevalent and standard, you can even hire experts for them.
For anyone out there in the same spot, I'll say that I switched my last company to Atlassian's OpsGenie and it was a 10x cost savings for the same feature set.
If cost is the only measure: I understand. But time lost in various areas of the software package (performance alone! Before we get into weird UX paradigms and esoteric query languages, shoddy search systems etc;) surely has an impact on cost. Having your employees spending a lot of time navigating janky software has a cost too.
I'm generally not a fan of Atlassian anything, but Opsgenie was really good before Atlassian purchased them and they're still really good.
On the flip side in the 8ish months I used Opsgenie I saw a litany of issues, like the mobile app silently logging people out (and thus not delivering push notifications) and the app failing to send SMS notifications. We had pages get dropped by the primary because they never got a notification.
I want exactly one feature from my pager, delivering 100% of my pages. It's not a situation where I'm ok with 99% success rate, and that seemed to be the tradeoff and what you are paying for with Pagerduty.
I really hope this project gets good enough to ditch PD. PD should literally lay off most of its staff and just maintain the existing product, cut costs and focus mostly on integrations. There is no way they have any other future.
For example, the auto-merge functionality is somewhat useful, but it sometimes gets it wrong. Both merging and splitting alerts is extremely clunky. I'd also love to be able to have policies like "this alert is low-priority outside of business hours". The ability to re-open an accidentally closed alert. The ability to edit the alerting rules (the Global Ruleset) without needing to be a super-admin, as the Power That Be are reluctant to hand that out, also, the ability to just read the Global Ruleset with my peon privileges.
… the sort of spit & polish that doesn't happen, IMO, anymore, because everything is an MVP feature, before the agile scrum PM moves on to the next MVP feature. "Polish" just accumulates in the JIRA graveyard.
Matvey Kukuy, ex-CEO of Amixr and a head of the OnCall project here. We've been working hard for a few months to make this OSS release happen. I believe it should make incident response features (on-call rotations, escalations, multi-channel notifications) and best practices more accessible to the wider audience of SRE and DevOps engineers.
Hope someone will be able to finally sleep well at night being sure that OnCall will handle escalations and will alert the right person :)
Please join our community on a GitHub! The whole Grafana OnCall team is help you and to make this thing better.
[0] Spoiler alert: it's not.
Seems like the /main is the culprit.
Is that a net positive?
It doesn't have anything to do, of course, with the fact that this morning we suddenly found that all our dashboards stopped working because we were upgraded to Grafana v9, for which there is not a stable release nor documentation for breaking changes.
Luckily they rolled back our account.
I just wouldn't mind to be the last to upgrade to a newer version :)
AGPL 3.0
Is there some hidden complexity or is it just a consequence of engineers building a product for other engineers? Also, any tips what worked for you?
Of course, in a few months we may have some new people having joined, some quit, or other circumstances. A single misclick when fixing that can invalidate the whole schedule and generate another. Infuriating.
Or the UI itself, might have become better tha last two years, but having to click "next week" tens of times to see when I was scheduled (since I wasnt just interested in my next scheduled time but all of them) were annoying.
I'm curious why/if this architecture was chosen. I get that it started as a standalone product (Amixr), but in the current state it is hard to rationalize deploying this next to Grafana in my current containerless setting.
Helm (https://github.com/grafana/oncall/tree/dev/helm/oncall), docker-composes for hobby and dev environments.
Besides deployment, there are two main priorities for OnCall architecture: 1) It should be as "default" as possible. No fancy tech, no hacking around 2) It should deliver notifications no matter what.
We chose the most "boring" (no offense Django community, that's a great quality for a framework) stack we know well: Django, Rabbit, Celery, MySQL, Redis. It's mature, reliable, and allows us to build a message bus-based pipeline with reliable and predictable migrations.
It's important for such a tool to be based on message bus because it should have no single point of failure. If worker will die, the other will pick up the task and deliver alert. If Slack will go down, you won't loose your data. It will continue delivering to other destinations and will deliver to Slack once it's up.
The architecture you see in the repo was live for 3+ years now. We were able to perform a few hundreds of data migrations without downtimes, had no major downtimes or data loss. So I'm pretty happy with this choice.
To be fair, even in its current form, it should be possible to operate this system with sqlite (i.e. no db server) and in-process celery workers (i.e. no rabbit MQ) if configured correctly, assuming they're not using MySQL-specific features in the app.
Using a message bus, a persistent data store behind a SQL interface, and a caching layer are all good design choices. I think the OP's concern is less with your particular implementations, and more with the principle of preventing operators from bringing their own preferred implementation of those interfaces to the table.
They mentioned that it makes sense because you were a standalone product, so stack portability was less of a concern. But as FOSS, you're opening yourself up to different standards on portability.
It requires some work on the maintainer to make the application tolerant to different fulfillments of the same interfaces. But it's good work. It usually results in cleaner separation of concerns between application logic and caching/message bus/persistence logic, for one. It also allows your app to serve a wider audience: for example, those who are locked-in to using Postgres/Kafka/Memcached.
I get why you would add a relational DB to the mix. Personally, I'd like a Rabbit-free option.
Sorry but why is rabbitmq really necessary?
https://prometheus.io/docs/introduction/overview/#architectu...
When comparing to OnCall you need OnCall AND still the rest of that Prometheus picture.
Compare with this picture where everything in the leftmost Alert Detection box is what you see in the Prometheus architecture. https://grafana.com/docs/oncall/latest/getting-started/
MySQL is a weird if not slightly disturbing choice. Other than that it's a boring, battle-tested stack that is relatively easy to scale. I agree that Go is nicer, but I'm biased by several years of dealing with horrific Flask / Django projects.
I think you misspelled "beautiful"
Complexity comes at a steep price when something critical (e.g. OnCall) breaks and you have to debug it in a hurry.
Shoving everything in a container and closing the lid does not help.
IMHO, OP just stated that one could solve this with less dependencies and have the same (if not a better) result.
Far less scalable, but it is dramatically simpler to deploy. Often gets you surprisingly far though. Would be interesting to know how many monitored integrations could be supported by that flow.
Python(Django)+Redis+[SQLite]
Python(Django)+Postgres
[Compiled Go binary]+[SQLite]
SQLite barely even counts as an architectural dependency TBH :)
Just think about Gitea vs GitLab.
As for python, at least getting a dockerfile helps a lot. Otherwise it's a huge mess to get running, yes.
Python is still a hassle anyways, since the lack of true multithreading means that you often need multiple deployments, which the Celery usage here for instance shows.
Maintaining tenths of binaries pulled from random github projects over the years is a nightmare.
(Not to mention all the issues around supply chain management, licensing issues, homecalling and so on)
It even ships in containers along with Docker Compose files and Helm charts, which would suit the deployment use cases of 99% of users. I understand that you're not using containers, but I don't think that's a limitation that many are inflicting upon themselves as of late, and if pressed, installing Docker Compose takes about 5 minutes and you don't have to think about it again.
Except when you need to pin and repin versions to comply with a security policy, which may be why you aren't even running containers in the first place
i love django projects but mysql, celery and rabbitmq -- no thanks.
Redis itself has had solid support for building reliable distributed task streaming for nearly 4 years (Redis ConsumerGroups introduced in 2018).
A minor note, if anyone from Grafana is around - a bunch of the links on the bottom of the announcement go to a 404.
A quick look at OnCall suggests it is more for managing fired alerts than firing alerts.
Their own screenshot has AlertManager as an alert source.
Having to run alertmanager and configure it in addition to Grafana was bad enough, now you need to run and configure another service if you want some extra functionalities for those alerts? Are they going to keep maintaining acknowledgements and scheduled silence in AlertManager now that OnCall exists? Are we going to have "legacy notifications" in AlertManager when not running OnCall, the same way there are "legacy alerts" in Grafana when updating from Grafana 7 (pre-AlertManager)?
For example, we need to run Loki ourselves, for security / privacy reasons, but wouldn't mind using hosted versions of Tempo, Prometheus and OnCall.
Right now it isn't super-easy to link e.g. self-hosted loki search queries with SaaS-Prometheus.
Do you mind if I ask what isn't super-easy about linking self-hosted loki search queries with SaaS-Prometheus? You should be e.g. able to add a Prometheus data source to your local Grafana (or securely expose your Loki to the internet and add a Loki data source to your Cloud Grafana)
In our particular scenario, we'd probably want to run Loki + Grafana locally, and then hosted Prometheus + hosted Grafana for metrics.
But would be great if we could just tell the two about each other, and under which domains they exist. That way, Prometheus-grafana could construct URLs that linked straight into Loki-grafana (that we host) for e.g. the same interval, or the same label filter (GET params).
But it would only work if I (the end-user) had access to both. That way, we don't have to expose Loki to the internet. But linking would still work.
There are quite a lot of services that does this with Github and commits. You can link from e.g. Bugsnag to Github by only telling Bugsnag your org and repo names. But Bugsnag won't have read access to Github (they also have another integration method which does require access, but that's not the one I'm talking about here).
Those types of "linking into a known URL pattern of another service" integrations are easy to setup and very easy to secure.
Im annoyed by their license choice.
But apparently when you are grafana everything looks like a dashboard UI?
Joke aside I will have a look but I didn't like the screenshots before already. I like the dashboardy thing for dashboards but otherwise it's not a really good UI system for everything else.
Of course the article isn't much better. It reads like a joke, the joke being that "on-call management" doesn't mean anything.
<<We offered Grafana OnCall to users as a SaaS tool first for a few reasons. It’s a commonly shared belief that the more independent your on-call management system is, the better it will be for your entire operation. If something goes wrong, there will be a “designated survivor” outside of your infrastructure to help identify any issues. >>
They tried to ensure that you use their SaaS offering because they care more about your own good than yourself. So humanist...
I mean, obviously they chose to address the segment of the market they could get more money out of first; I'm not contesting that. But the bit you quoted is low-grade bullshit at best. Hardly award-winning.
for someone at grafana; noticed a dead link in the post: https://grafana.com/docs/oncall/main/