Grafana releases OnCall open source project (opens in new tab)

(grafana.com)

383 pointsnetingle3y ago128 comments

128 comments

Looking forward to trying this out. I've always felt that PagerDuty was absurdly expensive for the feature set they were offering. It costs something at least $250 per user for organization larger than 5 person - even if you're not an engineer who is ever directly on call. At my previous company, IT had to regularly send surveys to employees to assess if they really needed to have a PagerDuty account. Alerts are a key information in an organization that runs software in production and you shouldn't have to pay $250 / month just to be able to have some visibility into it. I'm hoping Grafana OnCall is able to fully replace PagerDuty.

echelon3y ago

Time and time again.

"Business should focus on its core competency"

* Outsource in-house infra to cloud. This begets lock-in as every engineer is doing heaven knows what with Lambda. Still need that huge infra team to manage AWS.

* Outsource in-house metrics and visibility to SignalFx, Splunk, DataDog, NewRelic, etc. Still need a team to manage it. Costs get raised by more than double because we're beholden, so now we need to fund 20+ engineer quarters to migrate everything ASAP.

* Feature flagging system built in house works like a charm and needs one engineer for maintenance. Let's fund a team to migrate it all to LaunchDarkly. Year+ later and we still don't have proper support or rollout and their stuff doesn't work as expected.

Madness.

Expensive madness.

SaaS won't magically reduce your staffing needs. Open source solutions won't reduce your staffing needs either, but they'll make costs predictable. As these tools become more prevalent and standard, you can even hire experts for them.

CSMastermind3y ago

> I've always felt that PagerDuty was absurdly expensive for the feature set they were offering

For anyone out there in the same spot, I'll say that I switched my last company to Atlassian's OpsGenie and it was a 10x cost savings for the same feature set.

dijit3y ago

I really can’t find myself to ever recommend atlassian products though.

If cost is the only measure: I understand. But time lost in various areas of the software package (performance alone! Before we get into weird UX paradigms and esoteric query languages, shoddy search systems etc;) surely has an impact on cost. Having your employees spending a lot of time navigating janky software has a cost too.

therealdrag03y ago

At least with OPsGenie it feels like too few features instead of feature bloat hah. But it’s been fine for me as an engineer.

SEJeff3y ago

We did the same thing. We were fairly early and heavy users of the pagerduty API. We'd been using it for over a year and had integrated it fairly deeply into our SRE stack. When the renewal came up (we had a 3 year contract), they wanted to massively up the price due to our API usage. IIRC, it was more than 5x more. Our CTO had us port all of the tooling to opsgenie and we've never looked back.

I'm generally not a fan of Atlassian anything, but Opsgenie was really good before Atlassian purchased them and they're still really good.

twright03y ago

I made that leap too (moved from a company that used Pagerduty to a company that used Opsgenie) and it is the same feature set but not the same quality. Pagerduty in my experience is rock solid - I used it on various high-page-frequency rotations for something like seven straight years and literally never saw a dropped alert/notification once.

On the flip side in the 8ish months I used Opsgenie I saw a litany of issues, like the mobile app silently logging people out (and thus not delivering push notifications) and the app failing to send SMS notifications. We had pages get dropped by the primary because they never got a notification.

I want exactly one feature from my pager, delivering 100% of my pages. It's not a situation where I'm ok with 99% success rate, and that seemed to be the tradeoff and what you are paying for with Pagerduty.

arccy3y ago

the opsgenie api is really bad though if you want to manage it as code/declaratively

jthrowsitaway3y ago

We evaluated a bunch of solutions and came to that conclusion as well. Everything we do is in code (Terraform) and will gladly pay for something that has friendly APIs and an already existing Terraform module. Conversely, we'll not engage with or throw away anything that doesn't have friendly APIs.

pm903y ago

I knew Pagerduty was going down the toilet when their sales folks started aggressively pitching BS products nobody really needed before their IPO. They couldn’t even release a proper incident management tool.

I really hope this project gets good enough to ditch PD. PD should literally lay off most of its staff and just maintain the existing product, cut costs and focus mostly on integrations. There is no way they have any other future.

deathanatos3y ago

I agree with the weird pushiness on other products that add specious business value, but I do think there is room for feature growth in the main product.

For example, the auto-merge functionality is somewhat useful, but it sometimes gets it wrong. Both merging and splitting alerts is extremely clunky. I'd also love to be able to have policies like "this alert is low-priority outside of business hours". The ability to re-open an accidentally closed alert. The ability to edit the alerting rules (the Global Ruleset) without needing to be a super-admin, as the Power That Be are reluctant to hand that out, also, the ability to just read the Global Ruleset with my peon privileges.

… the sort of spit & polish that doesn't happen, IMO, anymore, because everything is an MVP feature, before the agile scrum PM moves on to the next MVP feature. "Polish" just accumulates in the JIRA graveyard.

smcnally3y ago

    > … the sort of spit & polish that doesn't happen, IMO, anymore, because everything is an MVP feature, before the agile scrum PM moves on to the next MVP feature. "Polish" just accumulates in the JIRA graveyard.

This is the #1 complaint about Agile I've heard from customers, engineers, and The Business. "The team under-promised, under-delivered, and told us 'that'd be in the next sprint.' We've never seen a .1 release of anything."

It eats away the trust and collaboration that get good work Live and adopted. IYO, is this an indictment of "Agile," or more of that org's approach to it?

motakuk3y ago

Check this ;) https://github.com/grafana/oncall/tree/dev/tools/pagerduty-m...

jlg233y ago

Thanks, I think I finally understand why some friends of mine, who can implement this for any company in half a day, take $2000/day...

yashap3y ago

Agreed. Cost really is the big selling point of Grafana Cloud - it’s far, far cheaper than most competitors, and good enough. Not as good as NewRelic, DataDog, etc., but you get good enough metrics, logs, alerts, distributed tracing, and now incident management, at an excellent price.

motakuk3y ago

Hello HN!

Matvey Kukuy, ex-CEO of Amixr and a head of the OnCall project here. We've been working hard for a few months to make this OSS release happen. I believe it should make incident response features (on-call rotations, escalations, multi-channel notifications) and best practices more accessible to the wider audience of SRE and DevOps engineers.

Hope someone will be able to finally sleep well at night being sure that OnCall will handle escalations and will alert the right person :)

Please join our community on a GitHub! The whole Grafana OnCall team is help you and to make this thing better.

knicholes3y ago

Being on-call has never made me sleep better at night!

krab3y ago

If I know someone else is on call and he's competent, I can sleep better.

Tao33003y ago

I've said it before, I'll say it again. I don't care if it's the Joint Chiefs and the whole United Nations calling me up to tell me an asteroid will destroy the planet unless I sign on [0]. If it's after hours, it'll have to wait until tomorrow, possibly Monday.

[0] Spoiler alert: it's not.

the_duke3y ago

The docs link [1] is 404.

Seems like the /main is the culprit.

[1] https://grafana.com/docs/oncall/main/.

motakuk3y ago

Fixed: https://grafana.com/docs/grafana-cloud/oncall/

Tao33003y ago

> it should make incident response features (on-call rotations, escalations, multi-channel notifications) and best practices more accessible to the wider audience

Is that a net positive?

pachico3y ago

I love Grafana, don't get me wrong, but I have the sensation they are now in that position where, companies that got a massive capital injection and, therefore, a massive increase of work power, release too much and too soon.

It doesn't have anything to do, of course, with the fact that this morning we suddenly found that all our dashboards stopped working because we were upgraded to Grafana v9, for which there is not a stable release nor documentation for breaking changes.

Luckily they rolled back our account.

danlimerick3y ago

I apologize for the disruption we caused you when rolling out Grafana 9. We are working on improving our releases to Grafana Cloud and also on making sure that errors due to breaking changes in a major release won't affect customers in the future. As a Grafana Cloud customer, you shouldn't need to read docs about breaking changes when we upgrade your instance.

pachico3y ago

Dude, I hope you also read when I say that I love what you do and your reply just confirms I'm putting my money in the right hands.

I just wouldn't mind to be the last to upgrade to a newer version :)

anyfactor3y ago

Here is the repo: https://github.com/grafana/oncall

AGPL 3.0

Equiet3y ago

It's surprising how seemingly difficult it is to build a good on-call scheduling system. Everything I tried so far (not naming the companies here) felt like the UX was the last thing on the developers' minds. Which is tolerable during business hours but really annoying at 2am.

Is there some hidden complexity or is it just a consequence of engineers building a product for other engineers? Also, any tips what worked for you?

matsemann3y ago

Have had lots of bad experiences with that from Pagerduty at least. Want to generate a schedule far in advance, so people know when they will be oncall and can plan/switch.

Of course, in a few months we may have some new people having joined, some quit, or other circumstances. A single misclick when fixing that can invalidate the whole schedule and generate another. Infuriating.

Or the UI itself, might have become better tha last two years, but having to click "next week" tens of times to see when I was scheduled (since I wasnt just interested in my next scheduled time but all of them) were annoying.

pphysch3y ago

A bit disappointed by the architecture -- it's a Django stack with MySQL, Redis, RabbitMQ, and Celery -- for what is effectively AlertManager (a single golang binary) with a nicer web frontend + Grafana integration + etc.

I'm curious why/if this architecture was chosen. I get that it started as a standalone product (Amixr), but in the current state it is hard to rationalize deploying this next to Grafana in my current containerless setting.

motakuk3y ago

I agree that multi-component architecture is harder to deploy. We did our best and prepared tooling to make deployment an easy thing.

Helm (https://github.com/grafana/oncall/tree/dev/helm/oncall), docker-composes for hobby and dev environments.

Besides deployment, there are two main priorities for OnCall architecture: 1) It should be as "default" as possible. No fancy tech, no hacking around 2) It should deliver notifications no matter what.

We chose the most "boring" (no offense Django community, that's a great quality for a framework) stack we know well: Django, Rabbit, Celery, MySQL, Redis. It's mature, reliable, and allows us to build a message bus-based pipeline with reliable and predictable migrations.

It's important for such a tool to be based on message bus because it should have no single point of failure. If worker will die, the other will pick up the task and deliver alert. If Slack will go down, you won't loose your data. It will continue delivering to other destinations and will deliver to Slack once it's up.

The architecture you see in the repo was live for 3+ years now. We were able to perform a few hundreds of data migrations without downtimes, had no major downtimes or data loss. So I'm pretty happy with this choice.

gen2203y ago

I think your decisions were reasonable, as is the opinion of the person you're responding to.

To be fair, even in its current form, it should be possible to operate this system with sqlite (i.e. no db server) and in-process celery workers (i.e. no rabbit MQ) if configured correctly, assuming they're not using MySQL-specific features in the app.

Using a message bus, a persistent data store behind a SQL interface, and a caching layer are all good design choices. I think the OP's concern is less with your particular implementations, and more with the principle of preventing operators from bringing their own preferred implementation of those interfaces to the table.

They mentioned that it makes sense because you were a standalone product, so stack portability was less of a concern. But as FOSS, you're opening yourself up to different standards on portability.

It requires some work on the maintainer to make the application tolerant to different fulfillments of the same interfaces. But it's good work. It usually results in cleaner separation of concerns between application logic and caching/message bus/persistence logic, for one. It also allows your app to serve a wider audience: for example, those who are locked-in to using Postgres/Kafka/Memcached.

raffraffraff3y ago

Nothing wrong with that. I managed 7+ Sensu "clusters" at a previous job, and it's stack was a ruby server, Redis and RabbitMQ. But I completely ditched RabbitMQ and used Redis for the queue and data. Simpler, more performant and more reliable (even if the feature was marked experimental). Our alerts were really spammy, and we had ~8k servers (each running a bunch of containers) per cluster, so these things were busy. Each cluster was 3x small nodes (6gb memory, 2CPU) Memory usage was miniscule, typically <300mb. Any box could be restarted without any impact because Redis just operated in (failover) mode and Sensu was horizontally scalable.

I get why you would add a relational DB to the mix. Personally, I'd like a Rabbit-free option.

Deritio3y ago

Hearing your message bus assumption sounds like one of the most ridiculous claims I heard.

Sorry but why is rabbitmq really necessary?

slotrans3y ago

You don't need Rabbit, Celery, or Redis. You should be able to replace MySQL with SQLite. Then it would be radically easier to deploy.

throwaway8922383y ago

A MySQL database cluster, and a local copy of a SQL database on a single file on a single filesystem, are not close to the same thing. Except they both have "SQL" in the name.

One of them allows a thousand different nodes on different networks to share a single dataset with high availability. The other can't share data with any other application, doesn't have high availability, is constrained by the resources of the executing application node, has obvious performance limits, limited functionality, no commercial support, etc etc.

And we're talking about a product that's intended for dealing with on-call alerts. The entire point is to alert when things are crashing, so you would want it to be highly available. As in, running on more than one node.

I know the HN hipsters are all gung-ho for SQLite, but let's try to reign in the hype train.

3 more replies

sergiomattei3y ago

It’s curious to see people questioning the stack choices of apps they haven’t built yet and problems they haven’t faced either.

They chose this stack, it works for them. They’ve put it through its paces in production.

It’s as boring as it gets.

vhold3y ago

AlertManager is one component of a more complicated infrastructure.

https://prometheus.io/docs/introduction/overview/#architectu...

https://kubernetes.io/docs/concepts/overview/components/

Too3y ago

That picture shows the complete monitoring stack with lots of optional components (pushgw, service discovery, Grafana).

When comparing to OnCall you need OnCall AND still the rest of that Prometheus picture.

Compare with this picture where everything in the leftmost Alert Detection box is what you see in the Prometheus architecture. https://grafana.com/docs/oncall/latest/getting-started/

pphysch3y ago

OnCall also does nothing unless you have something external firing alerts for you. They both fill similar niches in a larger monitoring system; this does not excuse OnCall having a drastically more complex internal architecture.

skullone3y ago

That seems like a perfectly reasonable architecture. If only all of us could work on battle tested components like those during our job!

contravariant3y ago

For something that is supposed to add some more features to the basic email/HTTP message alert like grafana generates, I do wonder what extra features require an additional 2 databases, a message queue and a separate task queue.

skullone3y ago

probably keeps history, state, escalation flow, etc?

mkl953y ago

> Django stack with MySQL, Redis, RabbitMQ, and Celery

MySQL is a weird if not slightly disturbing choice. Other than that it's a boring, battle-tested stack that is relatively easy to scale. I agree that Go is nicer, but I'm biased by several years of dealing with horrific Flask / Django projects.

airstrike3y ago

> several years of dealing with horrific Flask / Django projects.

I think you misspelled "beautiful"

goodpoint3y ago

That's very bad. 99% of organizations don't have a volume of alerts that justifies any of MySQL, Redis and RabbitMQ.

Complexity comes at a steep price when something critical (e.g. OnCall) breaks and you have to debug it in a hurry.

Shoving everything in a container and closing the lid does not help.

alex_dev3y ago

One of the most frustrating aspects of being a software engineer is dealing with others that love to over-engineer. Unfortunately, they make enough noise that complex solutions are necessary that it gets managers scared about taking any easier, simpler solutions.

lazyant3y ago

Curious as to what architecture you would have preferred or why this pretty standard stack (that can be deployed to k8s) is not giving you.

gjulianm3y ago

Installation in a regular system without Kubernetes? Right now I can install Grafana, Prometheus and Alertmanager in a regular Linux system using distribution packages, and just worry about those programs themselves. If I want to install OnCall, I need not only OnCall plus four other non-trivial dependencies that will still need configuration, management and troubleshooting. All for something that is going to deal with far less load than any of Grafana/Prometheus/Alertmanager. I honestly do not understand it.

lazyant3y ago

you can install this stack without kubernetes no? I don't see anything k8s-specific

2 more replies

chrisandchris3y ago

Not OP, but one may interpret your response as "I don't understand why you prefer a single binary over this architecture that requires 6 different services and prefers k8s".

IMHO, OP just stated that one could solve this with less dependencies and have the same (if not a better) result.

pphysch3y ago

Yes, thank you. I would be surprised if this same product couldn't be delivered with just Python(Django) + SQLite + Redis (assuming writing everything in Go is unrealistic). Spinning up a venv and launching a local Redis instance is significantly more reasonable than having to configure MySQL, RabbitMQ, and Celery.

lazyant3y ago

I missed that interpretation :(

IMHO a fat binary written from scratch would have been a way worse choice than to use a standard stack, both in terms of bugs and time, let alone Open Source contributions or any scalability.

In terms of number of services, what do you get rid of that produce a better result? maybe RMQ and use a worse queue?, celery and write your own task manager or use another dependency?

theptip3y ago

For a simple low-scale app you can often do without Redis and Celery/RMQ if you just push everything into Postgres.

Far less scalable, but it is dramatically simpler to deploy. Often gets you surprisingly far though. Would be interesting to know how many monitored integrations could be supported by that flow.

gjulianm3y ago

I bet quite a lot, probably at least 10-50 per second without doing anything special for performance, i.e. multiple queries per alert, calling different APIs, things like that. I don't know of many places that are dealing with alerts measured in "per second" as a unit.

Not to mention that having multiple components doesn't mean it's "scalable" by default, it could happen that some part of the pipeline doesn't like multiple instances of something.

picozeta3y ago

How does a message queue work via Postgres? Many people (including me) use Redis to run background jobs.

4 more replies

pphysch3y ago

Any of the following:

Python(Django)+Redis+[SQLite]

Python(Django)+Postgres

[Compiled Go binary]+[SQLite]

SQLite barely even counts as an architectural dependency TBH :)

MarquesMa3y ago

This. I find open source projects written in Go or Rust are usually more pleasant to work with than Java, Django or Rails, etc. They have less clunky dependencies, are less resource-hungry, and can ship with single executables which make people's life much easier.

Just think about Gitea vs GitLab.

matsemann3y ago

Not sure why you include java in that, as you mostly get a standalone file. No such thing as a jre in modern java deployment.

As for python, at least getting a dockerfile helps a lot. Otherwise it's a huge mess to get running, yes.

Python is still a hassle anyways, since the lack of true multithreading means that you often need multiple deployments, which the Celery usage here for instance shows.

Volundr3y ago

> Not sure why you include java in that, as you mostly get a standalone file. No such thing as a jre in modern java deployment.

Maybe I'm behind the times, but I can't figure out what you mean here. As far as I know 'java -jar' or servlets are still the most common ways of running a Java app. Are you talking graal and native image?

1 more reply

FridgeSeal3y ago

Python application deployments are all fun and games until suddenly the documentation starts unironically suggesting that you should “write your configuration as a Python script” that should get mounted to some random specific directory within the app as if that could ever be a sane and rational idea.

eeZah7Ux3y ago

Hell no, I want stuff like OnCall packaged into Linux distribution. I need something stable and reliable and that receive security fixes.

Maintaining tenths of binaries pulled from random github projects over the years is a nightmare.

(Not to mention all the issues around supply chain management, licensing issues, homecalling and so on)

morelisp3y ago

At this point I trust the Go modules supply chain considerably more than any free distro's packaging, which is ultimately pulling from GitHub anyway.

2 more replies

heavyset_go3y ago

That's a tried and true stack, and a very good one for maintaining sane levels of reliability, consistency, durability etc. Resource wise, at least with Celery, RabbitMQ and Django, they're also pretty lean.

It even ships in containers along with Docker Compose files and Helm charts, which would suit the deployment use cases of 99% of users. I understand that you're not using containers, but I don't think that's a limitation that many are inflicting upon themselves as of late, and if pressed, installing Docker Compose takes about 5 minutes and you don't have to think about it again.

pphysch3y ago

> Docker Compose takes about 5 minutes and you don't have to think about it again

Except when you need to pin and repin versions to comply with a security policy, which may be why you aren't even running containers in the first place

minusf3y ago

not gonna argue that a single binary is the ultimate deploy solution but running a django app is not that difficult (although i am biased cause i do that for a living).

i love django projects but mysql, celery and rabbitmq -- no thanks.

pphysch3y ago

Don't get me wrong, I love Django and think its a great framework for writing internal tools like this. Redis gets a pass too since Django has native support for it in 4.0+. It's really the (IMHO unnecessary) combo of MySQL+RabbitMQ+Celery that turns me off.

Redis itself has had solid support for building reliable distributed task streaming for nearly 4 years (Redis ConsumerGroups introduced in 2018).

martypitt3y ago

Congrats - this looks great, and definitely something I was wishing for during an incident earlier this week.

A minor note, if anyone from Grafana is around - a bunch of the links on the bottom of the announcement go to a 404.

motakuk3y ago

We're fixing that, thank you ;)

ildari3y ago

Hey HN, Ildar here, one of the co-founders of Amixr and one of the software engineers behind Grafana OnCall. Finally we open-sourced the product I'm really excited about that. Please try it out and leave your feedback

pphysch3y ago

Seems like a solid replacement for Alertmanager for those already using Grafana OSS. Anyone planning on using both OnCall and Alertmanager?

dString3y ago

Doesn't AlertManager evaluate metrics and fire alerts?

A quick look at OnCall suggests it is more for managing fired alerts than firing alerts.

Their own screenshot has AlertManager as an alert source.

remram3y ago

Grafana used to be so simple, I don't know if I'm a fan of this direction towards many services.

Having to run alertmanager and configure it in addition to Grafana was bad enough, now you need to run and configure another service if you want some extra functionalities for those alerts? Are they going to keep maintaining acknowledgements and scheduled silence in AlertManager now that OnCall exists? Are we going to have "legacy notifications" in AlertManager when not running OnCall, the same way there are "legacy alerts" in Grafana when updating from Grafana 7 (pre-AlertManager)?

pphysch3y ago

AlertManager does not do the evaluations, it does not connect to any metrics database; those are done by Prometheus/etc and forwarded to AlertManager, which handles deduplication and routing among other things.

sandstrom3y ago

I think it would be great if it was easier to mix and match Grafana SaaS and self-hosted products.

For example, we need to run Loki ourselves, for security / privacy reasons, but wouldn't mind using hosted versions of Tempo, Prometheus and OnCall.

Right now it isn't super-easy to link e.g. self-hosted loki search queries with SaaS-Prometheus.

netingleOP3y ago

Its very much our aim to make this mix of self-hosted and cloud services as easy as going all-cloud; but I agree we're not quite there yet.

Do you mind if I ask what isn't super-easy about linking self-hosted loki search queries with SaaS-Prometheus? You should be e.g. able to add a Prometheus data source to your local Grafana (or securely expose your Loki to the internet and add a Loki data source to your Cloud Grafana)

sandstrom3y ago

Honestly I haven't tried that much, but didn't find anything in the docs so I assumed it wasn't a prioritized area.

In our particular scenario, we'd probably want to run Loki + Grafana locally, and then hosted Prometheus + hosted Grafana for metrics.

But would be great if we could just tell the two about each other, and under which domains they exist. That way, Prometheus-grafana could construct URLs that linked straight into Loki-grafana (that we host) for e.g. the same interval, or the same label filter (GET params).

But it would only work if I (the end-user) had access to both. That way, we don't have to expose Loki to the internet. But linking would still work.

There are quite a lot of services that does this with Github and commits. You can link from e.g. Bugsnag to Github by only telling Bugsnag your org and repo names. But Bugsnag won't have read access to Github (they also have another integration method which does require access, but that's not the one I'm talking about here).

Those types of "linking into a known URL pattern of another service" integrations are easy to setup and very easy to secure.

Deritio3y ago

I like what grafana labs does with grafana.

Im annoyed by their license choice.

But apparently when you are grafana everything looks like a dashboard UI?

Joke aside I will have a look but I didn't like the screenshots before already. I like the dashboardy thing for dashboards but otherwise it's not a really good UI system for everything else.

raffraffraff3y ago

Production helm chart link on this page leads to 404: https://grafana.com/docs/grafana-cloud/oncall/open-source/#p...

NonNefarious3y ago

The title is missing critical info: What the hell is it?

Of course the article isn't much better. It reads like a joke, the joke being that "on-call management" doesn't mean anything.

goodpoint3y ago

It's very nice to see Python and AGPL used for this.

Maledictus3y ago

What I really want is an Android app that keeps alerting until a page is ACKed or escalated.

machinerychorus3y ago

check out pushover, I use it for this exact case

https://pushover.net/

ndom913y ago

OpsGenie's Android app does this. Wouldn't be surprised if OnCall has this as well (or is coming soon).

googletron3y ago

Very cool. I love what the Grafana team is up to.

ucosty3y ago

Looks very cool, will have to give this a shot.

greatgib3y ago

I would give a huge marketing bullshit award for the following sentence:

<<We offered Grafana OnCall to users as a SaaS tool first for a few reasons. It’s a commonly shared belief that the more independent your on-call management system is, the better it will be for your entire operation. If something goes wrong, there will be a “designated survivor” outside of your infrastructure to help identify any issues. >>

They tried to ensure that you use their SaaS offering because they care more about your own good than yourself. So humanist...

ezrast3y ago

The point isn't that their infrastructure is more reliable than yours, but that it's decoupled from yours. If you run your monitoring on the same infra as production, it's liable to go down when production does, i.e. just when you need it most. This is a real reason to outsource monitoring to a SaaS, just like there are real reasons to self-host.

I mean, obviously they chose to address the segment of the market they could get more money out of first; I'm not contesting that. But the bit you quoted is low-grade bullshit at best. Hardly award-winning.

JimXugle3y ago

Another similar tool I've used in the past is GoAlert.

https://goalert.me/

this_was_posted3y ago

glad to hear this got open sourced!

for someone at grafana; noticed a dead link in the post: https://grafana.com/docs/oncall/main/

nojito3y ago

Unfortunate that it's AGPL. But this is looks really great!

ucosty3y ago

Why is that unfortunate? Unless you're looking to make proprietary changes to Grafana Oncall and host it as a SAAS, it's the same as running any other GPL software.

nojito3y ago

GPL and its variants are a no go where I work.

ucosty3y ago

Must be quite the paranoid business, given even tier 1 banks here (in the UK) will happily run GPL software.

woadwarrior013y ago

Is Linux verboten at work?

2 more replies

matsemann3y ago

Running a service with a GPL license is different than including their code in your projects, though. So while it may be a blanket ban, it may be worth it to clarify the scope of that ban.

eeZah7Ux3y ago

Then the problem is in the company and not in the license.

ketralnis3y ago

To distribute I understand, but even just to use? Almost any desktop OS you run has GPL code somewhere in it

1 more reply

acatton3y ago

https://drewdevault.com/2020/07/27/Anti-AGPL-propaganda.html

josephcsible3y ago

There's nothing unfortunate about the AGPLv3. Everything that it doesn't let you do is stuff that you shouldn't be doing anyway.

1 more reply

bbkane3y ago

LinkedIn built and uses https://iris.claims/ . I don't know how it compares to alternatives, but I find IRIS relatively pretty easy to use.

j / k navigate · click thread line to collapse

128 comments

juliennakache3y ago

echelon3y ago

Time and time again.

"Business should focus on its core competency"

* Outsource in-house infra to cloud. This begets lock-in as every engineer is doing heaven knows what with Lambda. Still need that huge infra team to manage AWS.

Madness.

Expensive madness.

CSMastermind3y ago

> I've always felt that PagerDuty was absurdly expensive for the feature set they were offering

For anyone out there in the same spot, I'll say that I switched my last company to Atlassian's OpsGenie and it was a 10x cost savings for the same feature set.

dijit3y ago

I really can’t find myself to ever recommend atlassian products though.

therealdrag03y ago

At least with OPsGenie it feels like too few features instead of feature bloat hah. But it’s been fine for me as an engineer.

SEJeff3y ago

I'm generally not a fan of Atlassian anything, but Opsgenie was really good before Atlassian purchased them and they're still really good.

twright03y ago

arccy3y ago

the opsgenie api is really bad though if you want to manage it as code/declaratively

jthrowsitaway3y ago

pm903y ago

deathanatos3y ago

I agree with the weird pushiness on other products that add specious business value, but I do think there is room for feature growth in the main product.

smcnally3y ago

    > … the sort of spit & polish that doesn't happen, IMO, anymore, because everything is an MVP feature, before the agile scrum PM moves on to the next MVP feature. "Polish" just accumulates in the JIRA graveyard.

It eats away the trust and collaboration that get good work Live and adopted. IYO, is this an indictment of "Agile," or more of that org's approach to it?

motakuk3y ago

Check this ;) https://github.com/grafana/oncall/tree/dev/tools/pagerduty-m...

jlg233y ago

Thanks, I think I finally understand why some friends of mine, who can implement this for any company in half a day, take $2000/day...

yashap3y ago

motakuk3y ago

Hello HN!

Hope someone will be able to finally sleep well at night being sure that OnCall will handle escalations and will alert the right person :)

Please join our community on a GitHub! The whole Grafana OnCall team is help you and to make this thing better.

knicholes3y ago

Being on-call has never made me sleep better at night!

krab3y ago

If I know someone else is on call and he's competent, I can sleep better.

Tao33003y ago

[0] Spoiler alert: it's not.

the_duke3y ago

The docs link [1] is 404.

Seems like the /main is the culprit.

[1] https://grafana.com/docs/oncall/main/.

motakuk3y ago

Fixed: https://grafana.com/docs/grafana-cloud/oncall/

Tao33003y ago

> it should make incident response features (on-call rotations, escalations, multi-channel notifications) and best practices more accessible to the wider audience

Is that a net positive?

pachico3y ago

Luckily they rolled back our account.

danlimerick3y ago

pachico3y ago

Dude, I hope you also read when I say that I love what you do and your reply just confirms I'm putting my money in the right hands.

I just wouldn't mind to be the last to upgrade to a newer version :)

anyfactor3y ago

Here is the repo: https://github.com/grafana/oncall

AGPL 3.0

Equiet3y ago

Is there some hidden complexity or is it just a consequence of engineers building a product for other engineers? Also, any tips what worked for you?

matsemann3y ago

Have had lots of bad experiences with that from Pagerduty at least. Want to generate a schedule far in advance, so people know when they will be oncall and can plan/switch.

pphysch3y ago

motakuk3y ago

I agree that multi-component architecture is harder to deploy. We did our best and prepared tooling to make deployment an easy thing.

Helm (https://github.com/grafana/oncall/tree/dev/helm/oncall), docker-composes for hobby and dev environments.

gen2203y ago

I think your decisions were reasonable, as is the opinion of the person you're responding to.

They mentioned that it makes sense because you were a standalone product, so stack portability was less of a concern. But as FOSS, you're opening yourself up to different standards on portability.

raffraffraff3y ago

I get why you would add a relational DB to the mix. Personally, I'd like a Rabbit-free option.

Deritio3y ago

Hearing your message bus assumption sounds like one of the most ridiculous claims I heard.

Sorry but why is rabbitmq really necessary?

slotrans3y ago

You don't need Rabbit, Celery, or Redis. You should be able to replace MySQL with SQLite. Then it would be radically easier to deploy.

throwaway8922383y ago

A MySQL database cluster, and a local copy of a SQL database on a single file on a single filesystem, are not close to the same thing. Except they both have "SQL" in the name.

I know the HN hipsters are all gung-ho for SQLite, but let's try to reign in the hype train.

3 more replies

sergiomattei3y ago

It’s curious to see people questioning the stack choices of apps they haven’t built yet and problems they haven’t faced either.

They chose this stack, it works for them. They’ve put it through its paces in production.

It’s as boring as it gets.

vhold3y ago

AlertManager is one component of a more complicated infrastructure.

https://prometheus.io/docs/introduction/overview/#architectu...

https://kubernetes.io/docs/concepts/overview/components/

Too3y ago

That picture shows the complete monitoring stack with lots of optional components (pushgw, service discovery, Grafana).

When comparing to OnCall you need OnCall AND still the rest of that Prometheus picture.

Compare with this picture where everything in the leftmost Alert Detection box is what you see in the Prometheus architecture. https://grafana.com/docs/oncall/latest/getting-started/

pphysch3y ago

skullone3y ago

That seems like a perfectly reasonable architecture. If only all of us could work on battle tested components like those during our job!

contravariant3y ago

skullone3y ago

probably keeps history, state, escalation flow, etc?

mkl953y ago

> Django stack with MySQL, Redis, RabbitMQ, and Celery

airstrike3y ago

> several years of dealing with horrific Flask / Django projects.

I think you misspelled "beautiful"

goodpoint3y ago

That's very bad. 99% of organizations don't have a volume of alerts that justifies any of MySQL, Redis and RabbitMQ.

Complexity comes at a steep price when something critical (e.g. OnCall) breaks and you have to debug it in a hurry.

Shoving everything in a container and closing the lid does not help.

alex_dev3y ago

lazyant3y ago

Curious as to what architecture you would have preferred or why this pretty standard stack (that can be deployed to k8s) is not giving you.

gjulianm3y ago

lazyant3y ago

you can install this stack without kubernetes no? I don't see anything k8s-specific

2 more replies

chrisandchris3y ago

Not OP, but one may interpret your response as "I don't understand why you prefer a single binary over this architecture that requires 6 different services and prefers k8s".

IMHO, OP just stated that one could solve this with less dependencies and have the same (if not a better) result.

pphysch3y ago

lazyant3y ago

I missed that interpretation :(

IMHO a fat binary written from scratch would have been a way worse choice than to use a standard stack, both in terms of bugs and time, let alone Open Source contributions or any scalability.

In terms of number of services, what do you get rid of that produce a better result? maybe RMQ and use a worse queue?, celery and write your own task manager or use another dependency?

theptip3y ago

For a simple low-scale app you can often do without Redis and Celery/RMQ if you just push everything into Postgres.

Far less scalable, but it is dramatically simpler to deploy. Often gets you surprisingly far though. Would be interesting to know how many monitored integrations could be supported by that flow.

gjulianm3y ago

Not to mention that having multiple components doesn't mean it's "scalable" by default, it could happen that some part of the pipeline doesn't like multiple instances of something.

picozeta3y ago

How does a message queue work via Postgres? Many people (including me) use Redis to run background jobs.

4 more replies

pphysch3y ago

Any of the following:

Python(Django)+Redis+[SQLite]

Python(Django)+Postgres

[Compiled Go binary]+[SQLite]

SQLite barely even counts as an architectural dependency TBH :)

MarquesMa3y ago

Just think about Gitea vs GitLab.

matsemann3y ago

Not sure why you include java in that, as you mostly get a standalone file. No such thing as a jre in modern java deployment.

As for python, at least getting a dockerfile helps a lot. Otherwise it's a huge mess to get running, yes.

Python is still a hassle anyways, since the lack of true multithreading means that you often need multiple deployments, which the Celery usage here for instance shows.

Volundr3y ago

> Not sure why you include java in that, as you mostly get a standalone file. No such thing as a jre in modern java deployment.

1 more reply

FridgeSeal3y ago

eeZah7Ux3y ago

Hell no, I want stuff like OnCall packaged into Linux distribution. I need something stable and reliable and that receive security fixes.

Maintaining tenths of binaries pulled from random github projects over the years is a nightmare.

(Not to mention all the issues around supply chain management, licensing issues, homecalling and so on)

morelisp3y ago

At this point I trust the Go modules supply chain considerably more than any free distro's packaging, which is ultimately pulling from GitHub anyway.

2 more replies

heavyset_go3y ago

pphysch3y ago

> Docker Compose takes about 5 minutes and you don't have to think about it again

Except when you need to pin and repin versions to comply with a security policy, which may be why you aren't even running containers in the first place

minusf3y ago

not gonna argue that a single binary is the ultimate deploy solution but running a django app is not that difficult (although i am biased cause i do that for a living).

i love django projects but mysql, celery and rabbitmq -- no thanks.

pphysch3y ago

Redis itself has had solid support for building reliable distributed task streaming for nearly 4 years (Redis ConsumerGroups introduced in 2018).

martypitt3y ago

Congrats - this looks great, and definitely something I was wishing for during an incident earlier this week.

A minor note, if anyone from Grafana is around - a bunch of the links on the bottom of the announcement go to a 404.

motakuk3y ago

We're fixing that, thank you ;)

ildari3y ago

pphysch3y ago

Seems like a solid replacement for Alertmanager for those already using Grafana OSS. Anyone planning on using both OnCall and Alertmanager?

dString3y ago

Doesn't AlertManager evaluate metrics and fire alerts?

A quick look at OnCall suggests it is more for managing fired alerts than firing alerts.

Their own screenshot has AlertManager as an alert source.

remram3y ago

Grafana used to be so simple, I don't know if I'm a fan of this direction towards many services.

pphysch3y ago

sandstrom3y ago

I think it would be great if it was easier to mix and match Grafana SaaS and self-hosted products.

For example, we need to run Loki ourselves, for security / privacy reasons, but wouldn't mind using hosted versions of Tempo, Prometheus and OnCall.

Right now it isn't super-easy to link e.g. self-hosted loki search queries with SaaS-Prometheus.

netingleOP3y ago

Its very much our aim to make this mix of self-hosted and cloud services as easy as going all-cloud; but I agree we're not quite there yet.

sandstrom3y ago

Honestly I haven't tried that much, but didn't find anything in the docs so I assumed it wasn't a prioritized area.

In our particular scenario, we'd probably want to run Loki + Grafana locally, and then hosted Prometheus + hosted Grafana for metrics.

But it would only work if I (the end-user) had access to both. That way, we don't have to expose Loki to the internet. But linking would still work.

Those types of "linking into a known URL pattern of another service" integrations are easy to setup and very easy to secure.

Deritio3y ago

I like what grafana labs does with grafana.

Im annoyed by their license choice.

But apparently when you are grafana everything looks like a dashboard UI?

Joke aside I will have a look but I didn't like the screenshots before already. I like the dashboardy thing for dashboards but otherwise it's not a really good UI system for everything else.

raffraffraff3y ago

Production helm chart link on this page leads to 404: https://grafana.com/docs/grafana-cloud/oncall/open-source/#p...

NonNefarious3y ago

The title is missing critical info: What the hell is it?

Of course the article isn't much better. It reads like a joke, the joke being that "on-call management" doesn't mean anything.

goodpoint3y ago

It's very nice to see Python and AGPL used for this.

Maledictus3y ago

What I really want is an Android app that keeps alerting until a page is ACKed or escalated.

machinerychorus3y ago

check out pushover, I use it for this exact case

https://pushover.net/

ndom913y ago

OpsGenie's Android app does this. Wouldn't be surprised if OnCall has this as well (or is coming soon).

googletron3y ago

Very cool. I love what the Grafana team is up to.

ucosty3y ago

Looks very cool, will have to give this a shot.

greatgib3y ago

I would give a huge marketing bullshit award for the following sentence:

They tried to ensure that you use their SaaS offering because they care more about your own good than yourself. So humanist...

ezrast3y ago

JimXugle3y ago

Another similar tool I've used in the past is GoAlert.

https://goalert.me/

this_was_posted3y ago

glad to hear this got open sourced!

for someone at grafana; noticed a dead link in the post: https://grafana.com/docs/oncall/main/

nojito3y ago

Unfortunate that it's AGPL. But this is looks really great!

ucosty3y ago

Why is that unfortunate? Unless you're looking to make proprietary changes to Grafana Oncall and host it as a SAAS, it's the same as running any other GPL software.

nojito3y ago

GPL and its variants are a no go where I work.

ucosty3y ago

Must be quite the paranoid business, given even tier 1 banks here (in the UK) will happily run GPL software.

woadwarrior013y ago

Is Linux verboten at work?

2 more replies

matsemann3y ago

Running a service with a GPL license is different than including their code in your projects, though. So while it may be a blanket ban, it may be worth it to clarify the scope of that ban.

eeZah7Ux3y ago

Then the problem is in the company and not in the license.

ketralnis3y ago

To distribute I understand, but even just to use? Almost any desktop OS you run has GPL code somewhere in it

1 more reply

acatton3y ago

https://drewdevault.com/2020/07/27/Anti-AGPL-propaganda.html

josephcsible3y ago

There's nothing unfortunate about the AGPLv3. Everything that it doesn't let you do is stuff that you shouldn't be doing anyway.

1 more reply

bbkane3y ago

LinkedIn built and uses https://iris.claims/ . I don't know how it compares to alternatives, but I find IRIS relatively pretty easy to use.

j / k navigate · click thread line to collapse