It sounds like they were in a place that a lot of companies are in where they don't have a single pane of glass for observability. One of if not the main benefit I've gotten out of Datadog is having everything in Datadog so that it's all connected and I can easily jump from a trace to logs for instance.
One of the terrible mistakes I see companies make with this tooling is fragmenting like this. Everyone has their own personal preference for tool and ultimately the collective experience is significantly worse than the sum of its parts.
But I've also been in terrible vendor lock-in situations, being bent over the barrel because switching to a better solution is so damn expensive.
At least now with OTel you have an open standard that allows you to switch easier, but even then I'd rather have 2 solutions that meet my exact observability requirements than a single solution that does everything OKish.
Vendor lock in with Otel now is hopefully a thing of the past - but now that more obs solutions are going open source, hopefully it's not necessarily true that one tool would be mediocre over all use cases (since DD and the likes are inherently limited by their own engineering teams, vs OSS products can have community/customer contributions to improve the surface area over time on top of the core maintainer's work).
At SigNoz, we have metrics, traces and logs in a single application which helps you correlate across signals much more easily - and being natively based on opentelemetry makes this correlation much easier as it leverages the standard data format.
Though this might take sometime, as many teams have proprietary SDK in their code, which is not easy to rip out. Opentelemetry auto-instrumentation[2] makes it much easier, and I think that's the path people will follow to get started
[1]https://github.com/SigNoz/signoz [2]https://opentelemetry.io/docs/instrumentation/java/automatic...
It annoys me that logs are overlooked in honeycomb, (and metrics are... fine). But, given the choice between a single pane of glass in grafana or having to do logs (and metrics sometimes) in cloudwatch but spending 95% of my time in honeycomb - I'd pick honeycomb every time
Hopefully someone that works there is reading this.
It seems to miss some aggregation stuff, but also it's improving every time I check. I wonder if anyone's used it in anger yet and how far is it from replacing datadog or honeycomb.
One of the biggest features of AWS which is very easy to take for granted and go unnoticed is Amazon CloudWatch. It supports metrics, logging, alarms, metrics from alarms, alarms from alarms, querying historical logs, trigger actions, etc etc etc. and it covers each and every single service provided by AWS including metaservices like AWS Config and Cloudtrail.
And you barely notice it. It's just there, and you can see everything.
> One of the terrible mistakes I see companies make with this tooling is fragmenting like this.
So much this. It's not fun at all to have to go through logs and metrics on any application,and much less so if for some reason their maintainers scattered their metrics emission to the four winds. However, with AWS all roads lead to Cloudwatch, and everything is so much better.
Most of my clients are not in the product-market fit for AWS CloudWatch, because most of their developers don't have the development, testing and operational maturity/discipline to use CloudWatch cost-effectively (this is at root an organization problem, but let's not go off onto that giant tangent). So the only realistic tracing strategy we converged upon to recommend for them is "grab everything, and retain it up to the point in time we won't be blamed for not knowing root cause" (which in some specific cases can be up to years!), while we undertake the long journey with them to upskill their teams.
This would make using CloudWatch everywhere rapidly climb up into the top three largest line item in the AWS bill, easily justifying spinning that tracing functionality in-house. So we wind up opting into self-managed tooling like Elastic Observability or Honeycomb where the pricing is friendlier to teams in unfortunate situations that need to start with everything for CYA, much as I would like to stay within CloudWatch.
Has anyone found a better solution to these use cases where the development maturity level is more prosaic, or is this really the best local maxima at the industry's current SOTA?
Some part of the value of Datadog etc is having a single pane of glass over many aws accounts.
The collector (which processes and ships metrics) can be installed in K8S through Helm or an operator, and we just added a variable to our charts so the agent can be pointed at the collector. The collector speaks OTLP which is the fancy combined metrics/traces/logs protocol the OTEL SDKs/agents use, but it also speaks Prometheus, Zipkin, etc to give you an easy migration path. We currently ship to Datadog as well as an internal service, with the end goal being migrating off of Datadog gradually.
I know this isn't a DataDog post, and I'm a bit off topic, but I try to do my best to warn against DD these days.
I looked at the OTel DD stuff and did not see any support for this, fwiw, maybe it doesn't work b/c the agent expects more context from the pod (e.g. app and label?)
It a vendor pulled shit like this on me. That’s when I would counsel them. Of course most big orgs would rather not do the leg work to actually become portable, migrate off vendor. So of course they will just pay the bill.
Vendors love the custom shit they build because they know once it’s infiltrated the stack then it’s basically like gangrene (have to cut off the appendage to save the host)
We're definitely open to doing more consolidation in the future, especially if we can save money by doing that, but from a usability standpoint we've been pretty happy with Honeycomb for traces and Datadog for everything else so far. And, that seems to be aligned with what each vendor is best at at the moment.
https://www.honeycomb.io/pricing
https://www.datadoghq.com/pricing/
Am I wrong to say... having 2 is "expensive"? Maybe not if 50% of your stuff is going to Honeycomb and 50% going to DataDog. Could you save money/complexity (less places to look for things) having just DataDog or just Honeycomb?
Partly this lets us easily re-route & duplicate telemetry, partly it means changes to backend products in the future won't be a big disruption.
For metrics we're a mostly telegraf->prometheus->grafana mimir shop - telegraf because its rock solid and feature-rich, prometheus because there's no real competition in that tier, and mimir because of scale & self-host options.
Our scale problem means most online pricing calculators generate overflow errors.
Our non-security log destination preference is Loki - for similar reasons to Mimir - though a SIEM it definitely is not.
Tracing to a vendor, but looking to bring that back to grafana Tempo. Product maturity is a long way off commercial APM offerings, but it feels like the feature-set is about 70% there and converging rapidly. Off-the-shelf tracing products have an appealingly low cost of entry, which only briefly defers lock-in & pricing shocks.
If you are looking for a open source backend for OpenTelemetry, then you can explore SigNoz[2] (I am one of the founders) We have a quite a decent product for APM/tracing leveraging opentelemerty native data format and semantic convention.
[1]https://opentelemetry.io/docs/collector/ [2]https://github.com/SigNoz/signoz
Have you looked at VictoriaMetrics [0] before opting for Mimir?
Luckily the prometheus exporters have a switch to enable this behaviour, but there's talk of removing this functionality because it breaks the spec. If you were to use the OpenTelemetry protocol in to something like Mimir, you don't have the option of enabling that behaviour unless you use prometheus remote write.
Our developers aren't a fan of that.
https://opentelemetry.io/docs/specs/otel/compatibility/prome...
They maybe trying to address label cardinality but their approach seems like they are throwing the baby out with the bath water. The developer experience suffers as a result because from a dev pov, resource attributes are added to the metric yet this relationship isn't transferred when translated to Prometheus metrics.
With the advantage that you get only the specific attributes you want, thus avoiding a cardinality explosion.
https://github.com/open-telemetry/opentelemetry-collector-co...
The transform processor could be useful if they ever deprecate the resource_to_telemetry_conversion flag, but its still a pain point because it hinders a developers autonomy, and requires a whitelist of labels to be maintained on the collectors by another team.
Have encountered this a lot from teams attempting to use the metrics SDK.
Are you open to comment on specifics here and also what kind of shim you had to put in front of the SDK? It would be great to continue to retrieve feedback so that we can as a community have a good idea of what remains before it's possible to use the SDK for real world production use cases in anger. Just wiring up the setup in your app used to be fairly painful but that has gotten somewhat better over the last 12-24 months, I'd love to also hear what is currently causing compatibility issues w/ the metric types themselves using the SDK which requires a shim and what the shim is doing to achieve compatibility.
Our main issue was the lack of a synchronous gauge. The officially supported asynchronous API of registering a callback function to report a gauge metric is very different from how we were doing things before, and would have required lots of refactoring of our code. Instead, we wrote a wrapper that exposes a synchronous-like API: https://gist.github.com/yolken-airplane/027867b753840f7d15d6....
It seems like this is a common feature request across many of the SDKs, and it's in the process of being fixed in some of them (https://github.com/open-telemetry/opentelemetry-specificatio...)? I'm not sure what the plans are for the golang SDK specifically.
Another, more minor issue, is the lack of support for "constant" attributes that are applied to all observations of a metric. We use these to identify the app, among other use cases, so we added wrappers around the various "Add", "Record", "Observe", etc. calls that automatically add these. (It's totally possible that this is supported and I missed it, in which case please let me know.)
Overall, the SDK was generally well-written and well-documented, we just needed some extra work to make the interfaces more similar to the ones we were using before.
I am surprised there is no gauge update API yet (instead of callback only), this is a common use case and I don't think folks should be expected to implement their own. Especially since it will lead to potentially allocation heavy bespoke implementations, depending on use case given mutex+callback+other structures that likely need to be heap allocated (vs a simple int64 wrapper with atomic update/load APIs).
Also I would just say that the fact the APIs differ a lot to more common popular Prometheus client libraries does beg the question of do we need more complicated APIs that folks have a harder time using. Now is the time to modernize these before everyone is instrumented with some generation of a client library that would need to change/evolve. The whole idea of an OTel SDK is instrument once and then avoid needing to re-instrument again when making changes to your observability pipeline and where it's pointed. This becomes a hard sell if OTel SDK needs to shift fairly significantly to support more popular & common use cases with more typical APIs and by doing so leaves a whole bunch of OTel instrumented code that needs to be modernized to a different looking API.
for const attributes, generally these should be defined at the resource / provider level: https://pkg.go.dev/go.opentelemetry.io/otel/sdk/metric#WithR...
Congrats too! As I understand it from stories I've heard from others, migrating to OTel is no easy undertaking.
So, someone says, "let's make something smaller and more portable than logs. We need to track numerical data over time more easily, so that we can see pretty charts of when these values are outside of where they should be." This ends up being metrics and a time-series database (TSDB), built to handle not arbitrary lines of text but instead meant to parse out metadata and append numerical data to existing time-series based on that metadata.
Between metrics and logs, you end up with a good idea of what's going on with your infrastructure, but logs are still too verbose to understand what's happening with your applications past a certain point. If you have an application crashing repeatedly, or if you've got applications running slowly, metrics and logs can't really help you there. So companies built out Application Performance Monitoring, meant to tap directly into the processes running on the box and spit out all sorts of interesting runtime metrics and events about not just the applications, but the specific methods and calls those applications are utilizing within their stack/code.
Initially, this works great if you're running these APM tools on a single box within monolithic stacks, but as the world moved toward Cloud Service Providers and containerized/ephemeral infrastructure, APM stopped being as effective. When a transaction starts to go through multiple machines and microservices, APM deployed on those boxes individually can't give you the context of how these disparate calls relate to a holistic transaction.
So someone says, "hey, what if we include transaction IDs in these service calls, so that we can post-hoc stitch together these individual transaction lines into a whole transaction, end-to-end?" Which is how you end up with the concept of spans and traces, taking what worked well with Application Performance Monitoring and generalizing that out into the modern microservices architectures that are more common today.
In the future, we'll probably switch these logs to also go through our collector, and it shouldn't be super hard (because we already implemented a golang OTel log handler for the external case), but we just haven't gotten around to it yet.
[1] https://docs.datadoghq.com/integrations/google_cloud_platfor...
My guess is this is to save on costs. GCP logging is probably cheaper than Datadog, and infrastructure logs may not be needed as frequently as application logs.
In theory you can send telemetry data with OTel to Cloud Watch, but I've struggle to connect the dots with the front end application (e.g. React/Next.js).
This is endemic now. Doesn't matter what someone is writing about there'll be some pointless stock photo taking up half the page. There'll probably be some more throughout the page. Stop it please.
Anything you register to keep track of your environment has the form of either logs or metrics. The difference is about the contents of such logs and metrics.