Migrating to OpenTelemetry (opens in new tab)

(airplane.dev)

257 pointskkoppenhaver2y ago75 comments

75 comments

> The data collected from these streams is sent to several vendors including Datadog (for application logs and metrics), Honeycomb (for traces), and Google Cloud Logging (for infrastructure logs).

It sounds like they were in a place that a lot of companies are in where they don't have a single pane of glass for observability. One of if not the main benefit I've gotten out of Datadog is having everything in Datadog so that it's all connected and I can easily jump from a trace to logs for instance.

One of the terrible mistakes I see companies make with this tooling is fragmenting like this. Everyone has their own personal preference for tool and ultimately the collective experience is significantly worse than the sum of its parts.

badloginagain2y ago

I feel we hold up single-observability-solution as the Holy Grail, and I can see the argument for it- one place to understand the health of your services.

But I've also been in terrible vendor lock-in situations, being bent over the barrel because switching to a better solution is so damn expensive.

At least now with OTel you have an open standard that allows you to switch easier, but even then I'd rather have 2 solutions that meet my exact observability requirements than a single solution that does everything OKish.

mikeshi422y ago

Biased as a founder in the space [1] but I think with OpenTelemetry + OSS extensible observability tooling, the holy grail of one tool is more realizable than ever.

Vendor lock in with Otel now is hopefully a thing of the past - but now that more obs solutions are going open source, hopefully it's not necessarily true that one tool would be mediocre over all use cases (since DD and the likes are inherently limited by their own engineering teams, vs OSS products can have community/customer contributions to improve the surface area over time on top of the core maintainer's work).

[1] https://github.com/hyperdxio/hyperdx

pranay012y ago

I think that OpenTelemetry will solve this problem of vendor lock in. I am a founder building in this space[1] and we see many of our users switching to opentelemetry as that provides an easy way to switch if needed in future.

At SigNoz, we have metrics, traces and logs in a single application which helps you correlate across signals much more easily - and being natively based on opentelemetry makes this correlation much easier as it leverages the standard data format.

Though this might take sometime, as many teams have proprietary SDK in their code, which is not easy to rip out. Opentelemetry auto-instrumentation[2] makes it much easier, and I think that's the path people will follow to get started

[1]https://github.com/SigNoz/signoz [2]https://opentelemetry.io/docs/instrumentation/java/automatic...

sofixa2y ago

Switch the backend destination of metrics/traces/logs, but all your dashboards, alerts, and potentially legacy data still need to be migrated. Drastically better than before where instrumentation and agents were custom for each backend, but there's still hurdles.

dexterdog2y ago

Depending on your usage it can be prohibitively expensive to use datadog for everything like that. We have it for just our prod env because it's just not worth what it brings to the table to put all of our logs into it.

shric2y ago

I once worked out what it would cost to send our company's prod logs to datadog. It was 1.5x our total AWS cost. The company ran entirely on AWS

dabeeeenster2y ago

Is prod not 99% of your logs?

dexterdog2y ago

Not even close

maccard2y ago

I've spent a small amount of time in datadog, lots in grafana, and somewhere in between in honeycomb. Out applications are designed to emit traces, and comparing honeycomb with tracing to a traditional app with metrics and logs, I would choose tracing every time.

It annoys me that logs are overlooked in honeycomb, (and metrics are... fine). But, given the choice between a single pane of glass in grafana or having to do logs (and metrics sometimes) in cloudwatch but spending 95% of my time in honeycomb - I'd pick honeycomb every time

mdtusz2y ago

Agreed - honeycomb has been a boon, however some improvements to metric displays and the ability to set the default "board" used in the home page would be very welcome. Also would be pretty happy if there was a way to drop events on the honeycomb side for a way to dynamically filter - e.g. "don't even bother storing this trace if it has a http.status_code < 400". This is surprisingly painful to implement on the application side (at least in rust).

Hopefully someone that works there is reading this.

masterj2y ago

It sounds like you should look into their tail-sampling Refinery tool https://docs.honeycomb.io/manage-data-volume/refinery/

1 more reply

serverlessmom2y ago

I think Honeycomb is perfect for one kind of user, who's entirely concerned with traces and very long retention. For a more general OpenTelemetry-native solution, check out Signoz.

viraptor2y ago

Have you tried the traces in grafana/tempo yet? https://grafana.com/docs/grafana/latest/panels-visualization...

It seems to miss some aggregation stuff, but also it's improving every time I check. I wonder if anyone's used it in anger yet and how far is it from replacing datadog or honeycomb.

arccy2y ago

tempo still feels very much: look at a trace that you found from elsewhere (like logs).

with so much information in traces and the pure volume, the aggregation really is the key to actionable info out of a tracing setup if it's going to be the primary entry point.

maccard2y ago

I've not. Honestly, I'm not in the market for tool shopping at the moment, I need another honeycomb-style moment of "this is incredible" to start looking again. I think it would take "Honeycomb, but we handle metric rollups and do logs" right now.

ankit01-oss2y ago

You can also check out SigNoz - https://github.com/SigNoz/signoz. It has logs, metrics, and traces under a single pane. If you're using otel libraries and otel collector you can do a lot of correlation between your logs and traces. I am a maintainer, and we have seen a lot of our users using signoz to have the ease of having three signals in a single pane.

devin2y ago

Eh, personally I view honeycomb and datadog as different enough offerings that I can see why you'd choose to have both.

rewmie2y ago

> It sounds like they were in a place that a lot of companies are in where they don't have a single pane of glass for observability.

One of the biggest features of AWS which is very easy to take for granted and go unnoticed is Amazon CloudWatch. It supports metrics, logging, alarms, metrics from alarms, alarms from alarms, querying historical logs, trigger actions, etc etc etc. and it covers each and every single service provided by AWS including metaservices like AWS Config and Cloudtrail.

And you barely notice it. It's just there, and you can see everything.

> One of the terrible mistakes I see companies make with this tooling is fragmenting like this.

So much this. It's not fun at all to have to go through logs and metrics on any application,and much less so if for some reason their maintainers scattered their metrics emission to the four winds. However, with AWS all roads lead to Cloudwatch, and everything is so much better.

yourapostasy2y ago

> ...with AWS all roads lead to Cloudwatch, and everything is so much better.

Most of my clients are not in the product-market fit for AWS CloudWatch, because most of their developers don't have the development, testing and operational maturity/discipline to use CloudWatch cost-effectively (this is at root an organization problem, but let's not go off onto that giant tangent). So the only realistic tracing strategy we converged upon to recommend for them is "grab everything, and retain it up to the point in time we won't be blamed for not knowing root cause" (which in some specific cases can be up to years!), while we undertake the long journey with them to upskill their teams.

This would make using CloudWatch everywhere rapidly climb up into the top three largest line item in the AWS bill, easily justifying spinning that tracing functionality in-house. So we wind up opting into self-managed tooling like Elastic Observability or Honeycomb where the pricing is friendlier to teams in unfortunate situations that need to start with everything for CYA, much as I would like to stay within CloudWatch.

Has anyone found a better solution to these use cases where the development maturity level is more prosaic, or is this really the best local maxima at the industry's current SOTA?

everfrustrated2y ago

In addition, one of the largest limitations of CloudWatch is it doesn't work well with a many-aws-account strategy.

Some part of the value of Datadog etc is having a single pane of glass over many aws accounts.

tapoxi2y ago

I made this switch very recently. For our Java apps it was as simple as loading the otel agent in place of the Datadog SDK, basically "-javaagent:/opt/otel/opentelemetry-javaagent.jar" in our args.

The collector (which processes and ships metrics) can be installed in K8S through Helm or an operator, and we just added a variable to our charts so the agent can be pointed at the collector. The collector speaks OTLP which is the fancy combined metrics/traces/logs protocol the OTEL SDKs/agents use, but it also speaks Prometheus, Zipkin, etc to give you an easy migration path. We currently ship to Datadog as well as an internal service, with the end goal being migrating off of Datadog gradually.

andrewstuart22y ago

We tried this about a year and a half ago and ended up going somewhat backwards into DD entrenchment, because they've decided that anything not an official DD metric (that is, collected by their agent typically) is custom and then becomes substantially more expensive. We wanted a nice migration path from any vendor to any other vendor but they have a fairly effective strategy for making gradual migrations more expensive for heavy telemetry users. At least our instrumentation these days is otel, but it's the metrics we expected to just scrape from prometheus that we had to dial back and start using more official DD agent metrics and configs to get, lest our bill balloon by 10x. It's a frustrating place to be. Especially since it's still not remotely cheap, just that it could be way worse.

I know this isn't a DataDog post, and I'm a bit off topic, but I try to do my best to warn against DD these days.

shawnb5762y ago

This has been a concern for me too. But the agent is just a statsd receiver with some extra magic, so this seems like a thing that could be solved with the collector sending traffic to an agent rather than the HTTP APIs?

I looked at the OTel DD stuff and did not see any support for this, fwiw, maybe it doesn't work b/c the agent expects more context from the pod (e.g. app and label?)

andrewstuart22y ago

Yeah, the DD agent and the otel-collector DD exporter actually use the same code paths for the most part. The relevant difference tends to be in metrics, where the official path involves the DD agent doing collection directly, for example, collecting redis metrics by giving the agent your redis database hostname and creds. It can then pack those into the specific shape that DD knows about and they get sent with the right name, values, etc so that DD calls them regular metrics.

If you instead went the more flexible route of using many of the de-facto standard prometheus exporters like the one for redis, or built-in prometheus metrics from something like istio, and forward those to your agent or configure your agent to poll those prometheus metrics, it won't do any reshaping (which I can see the arguments for, kinda, knowing a bit about their backend) and they just end up in the DD backend as custom metrics, and charge you at $0.10/mo per 100 time series. If you've used prometheus before for any realistic deployments with enrichment etc, you can probably see this gets expensive ridiculously fast.

What I wish they'd do instead is have some form of adapter from those de facto standards, so I can still collect metrics 99% my own way, in a portable fashion, and then add DD as my backend without ending up as custom everything, costing significantly more.

xyst2y ago

> somewhat backwards into DD entrenchment, because they've decided that anything not an official DD metric (that is, collected by their agent typically) is custom and then becomes substantially more expensive.

It a vendor pulled shit like this on me. That’s when I would counsel them. Of course most big orgs would rather not do the leg work to actually become portable, migrate off vendor. So of course they will just pay the bill.

Vendors love the custom shit they build because they know once it’s infiltrated the stack then it’s basically like gangrene (have to cut off the appendage to save the host)

MajimasEyepatch2y ago

It's interesting that you're using both Honeycomb and Datadog. With everything migrated to OTel, would there be advantages to consolidating on just Honeycomb (or Datadog)? Have you found they're useful for different things, or is there enough overlap that you could use just one or the other?

bhyolken2y ago

Author here, thanks for the question! The current split developed from the personal preferences of the engineers who initially set up our observability systems, based on what they had used (and liked) at previous jobs.

We're definitely open to doing more consolidation in the future, especially if we can save money by doing that, but from a usability standpoint we've been pretty happy with Honeycomb for traces and Datadog for everything else so far. And, that seems to be aligned with what each vendor is best at at the moment.

MuffinFlavored2y ago

> from the personal preferences of the engineers

https://www.honeycomb.io/pricing

https://www.datadoghq.com/pricing/

Am I wrong to say... having 2 is "expensive"? Maybe not if 50% of your stuff is going to Honeycomb and 50% going to DataDog. Could you save money/complexity (less places to look for things) having just DataDog or just Honeycomb?

bhyolken2y ago

Right now, there isn't much duplication of what we're sending to each vendor, so I don't think we'd save a ton by consolidating, at least based on list prices. We could maybe negotiate better prices based on higher volumes, but I'm not sure if Airplane is spending enough at this point to get massive discounts there.

Another potential benefit would definitely be reduced complexity and better integration for the engineering team. So, for instance, you could look at a log and then more easily navigate to the UI for the associated trace. Currently, we do this by putting Honeycomb URLs in our Datadog log events, which works but isn't quite as seamless. But, given that our team is pretty small at this point and that we're not spending a ton of our time on performance optimizations, we don't feel an urgent need to consolidate (yet).

1 more reply

Jedd2y ago

The killer feature of OpenTelemetry for us is brokering (with ETL).

Partly this lets us easily re-route & duplicate telemetry, partly it means changes to backend products in the future won't be a big disruption.

For metrics we're a mostly telegraf->prometheus->grafana mimir shop - telegraf because its rock solid and feature-rich, prometheus because there's no real competition in that tier, and mimir because of scale & self-host options.

Our scale problem means most online pricing calculators generate overflow errors.

Our non-security log destination preference is Loki - for similar reasons to Mimir - though a SIEM it definitely is not.

Tracing to a vendor, but looking to bring that back to grafana Tempo. Product maturity is a long way off commercial APM offerings, but it feels like the feature-set is about 70% there and converging rapidly. Off-the-shelf tracing products have an appealingly low cost of entry, which only briefly defers lock-in & pricing shocks.

pranay012y ago

Yeah, the ability to send to multiple sources is quite powerful and most of this comes from the configurability of Otel Collector [1].

If you are looking for a open source backend for OpenTelemetry, then you can explore SigNoz[2] (I am one of the founders) We have a quite a decent product for APM/tracing leveraging opentelemerty native data format and semantic convention.

[1]https://opentelemetry.io/docs/collector/ [2]https://github.com/SigNoz/signoz

Jedd2y ago

Hi Pranay - actually I've had a signoz tab open for about 5 weeks - once I find time I'm meaning to run it up in my lab.

pranay012y ago

Awesome! Do reach out to us in our slack community[1] if you have any questions or need any help on setting things up

[1] https://signoz.io/slack

hagen17782y ago

> mimir because of scale & self-host options

Have you looked at VictoriaMetrics [0] before opting for Mimir?

[0] https://victoriametrics.com/blog/mimir-benchmark/

nevon2y ago

I would love to save a few hundred thousands a year by running Otel collector over Datadog agents, just on the cost-per-host alone. Unfortunately that would also mean giving up Datatog APM and NPM, as far as I can tell, which have been really valuable. Going back to just metrics and traces would feel like quite the step backwards and be a hard sell.

arccy2y ago

you can submit opentelemetry traces to datadog which should be the equivalent of apm/npm, though maybe with a less polished integration.

nevon2y ago

Just traces are a long way off from APM and NPM. APM gives me the ability to debug memory leaks from continuous heap snapshots, or performance issues through CPU profiling. NPM is almost like having tcpdump running constantly, showing me where there's packet loss or other forms of connectivity issues.

porker2y ago

Thank you for sharing this, I've had "look at tracing" on my to do list for months and assumed it was identical to APM. It seems it won't be a direct substitute, which helps explain the cost difference.

nullify882y ago

One thing that's slightly off putting about OpenTelemetry is how resource attributes don't get included as prometheus labels for metrics, instead they are on an info metric which requires a join to enrich the metric you are interested in.

Luckily the prometheus exporters have a switch to enable this behaviour, but there's talk of removing this functionality because it breaks the spec. If you were to use the OpenTelemetry protocol in to something like Mimir, you don't have the option of enabling that behaviour unless you use prometheus remote write.

Our developers aren't a fan of that.

https://opentelemetry.io/docs/specs/otel/compatibility/prome...

valyala2y ago

FYI, VictoriaMetrics converts resource attributes to ordinary labels before storing metrics received via OoenTelemetry protocol - https://docs.victoriametrics.com/#sending-data-via-opentelem... . This simplifies filtering and grouping of such metrics during querying. For example, you need to write `my_metric{resource_name="foo"}` instead of `my_metric * on(resource_id) group_left() resource_info{resource_name="foo"}` when filtering by `resource_name`.

nullify882y ago

Thanks, that's nice to know VM can accommodate that. A migration will be something we will have consider if Mimir and OpenTelemetry force us to use joins for all our queries.

They maybe trying to address label cardinality but their approach seems like they are throwing the baby out with the bath water. The developer experience suffers as a result because from a dev pov, resource attributes are added to the metric yet this relationship isn't transferred when translated to Prometheus metrics.

ronyaurora2y ago

If you are using the prometheus exporter, you can use the transform processor to get specific resource attributes into metric labels.

With the advantage that you get only the specific attributes you want, thus avoiding a cardinality explosion.

https://github.com/open-telemetry/opentelemetry-collector-co...

nullify882y ago

We've migrated away from the prometheus exporter to the prometheus remote write exporter as I'd like a completely "push" based architecture. Ideally I would have liked to be completely otlp but can't for the reasons already explained. So I use promethus remote write instead in to Mimir.

The transform processor could be useful if they ever deprecate the resource_to_telemetry_conversion flag, but its still a pain point because it hinders a developers autonomy, and requires a whitelist of labels to be maintained on the collectors by another team.

roskilli2y ago

> Moreover, we encountered some rough edges in the metrics-related functionality of the Go SDK referenced above. Ultimately, we had to write a conversion layer on top of the OTel metrics API that allowed for simple, Prometheus-like counters, gauges, and histograms.

Have encountered this a lot from teams attempting to use the metrics SDK.

Are you open to comment on specifics here and also what kind of shim you had to put in front of the SDK? It would be great to continue to retrieve feedback so that we can as a community have a good idea of what remains before it's possible to use the SDK for real world production use cases in anger. Just wiring up the setup in your app used to be fairly painful but that has gotten somewhat better over the last 12-24 months, I'd love to also hear what is currently causing compatibility issues w/ the metric types themselves using the SDK which requires a shim and what the shim is doing to achieve compatibility.

bhyolken2y ago

Sure, happy to provide more specifics!

Our main issue was the lack of a synchronous gauge. The officially supported asynchronous API of registering a callback function to report a gauge metric is very different from how we were doing things before, and would have required lots of refactoring of our code. Instead, we wrote a wrapper that exposes a synchronous-like API: https://gist.github.com/yolken-airplane/027867b753840f7d15d6....

It seems like this is a common feature request across many of the SDKs, and it's in the process of being fixed in some of them (https://github.com/open-telemetry/opentelemetry-specificatio...)? I'm not sure what the plans are for the golang SDK specifically.

Another, more minor issue, is the lack of support for "constant" attributes that are applied to all observations of a metric. We use these to identify the app, among other use cases, so we added wrappers around the various "Add", "Record", "Observe", etc. calls that automatically add these. (It's totally possible that this is supported and I missed it, in which case please let me know.)

Overall, the SDK was generally well-written and well-documented, we just needed some extra work to make the interfaces more similar to the ones we were using before.

roskilli2y ago

Thanks for the detailed response.

I am surprised there is no gauge update API yet (instead of callback only), this is a common use case and I don't think folks should be expected to implement their own. Especially since it will lead to potentially allocation heavy bespoke implementations, depending on use case given mutex+callback+other structures that likely need to be heap allocated (vs a simple int64 wrapper with atomic update/load APIs).

Also I would just say that the fact the APIs differ a lot to more common popular Prometheus client libraries does beg the question of do we need more complicated APIs that folks have a harder time using. Now is the time to modernize these before everyone is instrumented with some generation of a client library that would need to change/evolve. The whole idea of an OTel SDK is instrument once and then avoid needing to re-instrument again when making changes to your observability pipeline and where it's pointed. This becomes a hard sell if OTel SDK needs to shift fairly significantly to support more popular & common use cases with more typical APIs and by doing so leaves a whole bunch of OTel instrumented code that needs to be modernized to a different looking API.

arccy2y ago

the official SDKs will only support an api once there's a spec that allows it.

for const attributes, generally these should be defined at the resource / provider level: https://pkg.go.dev/go.opentelemetry.io/otel/sdk/metric#WithR...

caust1c2y ago

Curious about the code implemented for logs! Hopefully that's something that can be shared at some point. Also curious if it integrates with `log/slog` :-)

Congrats too! As I understand it from stories I've heard from others, migrating to OTel is no easy undertaking.

bhyolken2y ago

Thanks! For logs, we actually use github.com/segmentio/events and just implemented a handler for that library that batches logs and periodically flushes them out to our collector using the underlying protocol buffer interface. We plan on migrating to log/slog soon, and once we do that we'll adapt our handler and can share the code.

caust1c2y ago

Awesome! Great work and thanks for sharing your experience!

throwaway084t952y ago

What is the "first principles" argument that observability decomposes into logs, metrics, and tracing? I see this dogma accepted everywhere, but I'm inquisitive about it

yannyu2y ago

First you had logs. Everyone uses logs because it's easy. Logs are great, but suddenly you're spending a crapton of time or money maintaining terabytes or petabytes of storage and ingest of logs. And even worse, in some cases for these logs, you don't actually care about 99% of the log line and simply want a single number, such as CPU utilization or the value of the shopping cart or latency.

So, someone says, "let's make something smaller and more portable than logs. We need to track numerical data over time more easily, so that we can see pretty charts of when these values are outside of where they should be." This ends up being metrics and a time-series database (TSDB), built to handle not arbitrary lines of text but instead meant to parse out metadata and append numerical data to existing time-series based on that metadata.

Between metrics and logs, you end up with a good idea of what's going on with your infrastructure, but logs are still too verbose to understand what's happening with your applications past a certain point. If you have an application crashing repeatedly, or if you've got applications running slowly, metrics and logs can't really help you there. So companies built out Application Performance Monitoring, meant to tap directly into the processes running on the box and spit out all sorts of interesting runtime metrics and events about not just the applications, but the specific methods and calls those applications are utilizing within their stack/code.

Initially, this works great if you're running these APM tools on a single box within monolithic stacks, but as the world moved toward Cloud Service Providers and containerized/ephemeral infrastructure, APM stopped being as effective. When a transaction starts to go through multiple machines and microservices, APM deployed on those boxes individually can't give you the context of how these disparate calls relate to a holistic transaction.

So someone says, "hey, what if we include transaction IDs in these service calls, so that we can post-hoc stitch together these individual transaction lines into a whole transaction, end-to-end?" Which is how you end up with the concept of spans and traces, taking what worked well with Application Performance Monitoring and generalizing that out into the modern microservices architectures that are more common today.

tsamba2y ago

Interesting read. What did you find easier about using GCP's log tooling for your internal system logs, rather than the OTel collector?

bhyolken2y ago

Author here. This decision was more about ease of implementation than anything else. Our internal application logs were already being scooped up by GCP because we run our services in GKE, and we already had a GCP->Datadog log syncer [1] for some other GCP infra logs, so re-using the GCP-based pipeline was the easiest way to handle our application logs once we removed the Datadog agent.

In the future, we'll probably switch these logs to also go through our collector, and it shouldn't be super hard (because we already implemented a golang OTel log handler for the external case), but we just haven't gotten around to it yet.

[1] https://docs.datadoghq.com/integrations/google_cloud_platfor...

clintonb2y ago

Their collector is used to send infrastructure logs to GCP (instead of Datadog).

My guess is this is to save on costs. GCP logging is probably cheaper than Datadog, and infrastructure logs may not be needed as frequently as application logs.

shoelessone2y ago

I really really want to use OTel for a small project but have always had a really tough time finding a path that is cheap or free for a personal project.

In theory you can send telemetry data with OTel to Cloud Watch, but I've struggle to connect the dots with the front end application (e.g. React/Next.js).

yourapostasy2y ago

Have you checked out Jaeger [1]? It is lightweight enough for a personal project, open source, and featureful enough to really help "turn on the lightbulb" with other engineers to show them the difference between logging/monitoring and tracing.

[1] https://www.jaegertracing.io/

arccy2y ago

grafana cloud, honeycomb, etc have free tiers, though you'll have to watch how much data you send them. or you can self host something like signoz or the elastic stack. frontend will typically go to an instance of opentelemetry collector to filter/convert to the protocol for the storage backend.

jon-wood2y ago

At the risk of being downvoted (probably justly) for having a moan, can we please have a moratorium on every blog post needing to have a generally irrelevant picture attached to it? On opening this page I can see 28 words that are actually relevant because almost the entire view is consumed by a huge picture of a graph and the padding around it.

This is endemic now. Doesn't matter what someone is writing about there'll be some pointless stock photo taking up half the page. There'll probably be some more throughout the page. Stop it please.

k__2y ago

I had the impression, logs and metrics are a pre-observability thing.

SteveNuts2y ago

I've never heard the term "pre-observability", what does that mean?

renegade-otter2y ago

The era when "debugging in production" wasn't standard.

marcosdumay2y ago

Observability is about logs and metrics, and pre-observability (I guess you mean the high-level-only records simpler environments keep) is also about logs and metrics.

Anything you register to keep track of your environment has the form of either logs or metrics. The difference is about the contents of such logs and metrics.

k__2y ago

When I read Observability Engineering, I got the impression it was about long events and tracing, and metrics and logs were a thing of the past people gave up on since the rise of Microservices.

sofixa2y ago

> Authors Charity Majors, Liz Fong-Jones, and George Miranda from Honeycomb explain what constitutes good observability, show you how to improve upon what you're doing today, and provide practical dos and don'ts for migrating from legacy tooling, such as metrics, monitoring, and log management. You'll also learn the impact observability has on organizational culture (and vice versa).

No wonder, it's either strong bias from people working in a tracing vendor, or outright a sales pitch.

It's totally false though. Each pillar - metrics, logs and traces have their place and serve different purposes. You won't use traces to measure the number of requests hitting your load balancer, or the amount of objects in the async queue, or CPU utilisation, or network latency, or any number of things. Logs can be more rich than traces, and a nice pattern I've used with Grafana is linking the two, and having the option to jump to corresponding log lines from a trace which can describe the different actions performed during that span.

2 more replies

jwestbury2y ago

> metrics and logs were a thing of the past people gave up on since the rise of Microservices

Definitely not the case, and, in fact, probably the opposite is true. In the era of microservices, metrics are absolutely critical to understand the health of your system. Distributed tracing is also only beneficial if you have the associated logs - so that you can understand what each piece of the system was doing for a single unit of work.

1 more reply

j / k navigate · click thread line to collapse

75 comments

CSMastermind2y ago

> The data collected from these streams is sent to several vendors including Datadog (for application logs and metrics), Honeycomb (for traces), and Google Cloud Logging (for infrastructure logs).

badloginagain2y ago

I feel we hold up single-observability-solution as the Holy Grail, and I can see the argument for it- one place to understand the health of your services.

But I've also been in terrible vendor lock-in situations, being bent over the barrel because switching to a better solution is so damn expensive.

mikeshi422y ago

Biased as a founder in the space [1] but I think with OpenTelemetry + OSS extensible observability tooling, the holy grail of one tool is more realizable than ever.

[1] https://github.com/hyperdxio/hyperdx

pranay012y ago

[1]https://github.com/SigNoz/signoz [2]https://opentelemetry.io/docs/instrumentation/java/automatic...

sofixa2y ago

dexterdog2y ago

shric2y ago

I once worked out what it would cost to send our company's prod logs to datadog. It was 1.5x our total AWS cost. The company ran entirely on AWS

dabeeeenster2y ago

Is prod not 99% of your logs?

dexterdog2y ago

Not even close

maccard2y ago

mdtusz2y ago

Hopefully someone that works there is reading this.

masterj2y ago

It sounds like you should look into their tail-sampling Refinery tool https://docs.honeycomb.io/manage-data-volume/refinery/

1 more reply

serverlessmom2y ago

I think Honeycomb is perfect for one kind of user, who's entirely concerned with traces and very long retention. For a more general OpenTelemetry-native solution, check out Signoz.

viraptor2y ago

Have you tried the traces in grafana/tempo yet? https://grafana.com/docs/grafana/latest/panels-visualization...

It seems to miss some aggregation stuff, but also it's improving every time I check. I wonder if anyone's used it in anger yet and how far is it from replacing datadog or honeycomb.

arccy2y ago

tempo still feels very much: look at a trace that you found from elsewhere (like logs).

with so much information in traces and the pure volume, the aggregation really is the key to actionable info out of a tracing setup if it's going to be the primary entry point.

maccard2y ago

ankit01-oss2y ago

devin2y ago

Eh, personally I view honeycomb and datadog as different enough offerings that I can see why you'd choose to have both.

rewmie2y ago

> It sounds like they were in a place that a lot of companies are in where they don't have a single pane of glass for observability.

And you barely notice it. It's just there, and you can see everything.

> One of the terrible mistakes I see companies make with this tooling is fragmenting like this.

yourapostasy2y ago

> ...with AWS all roads lead to Cloudwatch, and everything is so much better.

Has anyone found a better solution to these use cases where the development maturity level is more prosaic, or is this really the best local maxima at the industry's current SOTA?

everfrustrated2y ago

In addition, one of the largest limitations of CloudWatch is it doesn't work well with a many-aws-account strategy.

Some part of the value of Datadog etc is having a single pane of glass over many aws accounts.

tapoxi2y ago

I made this switch very recently. For our Java apps it was as simple as loading the otel agent in place of the Datadog SDK, basically "-javaagent:/opt/otel/opentelemetry-javaagent.jar" in our args.

andrewstuart22y ago

I know this isn't a DataDog post, and I'm a bit off topic, but I try to do my best to warn against DD these days.

shawnb5762y ago

I looked at the OTel DD stuff and did not see any support for this, fwiw, maybe it doesn't work b/c the agent expects more context from the pod (e.g. app and label?)

andrewstuart22y ago

xyst2y ago

Vendors love the custom shit they build because they know once it’s infiltrated the stack then it’s basically like gangrene (have to cut off the appendage to save the host)

MajimasEyepatch2y ago

bhyolken2y ago

MuffinFlavored2y ago

> from the personal preferences of the engineers

https://www.honeycomb.io/pricing

https://www.datadoghq.com/pricing/

bhyolken2y ago

1 more reply

Jedd2y ago

The killer feature of OpenTelemetry for us is brokering (with ETL).

Partly this lets us easily re-route & duplicate telemetry, partly it means changes to backend products in the future won't be a big disruption.

Our scale problem means most online pricing calculators generate overflow errors.

Our non-security log destination preference is Loki - for similar reasons to Mimir - though a SIEM it definitely is not.

pranay012y ago

Yeah, the ability to send to multiple sources is quite powerful and most of this comes from the configurability of Otel Collector [1].

[1]https://opentelemetry.io/docs/collector/ [2]https://github.com/SigNoz/signoz

Jedd2y ago

Hi Pranay - actually I've had a signoz tab open for about 5 weeks - once I find time I'm meaning to run it up in my lab.

pranay012y ago

Awesome! Do reach out to us in our slack community[1] if you have any questions or need any help on setting things up

[1] https://signoz.io/slack

hagen17782y ago

> mimir because of scale & self-host options

Have you looked at VictoriaMetrics [0] before opting for Mimir?

[0] https://victoriametrics.com/blog/mimir-benchmark/

nevon2y ago

arccy2y ago

you can submit opentelemetry traces to datadog which should be the equivalent of apm/npm, though maybe with a less polished integration.

nevon2y ago

porker2y ago

nullify882y ago

Our developers aren't a fan of that.

https://opentelemetry.io/docs/specs/otel/compatibility/prome...

valyala2y ago

nullify882y ago

Thanks, that's nice to know VM can accommodate that. A migration will be something we will have consider if Mimir and OpenTelemetry force us to use joins for all our queries.

ronyaurora2y ago

If you are using the prometheus exporter, you can use the transform processor to get specific resource attributes into metric labels.

With the advantage that you get only the specific attributes you want, thus avoiding a cardinality explosion.

https://github.com/open-telemetry/opentelemetry-collector-co...

nullify882y ago

roskilli2y ago

Have encountered this a lot from teams attempting to use the metrics SDK.

bhyolken2y ago

Sure, happy to provide more specifics!

Overall, the SDK was generally well-written and well-documented, we just needed some extra work to make the interfaces more similar to the ones we were using before.

roskilli2y ago

Thanks for the detailed response.

arccy2y ago

the official SDKs will only support an api once there's a spec that allows it.

for const attributes, generally these should be defined at the resource / provider level: https://pkg.go.dev/go.opentelemetry.io/otel/sdk/metric#WithR...

caust1c2y ago

Curious about the code implemented for logs! Hopefully that's something that can be shared at some point. Also curious if it integrates with `log/slog` :-)

Congrats too! As I understand it from stories I've heard from others, migrating to OTel is no easy undertaking.

bhyolken2y ago

caust1c2y ago

Awesome! Great work and thanks for sharing your experience!

throwaway084t952y ago

What is the "first principles" argument that observability decomposes into logs, metrics, and tracing? I see this dogma accepted everywhere, but I'm inquisitive about it

yannyu2y ago

tsamba2y ago

Interesting read. What did you find easier about using GCP's log tooling for your internal system logs, rather than the OTel collector?

bhyolken2y ago

[1] https://docs.datadoghq.com/integrations/google_cloud_platfor...

clintonb2y ago

Their collector is used to send infrastructure logs to GCP (instead of Datadog).

My guess is this is to save on costs. GCP logging is probably cheaper than Datadog, and infrastructure logs may not be needed as frequently as application logs.

shoelessone2y ago

I really really want to use OTel for a small project but have always had a really tough time finding a path that is cheap or free for a personal project.

In theory you can send telemetry data with OTel to Cloud Watch, but I've struggle to connect the dots with the front end application (e.g. React/Next.js).

yourapostasy2y ago

[1] https://www.jaegertracing.io/

arccy2y ago

jon-wood2y ago

This is endemic now. Doesn't matter what someone is writing about there'll be some pointless stock photo taking up half the page. There'll probably be some more throughout the page. Stop it please.

k__2y ago

I had the impression, logs and metrics are a pre-observability thing.

SteveNuts2y ago

I've never heard the term "pre-observability", what does that mean?

renegade-otter2y ago

The era when "debugging in production" wasn't standard.

marcosdumay2y ago

Observability is about logs and metrics, and pre-observability (I guess you mean the high-level-only records simpler environments keep) is also about logs and metrics.

Anything you register to keep track of your environment has the form of either logs or metrics. The difference is about the contents of such logs and metrics.

k__2y ago

When I read Observability Engineering, I got the impression it was about long events and tracing, and metrics and logs were a thing of the past people gave up on since the rise of Microservices.

sofixa2y ago

No wonder, it's either strong bias from people working in a tracing vendor, or outright a sales pitch.

2 more replies

jwestbury2y ago

> metrics and logs were a thing of the past people gave up on since the rise of Microservices

1 more reply

j / k navigate · click thread line to collapse