I love observability probably more than most. And my initial reaction to this article is the obvious: why not both?
In fact, I tend to think more in terms of "events" when writing both logs and tracing code. How that event is notified, stored, transmitted, etc. is in some ways divorced from the activity. I don't care if it is going to stdout, or over udp to an aggregator, or turning into trace statements, or ending up in Kafka, etc.
But inevitably I bump up against cost. For even medium sized systems, the amount of data I would like to track gets quite expensive. For example, many tracing services charge for the tags you add to traces. So doing `trace.String("key", value)` becomes something I think about from a cost perspective. I worked at a place that had a $250k/year New Relic bill and we were avoiding any kind of custom attributes. Just getting APM metrics for servers and databases was enough to get to that cost.
Logs are cheap, easy, reliable and don't lock me in to an expensive service to start. I mean, maybe you end up integrating splunk or perhaps self-hosting kibana, but you can get 90% of the benefits just by dumping the logs into Cloudwatch or even S3 for a much cheaper price.
Luckily, some of the larger incumbents are also moving away from this model, especially as OpenTelemetry is making tracing more widespread as a baseline of sorts for data. And you can definitely bet they're hearing about it from their customers right now, and they want to keep their customers.
Cost is still a concern but it's getting addressed as well. Right now every vendor has different approaches (e.g., the one I work for has a robust sampling proxy you can use), but that too is going the way of standardization. OTel is defining how to propagate sampling metadata in signals so that downstream tools can use the metadata about population representativeness to show accurate counts for things and so on.
What newer tools/companies are in this category? Any that you recommend?
Cost is a real issue, and not just in terms of how much the vendor costs you. When tracing becomes a noticeable fraction of CPU or memory usage relative to the application, it's time to rethink doing 100% sampling. In practice, if you are sampling thousands of requests per second, you're very unlikely to actually look through each one of those thousands (thousands of req/s may not be a lot for some sites, but it is already exceeding human-scale without tooling). In order to keep accurate, useful statistics with sampling, you end up using metrics to store trace metrics prior to sampling.
They are events[1]. For my text editor, KeenWrite, events can be logged either to the console when run from the command-line or displayed in a dialog when running in GUI mode. By changing "logger.log()" statements to "event.publish()" statements, a number of practical benefits are realized, including:
* Decoupled logging implementation from the system (swap one line of code to change loggers).
* Publish events on a message bus (e.g., D-Bus) to allow extending system functionality without modifying the existing code base.
* Standard logging format, which can be machine parsed, to help trace in-field production problems.
* Ability to assign unique identifiers to each event, allowing for publication of problem/solution documentation based on those IDs (possibly even seeding LLMs these days).
[1]: https://dave.autonoma.ca/blog/2022/01/08/logging-code-smell/
Honeycomb at least charges per event, which in this case means per span - however they don't charge per span attribute, and each span can be pretty large (100kb / 2000 attributes).
I run all my personal services in their free tier, which has plenty of capacity, and that's before I do any sampling.
Log for long term, traces for short debut and analisys is a fine compromise.
I think this is the issue. Both Splunk and OpenSearch (even self-hosted OpenSearch) get really pricy as well especially with large volumes of log data. Cloudwatch can also get ludicrously expensive. They charge something like $0.50 per GB (!) and another $0.03 per GB to store. I've seen situations at a previous employer where someone accidentally deployed a lambda function with debug logging and ran up a few thousand $$ in Cloudwatch bills overnight.
You should look at Coralogix (disclaimer: I work there). We've built a platform that allows you to store your observability data in S3 and query it through our infrastructure. It can be dramatically more cost-effective than other providers in this space.
I partly agree and disagree. In terms of severity, there are only three levels:
– info: not a problem
– warning: potential problem
– error: actual problem (operational failure)
Other levels like “debug” are not about severity, but about level of detail.
In addition, something that is an error in a subcomponent may only be a warning or even just an info on the level of the superordinate component. Thus the severity has to be interpreted relative to the source component.
The latter can be an issue if the severity is only interpreted globally. Either it will be wrong for the global level, or subcomponents have to know the global context they are running in to use the severity appropriate for that context. The latter causes undesirable dependencies on a global context. Meaning, the developer of a lower-level subcomponent would have to know the exact context in which that component is used, in order to chose the appropriate log level. And what if the component is used in different contexts entailing different severities?
So one might conclude that the severity indication is useless after all, but IMO one should rather conclude that severity needs to be interpreted relative to the component. This also means that a lower-level error may have to be logged again in the higher-level context if it’s still an error there, so that it doesn’t get ignored if e.g. monitoring only looks at errors on the higher-level context.
Differences between “fatal” and “error” are really nesting differences between components/contexts. An error is always fatal on the level where it originates.
Here's a classic problem as an illustration: The storage cost of your logs is really prohibitive. You would like to cut out some of your logs from storage but cannot lower retention below some threshold (say 2 weeks maybe). For this example, assume that tracing is also enabled and every log has a traceId
A good answer is to run a compaction job that inspects each trace. If it contains an error preserve it. Remove X% of all other traces.
Log levels make the ergonomics for this excellent and it can save millions of dollars a year at sufficient scale.
Or, keep it simple.
- error means someone is alerted urgently to look at the problem
- warning means someone should be looking into it eventually, with a view to reclassifying as info/debug or resolving it.
IMO many people don't care much about their logs, until the shit hits the fan. Only then, in production, do they realise just how much harder their overly verbose (or inadequate) logging is making things.
The simple filter of "all errors send an alert" can go a long way to encouraging a bit of ownership and correctness on logging.
The issue is that the code that encounters the problem may not have the knowledge/context to decide whether it warrants alerting. The code higher up that does have the knowledge, on the other hand, often doesn’t have the lower-level information that is useful to have in the log for analyzing the failure. So how do you link the two? When you write modular code that minimizes assumptions about its context, that situation is a common occurrence.
Info is things like “processing X”
Debug is things like “variable is Y” or “made it to this point”
And then "error" as - "things are not okay, a developer is going to need to intervene"
And errors then split roughly between "must be fixed sometime", and "must be fixed now/ASAP"
It was handled safely at the level where it occurred, but because it was unusual/unexpected, the underlying cause may cause issues later on or higher up.
If one were sure it would 100% not indicate any issue, one wouldn’t need to warn about it.
That said for the spaces where tracing works well, it works unreasonably well.
[1] https://opentelemetry.io/docs/concepts/signals/traces/#span-...
The really hard things, which we had reasonable answers for, but never quite perfect: * Rails websockets (actioncable) * very long running background jobs (we stopped collecting at some limit, to prevent unbounded memory) * trying to profile code, we used a modified version of Stackprof to do sampling instead of exact profiling. That worked surprisingly well at finding hotspots, with low overhead.
All sorts of other tricks came along too. I should go look at that codebase again to remind me. That'd be good for my resume.... :)
Suppose you have a long data pipeline that you want to trace jobs across. There are not an enormous number of jobs but each one takes 12 hours across many phases. In theory tracing works great here, but in practice most tracing platforms can’t handle this. This is especially true with tailed based tracing as traces can be unbounded and it has to assume at some point their time out. You can certainly build your own, but most of the value of tracing solutions is the user experience; which is also the hardest part.
On stream processing I’ve generally found it too expensive to instrument stream processors with tracing. Also there’s generally not enough variability to make it interesting. Context stitching and span management as well as sweeping and shipping of traces can be expensive in a lot of implementations and stream processing is often cpu bound.
A simple transaction id annotated log makes a lot more sense in both, queried in a log analytic platform.
That’s when you want a log and that’s what the big traditional log frameworks were designed to handle.
A web backend/service is basically the opposite. End users don’t have access to the log, those who analyze it can cross reference with system internals like source code or db state and the log is basically infinite. In that situation a structured log and querying obviously wins.
It’s honestly not even clear that these systems are that closely related.
For a web app, serving lots of concurrent users, they are essentially unreadable without tools, so you may as well optimise the logs for tool based consumption.
I too use this bait statement.
Then I follow it up with (the short version):
1) Rewrite your log statements so that they're machine readable
2) Prove they're machine-readable by having the down-stream services read them instead of the REST call you would have otherwise sent.
3) Switch out log4j for Kafka, which will handle the persistence & multiplexing for you.
Voila, you got yourself a reactive, event-driven system with accurate "logs".
If you're like me and you read the article thinking "I like the result but I hate polluting my business code with all that tracing code", well now you can create an independent reader of your kafka events which just focuses on turning events into traces.
I don't think this is a reasonable statement. There are already a few logging agents that support structured logging without dragging in heavyweight dependencies such as Kafka. Bringing up Kafka sounds like a case of a solution looking for a problem.
If it's data you care about then you put it in Kafka, unless you're big enough to use something like Cassandra or rich enough to pay a cloud provider to make redundant data storage their problem. Logs are something that you need to write durably and reliably when shit is hitting the fan and your networks are flaking and machines are crashing - so ephemeral disks are out, NFS is out, ad-hoc log collector gossip protocols are out, and anything that relies on single master -> read replica and "promoting" that replica is definitely out.
Kafka is about as lightweight as it gets for anything that can't be single-machine/SPOF. It's a lot simpler and more consistent than any RDBMS. What else would you use? HDFS (or maybe OpenAFS if your ops team is really good) is the only half-reasonable alternative I can think of.
What are they? Because admittedly I've lost a little love for the operational side of Kafka, and I wish the client-side were a little "dumber", so I could match it better to my uses cases.
That said, I've had conflicts with a previous team-mate about this. He couldn't wrap his head around Kafka being a source of truth. But when I asked him whether he'd trust our Kafka or our Postgres if they disagreed, he conceded that he'd believe Kafka's side of things.
Who on Earth does that? Logs are almost always written to stderr... In part to prevent other problems author is talking about (eg. mixing with the output generated by the application).
I don't understand why this has to be either or... If you store the trace output somewhere you get a log... (let's call it "un-annotated" log, since trace won't have the human-readable message part). Trace is great when examining the application interactively, but if you use the same exact tool and save the results for later you get logs, with all the same problems the author ascribes to logs.
Like, I develop cli apps, so like, what else would go to stdout that you suppose will interfere?
But why would you write your own logs instead of using something built into your language's library? I believe Python's logging module writes to stderr by default. Go's log package always goes to stderr.
But... today I've learned that console.log() in NodeJS writes to stdout... well, I've lots another tiny bit of faith in humanity.
And while my historical gripes are largely still the status quo: stack traces in multi-threaded, evented/async code that actually show real line numbers? Span-based tracing that makes concurrent introspection possible by default?
I’m in. I apologize for everything bad I ever said and don’t care whatever other annoying thing.
That’s the whole show. Unless it deletes my hard drive I don’t really care about anything else by comparison.
- we collectively realized that logs, events, traces, metrics, and errors are actually all just logs
- we agreed on a single format that encapsulated all that information in a structured manner
- we built firehose/stream processing tooling to provide modern o11y creature comforts
I can't tell if that universe is better than this one, or worse.
Honestly it sounds like you're pitching opentelemetry/otlp but where you only trace and leave all the other bits for later inside your opentelemetry collector, which can turn traces into metrics or traces into logs.
So I'm imagining something more like:
{"level":"info", "otlp": { "trace": { ... }}}
{"level":"info", "otlp": { "error": { ... }}}
{"level":"info", "otlp": { "log": { ... }}}
{"level":"info", "otlp": { "metric": { ... }}}
(standardizing this format would be non-trivial of course, but I could imagine a really minimal standard)Your downstream collector only needs one API endpoint/ingestion mechanism -- unpacking the actual type of telemetry that came in (and persisting where necessary) can be left to other systems.
Basically I think the systems could have been massively simpler in most UNIX-y environments -- just hook up STDOUT (or scrape it, or syslog or whatever), and you're done -- no allowing ports out for jaeger, dealing with complicated buffering, etc -- just log and forget.
Yeah I think the worst case you basically just exfiltrate metrics out to other subsystems (honestly, you could kind of exfiltrate all of this), but the default is pipe heavily compressed stuff to short and long term storage, and some processors for real time... blah blah blah.
Obviously Honeycomb is actually doing the thing and it's not as easy as it sounds, but it feels like if we had all thought like this earlier we might have skipped making a few protocols (zipkin, jaeger, etc), and focused on just data layout (JSON vs protobuf vs GELF, etc) and figuring out what shapes to expect across tools.
There's a big gap between what it takes for the engineering to work and what all these companies charge.
My point is really more about the engineering time wasted on different protocols and stuff when we could have stuffed everything into minimally structured log lines (and figured out the rest of the insight machinery later). Concretely, that zipkin/jaeger/prometheus protocols and stuff may not have needed to exist, etc.
Off-the-shelf tracing libraries on the other hand are pretty expensive. You have one additional mandatory read of the system clock, to establish the span duration, plus you are still paying for a clock read on every span event, if you use span events. Every span has a PRNG call, too. Distributed tracing is worthless if you don't send the spans somewhere, so you have to budget for encoding your span into json, msgpack, protobuf, or whatever. It's a completely different ball game in terms of efficiency.
Adding timestamps and UUIDs and an encoding is par for the course in logging these days, I don't think that is the right angle to criticize efficiency.
Tracing can be very cheap if you "simply" (and I'm glossing over a lot here) search for all messages in a liberal window matching each "span start" message and index the result sets. Offering a way to view results as a tree is just a bonus.
Of course, in practice this ends up meaning something completely different, and far costlier. Why that is I cannot fathom.
Actually structured logging exists since years like in Java https://github.com/logfellow/logstash-logback-encoder
1. application logs, emitted multiple times per request and serve as breadcrumbs
2. request logs emitted once per request and include latencies, counters and metadata about the request and response
The application logs were useless to me except during development. However the request logs I could run aggregations on which made them far more useful for answering questions. What the author explains very well is that the problem with application logs is they aren't very human-readable which is where visualizing a request with tracing shines. If you don't have tracing, creating request logs will get you most of the way there, it's certainly better than application logs. https://speedrun.nobackspacecrew.com/blog/2023/09/08/logging...
In any case, the post itself (which is not long) illustrates and marks out many of the differences.
On my side I have opted to mixed structured/text, a generic message that can be easily understood while glancing over logs, and a data object attached for more details.
And OpenTelemetry has a very questionable implementation. For a nested trace, events fire when the trace closes, meaning that a parent ID is reported before it is seen in the stream. That can’t be good for processing. Would be better to have a leading edge event (also helps with errors throwing and the parent never being reported).
Kind of a bummer. Needs work.
The nice thing about OpenTelemetry is that it's a standard. The questionable implementation you're referencing isn't a source of truth. There isn't some canonical "questionable" implementation.
There are many, slightly different, questionable implementations.
Here's this log of every frame of compute going on, plus data or metadata about the frame.... but afaik we have yet to start using the same stream of computation for business processes as we do for it's excellent observability.
1. At the start of a request, generate a globally unique traceId
2. Pass this traceId through the whole call stack.
3. Whenever logging, log the traceId as a parameter
Now you have a log with many of the plusses of a trace. The only additional cost to the log is the storage of the traceId on every message.
If you want to read a trace, search through your logs for "traceId: xyz123". If you use plain text storage you can grep. If you use some indexed storage, search for the key-value pair.
This way, you can retrieve something that looks like a trace from a log.
This does not solve all the issues named in the article. However, it is a decent tradeoff that I've used successfully in the past. Call it "poor man's tracing".
A migration path I could see might be:
- replace current logging lib with otel logging (sending to same output) - setup tracing - replace logging with tracing over time (I prefer moving the most painful areas of code first)
Context: the last thing I wrote used Deno and Deno Deploy.
opentelemetry has a service you can run that will collect the telemetry data and you can export it to something like prometheus which can store it and let you query it. Example here https://github.com/open-telemetry/opentelemetry-collector-co...
Typically in dev environments trace spans are just emitted to stdout just like logs. I sometimes turn that off too though because it gets noisy.
While there are better tools for alerting, metrics, or aggregations, it helps a lot in debugging and troubleshooting.
Traces can be aggregated or sampled to provide all of the information available from logs, but in a more flexible way. * Certain traces can be retained at 100%. This is equivalent to logs. * Certain trace attributes can be converted to timeseries data. This is equivalent to metrics. * Certain traces can be sampled and/or queried with streaming infrastructure. This is a way to observe data with high cardinality without hitting the high cost.
Probably the biggest tradeoff with traces is that, in practice, you are not retaining 100% of all traces. In order to keep accurate statistics, it generally gets ingested as metrics before sampling. The other is that traces are not stored in such a way where you are looking at what is happening at a point-in-time -- which is what logging does well. If I want to ensure I have execution context for logging, I make the effort to add trace and span ids so that traces and logging can be correlated.
To be fair, I live in the devops world more often than not, and my colleagues on the dev teams rarely have to venture outside of traces.
I don't mind the points this author is making. My main criticism is that it is scoped to the world of applications -- which is fine -- but then taken as universal for all of software engineering.
The cool thing about logs is that they're just a text file and don't need to be sent over the internet to someone else. But yes, I've encountered some problems just using text logs and I'd like to solve them.
Is there an OpenTelemetry solution that is capable of being self-hosted (and preferably OS) that anyone recommends?
Note to author: all but the last code block have a very odd mixture of rather large font sizes (at least on mobile) which vary line to line that make them pretty difficult to read.
Also the link to "Observability Driven Development." was a blank slide deck AFAICT
It's all statically rendered html, and I don't see anything weird in the html either.
Do you have a screenshot and some device info so I can look a bit more? Thanks
This should not require code at the application level, but it should be implemented at the tooling level.
Unless you are talking about profilers, that measure execution time and memory only, but traces are a lot more than only that.
Annotating the code with logs and traces is a UX activity, not for the end users, but for the ops-team. They don't have knowledge of the internals of the code. Logs should be written in the context of levers that ops have control over.
Take the example from the OP: nr of cache hits. It's something ops can control by configuring the cache size, it is something ops can observe and correlate with request-time and network bandwidth. It would require an immensely sophisticated debugger to make all these correlations automatically.
Perhaps really performance critical stuff could have a "notrace" annotation.
How accurate and useful these are vs. doing this manually will depend on the use case, but I reckon the automatic approach gets you most of the way there, and you can add the missing traces yourself, so if nothing else it saves a lot of work.
For Java: https://opentelemetry.io/docs/instrumentation/java/automatic...
However, tracing literally every method call would probably be prohibitively expensive so typically you have either:
1. Instrumentation with "understands" common frameworks/libraries and knows what to instrument (eg request handlers in web frameworks)
2. Full opt-in. They make it easy to add a trace for a method invocation with a simple annotation but nothing gets instrumented by default
However, no automatic instrumentation can do everything for you; it can't know what are all the interesting properties or things to add as attributes. But adding tracing automatically to SQL clients, web frameworks etc is very valuable