I can't count the number of times users (or myself) discovered bug after many weeks because something gradually failed over time. Also it saves a lot of time to be able to pin point the exact day a behavior as changed so you can check the deploy of that day and quickly find the bug. Sometimes a trend is not obvious after a deploy but is clearly visible on the graph after a long period of time.
And for business intelligence, it's always when you badly need a metric that you realize you never tracked it.
But let's take the case of metrics as an example--do we need full sample granularity for "old" data? Do we need full tag cardinality? Sample granularity reduction could be done with a transform to rollups at a coarser time granularity. That's a 60x reduction going from Hz to 1/min. You might lose a bunch of frequency information this way, but maybe that's ok?
Numbers are really nice in ways that text is not.
I keep harping on this, but compressed utf-8 text (or even worse, compressed json) is a horribly wasteful way to do it. See [1]. Putting a small amount of thought into storing telemetry data seems like it could yield incredible savings at scale.
[1] https://lists.w3.org/Archives/Public/www-logging/1996May/000...
One great side effect of this was service developers weren't afraid to write logs. We logged excessively, and it didn't cost too much. If we'd been indexing everything in ES it would have bankrupted us.
These days with S3 and the cloud, hadoop (or the EMR suite) per se probably isn't the way to go, but I'd sure like to see observability solutions giving me a first-class programming model that I as a user can interact with--not some bespoke "query DSL", and for them to accept that instantaneous indexed retrieval isn't important.
This paper is really interesting: https://www.usenix.org/system/files/osdi21-rodrigues.pdf
Stuff like this gives me hope we can have it both ways. With highly tuned compression and programmatic access the user is empowered and the cost is minimized.
- converting every number into its sequence of digits in decimal notation,
- writing those one character at a time,
- also write the string representation of the label of each value repeatedly for every record,
- compress all this with a structure-unaware generic text compression algorithm based on longest match search.
Each time you want to read that data, undo all of the above in reverse order.
You can optimize to some degree, but that's basically it.
I expect that not doing any of this saves the time spent doing it. I also expect data type aware compression to be much more efficient than text compressing the text expansion.
In numbers, I expect 2 to 3 orders of magnitude difference in time and also in space (for non random data).
But jcgrillo was talking about storage (at least his link was). And when parsing for analysis or for storing millions of points daily, there's no doubt that a binary format is simply a lot more CPU and disk efficient.
127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
As written this is 99 bytes (792 bits), but how much information is actually in it? We have an IP address which is taking up 9 bytes but only needs at most 4 (fewer in cases like this where two of the bytes are zero if we employ varint encoding). Across log lines the ident and user will likely be very repetitive, so storing each unique occurrence more than once is really wasteful. The timestamp takes up 28 bytes but only needs 13 bytes--far fewer if that field is delta encoded between log lines. The HTTP method is taking up 5+ bytes, it's only worth 1 byte. The URLs are also super redundant--no need to store a copy in each line. The HTTP version is 1 byte but it's taking up 8. The status code is taking up 3 bytes but it's only worth 1--there are only 63 "real" HTTP status codes. The content length is taking up 4 bytes when it needs only 2. So I guess this log line only really has ~33 bytes of information in it (assuming a 32 bit pointer for each string--ident, user, URL). Much less if amortized across many lines. So maybe by naively parsing this log line and throwing a bunch of them in columnar, packed protobuf fields (where we get varint encoding for free), and delta-encoding the timestamps, and maintaining a dictionary for all the strings, we might achieve something like a ~5x compression ratio.Playing around with gzip -9 on some test data[2] (not exactly CLF, but maybe similar entropy) I'm getting like ~1.9x compression.
Obviously if I parse this log line into a JSON blob, that blob will compress with a much higher ratio due to the repetitive nature of JSON, but it'll still be larger than the equivalent compressed CLF.
I'm working on a demo for my "protobof + fst[3]" idea, so I'm not sure if my "maybe ~5x" claim is totally off the mark or not. But I'm confident we can do way better than JSON.
[1] https://en.wikipedia.org/wiki/Common_Log_Format [2] https://www.sec.gov/about/data/edgar-log-file-data-sets [3] https://crates.io/crates/fst
EDIT: I guess maybe another way to state my conjecture is "telemetry compression is not general purpose text compression". These data have a schema, and by ignoring that fact and treating them always as schemaless data (employing general purpose text compression methods) we're leaving something on the table.
I bet you a full dollar that both in-house and open source solutions, on average, are way more stingy with resources. As they should be.
I think that this is a good idea when storage is concern for high-volume logs / production. Persisting the buffer when high error rates / unusual system behavior is observed would be a cool idea.
OpenTelemetry recently ish gained Open Agent Management Protocol (OpAMP), which allows some runtime control over things generating telemetry. The ability to stay fairly low but then scale up as needed sounds tempting, but gee it also sends shivers down my spine thinking of having such a elastic demands on one's telemetry infrastructure, as engineers turn telemetry up as problems are occuring. https://opentelemetry.io/docs/specs/opamp/
The idea of having a local circular buffer sounds excellent to me. Being able to run local queries & aggregate would be sweet. Are there any open otel issues discussing these ideas?
It's also possible they collect too much in the wrong formats.
But the ability to vet a hypothesis (I bet our users are confused about feature X, which we can test by looking at how many times they go to page X, then Y, then X again in 30 second window) in an hour versus 2 sprints is vastly underappreciated/underutilized.
I feel like this article paints with too broad a brush.
From a systems standpoint do you need to have all resources stored centrally in order to do centralized reporting? No, of course not. Admittedly it's handy if bandwidth and storage are free. The alternative is distributed storage, with or without summarization at the edge (and aggregating from distributed storage for reporting).
Having it distributed does raise access issues: access needs to be controlled, and management of access needs to be managed. Philosophically the Cloud solutions sell centralized management, but federation is a perfectly viable option. The choice is largely dictated by organizational structure not technology.
There is also a difference between diagnostic and evaluative indicators. Trying to evaluate from diagnostics causes fatigue because humans aren't built that way; evaluatives can and should be built from diagnostics. Diagnostics can't be built from evaluatives.
The logging/telemetry stack that I propose is:
1) Ephemeral logging at the limits of whatever observability you can build. E.g.: systemd journal with a small backing store, similar to a ring buffer.
2) Your compliance framework may require shipping some classes of events off of the local host, but I don't think any of them require shipping it to the cloud.
3) Build evaluatives locally in Redis.
4) Use DNS to query those evaluatives from elsewhere for ad hoc as well as historical purposes. This could be a centralized location or it could be true federation where each site accesses all other site's evaluatives.
I wouldn't put Redis on the internet, but I don't worry too much about DNS; and there are well-understood ways of securing DNS from tampering, unauthorized access, and even observation. By the way, DNS will handle hundreds or thousands of queries per second you just have to build for it.
curl http://athena.m3047.net/grafitti.html
dig @athena.m3047.net grafitti\;*.keys.redis.athena.m3047 txtHas Matt read any prior art in this field? https://research.google/pubs/monarch-googles-planet-scale-in...
I couldn't agree with the author more. Keeping historical records of business metrics makes a ton of sense. But history telemetry (CPU, Memory, Network, error logs) makes little sense.
If an issue occurs, then turn on telemetry around that issue until you track it down. If an issue occurs once and never again, did it really matter? This obviously does not apply to security, I'm just speaking of operational issues.
Keeping all of your application logs and telemetry forever is expensive, and I can't recall a single time when having more than a day's with of history was ever useful in tracking down an operational issue.
If your CPU and memory aren't affecting the business metrics, then it's not super relevant.
Also consider the potential risks of handling personal data and leaks.
A day is a pretty small window, I'd say a week or a bit more is good enough for most orgs. That way you can compare specific endpoints/code between deploys, answering questions like "was this endpoint this slow last week too or did I break it?". Some issues take a few days to brew and having historical data is important in debugging. Many orgs don't do load testing at all or have any real performance analysis done before things crash.
Log retention is also directly tied to how fast and easily can you detect and recover from issues.
I disagree. Every issue I've ever debugged, I did a tail -f on the logs. I can't recall ever searching the old logs.
Even if it takes a few days for an issue to brew, usually the logs right now will show the issue. Or if they don't, then you can turn on the logs and have them in a few days time. It's so rare that it's almost never worth keeping the logs around just for that one case where an old log might lead to resolution, and rarely does one have time during an active incident to look at old logs anyway.
User writes into support 3 days after the problem occurred, and support goes back and forth covering level 1 possibilities for an additional 2 days before escalating. It's common for 1 support complaint to represent some larger factor of users who never complain, so it would be useful to understand how common the issue is once it has been identified in the observability data. Having one day isn't sufficient in this scenario.
Forever is probably too much, but keeping a month or so is totally sane.
1: https://www.forbes.com/sites/forbestechcouncil/2022/10/03/th...
Nobody needs to retain metrics like CPU, Memory for weeks but I may want to see their numbers during an incident, or not long after it is over.