127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
As written this is 99 bytes (792 bits), but how much information is actually in it? We have an IP address which is taking up 9 bytes but only needs at most 4 (fewer in cases like this where two of the bytes are zero if we employ varint encoding). Across log lines the ident and user will likely be very repetitive, so storing each unique occurrence more than once is really wasteful. The timestamp takes up 28 bytes but only needs 13 bytes--far fewer if that field is delta encoded between log lines. The HTTP method is taking up 5+ bytes, it's only worth 1 byte. The URLs are also super redundant--no need to store a copy in each line. The HTTP version is 1 byte but it's taking up 8. The status code is taking up 3 bytes but it's only worth 1--there are only 63 "real" HTTP status codes. The content length is taking up 4 bytes when it needs only 2. So I guess this log line only really has ~33 bytes of information in it (assuming a 32 bit pointer for each string--ident, user, URL). Much less if amortized across many lines. So maybe by naively parsing this log line and throwing a bunch of them in columnar, packed protobuf fields (where we get varint encoding for free), and delta-encoding the timestamps, and maintaining a dictionary for all the strings, we might achieve something like a ~5x compression ratio.Playing around with gzip -9 on some test data[2] (not exactly CLF, but maybe similar entropy) I'm getting like ~1.9x compression.
Obviously if I parse this log line into a JSON blob, that blob will compress with a much higher ratio due to the repetitive nature of JSON, but it'll still be larger than the equivalent compressed CLF.
I'm working on a demo for my "protobof + fst[3]" idea, so I'm not sure if my "maybe ~5x" claim is totally off the mark or not. But I'm confident we can do way better than JSON.
[1] https://en.wikipedia.org/wiki/Common_Log_Format [2] https://www.sec.gov/about/data/edgar-log-file-data-sets [3] https://crates.io/crates/fst
EDIT: I guess maybe another way to state my conjecture is "telemetry compression is not general purpose text compression". These data have a schema, and by ignoring that fact and treating them always as schemaless data (employing general purpose text compression methods) we're leaving something on the table.