Github project for CLP: https://github.com/y-scope/clp
The interesting part about the article isn't that structured data is easier to compress and store, its that there's a relatively new way to efficiently transform unstructured logs to structured data. For those shipping unstructured logs to an observability backend this could be a way to save significant money
2) Organize the log data into tables with columns, and then compress by column (so, each column has its own dictionary). This lets the compression algorithm perform optimally, since now all the similar data is right next to itself. (This reminds me of the Burrows-Wheeler transform, except much more straightforward, thanks to how similar log lines are.)
3) Search is performed without decompression. Somehow they use the dictionary to index into the table- very clever. I would have just compressed the search term using the same dictionary and do a binary search for that, but I think that would only work for exact matches.
log(“Began {x} connected to {y}”)
log(“Ended {x} connected to {y}”)
We can label the first one logging operation 1 and the second one logging operation 2.
Then, if logging operation 1 occurs we can write out:
1, {x}, {y} instead of “Began {x} connected to {y}” because we can reconstruct the message as long as we know what operation occurred, 1, and the value of all the variables in the message. This general strategy can be extended to any number of logging operations by just giving them all a unique ID.
That is basically the source of their entire improvement. The only other thing that may cause a non-trivial improvement is that they delta encode their timestamps instead of writing out what looks to be a 23 character timestamp string.
The columnar storage of data and dictionary deduplication, what is called Phase 2 in the article, is still not fully implemented according to the article authors and is only expected to result in a 2x improvement. In contrast, the elements I mentioned previously, Phase 1, were responsible for a 169x(!) improvement in storage density.
[2]: https://blog.uber-cdn.com/cdn-cgi/image/width=2216,quality=8...
Having structured data and an additional json/jsonb column where it makes sense can be very powerful. There's a reason every new release of Postgres improves on the performance and features available for the json data type. (https://www.postgresql.org/docs/9.5/functions-json.html)
Of course. If there was, postgres wouldn't even support it.
The GP's rant is usually thrown against people that default into json instead of thinking about it and maybe coming up with an adequate structure. There are way too many of those people.
It's nonsensical most of the time: do a table, transform values out of the db or in the consumer.
The reason postgres does it is because lazy developpers overused the json columns and then got fucked and say postgres is slow (talking from repeated experience here). Yeah searching in random unstructured blob is slow, surprise.
I dont dislike the idea to store json and structured data together but... you dont need performance then. Transferring a binary representation of a table and having a binary to object converter in your consumer (even chrome) is several orders of magnitudes faster than parsing strings, especially with json vomit of schema at every value.
Next up: Uber discovers column oriented databases are more efficient for data warehouses.
- their application parses and generates JSON already, so it's low-effort.
- the JSON can have various shapes: database records generally don't do that.
- even if it has the same shape, it can change over time; they don't want to deal with the insane hassle of upgrade-time DB schema changes in existing installations
The alternative to JSON-in-DB is to have a persistent object store. That has downsides too.
You either dont care abt the past and cant read it anymore, version your writer and reader each time you realize an address in a new country has yet another frigging field, or parse each json value to add the new shape in place in your store.
Json doesnt solve the problem of shape evolution, but it tempt you very strongly to think you can ignore it.
This means we can encode objects in your program directly into objects in the database. It also means we can natively encode documents, sub-documents, arrays, geo-spatial coordinates, floats, ints and decimals. This is a primary function of the driver.
This also allows us to efficiently index these fields, even sub-documents and arrays.
All MongoDB collections are compressed on disk by default.
(I work for MongoDB)
Apparently the Uber site noticed I'm not in the USA and automatically redirects to a localized version, which doesn't exist. If their web-development capabilities are any indication I'll skip their development tips.
This is a very complicated and sophisticated architecture that leverages the JVM to the hilt. The "big data" architecture that Java and the JVM ecosystem present is really something to be admired, and it can definitely move big data.
I know that competition to this architecture must exist in other frameworks or platforms. But what exactly would replace the HDFS, Spark, Yarn configuration described by the article? Are there equivalents of this stack in other non-JVM deployments, or to other big data projects, like Storm, Hive, Flink, Cassandra?
And granted, Hadoop is somewhat "old" at this point. But I think it (and Google's original map-reduce paper) significantly moved the needle in terms of architecture. Hadoop's Map-Reduce might be dated, but HDFS is still being used very successfully in big data centers. Has the cloud and/or Kubernetes completely replaced the described style of architecture at this point?
Honest questions above, interested in other thoughts.
With Cloud operating costs dominating the expenses at companies one can see more migration away from JVM setups to simpler (Golang) and close to metal architectures (Rust, C++).
COBOL is exotic syntax and runs on fringe/exotic hardware (mainframes, minicomps, IBM iron).
Java is a c-like syntax that runs everywhere people are shoehorning in Go and Node.JS. Syntax arguments are bikeshedding, but it was a "step forward" for non-systems coding from C and has fundamental design, architectural, breadth of library, interop, modernization, and familiarity advantages over COBOL.
Go is a syntax power stepback, with possibly some GC advantages, and Javascript even with Typescript is still a messed up ecosystem with worse GC and performance issues.
One thing that was interesting was watching the Ruby on Rails stack explode in complexity to encompass an acronym soup nearly as bad a Java as the years moved forward and it matured. Java isn't as complex an ecosystem as it is due to any failings or language failures. It simply has to be as a mature ecosystem.
Syntax complaints I'll listen too, after all I do all my JVM programming in Groovy. But if you complain about java syntax, why would you think Go is "better"?
I think a meta-language will emerge that will have Rust power and checking underlying it but a lot simpler, kind of like elixir and erlang, or typescript and javascript, or, well, Groovy and Java.
But I do get your points and don't necessarily disagree with them. I just don't see this as "legacy" technology, but maybe more like "mature"?
My best effort in finding replacements of those tools that don't leverage the JVM:
HDFS: Any cloud object store like S3/AzBlob, really. In some workloads data locality provided by HDFS may be important. Alluxio can help here (but I cheat, it's a JVM product)
Spark: Different approach but you could use Dask, Ray, or dbt plus any SQL Analytical DB like Clickhouse. If you're in the cloud, and are not processing 10s TB at a time, spinning an ephemeral HUGE VM and using something in-memory like DuckDB, Polars or DataFrame.jl is much faster.
Yarn: Kubernetes Jobs. Period. At this point I don't see any advantage of Yarn, including running Spark workloads.
Hive: Maybe Clickhouse for some SQL-like experience. Faster but likely not at the same scale.
Storm/Flink/Cassandra: no clue.
My preferred "modern" FOSS stack (for many reasons) is Python based, with the occasional Julia/Rust thrown in. For a medium scale (ie. few TB daily ingestion), I would go with:
Kubernetes + Airflow + ad-hoc Python jobs + Polars + Huge ephemeral VMs.
And yes, I'm with you. I'm super excited about the changes to the Java language, and the JVM continues to be superior for many workloads. Hotspot is arguably one of the best virtual machines that exists today.
But there are plenty of "Java is dead" blog posts and comments here on HN to substantiate my original viewpoint. Maybe because I make a living with Java, I have a bias towards those articles but filter out others, so I don't have a clean picture of this sentiment and it's more in my head.
I clicked, found out, and was dissapointed that this wasn't about wood.
Maybe I should start that woodworking career change already.
I'm curious, are there any managed services / simple to use setups to take advantage of something like this for massive log storage and search? (Most hosted log aggregators I've looked at charge by the raw text GB processed)
IIRC Elasticsearch compresses by default with LZ4
Compressing logs has been a thing since the mid-1990s.
Minimizing writes to disk, or setting up a way to coalesce the writes, has also been around for as long as we have had disk drives. If you don't have enough RAM on your system to buffer the writes so that more of the writes get turned into sequential writes, your disk performance will suffer - this too has been known since the 1990s.
> Zstandard or Gzip do not allow gaps in the repetitive pattern; therefore when a log type is interleaved by variable values, they can only identify the multiple substrings of the log type as repetitive.
See for instance this discussion of creating a custom DEFLATE dictionary: https://blog.cloudflare.com/improving-compression-with-prese...
There is another way to tackle the problem for most normal, back-end applications: Dynamic Logging[0].
Instead of adding a large of amount of logs during development (and then having to deal with compressing and transforming them later) one can instead choose to only add the logs required at runtime.
This is a workflow shift, and as such should be handled with care. But for the majority of logs used for troubleshooting, it's actually a saner approach: Don't make a priori assumptions about what you might need in production, then try and "massage" the right parts out of it when the problem rears its head.
Instead, when facing an issue, add logs where and when you need them to almost "surgically" only get the bits you want. This way, logging cost reduction happens naturally - because you're never writing many of the logs to begin with.
Note: we're not talking about removing logs needed for compliance, forensics or other regulatory reasons here, of course. We're talking about those logs that are used by developers to better understand what's going on inside the application: the "print this variable" or "show this user's state" or "show me which path the execution took" type logs, the ones you look at once and then forget about (while their costs piles on and on).
We call this workflow "Dynamic Logging", and have a fully-featured version of the product available for use at the website with up to 3 live instances.
On a personal - albeit obviously biased - note, I was an SRE before I joined the company, and saw an early demo of the product. I remember uttering a very verbal f-word during the demonstration, and thinking that I want me one of these nice little IDE thingies this company makes. It's a different way to think about logging - I'll give you that - but it makes a world of sense to me.
I would also at least try a cluster-local log buffering system that forwards INFO and above as received, but buffers DEBUG and below, optionally allowing someone to uncork them if required, getting the "time traveling logging" you were describing. The risk, of course, is the more chains in that transmission flow the more opportunities for something to go sideways and take out all logs which would be :-(
We're more after ongoing situations, where the issue is either hard to reproduce locally or requires very specific state - APIs returning wrong data, vague API 500 errors, application transactions issues, misbehaving caches, 3rd party library errors - that kind of stuff.
If you're looking at the app and your approach would normally be to add another hotfix with logging because some specific piece of information is missing, this approach works beautifully.
We had a dashboard where you could flip on certain logging only as needed.
I tried finding you on twitter but no go since DMs are closed.
Would be happy to pick your brain about the topic - tom@granot.dev is where I’m at if you have the time!
Sounds interesting, now I want to read up on CLP. Not that we have much log texts to worry about.
Says the person who at work just added structured logging to our new product.