Reducing logging cost by two orders of magnitude using CLP (opens in new tab)

(uber.com)

204 pointsath03y ago103 comments

103 comments

Original CLP Paper: https://www.usenix.org/system/files/osdi21-rodrigues.pdf

Github project for CLP: https://github.com/y-scope/clp

The interesting part about the article isn't that structured data is easier to compress and store, its that there's a relatively new way to efficiently transform unstructured logs to structured data. For those shipping unstructured logs to an observability backend this could be a way to save significant money

SergeAx3y ago

Wow, that's awesome! What are the chances to see it integrated into Loki stack?

gdcohen3y ago

Does anyone have a simple explanation of how it structures the log data?

pmarreck3y ago

1) Use Zstandard with a generated dictionary kept separate from the data, but moreover:

2) Organize the log data into tables with columns, and then compress by column (so, each column has its own dictionary). This lets the compression algorithm perform optimally, since now all the similar data is right next to itself. (This reminds me of the Burrows-Wheeler transform, except much more straightforward, thanks to how similar log lines are.)

3) Search is performed without decompression. Somehow they use the dictionary to index into the table- very clever. I would have just compressed the search term using the same dictionary and do a binary search for that, but I think that would only work for exact matches.

sliken3y ago

Heh, my hacked version of this was used for apache logging. I just ran Mysql on top of a ZFS volume with compression turned on and had a column for timestamp, IP, referral IP, URL, action, and return code. I was amazed at how fast it was, how easy/fast it was to query, disk storage efficiency, and was overall quite impressed at how it was nearly as useful as a standard web traffic analysis tool that took significant time to crunch the logs, but worked on live data.

1 more reply

gcr3y ago

Does the user have to specify the "schema" (each unique log message type) manually? or is it learned automatically (a la gzip and friends)? I wasn't able to discover this from a cursory readthrough of the paper...

1 more reply

Veserv3y ago

If you have exactly two logging operations in your entire program:

log(“Began {x} connected to {y}”)

log(“Ended {x} connected to {y}”)

We can label the first one logging operation 1 and the second one logging operation 2.

Then, if logging operation 1 occurs we can write out:

1, {x}, {y} instead of “Began {x} connected to {y}” because we can reconstruct the message as long as we know what operation occurred, 1, and the value of all the variables in the message. This general strategy can be extended to any number of logging operations by just giving them all a unique ID.

That is basically the source of their entire improvement. The only other thing that may cause a non-trivial improvement is that they delta encode their timestamps instead of writing out what looks to be a 23 character timestamp string.

The columnar storage of data and dictionary deduplication, what is called Phase 2 in the article, is still not fully implemented according to the article authors and is only expected to result in a 2x improvement. In contrast, the elements I mentioned previously, Phase 1, were responsible for a 169x(!) improvement in storage density.

hagen17783y ago

So does it means you need to specify rules for all unique logging operations you have? I was under impression that the thing can do it automatically... In a runtime based on receiving repetitive logs or something like that. If it doesn't, it is still a hell of work to compile and maintain the list of rules for your applications.

benmanns3y ago

Figure 2[2] from the article is pretty good.

[2]: https://blog.uber-cdn.com/cdn-cgi/image/width=2216,quality=8...

VectorLock3y ago

It seems very cool to get logmine style log line templating built right in. I've found its very helpful to run logs through it, and having a log system that can do this at ingest time for quicker querying seems like it'd have amazing log digging workflow benefits.

hobs3y ago

This just in, Uber rediscovers what all us database people already knew, structured data is usually way easier to compress and store and index and query than unstructured blobs of text, which is why we kept telling you to stop storing json in your databases.

dewey3y ago

There's nothing wrong with storing json in your database if the tradeoffs are clear and it's used in a sensible way.

Having structured data and an additional json/jsonb column where it makes sense can be very powerful. There's a reason every new release of Postgres improves on the performance and features available for the json data type. (https://www.postgresql.org/docs/9.5/functions-json.html)

marcosdumay3y ago

> There's nothing wrong with storing json in your database if the tradeoffs are clear and it's used in a sensible way.

Of course. If there was, postgres wouldn't even support it.

The GP's rant is usually thrown against people that default into json instead of thinking about it and maybe coming up with an adequate structure. There are way too many of those people.

VectorLock3y ago

Rigid structures and schemas are nice, but having document oriented data also has its advantages.

2 more replies

xwolfi3y ago

It cant be. Json has a huge structural problem: it's an ASCII representation of a schema+value list, where the schema is repeated with each value. It improved on xml because it doesn't repeat the schema twice, at least...

It's nonsensical most of the time: do a table, transform values out of the db or in the consumer.

The reason postgres does it is because lazy developpers overused the json columns and then got fucked and say postgres is slow (talking from repeated experience here). Yeah searching in random unstructured blob is slow, surprise.

I dont dislike the idea to store json and structured data together but... you dont need performance then. Transferring a binary representation of a table and having a binary to object converter in your consumer (even chrome) is several orders of magnitudes faster than parsing strings, especially with json vomit of schema at every value.

dewey3y ago

> It's nonsensical most of the time

As usual, it comes down to being sensible about how to use a given tool. You can start with a json column and later when access patterns become clear you split out specific keys that are often accessed / queried on into specific columns.

Another good use case for data that you want to have but don't have to query on often: https://supabase.com/blog/audit

> Yeah searching in random unstructured blob is slow, surprise.

If your use case it to search / query on it often then jsonb column is the wrong choice. You can have an index on a json key, which works reasonably well but I'd probably not put it in the hot path: https://www.postgresql.org/docs/current/datatype-json.html#J...

b0ti3y ago

Couldn't agree more. Not to mention timestamps that JSON simply doesn't handle and is essential for event data. The raijin database has an clever approach to solving the schema rigidity problem: https://raijin.co

stingraycharles3y ago

Yeah it’s silly, even when Spark is writing unstructured logs, that doesn’t mean that you can‘t parse them after-the-fact and store them in a structured way. Even if it doesn’t work for 100% of the cases, it’s very easy to achieve for 99% of them, in which case you’ll still keep a “raw_message” column which you can query as text.

Next up: Uber discovers column oriented databases are more efficient for data warehouses.

polotics3y ago

Word! Storing JSON is so often the most direct and explicit way of accruing technical debt: "We don't really know what structure the data we'll get should have, just specify that it's going to be JSON"...

gtowey3y ago

I like to say that when you try to make a "schemaless" database, you've just made 1000 different schemas instead.

Gh0stRAT3y ago

Yeah, "Schemaless" is a total misnomer. You either have "schema-on-write" or "schema-on-read".

2 more replies

stingraycharles3y ago

But if you’re not storing data as JSON, can you really say you’re agile? /s

weego3y ago

Look, we'll just get it in this way for now, once it's live we'll have all the time we need to change the schema in the background

1 more reply

kevindong3y ago

You can't realistically expect every log format to get a custom schema declared for it prior to deployment.

xwolfi3y ago

If you never intend to monitor them systematically, absolutely!

If you're a bit serious you can at least impose date, time to the millisecond, pointer to the source of the log line, level, and a message. Let s be crazy and even say the message could have a structure too, but I can feel the weight of effort on your shoulders and say you ve already saved yourself the embarassement a colleague of mine faced when he realized he couldnt give me millisecond timestamp, rendering a latency calculation in the past impossible.

1 more reply

groestl3y ago

I think it's important to note the difference between unstructured and schemaless. JSON is very much not a blob of text. One layer's structured data is another layer's opaque blob.

xwolfi3y ago

It s the same ! Look the syntax of json does not impose a structure (what fields, what order, can I remove a field), making it dangerous for any stable parsing over time.

foobiekr3y ago

It is pretty well known at this point by much of the industry that Uber has the same promo policy incentives as Google. That’s what happens when you ape google.

jodrellblank3y ago

This just in: “I knew that already hahaha morons” still as unhelpful and uninteresting a comment as ever.

kazinator3y ago

Maybe people do things like that because:

- their application parses and generates JSON already, so it's low-effort.

- the JSON can have various shapes: database records generally don't do that.

- even if it has the same shape, it can change over time; they don't want to deal with the insane hassle of upgrade-time DB schema changes in existing installations

The alternative to JSON-in-DB is to have a persistent object store. That has downsides too.

xwolfi3y ago

It's a hassle to change shape in a db possibly, but have you lived through changing the shape of data in a json store where historical data is important ?

You either dont care abt the past and cant read it anymore, version your writer and reader each time you realize an address in a new country has yet another frigging field, or parse each json value to add the new shape in place in your store.

Json doesnt solve the problem of shape evolution, but it tempt you very strongly to think you can ignore it.

VectorLock3y ago

You do database migrations or you handle data with a variable shape. Where do you want to put your effort? The latter makes rollbacks easier at least.

1 more reply

rubyist5eva3y ago

Postgres has excellent JSON support, it's one of my favorite features, it's a nice middle ground between having a schema for absolutely everything or standing up mongodb beside it because it's web scale. Us in particular, we leverage json-schema in our application for a big portion json data and it works great.

jd_mongodb3y ago

MongoDB actually doesn't store JSON. It stores a binary encoding of JSON called BSON (Binary JSON) which encodes type and size information.

This means we can encode objects in your program directly into objects in the database. It also means we can natively encode documents, sub-documents, arrays, geo-spatial coordinates, floats, ints and decimals. This is a primary function of the driver.

This also allows us to efficiently index these fields, even sub-documents and arrays.

All MongoDB collections are compressed on disk by default.

(I work for MongoDB)

rubyist5eva3y ago

Thanks for the clarification. I appreciate the info, and all-in-all I think MongoDB seems really good these days. Though I did have a lot of problems with earlier versions (which, I acknowledge have mostly been resolved in current versions) that have kinda soured me on the product and I hesitate to reach for it on new projects when Postgres has been rock-solid for me for over a decade. I wish you guys all the best in building Mongo.

Getting back to my main point though, less the bit of sarcasm, is more of a general rule of thumb that has served me well is that if you already have something like postgresql stood up, you can generally take it much further than you may initially think before having to complicate your infrastructure by setting up another database (not just mongodb, but pretty much anything else).

xani_3y ago

Modern SQL engines can index JSONs tho. And they can be structured

SkeuomorphicBee3y ago

"Page not found"

Apparently the Uber site noticed I'm not in the USA and automatically redirects to a localized version, which doesn't exist. If their web-development capabilities are any indication I'll skip their development tips.

detaro3y ago

https://web.archive.org/web/20220930114340/https://www.uber.... works

emj3y ago

https://www.uber.com/en-US/blog/reducing-logging-cost-by-two...

Rexxar3y ago

I have the same problem but the redirect only fail when I use Firefox.

hknmtt3y ago

yep

taftster3y ago

I'm not trying to flame bait here, but this whole article refutes the "Java is Dead" sentiment that seems to float around regularly among developers.

This is a very complicated and sophisticated architecture that leverages the JVM to the hilt. The "big data" architecture that Java and the JVM ecosystem present is really something to be admired, and it can definitely move big data.

I know that competition to this architecture must exist in other frameworks or platforms. But what exactly would replace the HDFS, Spark, Yarn configuration described by the article? Are there equivalents of this stack in other non-JVM deployments, or to other big data projects, like Storm, Hive, Flink, Cassandra?

And granted, Hadoop is somewhat "old" at this point. But I think it (and Google's original map-reduce paper) significantly moved the needle in terms of architecture. Hadoop's Map-Reduce might be dated, but HDFS is still being used very successfully in big data centers. Has the cloud and/or Kubernetes completely replaced the described style of architecture at this point?

Honest questions above, interested in other thoughts.

npalli3y ago

I didn't read the article that way. FTA, the sense is Java is not dead in the same sense COBOL is not dead, that is "legacy" technology that you have now work around because it is too costly to operate and maintain. Ironically, from this article the two main technical solves for the issues with their whole JVM setup are CLP (which is the main article) and moving to Clickhouse for non-Spark logs both of which are written in C++.

With Cloud operating costs dominating the expenses at companies one can see more migration away from JVM setups to simpler (Golang) and close to metal architectures (Rust, C++).

AtlasBarfed3y ago

"Java is the new COBOL" has always been either a glaring sign of idiocy/ignorance or a bad joke signifying... idiocy/ignorance.

COBOL is exotic syntax and runs on fringe/exotic hardware (mainframes, minicomps, IBM iron).

Java is a c-like syntax that runs everywhere people are shoehorning in Go and Node.JS. Syntax arguments are bikeshedding, but it was a "step forward" for non-systems coding from C and has fundamental design, architectural, breadth of library, interop, modernization, and familiarity advantages over COBOL.

Go is a syntax power stepback, with possibly some GC advantages, and Javascript even with Typescript is still a messed up ecosystem with worse GC and performance issues.

One thing that was interesting was watching the Ruby on Rails stack explode in complexity to encompass an acronym soup nearly as bad a Java as the years moved forward and it matured. Java isn't as complex an ecosystem as it is due to any failings or language failures. It simply has to be as a mature ecosystem.

Syntax complaints I'll listen too, after all I do all my JVM programming in Groovy. But if you complain about java syntax, why would you think Go is "better"?

I think a meta-language will emerge that will have Rust power and checking underlying it but a lot simpler, kind of like elixir and erlang, or typescript and javascript, or, well, Groovy and Java.

taftster3y ago

Just to probe. COBOL doesn't have many (if any) updates to it, though. And there are no big data architectures being built around it. Equating "Java is Dead" to the same meaning as "COBOL is Dead" doesn't seem like a legitimate comparison.

But I do get your points and don't necessarily disagree with them. I just don't see this as "legacy" technology, but maybe more like "mature"?

npalli3y ago

Yes, "mature" would have been more accurate for Java, some exaggeration on my end. I was trying to convey the sense of excitement for new projects and developers in Java but it is not fair to Java to be compared to COBOL. Primarily because Java is actively developed, lot more developers etc. Nevertheless Cloud is so big nowadays that people are looking for alternatives to the JVM world. 10 years ago it would been a close to default option.

lenkite3y ago

Java supports AOT via Graal so you can have non JVM setups already.

cjalmeida3y ago

Of note, Java !== JVM. Spark and Flink, for instance, are written in Scala which is alive and well :).

My best effort in finding replacements of those tools that don't leverage the JVM:

HDFS: Any cloud object store like S3/AzBlob, really. In some workloads data locality provided by HDFS may be important. Alluxio can help here (but I cheat, it's a JVM product)

Spark: Different approach but you could use Dask, Ray, or dbt plus any SQL Analytical DB like Clickhouse. If you're in the cloud, and are not processing 10s TB at a time, spinning an ephemeral HUGE VM and using something in-memory like DuckDB, Polars or DataFrame.jl is much faster.

Yarn: Kubernetes Jobs. Period. At this point I don't see any advantage of Yarn, including running Spark workloads.

Hive: Maybe Clickhouse for some SQL-like experience. Faster but likely not at the same scale.

Storm/Flink/Cassandra: no clue.

My preferred "modern" FOSS stack (for many reasons) is Python based, with the occasional Julia/Rust thrown in. For a medium scale (ie. few TB daily ingestion), I would go with:

Kubernetes + Airflow + ad-hoc Python jobs + Polars + Huge ephemeral VMs.

ofrzeta3y ago

There's ScyllaDB as a replacement for Cassandra. https://www.scylladb.com/

mdaniel3y ago

That's probably only true for extremely license-permissive shops:

https://github.com/apache/cassandra/blob/trunk/LICENSE.txt (Apache 2)

https://github.com/scylladb/scylladb/blob/master/LICENSE.AGP...

foobarian3y ago

Not answering your primary question, I know. But I wonder where you are getting the "Java is Dead" sentiment - I am not getting it at all in my (web/enterprisey) circle, if anything there is a lot of excitement due to new LTS versions and other JVM languages like Kotlin. And I am also finding a lot of gratitude for the language not changing in drastic ways (can you imagine a Python 2->3 like transition?) despite the siren call of fancy new PL features.

taftster3y ago

Maybe it's just a little cliche and maybe the phrase "XXX is Dying" is too easily thrown around for click-bait and hyperbole. It can probably be applied to any language that isn't garnering recent fandom. You could probably just as easily say, "Is C# dead?" or "Is Ruby on Rails dead?" or "Is Python dead?" or "Is Rust dead?" (kidding on those last ones).

And yes, I'm with you. I'm super excited about the changes to the Java language, and the JVM continues to be superior for many workloads. Hotspot is arguably one of the best virtual machines that exists today.

But there are plenty of "Java is dead" blog posts and comments here on HN to substantiate my original viewpoint. Maybe because I make a living with Java, I have a bias towards those articles but filter out others, so I don't have a clean picture of this sentiment and it's more in my head.

paulmd3y ago

Java has never been more alive really. All the other JVM languages just strengthen and retrench JVM, it all deploys the same. And Java itself has really come a long long ways since the JDK7 days.

ddorian433y ago

The next step is to build it in lower level language with modern hardware in mind to be 2x+ faster than the java alternatives. See scylladb, redpanda, quickwit, yugabytedb, etc.

otikik3y ago

I really didn't know whether this was going to be an article about structuring sequential information or about a more efficient way to produce wood. Hacker news!

I clicked, found out, and was dissapointed that this wasn't about wood.

Maybe I should start that woodworking career change already.

eterm3y ago

Given that uber.com is prominently displayed in the title, I don't believe this charming and relatable little anecdote about title confusion.

otikik3y ago

You overestimate my reading speed and underestimate my clicking speed, good sir

Cerium3y ago

Eh, I didn't notice Uber while I was considering if this was about automation of forestry operations. Though I did stop to consider if there is even enough waste in the industry to have potential gains of 100x. I doubt there is more than 2x available.

bcjordan3y ago

Always bugged me that highly repetitive logs take up so much space!

I'm curious, are there any managed services / simple to use setups to take advantage of something like this for massive log storage and search? (Most hosted log aggregators I've looked at charge by the raw text GB processed)

trueleo3y ago

Check out https://github.com/parseablehq/parseable ... we are building a log storage and analysis platform in rust. Columnar format helps a lot in reducing overall size but then you have little computational overhead to deal with conversion and compression. This trade off will be there but we are discovering ways to minimise it with rust

tarasglek3y ago

It doesn't seem like your solution achieves columnar breakdowns for unstructured parts of the log. Eg they will basically reverse engineer printfs, you don't. Misleading claim of being similar

hericium3y ago

ZFS as an underlying filesystem offers several compression algos and suits raw logs storage well.

LilBytes3y ago

Deduplication can literally save petabytes.

paulmd3y ago

deduplication is probably the biggest "we don't do that here" in the ZFS world lol, at this point I think even the authors of that feature have disowned it.

it does what it says on the tin, but this comes at a much higher price than almost any other ZFS feature: you have to store the dedup tables in memory, permanently, to get any performance out of the system, so the rule of thumb you need at least 20GB of RAM per TB stored. In practice you only want to do it if your data is HIGHLY duplicated, and that's often a smell that building a layered image from a common ancestor using the snapshot functionality is going to be a better option.

and once you've committed to deduplication, you're committed... dedup metadata builds up over time and the only time it gets purged is if you remove ALL references to ANY dedup'd blocks on that pool. So practically speaking this is a commitment to running multiple pools and migrating them at some point. That's not a huge problem for enterprise, but, most people usually want to run "one big pool" for their home stuff. But all in all, even for enterprise, you have to really know that you want it and it's going to produce big gains for your specific use-case.

in contrast LZ4 compression is basically free (actually it's usually faster due to reduced IOPS) and still performs very well on things like column-oriented stores, or even just unstructured json blobs, and imposes no particular limitations on the pool, it's just compressed blocks.

xani_3y ago

They would still charge you per raw GB processed regardless of compression used.

IIRC Elasticsearch compresses by default with LZ4

shrubble3y ago

This is basically sysadmin 101, however.

Compressing logs has been a thing since the mid-1990s.

Minimizing writes to disk, or setting up a way to coalesce the writes, has also been around for as long as we have had disk drives. If you don't have enough RAM on your system to buffer the writes so that more of the writes get turned into sequential writes, your disk performance will suffer - this too has been known since the 1990s.

sa463y ago

Sysadmin 101 doesn't involve separating the dynamic portions of similar, but unstructured log lines to dramatically improve compression and search performance.

> Zstandard or Gzip do not allow gaps in the repetitive pattern; therefore when a log type is interleaved by variable values, they can only identify the multiple substrings of the log type as repetitive.

shrubble3y ago

A sysadmin would use the logging facility (if traditional syslog) or simply awk/sed to process the logs into different files that are similar to each other (such as different levels of INFO/WARN/ERROR); then, increase the size of the DEFLATE dictionary used for compression until you get better compression.

See for instance this discussion of creating a custom DEFLATE dictionary: https://blog.cloudflare.com/improving-compression-with-prese...

prionassembly3y ago

Man, I was expecting constraint linear programming.

demux3y ago

I was expecting constraint logic programming!

laweijfmvo3y ago

I thought it was going to be about trees until I saw the URL…

dylan6043y ago

They're just saving lumberjacks money by not having to own their own trucks. They can just load their chainsaws in an Uber XL. Boom! Another industry disruption! /s

kazinator3y ago

Log4j is 350,000 lines of code ... and you still need an add-on to compress logs?

qualudeheart3y ago

Given the small size of the logs at my employer, they can probably be compressed to a few tens of KB. Compression is always a good thing, but especially when you need to cut cloud storage costs.

tomgs3y ago

Disclaimer: I run Developer Relations for Lightrun.

There is another way to tackle the problem for most normal, back-end applications: Dynamic Logging[0].

Instead of adding a large of amount of logs during development (and then having to deal with compressing and transforming them later) one can instead choose to only add the logs required at runtime.

This is a workflow shift, and as such should be handled with care. But for the majority of logs used for troubleshooting, it's actually a saner approach: Don't make a priori assumptions about what you might need in production, then try and "massage" the right parts out of it when the problem rears its head.

Instead, when facing an issue, add logs where and when you need them to almost "surgically" only get the bits you want. This way, logging cost reduction happens naturally - because you're never writing many of the logs to begin with.

Note: we're not talking about removing logs needed for compliance, forensics or other regulatory reasons here, of course. We're talking about those logs that are used by developers to better understand what's going on inside the application: the "print this variable" or "show this user's state" or "show me which path the execution took" type logs, the ones you look at once and then forget about (while their costs piles on and on).

We call this workflow "Dynamic Logging", and have a fully-featured version of the product available for use at the website with up to 3 live instances.

On a personal - albeit obviously biased - note, I was an SRE before I joined the company, and saw an early demo of the product. I remember uttering a very verbal f-word during the demonstration, and thinking that I want me one of these nice little IDE thingies this company makes. It's a different way to think about logging - I'll give you that - but it makes a world of sense to me.

[0] https://docs.lightrun.com/logs/

thethimble3y ago

Perhaps I’m misunderstanding but what happens if you’ve had a one-off production issue (job failed, etc) and you hadn’t dynamically logged the corresponding code? You can’t go back in time and enable logging for that failure right?

mdaniel3y ago

An alternative approach, IMHO, is to log all the things and just be judicious about expunging old stuff -- I believe the metrics community buys into this approach, too, storing high granularity captures for a week or whatever, and then rolling them up into larger aggregates for longer-term storage

I would also at least try a cluster-local log buffering system that forwards INFO and above as received, but buffers DEBUG and below, optionally allowing someone to uncork them if required, getting the "time traveling logging" you were describing. The risk, of course, is the more chains in that transmission flow the more opportunities for something to go sideways and take out all logs which would be :-(

tomgs3y ago

That would entail time-travelling and capturing that exact spot in the code, which is usually done by exception monitoring/handling products (plenty exist on the market).

We're more after ongoing situations, where the issue is either hard to reproduce locally or requires very specific state - APIs returning wrong data, vague API 500 errors, application transactions issues, misbehaving caches, 3rd party library errors - that kind of stuff.

If you're looking at the app and your approach would normally be to add another hotfix with logging because some specific piece of information is missing, this approach works beautifully.

mdaniel3y ago

> which is usually done by exception monitoring/handling products (plenty exist on the market).

Only if one considers the bug/unexpected condition to be an exception; the only thing worse than nothing being an exception is everything being an exception

jedberg3y ago

I'm glad someone put a name on the concept I've been advocating for a decade. Thank you! It's something we added at Netflix when we realized our logging costs were out of control.

We had a dashboard where you could flip on certain logging only as needed.

tomgs3y ago

I know Mykyta, who does dev productivity at Netflix now, and he said something to that effect;)

I tried finding you on twitter but no go since DMs are closed.

Would be happy to pick your brain about the topic - tom@granot.dev is where I’m at if you have the time!

VectorLock3y ago

Sounds pretty cool. How much?

tomgs3y ago

Pricing is here:

https://lightrun.com/pricing

hermanradtke3y ago

I don’t consider Free for one agent and “contact us” for everything else to be pricing.

2 more replies

VectorLock3y ago

And poof there goes my interest. No prices.

jbergens3y ago

> After implementing Phase 1, we were surprised to see we had achieved a compression ratio of 169x.

Sounds interesting, now I want to read up on CLP. Not that we have much log texts to worry about.

xani_3y ago

I'm more surprised that they don't just ship logs to central log gathering directly instead of saving plain files then move them around

dathinab3y ago

Am I the only person who expected an article about logging. Like as in cutting down trees ;=)

Says the person who at work just added structured logging to our new product.

j / k navigate · click thread line to collapse

103 comments

twunde3y ago

Original CLP Paper: https://www.usenix.org/system/files/osdi21-rodrigues.pdf

Github project for CLP: https://github.com/y-scope/clp

SergeAx3y ago

Wow, that's awesome! What are the chances to see it integrated into Loki stack?

gdcohen3y ago

Does anyone have a simple explanation of how it structures the log data?

pmarreck3y ago

1) Use Zstandard with a generated dictionary kept separate from the data, but moreover:

sliken3y ago

1 more reply

gcr3y ago

1 more reply

Veserv3y ago

If you have exactly two logging operations in your entire program:

log(“Began {x} connected to {y}”)

log(“Ended {x} connected to {y}”)

We can label the first one logging operation 1 and the second one logging operation 2.

Then, if logging operation 1 occurs we can write out:

hagen17783y ago

benmanns3y ago

Figure 2[2] from the article is pretty good.

[2]: https://blog.uber-cdn.com/cdn-cgi/image/width=2216,quality=8...

VectorLock3y ago

hobs3y ago

dewey3y ago

There's nothing wrong with storing json in your database if the tradeoffs are clear and it's used in a sensible way.

marcosdumay3y ago

> There's nothing wrong with storing json in your database if the tradeoffs are clear and it's used in a sensible way.

Of course. If there was, postgres wouldn't even support it.

The GP's rant is usually thrown against people that default into json instead of thinking about it and maybe coming up with an adequate structure. There are way too many of those people.

VectorLock3y ago

Rigid structures and schemas are nice, but having document oriented data also has its advantages.

2 more replies

xwolfi3y ago

It's nonsensical most of the time: do a table, transform values out of the db or in the consumer.

dewey3y ago

> It's nonsensical most of the time

Another good use case for data that you want to have but don't have to query on often: https://supabase.com/blog/audit

> Yeah searching in random unstructured blob is slow, surprise.

b0ti3y ago

stingraycharles3y ago

Next up: Uber discovers column oriented databases are more efficient for data warehouses.

polotics3y ago

gtowey3y ago

I like to say that when you try to make a "schemaless" database, you've just made 1000 different schemas instead.

Gh0stRAT3y ago

Yeah, "Schemaless" is a total misnomer. You either have "schema-on-write" or "schema-on-read".

2 more replies

stingraycharles3y ago

But if you’re not storing data as JSON, can you really say you’re agile? /s

weego3y ago

Look, we'll just get it in this way for now, once it's live we'll have all the time we need to change the schema in the background

1 more reply

kevindong3y ago

You can't realistically expect every log format to get a custom schema declared for it prior to deployment.

xwolfi3y ago

If you never intend to monitor them systematically, absolutely!

1 more reply

groestl3y ago

I think it's important to note the difference between unstructured and schemaless. JSON is very much not a blob of text. One layer's structured data is another layer's opaque blob.

xwolfi3y ago

It s the same ! Look the syntax of json does not impose a structure (what fields, what order, can I remove a field), making it dangerous for any stable parsing over time.

foobiekr3y ago

It is pretty well known at this point by much of the industry that Uber has the same promo policy incentives as Google. That’s what happens when you ape google.

jodrellblank3y ago

This just in: “I knew that already hahaha morons” still as unhelpful and uninteresting a comment as ever.

kazinator3y ago

Maybe people do things like that because:

- their application parses and generates JSON already, so it's low-effort.

- the JSON can have various shapes: database records generally don't do that.

- even if it has the same shape, it can change over time; they don't want to deal with the insane hassle of upgrade-time DB schema changes in existing installations

The alternative to JSON-in-DB is to have a persistent object store. That has downsides too.

xwolfi3y ago

It's a hassle to change shape in a db possibly, but have you lived through changing the shape of data in a json store where historical data is important ?

Json doesnt solve the problem of shape evolution, but it tempt you very strongly to think you can ignore it.

VectorLock3y ago

You do database migrations or you handle data with a variable shape. Where do you want to put your effort? The latter makes rollbacks easier at least.

1 more reply

rubyist5eva3y ago

jd_mongodb3y ago

MongoDB actually doesn't store JSON. It stores a binary encoding of JSON called BSON (Binary JSON) which encodes type and size information.

This also allows us to efficiently index these fields, even sub-documents and arrays.

All MongoDB collections are compressed on disk by default.

(I work for MongoDB)

rubyist5eva3y ago

xani_3y ago

Modern SQL engines can index JSONs tho. And they can be structured

SkeuomorphicBee3y ago

"Page not found"

detaro3y ago

https://web.archive.org/web/20220930114340/https://www.uber.... works

emj3y ago

https://www.uber.com/en-US/blog/reducing-logging-cost-by-two...

Rexxar3y ago

I have the same problem but the redirect only fail when I use Firefox.

hknmtt3y ago

yep

taftster3y ago

I'm not trying to flame bait here, but this whole article refutes the "Java is Dead" sentiment that seems to float around regularly among developers.

Honest questions above, interested in other thoughts.

npalli3y ago

With Cloud operating costs dominating the expenses at companies one can see more migration away from JVM setups to simpler (Golang) and close to metal architectures (Rust, C++).

AtlasBarfed3y ago

"Java is the new COBOL" has always been either a glaring sign of idiocy/ignorance or a bad joke signifying... idiocy/ignorance.

COBOL is exotic syntax and runs on fringe/exotic hardware (mainframes, minicomps, IBM iron).

Go is a syntax power stepback, with possibly some GC advantages, and Javascript even with Typescript is still a messed up ecosystem with worse GC and performance issues.

Syntax complaints I'll listen too, after all I do all my JVM programming in Groovy. But if you complain about java syntax, why would you think Go is "better"?

I think a meta-language will emerge that will have Rust power and checking underlying it but a lot simpler, kind of like elixir and erlang, or typescript and javascript, or, well, Groovy and Java.

taftster3y ago

But I do get your points and don't necessarily disagree with them. I just don't see this as "legacy" technology, but maybe more like "mature"?

npalli3y ago

lenkite3y ago

Java supports AOT via Graal so you can have non JVM setups already.

cjalmeida3y ago

Of note, Java !== JVM. Spark and Flink, for instance, are written in Scala which is alive and well :).

My best effort in finding replacements of those tools that don't leverage the JVM:

HDFS: Any cloud object store like S3/AzBlob, really. In some workloads data locality provided by HDFS may be important. Alluxio can help here (but I cheat, it's a JVM product)

Yarn: Kubernetes Jobs. Period. At this point I don't see any advantage of Yarn, including running Spark workloads.

Hive: Maybe Clickhouse for some SQL-like experience. Faster but likely not at the same scale.

Storm/Flink/Cassandra: no clue.

My preferred "modern" FOSS stack (for many reasons) is Python based, with the occasional Julia/Rust thrown in. For a medium scale (ie. few TB daily ingestion), I would go with:

Kubernetes + Airflow + ad-hoc Python jobs + Polars + Huge ephemeral VMs.

ofrzeta3y ago

There's ScyllaDB as a replacement for Cassandra. https://www.scylladb.com/

mdaniel3y ago

That's probably only true for extremely license-permissive shops:

https://github.com/apache/cassandra/blob/trunk/LICENSE.txt (Apache 2)

https://github.com/scylladb/scylladb/blob/master/LICENSE.AGP...

foobarian3y ago

taftster3y ago

paulmd3y ago

Java has never been more alive really. All the other JVM languages just strengthen and retrench JVM, it all deploys the same. And Java itself has really come a long long ways since the JDK7 days.

ddorian433y ago

The next step is to build it in lower level language with modern hardware in mind to be 2x+ faster than the java alternatives. See scylladb, redpanda, quickwit, yugabytedb, etc.

otikik3y ago

I really didn't know whether this was going to be an article about structuring sequential information or about a more efficient way to produce wood. Hacker news!

I clicked, found out, and was dissapointed that this wasn't about wood.

Maybe I should start that woodworking career change already.

eterm3y ago

Given that uber.com is prominently displayed in the title, I don't believe this charming and relatable little anecdote about title confusion.

otikik3y ago

You overestimate my reading speed and underestimate my clicking speed, good sir

Cerium3y ago

bcjordan3y ago

Always bugged me that highly repetitive logs take up so much space!

trueleo3y ago

tarasglek3y ago

It doesn't seem like your solution achieves columnar breakdowns for unstructured parts of the log. Eg they will basically reverse engineer printfs, you don't. Misleading claim of being similar

hericium3y ago

ZFS as an underlying filesystem offers several compression algos and suits raw logs storage well.

LilBytes3y ago

Deduplication can literally save petabytes.

paulmd3y ago

deduplication is probably the biggest "we don't do that here" in the ZFS world lol, at this point I think even the authors of that feature have disowned it.

xani_3y ago

They would still charge you per raw GB processed regardless of compression used.

IIRC Elasticsearch compresses by default with LZ4

shrubble3y ago

This is basically sysadmin 101, however.

Compressing logs has been a thing since the mid-1990s.

sa463y ago

Sysadmin 101 doesn't involve separating the dynamic portions of similar, but unstructured log lines to dramatically improve compression and search performance.

shrubble3y ago

See for instance this discussion of creating a custom DEFLATE dictionary: https://blog.cloudflare.com/improving-compression-with-prese...

prionassembly3y ago

Man, I was expecting constraint linear programming.

demux3y ago

I was expecting constraint logic programming!

laweijfmvo3y ago

I thought it was going to be about trees until I saw the URL…

dylan6043y ago

They're just saving lumberjacks money by not having to own their own trucks. They can just load their chainsaws in an Uber XL. Boom! Another industry disruption! /s

kazinator3y ago

Log4j is 350,000 lines of code ... and you still need an add-on to compress logs?

qualudeheart3y ago

Given the small size of the logs at my employer, they can probably be compressed to a few tens of KB. Compression is always a good thing, but especially when you need to cut cloud storage costs.

tomgs3y ago

Disclaimer: I run Developer Relations for Lightrun.

There is another way to tackle the problem for most normal, back-end applications: Dynamic Logging[0].

Instead of adding a large of amount of logs during development (and then having to deal with compressing and transforming them later) one can instead choose to only add the logs required at runtime.

We call this workflow "Dynamic Logging", and have a fully-featured version of the product available for use at the website with up to 3 live instances.

[0] https://docs.lightrun.com/logs/

thethimble3y ago

mdaniel3y ago

tomgs3y ago

That would entail time-travelling and capturing that exact spot in the code, which is usually done by exception monitoring/handling products (plenty exist on the market).

If you're looking at the app and your approach would normally be to add another hotfix with logging because some specific piece of information is missing, this approach works beautifully.

mdaniel3y ago

> which is usually done by exception monitoring/handling products (plenty exist on the market).

Only if one considers the bug/unexpected condition to be an exception; the only thing worse than nothing being an exception is everything being an exception

jedberg3y ago

I'm glad someone put a name on the concept I've been advocating for a decade. Thank you! It's something we added at Netflix when we realized our logging costs were out of control.

We had a dashboard where you could flip on certain logging only as needed.

tomgs3y ago

I know Mykyta, who does dev productivity at Netflix now, and he said something to that effect;)

I tried finding you on twitter but no go since DMs are closed.

Would be happy to pick your brain about the topic - tom@granot.dev is where I’m at if you have the time!

VectorLock3y ago

Sounds pretty cool. How much?

tomgs3y ago

Pricing is here:

https://lightrun.com/pricing

hermanradtke3y ago

I don’t consider Free for one agent and “contact us” for everything else to be pricing.

2 more replies

VectorLock3y ago

And poof there goes my interest. No prices.

jbergens3y ago

> After implementing Phase 1, we were surprised to see we had achieved a compression ratio of 169x.

Sounds interesting, now I want to read up on CLP. Not that we have much log texts to worry about.

xani_3y ago

I'm more surprised that they don't just ship logs to central log gathering directly instead of saving plain files then move them around

dathinab3y ago

Am I the only person who expected an article about logging. Like as in cutting down trees ;=)

Says the person who at work just added structured logging to our new product.

j / k navigate · click thread line to collapse