Saving millions on logging: Finding relevant savings (opens in new tab)

(product.hubspot.com)

58 pointszek3y ago23 comments

23 comments

It feels like logging is misunderstood. Critical revenue or audit logs need to be centralized, but debug logs don’t. Logging debug logs to local storage and deleting it after nobody looks at it-the lifecycle of at least 99.999% of informational log statements-costs almost nothing. Another benefit is that pushing your predicate out to your edge nodes works far better than trying to get acceptable performance from central logging facilities. So I don’t understand why people waste so much money on centralized informational logs.

Denzel3y ago

Yes, shipping computations instead of data is a reasonable design goal. Your proposed system only works when the predicate is independent across all logs though, correct? If you have to correlate or join your logs to anything, then this model becomes more complex. Not to mention, you're adding an additional performance tax to your prod machines which could be more costly than shipping logs to a centralized store. (A team should profile and make a tradeoff decision appropriate to their context.)

Additionally, what happens when we want to correlate these logs with tens of other systems?

I guess I don't agree that distributed log analysis simplifies the problem any more than centralized log analysis does. If the primary concern is cost, then you can save equivalent amounts of money with a different lifecycle policy for centralized logs.

EDIT: Btw, don't get me wrong, you are asking the right questions that HubSpot's performance team should be asking. The first phase of a cost savings program should observe benefits against cost, or stated another way, requirements vs cost. You're asking the right question, i.e., uhm, how do we actually use this data after we log it? I find it striking that this cost analysis didn't say anything about the end-user's use cases or benefits. Sure, we can optimize a system and save 40% the cost, but what if no one is using the system? Then we could save 100% the cost.

coredog643y ago

I once worked on a system where we were told to keep 18 months of debug logs (something that would have cost about $2k/month). When we pushed back and asked why, the answer was that occasionally (every month or every other month) there would be some customer issue that would need investigation that might result in a customer refund of $20-50 dollars.

Setting aside that the human time required for the investigation was probably close to $40-50, it was still not a slam dunk to get the business to shrink retention to a few days for critical debug.

3 more replies

logicfiction3y ago

That's a fair criticism in the edit. Part 2 will cover that a bit more. I did run analysis on the types of queries users ran against the data and what parts of the timeseries were used, which informed a bit of our solution. I don't want to give away too much, but lifecycle retention adjustment ends up being relatively lower value (but still worthwhile) compared to general space savings.

1 more reply

dilyevsky3y ago

Because you can’t sell that as a saas product

olliej3y ago

I am lost at how you can have 20% of your storage costs be for logging, and not immediately say that at minimum you are persisting too many logs, and probably logging too much in the first place.

I get that modern tech companies log every movement and interaction a user has with an app, far beyond any amount that is reasonable, but surely at some point you can go “we probably don’t need this”.

It shouldn’t be a matter of “let’s compress the logs”, it should be a “are we even using these logs”.

logicfiction3y ago

You’re right, it’s not an either or for this as we tackle both less total data and making it smaller, although I probably failed to clarify that in this first post.

At 20% of storage costs, the should makes a lot of sense to focus on. Once it becomes 1% of storage costs it’s maybe not as problematic though. The magnitude to which, “let’s compress the logs” changes how much something like “am I logging too much” matters is important. Taking it to the absurd, if logging storage were free why not retain all logs. And if logging is cheap, why invest in complicated guardrails for what qualifies as important logs.

A specific consideration for us is organizational inertia. We have a lot of teams using infrastructure in a lot of ways both intended and unintended. One thing, for better or worse, that has been emphasized for us is developer velocity. Which includes things like abstracting “do I need to log this” from most engineers. We have some guardrails to alert if you do some egregious volume.

I think we do often opt for non-invasive infra solutions first because they have much shorter delivery times and less risk of stalling on long-tail outliers. They avoid very expensive organizational costs of buy in and team-level migration. I’m not suggesting this is the best organizational model, but it also transcends one team’s influence.

That ends up circling back around to the start of the problem. If we can transparently reduce the cost burden of some heavily-used internal infrastructure within the same relative magnitude of applying a paradigm shift on the usage of the same internal infrastructure, the former wins out.

SamuelAdams3y ago

I agree with you, but never underestimate fear of litigation. Legal and HR departments everywhere love to have extremely detailed logs for years that can refute ex-employees claims about one thing or another.

hinkley3y ago

Using logs to derive telemetry is a short term strategy, not a long term one. This degree of logging suggests they’ve been doing everything with logs.

satyrnein3y ago

Hubspot offers hosted websites and landing pages (including analytics) as part of their product, so it's not just internal facing data.

mnkmnk3y ago

Unlike JSON, orc requires batching of rows to write to disk. It's because it does a lot of computation - maintaining indexes, encoding columns (run-length, dictionary), calculating statistics, maintaining bloom filters, compressing columns etc. Doing this at the source where you are more interested in serving an individual request as quickly as possible doesn't look like a good idea. If you want the orc files to be useful, you need to batch a lot of rows together otherwise you don't get the benefits of columnar storage. So logs in the happy path will be delayed, and in the unhappy path if the process crashes, recent logs are gone. JSON isn't really bad as a logging format. And it can be stored temporarily to then asynchronously convert to a columnar format.

I'm looking forward to the next post.

orf3y ago

Intercepting network traffic like this is an interesting approach to the problem.

If each service has a unique IAM role, which it definitely should do, wouldn’t you be able to track this via a combination of cloudtrail and proper resource tags?

logicfiction3y ago

(Author here). Yes I believe you are correct with regards to tracking application utilization of say EC2 and other AWS resources.

The post fails to mention this system is also tracking internal data dimensions like customer ids, such that we can also use this sampled data to estimate the cost of customers (and joining that with tiers of customers, and so forth).

I'm also not sure if that would allow us to attribute the cost of our datastore utilizations since those are not AWS-hosted versions but ones we run ourselves. The traffic interception lets us be able to say that Application A is using 75% of database cluster XYZ, and therefore that application/product group are most likely responsible for that share of how much the database costs.

The last thing I'll mention is that CloudTrail has the potential to be expensive on its own, I believe at least moreso than us storing the raw data in S3 for something like Athena to read. I don't think I'll be writing about it, but we've also done work this last year to trim down what we track in CloudTrail due to the cost of events (for example tracking everything in S3 ends up being pretty expensive).

orf3y ago

When it comes to shared resources like a database cluster, you’re making the assumption that usage is correlated with number of connections.

Is this always true? Typically the shared resources you care about are CPU, memory and disk. I would say an application issuing fewer, much heavier queries is using the shared resource more than an application that issues more really simple queries. And this doesn’t correlate much to disk usage right?

There isn’t really a good solution to this. You can use a combination of query sampling and per-app databases to correlate this better.

Great post though, this is something we’ve been dealing and experimenting with recently.

1 more reply

nrivoli3y ago

I would like to know more about the kubernetes databases, what kind, challenges, how are the fault domains configured and etc.

Also is not clear to me how intercepting calls helped you to figure out the offending services?

logicfiction3y ago

I would have to defer to one of my colleagues for most of the details on running data infrastructure on Kubernetes, I’m not that close to that domain. The major ones we run are HBase, Vitess (MySql), Kafka, Elasticsearch, Memcached, and Zookeeper.

As for the call sampling/interception, that did not factor into discovering the high cost buckets in the logging case study. It was mostly relevant to generally describing how we track costs and it ends up being useful in other scenarios. For example it could be used to assess the estimated unit economics of customers subscribed into a specific product tier.

We also have the death star microservice model, so even relatively simple attribution can be helpful when you want to run a query like “for my team which owns 30 applications, tell me the monthly attributed cost grouped by resource” and that will be able to return all the associated database and cloud costs.

j / k navigate · click thread line to collapse

23 comments

jeffbee3y ago

Denzel3y ago

Additionally, what happens when we want to correlate these logs with tens of other systems?

coredog643y ago

Setting aside that the human time required for the investigation was probably close to $40-50, it was still not a slam dunk to get the business to shrink retention to a few days for critical debug.

3 more replies

logicfiction3y ago

1 more reply

dilyevsky3y ago

Because you can’t sell that as a saas product

olliej3y ago

I am lost at how you can have 20% of your storage costs be for logging, and not immediately say that at minimum you are persisting too many logs, and probably logging too much in the first place.

It shouldn’t be a matter of “let’s compress the logs”, it should be a “are we even using these logs”.

logicfiction3y ago

You’re right, it’s not an either or for this as we tackle both less total data and making it smaller, although I probably failed to clarify that in this first post.

SamuelAdams3y ago

hinkley3y ago

Using logs to derive telemetry is a short term strategy, not a long term one. This degree of logging suggests they’ve been doing everything with logs.

satyrnein3y ago

Hubspot offers hosted websites and landing pages (including analytics) as part of their product, so it's not just internal facing data.

mnkmnk3y ago

I'm looking forward to the next post.

orf3y ago

Intercepting network traffic like this is an interesting approach to the problem.

If each service has a unique IAM role, which it definitely should do, wouldn’t you be able to track this via a combination of cloudtrail and proper resource tags?

logicfiction3y ago

(Author here). Yes I believe you are correct with regards to tracking application utilization of say EC2 and other AWS resources.

orf3y ago

When it comes to shared resources like a database cluster, you’re making the assumption that usage is correlated with number of connections.

There isn’t really a good solution to this. You can use a combination of query sampling and per-app databases to correlate this better.

Great post though, this is something we’ve been dealing and experimenting with recently.

1 more reply

nrivoli3y ago

I would like to know more about the kubernetes databases, what kind, challenges, how are the fault domains configured and etc.

Also is not clear to me how intercepting calls helped you to figure out the offending services?

logicfiction3y ago

j / k navigate · click thread line to collapse