Additionally, what happens when we want to correlate these logs with tens of other systems?
I guess I don't agree that distributed log analysis simplifies the problem any more than centralized log analysis does. If the primary concern is cost, then you can save equivalent amounts of money with a different lifecycle policy for centralized logs.
EDIT: Btw, don't get me wrong, you are asking the right questions that HubSpot's performance team should be asking. The first phase of a cost savings program should observe benefits against cost, or stated another way, requirements vs cost. You're asking the right question, i.e., uhm, how do we actually use this data after we log it? I find it striking that this cost analysis didn't say anything about the end-user's use cases or benefits. Sure, we can optimize a system and save 40% the cost, but what if no one is using the system? Then we could save 100% the cost.
Setting aside that the human time required for the investigation was probably close to $40-50, it was still not a slam dunk to get the business to shrink retention to a few days for critical debug.
I get that modern tech companies log every movement and interaction a user has with an app, far beyond any amount that is reasonable, but surely at some point you can go “we probably don’t need this”.
It shouldn’t be a matter of “let’s compress the logs”, it should be a “are we even using these logs”.
At 20% of storage costs, the should makes a lot of sense to focus on. Once it becomes 1% of storage costs it’s maybe not as problematic though. The magnitude to which, “let’s compress the logs” changes how much something like “am I logging too much” matters is important. Taking it to the absurd, if logging storage were free why not retain all logs. And if logging is cheap, why invest in complicated guardrails for what qualifies as important logs.
A specific consideration for us is organizational inertia. We have a lot of teams using infrastructure in a lot of ways both intended and unintended. One thing, for better or worse, that has been emphasized for us is developer velocity. Which includes things like abstracting “do I need to log this” from most engineers. We have some guardrails to alert if you do some egregious volume.
I think we do often opt for non-invasive infra solutions first because they have much shorter delivery times and less risk of stalling on long-tail outliers. They avoid very expensive organizational costs of buy in and team-level migration. I’m not suggesting this is the best organizational model, but it also transcends one team’s influence.
That ends up circling back around to the start of the problem. If we can transparently reduce the cost burden of some heavily-used internal infrastructure within the same relative magnitude of applying a paradigm shift on the usage of the same internal infrastructure, the former wins out.
I'm looking forward to the next post.
If each service has a unique IAM role, which it definitely should do, wouldn’t you be able to track this via a combination of cloudtrail and proper resource tags?
The post fails to mention this system is also tracking internal data dimensions like customer ids, such that we can also use this sampled data to estimate the cost of customers (and joining that with tiers of customers, and so forth).
I'm also not sure if that would allow us to attribute the cost of our datastore utilizations since those are not AWS-hosted versions but ones we run ourselves. The traffic interception lets us be able to say that Application A is using 75% of database cluster XYZ, and therefore that application/product group are most likely responsible for that share of how much the database costs.
The last thing I'll mention is that CloudTrail has the potential to be expensive on its own, I believe at least moreso than us storing the raw data in S3 for something like Athena to read. I don't think I'll be writing about it, but we've also done work this last year to trim down what we track in CloudTrail due to the cost of events (for example tracking everything in S3 ends up being pretty expensive).
Is this always true? Typically the shared resources you care about are CPU, memory and disk. I would say an application issuing fewer, much heavier queries is using the shared resource more than an application that issues more really simple queries. And this doesn’t correlate much to disk usage right?
There isn’t really a good solution to this. You can use a combination of query sampling and per-app databases to correlate this better.
Great post though, this is something we’ve been dealing and experimenting with recently.
Also is not clear to me how intercepting calls helped you to figure out the offending services?
As for the call sampling/interception, that did not factor into discovering the high cost buckets in the logging case study. It was mostly relevant to generally describing how we track costs and it ends up being useful in other scenarios. For example it could be used to assess the estimated unit economics of customers subscribed into a specific product tier.
We also have the death star microservice model, so even relatively simple attribution can be helpful when you want to run a query like “for my team which owns 30 applications, tell me the monthly attributed cost grouped by resource” and that will be able to return all the associated database and cloud costs.