collecting user-identified telemetry is debatable (depends on the case), but not collecting anything at all is just plain stupid.
When the business metrics start to fail. You don't need constant metics on storage, you can poll it every so often. If your app is constrained by CPU or RAM, then the business metrics will reflect that, and then you can turn on collection of those metrics to identify the problem.
> i feel like you have never touched servers/backend in anything more than simple projects (or at all). with full storage/memory there could be an issue that you won't be able to ssh to the server, so it speaks about your knowledge in this matter.
I ran all of ops for reddit for four years and headed up SRE at Netflix, so I have some experience in large scale systems. Not that it should matter.
After having annoyed how many users and lost how much revenue? Having metrics to identify brewing problems before issues start to arise (be they on arriving CPU, memory, disk, network constraints or increasing network latency which will soon but not yet show up in the business metrics) is valuable.
> I ran all of ops for reddit for four years and headed up SRE at Netflix, so I have some experience in large scale systems. Not that it should matter.
I have a hard time believing at either of those it was acceptable to have a problem ongoing for days without any idea what's happening because logs and metrics weren't enabled in the first place.
…but why incur that round trip on my feedback loop? Having those metrics on doesn’t cost me much.
This feels potentially like the perspective of a large organisation with both mature monitoring systems and quite steady state user base activity (through scale). When I have a customer who had an issue yesterday because they had an unusual workload that won’t be repeated often, I can’t afford not to have had the basic metrics turned on, in case they point us in the right direction.
not having cpu/mem/hdd metrics is just plain bogus and sounds like fantasy world, where everything works like we expect it to work, and there is no bugs at all. ridiculous