undefined | Better HN

0 pointstptacek4mo ago0 comments

I don't think this system is best thought of as "deployment" in the sense of CI/CD; it's a control channel for a distributed bot detection system that (apparently) happens to be actuated by published config files (it has a consul-template vibe to it, though I don't know if that's what it is).

0 comments

EvanAnderson4mo ago

That's why I likened it Crowdstrike. It's a signature database that blew up the consumer of said database. (You probably caught my post mid-edit, too. You may be replying to the snarky paragraph I felt better of and removed.)

Edit: Similar to Crowdstrike, the bot detector should have fallen-back to its last-known-good signature database after panicking, instead of just continuing to panic.

JB_Dev4mo ago

Code and Config should be treated similarly. If you would use a ring based rollout, canaries, etc for safely changing your code, then any config that can have the same impact must also use safe rollout techniques.

tptacekOP4mo ago

You're the nth person on this thread to say that and it doesn't make sense. Events that happen multiple times per second change data that you would call "configuration" in systems like these. This isn't `sendmail.cf`.

If you want to say that systems that light up hundreds of customers, or propagate new reactive bot rules, or notify a routing system that a service has gone down are intrinsically too complicated, that's one thing. By all means: "don't build modern systems! computers are garbage!". I have that sticker on my laptop already.

But like: handling these problems is basically the premise of large-scale cloud services. You can't just define it away.

EvanAnderson4mo ago

I'm sorry to belabor this but I'm genuinely not understanding what you're saying in this reply. I haven't operated large scale systems. I'm just an IT generalist and casual coder. I acknowledge I'm too inexperienced to even know what I don't know re: running large systems.

I read the parent poster as broadly suggesting configuration updates should have fitness tests applied and be deployed to minimize the blast radius when an update causes a malfunction. That makes intuitive sense to me. It seems like software should be subject to health checks after configuration updates, even if it's just to stop a deployment before it's widely distributed (let alone rolling-back to last-working configurations, etc).

Am I being thick-headed in thinking defensive strategies like those are a good idea? I'm reading your reply as arguing against those types of strategies. I'm also not understanding what you're suggesting as an alternative.

Again, I'm sorry to belabor this. I've replied once, deleted it, tried writing this a couple more times and given up, and now I'm finally pulling the trigger. It's really eating at me. I feel as though I must be deep down the Dunning-Kruger rabbit hole and really thinking "outside my lane".

1 more reply

eastdakota4mo ago

That’s correct.

tptacekOP4mo ago

Is it actually consul-template? (I have post-consul-template stress disorder).

threatofrain4mo ago

I'd love to hear any commentary on Consul if anyone else has it.

1 more reply

mh-4mo ago

Did you know: PCTSD affects more than 2 in 5 engineers.

j / k navigate · click thread line to collapse

0 comments

EvanAnderson4mo ago

Edit: Similar to Crowdstrike, the bot detector should have fallen-back to its last-known-good signature database after panicking, instead of just continuing to panic.

JB_Dev4mo ago

tptacekOP4mo ago

But like: handling these problems is basically the premise of large-scale cloud services. You can't just define it away.

EvanAnderson4mo ago

1 more reply

eastdakota4mo ago

That’s correct.

tptacekOP4mo ago

Is it actually consul-template? (I have post-consul-template stress disorder).

threatofrain4mo ago

I'd love to hear any commentary on Consul if anyone else has it.

1 more reply

mh-4mo ago

Did you know: PCTSD affects more than 2 in 5 engineers.

j / k navigate · click thread line to collapse