undefined | Better HN

0 pointsrented_mule1y ago0 comments

The kind of margin you indicate would have been plenty for our use cases. But, we were already processing all these log entries for multiple other purposes in a single pass (not one pass per thing computed). With this single pass approach, the median calculation could happen with the same single-pass parsing of the logs (they were JSON and that parsing was most of our cost), roughly for free.

Uniform sampling also wasn't obviously simple, at least to me. There were thousands of log files involved, coming from hundreds of computers. Any single log file only had timings from a single computer. What kind of bias would be introduced by different approaches to distributing those log files to a cluster for the median calculation? Once the solution outlined in the previous comment was identified, that seemed simpler that trying to understand if we were talking about 49-51% or 40-50%. And if it was too big a margin, restructuring our infra to allow different log file distribution algorithms would have been far more complicated.

0 comments

jiggawatts1y ago

Speaking of "single pass", one of the criticisms I have of the "enumerator" patterns in modern programming languages is that they encourage multiple passes.

As an example: computing the .min() and .max() of an enumerable is two passes even though it could be done with one pass.

I'd love to see a language embrace a more efficient style similar to how a SQL does it, where you can elegantly request this as a single pass over the data: "SELECT min(x), max(x) FROM y"

_dain_1y ago

What's wrong with doing it in two passes? N iterations each doing 2 operations is exactly the same cost as 2*N iterations each doing 1 operation. Because multiplication is commutative.

buildbot1y ago

This is true, but a big advantage of the single pass method is data reuse. Instead of loading each element twice, you load each element once. Per bit, reading/writing to external memory is massively more energy intensive than any compute op. Orders of magnitude more.

srcreigh1y ago

It’s more like 2NM where M is loading the data from disk/memory. One pass is 2N+M.

Why go to the store and back twice to buy two things instead of buying two things in one trip? ;p

1 more reply

davidmurdoch1y ago

Does C# have the plumbing for this built in? It's been 7 years since using it so I might not be remembering correctly.

neonsunset1y ago

IEnumerable<T>.Max and .Min are the same as they were just significantly faster through the use of SIMD: https://github.com/dotnet/runtime/blob/ebbebaca1184940f06df6...

You could implement a similar (and simpler) fast path for types with contiguous memory by performing min and max per iteration instead.

1 more reply

cerved1y ago

what's preventing you from doing it in one pass?

sroussey1y ago

Actually, seeking the bias numbers can be quite illuminating.

rented_muleOP1y ago

You're absolutely right! Some learning that we came to later that isn't unrelated to what you're saying... don't just look at metrics (in the case I've described above, it was timings of operations in a large system), but look at histograms for them. You should be able to explain why they have the shape they do. The distributions are so often multimodal, and understanding the modes helps you understand a lot more nuance about your system.

We were an infra-focused team building up a scalable data handling / computation platform in preparation to apply ML at a non-trivial scale. When we finally hired some people with deep stats knowledge 10 or 12 years ago, there was a lot of great learning for everyone about how each of our areas of expertise helped the others. I wish we'd found the right stats people earlier to help us understand things like this more deeply, more quickly.

sroussey1y ago

Even some regular engineers with experience dealing with many servers will have built up an intuition to exploit.

In fact, this is often where decisions on what metrics to have in the first place come from. Ask why. You can go far without deep stats knowledge!

j / k navigate · click thread line to collapse

0 pointsrented_mule1y ago0 comments

0 comments

jiggawatts1y ago

Speaking of "single pass", one of the criticisms I have of the "enumerator" patterns in modern programming languages is that they encourage multiple passes.

As an example: computing the .min() and .max() of an enumerable is two passes even though it could be done with one pass.

I'd love to see a language embrace a more efficient style similar to how a SQL does it, where you can elegantly request this as a single pass over the data: "SELECT min(x), max(x) FROM y"

_dain_1y ago

What's wrong with doing it in two passes? N iterations each doing 2 operations is exactly the same cost as 2*N iterations each doing 1 operation. Because multiplication is commutative.

buildbot1y ago

srcreigh1y ago

It’s more like 2NM where M is loading the data from disk/memory. One pass is 2N+M.

Why go to the store and back twice to buy two things instead of buying two things in one trip? ;p

1 more reply

davidmurdoch1y ago

Does C# have the plumbing for this built in? It's been 7 years since using it so I might not be remembering correctly.

neonsunset1y ago

IEnumerable<T>.Max and .Min are the same as they were just significantly faster through the use of SIMD: https://github.com/dotnet/runtime/blob/ebbebaca1184940f06df6...

You could implement a similar (and simpler) fast path for types with contiguous memory by performing min and max per iteration instead.

1 more reply

cerved1y ago

what's preventing you from doing it in one pass?

sroussey1y ago

Actually, seeking the bias numbers can be quite illuminating.

rented_muleOP1y ago

sroussey1y ago

Even some regular engineers with experience dealing with many servers will have built up an intuition to exploit.

In fact, this is often where decisions on what metrics to have in the first place come from. Ask why. You can go far without deep stats knowledge!

j / k navigate · click thread line to collapse