Summing ASCII encoded integers on Haswell at almost the speed of memcpy (opens in new tab)

(blog.mattstuchlik.com)

132 pointsiliekcomputers1y ago36 comments

36 comments

Knew it'd be SIMD. Such an underrated feature of modern CPUs. Hopefully with cross-platform SIMD in Rust and Golang, it'll be more commonly used.

Thinking parallel gets you enormous speed benefits for any number of arbitrary algorithms: https://mcyoung.xyz/2023/11/27/simd-base64/

neonsunset1y ago

Here's the tracking issue for Go if you're interested: https://github.com/golang/go/issues/67520

I wouldn't be holding my breath though - proper support of high-level portable SIMD abstraction requires quite a lot of compiler complexity due to how wide (heh) the API surface of SIMD extensions is in most ISAs, and because of details necessary to get right to keep data in appropriate (vector and/or mask) registers. This, naturally, goes in the complete opposite direction to the design philosophy of Go's compiler. Instead, you are supposed to write a custom Go ASM syntax, with byte literals used to encode opcodes if they are not natively supported (which is common).

If you're interested in what high-effort SIMD implementation in this kind of language looks like, take a look at C#'s cross-platform Vector API: https://github.com/dotnet/runtime/blob/main/docs/coding-guid...

https://lemire.me/blog/2024/07/05/scan-html-faster-with-simd... (uses platform intrisics, but showcases that you can go one abstraction level lower, retaining the same Vector128<T> type if you need to specialize a particular part of your algorithm for a platform, without having to maintain separate copy for each one)

Here's high-effort vectorized CRC64 implementation that uses these: https://github.com/dotnet/runtime/blob/283de5b5adf08c42d4945... (performs as fast as C++-based mnemonic variant)

ziofill1y ago

Mojo should get a mention since we are in topic of SIMD.

neonsunset1y ago

Corrective upvote from me :)

Mojo's effort in bringing portable SIMD abstraction to Python audience is commendable. I'm looking forward to open-sourcing of it to try it out!

For anyone's curious, the reason I'm mostly talking about C# above is that its Vector API is the most accessible and mature portable SIMD abstraction that is part of standard library / comes out of box among most other options - you really only need to install SDK and `dotnet new console` to start working with it over more complex alternatives.

dist1ll1y ago

First time I hear about HighLoad. Seems really interesting to me on the first glance. I personally find SIMD and ISA/μarch-specific optimizations more rewarding than pure algorithmic challenges (codeforces and such).

Though Haswell seems like a pretty obsolete platform to optimize for at this point. Even Skylake will be a decade old next year.

sgerenser1y ago

Realistically beyond Haswell there hasn’t been a ton of advancement in SIMD. Hawell introduced AVX2, which is what this blog post uses. AVX512 is certainly more powerful, but that’s not even available in the majority of Intel CPUs, even brand new ones.

jandrewrogers1y ago

AVX-512 has been ubiquitous on Intel server CPUs for a long time. Most people don't run high-performance throughput-oriented codes on consumer-grade CPUs with no ECC, which is the primary application for AVX-512. AVX-512 is a markedly better ISA than AVX2, aside from being wider.

adrian_b1y ago

"has been" => "had been"

AVX-512 is no longer ubiquitous on Intel servers, but only on new AMD servers.

Even earlier, there were cheap Intel servers with CPUs using Atom cores, for example the Denverton, Snow Ridge, Denverton Refresh, Snow Ridge Refresh, Parker Ridge and Arizona Beach series of server CPUs. None of these supported AVX-512 and many did not support even AVX.

However, now, after the launch of the Sierra Forest server CPUs, which will be followed next year by the Clearwater Forest server CPUs, the Atom cores have expanded up to the biggest Intel server CPUs. While such server CPUs are intended for applications where computations using array operations are less important, like Web servers or the hosting of many small virtual machines, the fragmentation of the Intel ISA is extremely annoying, especially when AMD demonstrates how they can implement the same ISA, but at different levels of performance (by varying the number of vector pipelines and the maximum achievable clock frequency) both in laptop CPUs and in desktop/server CPUs and both in compact cores with low clock frequency and in big cores with high clock frequency.

At least for me, the lack of AVX-512 support is the reason that made me stop buying Intel CPUs already some years ago, even if there are some features of the Intel CPUs that I prefer over the AMD CPUs (like TSC deadline), but none of those can compensate the lack of AVX-512 support.

The greater width of AVX-512 is not its greater advantage, but the mask registers and a more complete set of instructions, which simplify many algorithms. Therefore when Intel will support AVX10/256 across all their CPUs, that will partially restore the competitivity of the Intel CPUs, but that is not likely to happen before 2026.

2 more replies

hedgehog1y ago

As part of that Haswell brought FMA support which was a boon to those of us doing a lot of multiplication and addition (made those workloads twice as fast).

hansvm1y ago

It does depend a little on what ratio of additions to multiplies you had. Haswell dropped down to one execution unit capable of floating point addition, so for addition-heavy workloads you basically had to replace half the additions with fma instructions just to keep your old performance from dropping by 2x.

xipix1y ago

The correct solution to this optimization problem is to write the integers raw, not as ASCII.

wolf550e1y ago

I think the trick with dereferencing unmapped memory is cool, but I only really care about techniques that work reliably and I can use in production.

sYnfo1y ago

To be clear, it’s not dereferencing unmapped memory, I just haven’t shown how it’s being mapped, because it’s a little complex. As I note in the post, you can imagine as if I mmap all the necessary addresses at the start of the program.

camel-cdr1y ago

Given that the input is "integers uniformly sampled from [0, 2³¹−1]" couldn't you use a LUT for the 99.99% case of just 10/9/8 digit numbers instead and have a cold branch the handle the very rare smaller numbers.

anonymoushn1y ago

Yes, maybe if one is clever and lucky this could cost only a popcnt and a branch? not sure.

raldi1y ago

Is there an explanation of why it sometimes gives the wrong answer?

sYnfo1y ago

1) if you set BATCH_SIZE > 14 sums_acc may overflow

2) chunks with too many small numbers cannot be processed with just 2 shuffle-adds

3) (not mentioned in the post) HighLoad limits the size of the source code you can submit, so you can't put all possible values in the look-up table

camel-cdr1y ago

Couldn't you organize the accumulators in 8 byte chunks, and leave the upper byte unused. Then you map consecutive digits to those chunks and use 64 bit addition for the accumulation. Then overflow between the bytes would keep the correct result if you do the shuffles correctly, and you have a full byte of overflow buffer.

Dwedit1y ago

Gaps in the numbers are often enough to do some kind of "SIMD" even on ordinary 32-bit processors.

1 more reply

rurban1y ago

So SIMD would need to set the overflow flag also to catch em.

Which would be much faster than the checked add (adc). Does any hardware support such checked SIMD arithmetic already?

Or can you still assume that most arithmetic is still broken in most languages/libraries.

adrian_b1y ago

AVX-512 has evolved from the Larrabee New Instructions (2009), passing through Knights Ferry (2010), Knights Corner (2012) and Knights Landing (2016), to reach Skylake Server (2017), whose set of AVX-512 instructions has remained a subset of the instruction sets of all later CPUs with AVX-512 support.

At each step from Larrabee to Skylake Server, some instructions have been lost, because the initial set of instructions was more complete in order to enable the writing of efficient GPU algorithms, while later Intel believed that for a general-purpose CPU they can reduce the costs by omitting some of those instructions.

(Nevertheless, later they have added many other instructions, some of which may be less useful and more expensive than the original instructions that have been removed.)

Among the original Larrabee instructions that have been deleted, was addition with unsigned overflow (a.k.a. carry), where the output overflow flags were stored in a mask register, enabling their use in a later conditional SIMD instruction.

Signed overflow can be implemented in hardware with negligible additional complexity (a single gate per each result number), so it would have been easy to also add to Larrabee/AVX-512 an addition instruction with signed overflow flags stored in a mask register. Even when only unsigned overflow is available, it is possible to preprocess the operands in such a way that detecting signed overflow would be possible with the unsigned overflow bits, though that requires multiple instructions, slowing down a lot the algorithm.

However in this problem the numbers that are added are non-negative, so the addition with unsigned overflow of the original Larrabee ISA would have been sufficient, had Intel not removed it from AVX-512.

1 more reply

kardos1y ago

For 1, can you raise that to 28 with unsigned accumulators?

sYnfo1y ago

14 already assumes unsigned accumulator! 255 [accumulator capacity] / (2 [shuffle-adds] * 9 [highest digit value]) ~= 14

1 more reply

genter1y ago

> will only produce correct results with probability < 1, though very close to 1

That's terrifying

madars1y ago

It's worse: Pr[correct output | hard input] = 0, even though they estimate that Pr[correct output | random input] ~ 1. This means that you can't, for example, amplify your success probability by repeating the algorithm a bunch of times and taking the majority vote.

_a_a_a_1y ago

Why? So long as you know the probabilities and they are tolerable, why?

kardos1y ago

The challenge is to get the right answer. It's much less interesting if you relax the challenge to no longer require the right answer. Here's a really fast approximate answer: 50000000*(2^31-1)/2

3 more replies

j / k navigate · click thread line to collapse

36 comments

ashleyn1y ago

Knew it'd be SIMD. Such an underrated feature of modern CPUs. Hopefully with cross-platform SIMD in Rust and Golang, it'll be more commonly used.

Thinking parallel gets you enormous speed benefits for any number of arbitrary algorithms: https://mcyoung.xyz/2023/11/27/simd-base64/

neonsunset1y ago

Here's the tracking issue for Go if you're interested: https://github.com/golang/go/issues/67520

Here's high-effort vectorized CRC64 implementation that uses these: https://github.com/dotnet/runtime/blob/283de5b5adf08c42d4945... (performs as fast as C++-based mnemonic variant)

ziofill1y ago

Mojo should get a mention since we are in topic of SIMD.

neonsunset1y ago

Corrective upvote from me :)

Mojo's effort in bringing portable SIMD abstraction to Python audience is commendable. I'm looking forward to open-sourcing of it to try it out!

dist1ll1y ago

Though Haswell seems like a pretty obsolete platform to optimize for at this point. Even Skylake will be a decade old next year.

sgerenser1y ago

jandrewrogers1y ago

adrian_b1y ago

"has been" => "had been"

AVX-512 is no longer ubiquitous on Intel servers, but only on new AMD servers.

2 more replies

hedgehog1y ago

As part of that Haswell brought FMA support which was a boon to those of us doing a lot of multiplication and addition (made those workloads twice as fast).

hansvm1y ago

xipix1y ago

The correct solution to this optimization problem is to write the integers raw, not as ASCII.

wolf550e1y ago

I think the trick with dereferencing unmapped memory is cool, but I only really care about techniques that work reliably and I can use in production.

sYnfo1y ago

camel-cdr1y ago

anonymoushn1y ago

Yes, maybe if one is clever and lucky this could cost only a popcnt and a branch? not sure.

raldi1y ago

Is there an explanation of why it sometimes gives the wrong answer?

sYnfo1y ago

1) if you set BATCH_SIZE > 14 sums_acc may overflow

2) chunks with too many small numbers cannot be processed with just 2 shuffle-adds

3) (not mentioned in the post) HighLoad limits the size of the source code you can submit, so you can't put all possible values in the look-up table

camel-cdr1y ago

Dwedit1y ago

Gaps in the numbers are often enough to do some kind of "SIMD" even on ordinary 32-bit processors.

1 more reply

rurban1y ago

So SIMD would need to set the overflow flag also to catch em.

Which would be much faster than the checked add (adc). Does any hardware support such checked SIMD arithmetic already?

Or can you still assume that most arithmetic is still broken in most languages/libraries.

adrian_b1y ago

(Nevertheless, later they have added many other instructions, some of which may be less useful and more expensive than the original instructions that have been removed.)

1 more reply

kardos1y ago

For 1, can you raise that to 28 with unsigned accumulators?

sYnfo1y ago

14 already assumes unsigned accumulator! 255 [accumulator capacity] / (2 [shuffle-adds] * 9 [highest digit value]) ~= 14

1 more reply

genter1y ago

> will only produce correct results with probability < 1, though very close to 1

That's terrifying

madars1y ago

_a_a_a_1y ago

Why? So long as you know the probabilities and they are tolerable, why?

kardos1y ago

The challenge is to get the right answer. It's much less interesting if you relax the challenge to no longer require the right answer. Here's a really fast approximate answer: 50000000*(2^31-1)/2

3 more replies

j / k navigate · click thread line to collapse