{n} times faster than C (opens in new tab)

(owen.cafe)

447 points414owen2y ago237 comments

237 comments

I'm not so sure that the right take-away is "hand-written assembler is 6x faster than C." It's more like "jumps are a lot slower than conditional arithmetic." And that can [edit:often] be achieved easily in C by simply not using switch statements when an if statement or two will work fine.

Rewriting the C function as follows got a 5.5x speedup:

    int run_switches(char *input) {
        int r = 0;
        char c; 
        while (1) {
            c = *input++;
            if (c == 's') r++;
            if (c == 'p') r--;
            if (c == '\0') break;
        }
        return r;
    }

Results:

    [16:50:14 user@boxer ~/looptest] $ gcc -O3 bench.c loop1.c -o lone
    [16:50:37 user@boxer ~/looptest] $ gcc -O3 bench.c loop2.c -o ltwo
    [16:50:47 user@boxer ~/looptest] $ time ./lone 1000 1
    449000
    ./lone 1000 1  3.58s user 0.00s system 99% cpu 3.589 total
    [16:50:57 user@boxer ~/looptest] $ time ./ltwo 1000 1
    449000
    ./ltwo 1000 1  0.65s user 0.00s system 99% cpu 0.658 total

414owenOP2y ago

Nice! There's a part two in which I rewrote the C. I got a 12x speedup :)

https://owen.cafe/posts/the-same-speed-as-c/

And as others have pointed out, you can tweak the input, then vectorize the algo, if you want to go that route.

I considered this a pedagogical exercise and I sincerely hope nobody will start dropping down to assembly without a very good reason to.

sriku2y ago

Wondering how res += (c=='s')-(c=='p') might do. I sure there is some C undefined behaviour relevant there. Curious but too lazy to check it myself!

Tempest19812y ago

While `false` evaluates to 0, not sure `true` always evaluate to 1 in C... maybe compiler dependent. Maybe add `? 1 : 0`

2 more replies

romnon2y ago

ive seen people doing += !!(c=='s')-!!(c=='p') for that

2 more replies

rajnathani2y ago

Great post!

One thought: If the code is rewritten using bit arithmetic, then potentially the result could be even faster as there need not be a pointer look-up.

A bit arithmetic solution would have a mask created for the characters ‘p’ and ‘s’, and then the result could be AND-ed, and then with more bit arithmetic this all 1s value can be translated to a 1 if and only if all the bits are 1. Following which, there would be a no conditional check and simply be both an add and a subtract operation but where the value to be added will only be 1 if the mask for ‘p’ matches and 1 to be subtracted if the mask for ‘s’ matches respectively. I’m not fully sure if this would necessarily be faster than the pointer look-up solution, but it would be interested to try this version of the code and see how fast it performs.

Update: The bit arithmetic could also be done with an XOR on the mask, and following which the ‘popcnt’ x86 instruction could be used to figure out if all are 0 bits.

JohnMakin2y ago

Thank you for your post and reply but I fear with a post + title like this you may just be chumming the waters.

jdsalaro2y ago

What do you mean by "chumming the waters" in this context?

2 more replies

haberman2y ago

> jumps are a lot slower than conditional arithmetic.

This statement is true if the jumps are unpredictable. If the jumps are predictable, then jumps will be faster.

Linus had a whole rant about this back in the day, arguing that cmov is not useful if branches are predictable: https://yarchive.net/comp/linux/cmov.html

torstenvl2y ago

I haven't run any benchmarks, but jump-if-equal and set-if-equal would seem to have the same level of predictability.

My naive, untested intuition is that there's only one meaningful difference: the former has to dump the entire pipeline on a miss, and the latter only has to nop a single instruction on a miss.

But maybe I'm missing something. I'll re-read his rant.

EDIT:

Linus rants a lot, but makes one concrete claim:

    You can always replace it by
    
      j<negated condition> forward
      mov ..., %reg
     forward:
    
    and assuming the branch is AT ALL predictable (and 95+% of all branches
    are), *the branch-over will actually be a LOT better for a CPU.*

So, I decided to test that.

    [18:50:14 user@boxer ~/src/looptest] $ diff -u loop2.s loop4.s
    --- loop2.s 2023-07-06 18:40:11.000000000 -0400
    +++ loop4.s 2023-07-06 18:46:58.000000000 -0400
    @@ -17,11 +17,15 @@
      incq %rdi
      xorl %edx, %edx
      cmpb $115, %cl
    - sete %dl
    + jne _run_switches_jmptgt1
    + mov $1,   %dl
    +_run_switches_jmptgt1:  
      addl %edx, %eax
      xorl %edx, %edx
      cmpb $112, %cl
    - sete %dl
    + jne _run_switches_jmptgt2
    + mov $1,   %dl
    +_run_switches_jmptgt2:  
      subl %edx, %eax
      testb %cl, %cl
      jne LBB0_1
    [18:50:29 user@boxer ~/src/looptest] $ gcc -O3 bench.c loop2.s -o l2
    [18:50:57 user@boxer ~/src/looptest] $ gcc -O3 bench.c loop4.s -o l4
    [18:51:02 user@boxer ~/src/looptest] $ time ./l2 1000 1
    449000
    ./l2 1000 1  0.69s user 0.00s system 99% cpu 0.697 total
    [18:51:09 user@boxer ~/src/looptest] $ time ./l4 1000 1
    449000
    ./l4 1000 1  4.53s user 0.01s system 99% cpu 4.542 total

I feel pretty confident that Linus has made a poor prediction about poor prediction here. Jumps are indeed slower.

To be fair to Linus, since Clang and I are using sete here, not cmov, I also tested cmov, and the difference was insignificant:

    [19:53:12 user@boxer ~/src/looptest] $ time ./l2 1000 1            
    449000
    ./l2 1000 1  0.69s user 0.00s system 99% cpu 0.700 total
    [19:53:15 user@boxer ~/src/looptest] $ time ./l5 1000 1            
    449000
    ./l5 1000 1  0.68s user 0.00s system 99% cpu 0.683 total

Jumps are slower.

haberman2y ago

> jump-if-equal and set-if-equal would seem to have the same level of predictability.

The difference is that branches have dedicated hardware (branch predictors) that will speculatively execute subsequent instructions based on their best guess about which way the branch will go. Whereas conditional moves cannot execute any subsequent instructions until the correct value is available.

Put another way, CPUs have control flow speculation, but not conditional move speculation. I don't know if conditional move speculation would be a feasible thing to implement or not, but I'm pretty sure that no mainstream CPUs have such a feature.

2 more replies

seventhson2y ago

Linus' post is 15+ years old. Much has changed in Intel hardware since then. He was probably right on the money re the hardware available at the time.

2 more replies

AnotherGoodName2y ago

The inputs here are random which is the problem and why this isn't demonstrating that. Create an input of all 's' and compare it.

1 more reply

MobiusHorizons2y ago

Jumps are slower on completely random input. If I understand Linus’s point correctly, he is suggesting that random inputs like this are unusual (although a good way to measure worst case performance)

zaxomi2y ago

Did you test this on a Pentium 4, the processor that Linus is talking about?

1 more reply

EVa5I7bHFq9mnYK2y ago

Is this the reason I dont usually see any speed up if I eliminate array boundary checking in C#? The jump condition is almost always false, is this what "predictable" means?

gpderetta2y ago

Indeed.

The cost of bound checking is second order effects like making vectorization harder, slightly higher instruction (and possibly data) cache pressure, or requiring higher decode bandwidth. For the vast majority of programs these bottlenecks do not really matter.

1 more reply

BoppreH2y ago

What version of GCC are you using? For me both versions perform the same, both on Ubuntu and Windows:

    $ time ./lone 1000 1
        851000

        real    0m3.578s
        user    0m3.574s
        sys     0m0.004s
        
    $ time ./ltwo 1000 1
        851000

        real    0m3.583s
        user    0m3.583s
        sys     0m0.000s

    $ gcc --version
        gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
        Copyright (C) 2019 Free Software Foundation, Inc.
        This is free software; see the source for copying conditions.  There is NO
        warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

torstenvl2y ago

Sorry, I write 'gcc' purely out of force of habit. I'm using Clang/LLVM.

    [17:23:00 user@boxer ~/looptest] $ uname -a
    Darwin boxer.local 21.6.0 Darwin Kernel Version 21.6.0: Thu Jun  8 23:57:12 PDT 2023; root:xnu-8020.240.18.701.6~1/RELEASE_X86_64 x86_64
    [17:23:47 user@boxer ~/looptest] $ cc -v
    Apple clang version 14.0.0 (clang-1400.0.29.202)
    Target: x86_64-apple-darwin21.6.0
    Thread model: posix
    InstalledDir: /Library/Developer/CommandLineTools/usr/bin

Clang generates the sete instruction for me with the above code:

    [17:23:49 user@boxer ~/looptest] $ gcc -c -O3 loop2.c   
    [17:25:00 user@boxer ~/looptest] $ objdump -d --symbolize-operands --x86-asm-syntax=intel --no-show-raw-insn loop2.o
    
    loop2.o: file format mach-o 64-bit x86-64
    
    Disassembly of section __TEXT,__text:
    
    0000000000000000 <_run_switches>:
           0:       push rbp
           1:       mov rbp, rsp
           4:       xor eax, eax
           6:       nop word ptr cs:[rax + rax]
    <L0>:
          10:       movzx ecx, byte ptr [rdi]
          13:       add rdi, 1
          17:       xor edx, edx
          19:       cmp cl, 115
          1c:       sete dl
          1f:       add eax, edx
          21:       xor edx, edx
          23:       cmp cl, 112
          26:       sete dl
          29:       sub eax, edx
          2b:       test cl, cl
          2d:       jne  <L0>
          2f:       pop rbp
          30:       ret

p1necone2y ago

Is rewriting switch statements to a bunch of ifs always faster? Or is there some number of cases where the switch is faster? Seems like it should be added as a compiler optimization if it's consistent.

Karellen2y ago

It shouldn't be.

If the fastest way to implement a particular `switch` in assembly is with the equivalent of a set of `if`s, a reasonably smart compiler "should" be able to output the assembly to do that. And I thought that gcc and clang at least have actually been smart enough to do that for a while now.

But if the number of `if`s is high and the distribution sufficiently dense, where a jump table is better than a bunch of `if`s, then a `switch` should output that.

OTOH, a sufficiently smart compiler could theoretically turn a bunch of `if`s into a `switch`-like jump table - but it's much harder to reason that case through correctly than it is the other way, so I'm not sure any current compilers are sufficiently smart to do that.

kevin_thibedeau2y ago

On x64, GCC is supposed to use a a jump table with more than 4 cases when the cases are dense enough (gaps of 9 or less) to minimize wasted memory, otherwise it generates sequential comparisons. Testing on Godbolt, it looks like GCC 13.1 uses 10 cases as the jump table threshold for ARM64.

flohofwoe2y ago

It shouldn't if the control flow is actually identical.

E.g. note how both the switch- and if-based functions generate the same code using a lookup table here:

https://godbolt.org/z/qoT3M7r5G

DeathArrow2y ago

Shouldn't the compiler be able to do that, too?

AnotherGoodName2y ago

It's not true in general.

In general branching code is faster than branchless code and there's many many places that will demonstrate this with a quick Google. You know how many cycles a correctly predicted branch takes? 0.

On the other hand branchless code has to wait for each calculation to reach a certain stage in the pipeline since the thing to be output is dependent on the result. The CPU will have a whole lot of halts.

So why is this faster? Because the input is literally random(). The branch predictor will be wrong. This isn't normal though. The compiler is creating code that will be faster in most normal use cases.

It's an artificial benchmark that works against the output the compiler produces.

rendaw2y ago

You're saying that humans have information to context that allows them to provide use-case-specific optimizations which a compiler which must anticipate general usage couldn't. That's what profile guided optimizers are for though, right?

thomasahle2y ago

I tried https://godbolt.org/, and neither Clang nor GCC trunk give the same code for the two programs.

Pretty shocking for such a simple program.

ModernMech2y ago

Yes, there’s always the “sufficiently smart compiler” that can generate this code. Question is, does that compiler exist?

Groxx2y ago

I sure hope so. The semantics are trivially identical, the optimizations should be as well, by default - they should depend on semantics, not syntax. And GCC in another comment under this thread seems to be doing similar or identical optimizations in both cases.

I wholly admit that this implies nothing about all optimizers. But it's a pretty reasonable one to expect.

charcircuit2y ago

>does that compiler exist?

and if so are the compile times worth it

icsa2y ago

These days, I's put my money on zig cc. I.E. zig cc -Os or zig cc -O3

1 more reply

okaleniuk2y ago

Yes, but this will backfire on ARM, where jumps are as roughly fast as conditional arithmetic.

The whole point of using C is not to think about the underlying architecture. As soon as you start taking "jumps are a lot slower than conditional arithmetic on x86" into account, you're not writing in C, you're writing in assembly with extra steps :-)

dzaima2y ago

Note, that's only ARMv7; ARMv8 dropped most of the conditionally executed instruction stuff. And so it isn't even jumps that's fast on ARMv7, it's specifically cases that can be (and are) converted to predicated instrs; jumps are still gonna be slow in general on anything high-perf enough that it needs speculation, which can include the actual jumps of ARMv7.

If a compiler can convert jumpy code to the predicated instrs, it should be able to trivially convert conditional arith to such too (even easier & more consistently than branches I'd say).

bruce3434342y ago

Is that because cjumps on ARM are faster, or cmovs on ARM are slower?

BbzzbB2y ago

Is there any good article comparing performance across programming languages? Seems like everytime I see one they're broken because the tested logic is poorly implemented in language(s) XYZ.

MisterTea2y ago

You can shorten the loop to just tree conditionals:

  while (c = *input++) {
      if (c == 's') r++;
      if (c == 'p') r--;
  }

nwallin2y ago

IMHO the original code wasn't written in a way that's particularly friendly to compilers. If you write it like this:

    int run_switches_branchless(const char* s) {
        int result = 0;
        for (; *s; ++s) {
            result += *s == 's';
            result -= *s == 'p';
        }
        return result;
    }

...the compiler will do all the branchless sete/cmov stuff as it sees fit. It will be the same speed as the optimized assembly in the post, +/- something insignificant. However it won't unroll and vectorize the loop. If you write it like this:

    int run_switches_vectorized(const char* s, size_t size) {
        int result = 0;
        for (; size--; ++s) {
            result += *s == 's';
            result -= *s == 'p';
        }
        return result;
    }

It will know the size of the loop, and will unroll it and use AVX-512 instructions if they're available. This will be substantially faster than the first loop for large inputs, although I'm too lazy to benchmark just how much faster it is.

Now, this requires knowing the size of your string in advance, and maybe you're the sort of C programmer who doesn't keep track of how big your strings are. I'm not your coworker, I don't review your code. Do what you want. But you really really probably shouldn't.

https://godbolt.org/z/rde51zMd8

414owenOP2y ago

The version that's friendly to the compiler is described in part two: https://owen.cafe/posts/the-same-speed-as-c/

It achieves 3.88GiB/s

I intentionally didn't go down the route of vectorizing. I wanted to keep the scope of the problem small, and show off the assembly tips and tricks in the post, but maybe there's potential for a future post, where I pad the input string and vectorize the algorithm :)

nwallin2y ago

So I downloaded your code. On my desktop, with loop-9 gcc I got ~4.5GB/s, and with loop-7 I got ~4.4GB/s. With the following code:

    #include <stddef.h>
    
    int run_switches(const char *s, size_t n) {
      int res = 0;
      for (; n--; ++s)
        res += (*s == 's') - (*s == 'p');
      return res;
    }

I got ~31GB/s in GCC and ~33GB/s in Clang. This is without any padding, or SIMD intrinsics, or any such nonsense. This is just untying the compiler's hands and giving it permission to do its job properly.

Don't want to pass the string length? That's fine, we can figure that out for ourselves. This code:

    #include <stddef.h>
    #include <string.h>

    int run_switches(const char *s) {
      int res = 0;
      for (size_t n = strlen(s); n--; ++s)
        res += (*s == 's') - (*s == 'p');

      return res;
    }

Is 27GB/s. With a little bit of blocking:

    #include <stddef.h>
    
    int run_switches(const char *s, size_t n) {
      int res = 0;
      char tmp = 0;
      for (size_t i = n & 63; i--; ++s)
        tmp += (*s == 's') - (*s == 'p');
      res += tmp;
    
      for (n >>= 6; n--;) {
        tmp = 0;
        for (size_t i = 64; i--; ++s)
          tmp += (*s == 's') - (*s == 'p');
        res += tmp;
      }
    
      return res;
    }

That's ~55GB/s.

Anyway, the point is, you're pretty far from the point where you ought to give up on C and dive into assembly.

utopcell2y ago

Indeed. I suppose the two lessons are, stick with C, and don't forget the semantics of your original problem when optimizing.

    int run_switches(const char *s) {
      int res = 0;
      uint8_t tmp = 0;
      size_t n = strlen(s);
      for (size_t i = n & 127; i--; ++s)
        tmp += (*s == 's');
      res += tmp;
    
      for (size_t j = n >> 7; j--;) {
        tmp = 0;
        for (size_t i = 128; i--; ++s)
          tmp += (*s == 's');
        res += tmp;
      }
    
      return 2 * res - n;
    }

3 more replies

magicalhippo2y ago

Another good reason to write optimization-friendly C (or similar) over assembly code, especially in libraries, is that the compiler will evolve with CPUs, while the assembly code will not.

I've seen plenty of cases where replacing hand-written assembly with C (or similar) lead to a substantial performance increase because the assembly code was written for some old CPU and not the best way of doing things on current CPUs.

rajnathani2y ago

This seems like the most efficient solution. I have a neighboring comment on this post which suggests using bit arithmetic, but the above solution is more efficient than that. Here’s what the assembly code for the body of the first loop compiles down to (I had to use ChatGPT-4 as godbolt unfortunately doesn’t work on mobile):

    cmp dl, 's'    ; Compare the character with 's'
    sete dl        ; If the character is 's', set dl to 1. Otherwise, set dl to 0.
    sub al, dl     ; Subtract the result from res

    cmp dl, 'p'    ; Compare the character with 'p'
    sete dl        ; If the character is 'p', set dl to 1. Otherwise, set dl to 0.
    add al, dl     ; Add the result to res

SleepyMyroslav2y ago

>Anyway, the point is, you're pretty far from the point where you ought to give up on C and dive into assembly.

Thank you. I hope people who post random assembly listings on HN written in some extinct ISA will read your posts.

shusaku2y ago

You forgot an important line of the code:

/* DON’T REFACTOR THIS FOR READABILITY IT WILL SLOW DOWN */

elcritch2y ago

Nice! I tried it in Nim and it appears to trigger it with:

    {.overflowChecks:off.}
    proc run_switches*(input: cstring): int {.exportc.} =
      result = 0
      for c in input:
        result.inc int('s' == c)
        result.dec int('p' == c)

That gives a ~5x speedup on an Apple M1. Keeping overflow checks on only gets it up to ~2x the default C version. Always nice to know good ways to trigger SIMD opts.

jonny_eh2y ago

> But you really really probably shouldn't.

Shouldn't "not" keep track of string length?

nwallin2y ago

Err... yes. You shouldn't not keep track of string/buffer sizes.

Const-me2y ago

I’m probably an optimization expert, and I would solve that problem completely differently.

On my computer, the initial C version runs at 389 MB / second. I haven’t tested the assembly versions, but if they deliver the same 6.2x speedup, would result in 2.4 GB/second here.

Here’s C++ version which for long buffers exceeds 24 GB/second on my computer: https://gist.github.com/Const-me/3ade77faad47f0fbb0538965ae7... That’s 61x speedup compared to the original version, without any assembly, based on AVX2 intrinsics.

xoranth2y ago

Interesting. I think you can vectorize the prologue using movemask + popcnt instead of keeping a counter in the ymm registers (warning: untested code, still need to benchmark it):

    const __m256i zero = _mm256_setzero_si256();
    const __m256i s = _mm256_set1_epi8( 's' );
    const __m256i p = _mm256_set1_epi8( 'p' );

    const size_t a = (size_t)input;
    const size_t rem = a % 32;
    const char* aligned = input - rem;

    const __m256i v = _mm256_load_si256(( const __m256i*) input);
    const __m256i z = _mm256_cmpeq_epi8( v, zero );

    size_t m_plus = _mm256_movemask_epi8(_mm_cmpeq_epi8(v, s));
    size_t m_minus = _mm256_movemask_epi8(_mm_cmpeq_epi8(v, p));
    size_t m_zero = _mm256_movemask_epi8(_mm_cmpeq_epi8(v, z));
    size_t offset_zero = _mm_tzcnt_64(m_zero >> rem);

    m_plus = _bzhi_u64(m_plus >> rem, offset_zero);
    m_minus = _bzhi_u64(m_minus >> rem, offset_zero);

    // Skip loop we already found the end of the string...
    while (m_zero == 0) {
        // ...
    }
    
    // ...
    
    return m_plus + res - m_minus;

Const-me2y ago

Good idea, and it can be used for both prologue and epilogue pieces. Updated that gist.

However, this is only relevant for very small inputs. For longer inputs the vectorized portion of the function gonna dominate the performance.

gavinray2y ago

Do you know if this is possible using "std::experimental::simd" out of curiosity?

https://en.cppreference.com/w/cpp/experimental/simd

Const-me2y ago

I don’t have any experience with that library.

Still, based on the documentation you have linked, I’m not sure it could possibly generate some code similar to my version. I could be wrong but I don’t see APIs which aggregate or accumulate the `simd_mask` vectors they output for results of vector comparisons.

utopcell2y ago

you might want to rewrite this in a form that is compatible with @414owen's repo.

bluedevilzn2y ago

What’s a good source to learn and practice AVX?

Const-me2y ago

For a starting point, I wrote that article couple years ago: http://const.me/articles/simd/simd.pdf

I don’t recommend assembly. Intrinsics are typically good enough performance wise, and writing correct assembly is hard. For instance, Chromium has non-trivial dependencies with code written in assembly, and they caused tons of fun debugging issues like that https://bugs.chromium.org/p/chromium/issues/detail?id=121838... Using intrinsics would have solved that because modern compilers follow ABI conventions of the target platforms very carefully.

About that highway, I don’t have any experience but based on the documentation I don’t like it too much. They say that’s a thin wrapper over intrinsics, but I believe it still breaks things. Specifically, Intel and ARM don’t document highway, but they have decent documentation on their intrinsics. Similarly, stackoverflow has thousands of questions and answers with tags like SSE and AVX, most of them are related to intrinsics, but nothing related to highway.

utopcell2y ago

You could study Google's Highway library [1].

[1] https://github.com/google/highway

kierank2y ago

FFmpeg has a lot of assembly language that can be added to:

https://blogs.gnome.org/rbultje/2017/07/14/writing-x86-simd-...

Sesse__2y ago

This code screams for SIMD! If you can change the prototype to take an explicit length, you could easily read and process 16 bytes at a time (the compares will give you values you can just add and subtract directly). Heck, even calling strlen() at the function's start to get the explicit length would probably be worth it.

camel-cdr2y ago

I threw together a quick risc-v vectorized implementation:

    size_t run(char *str) {
            uint8_t *p = (uint8_t*)str;
            long end = 0;
            size_t res = 0, vl;
            while (1) {
             vl = __riscv_vsetvlmax_e8m8();
                    vuint8m8_t v = __riscv_vle8ff_v_u8m8(p, &vl, vl);
                    end = __riscv_vfirst_m_b1(__riscv_vmseq_vx_u8m8_b1(v, '\0', vl), vl);
                    if (end >= 0)
                            break;
                    res += __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 's', vl), vl);
                    res -= __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 'p', vl), vl);
                    p += vl;
            }
            vl = __riscv_vsetvl_e8m8(end);
            vuint8m8_t v = __riscv_vle8_v_u8m8(p, vl);
            res += __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 's', vl), vl);
            res -= __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 'p', vl), vl);
            return res;
    }

Here are the results from the above, the switch and the table c implementation, ran on my mangopi mq pro (C906, in order rv64gc with rvv 0.7.1, and a 128 bit vector length):

    switch: 0.19 Bytes/Cycle
    tbl:    0.17 Bytes/Cycle
    rvv:    1.57 Bytes/Cycle (dips down to 1.35 after ~30 KiB)

Edit: you can go up to 2/1.7 Bytes/Cycle, if you make sure the pointer is page aligned (and vl isn't larger than the page size), see comments

dzaima2y ago

To be fully correct, you'd need the load to be a fault-only-first load (which rvv does have), otherwise that could fail if the null byte was just before the end of allocated memory.

camel-cdr2y ago

I just found your rvv intrinsics-viewer [0], that'll be so helpful.

I tried building one, my self, but my miserable web skills didn't allow me to lazily load the instructions, which made it too slow for actual use.

Can I share your project on lemmy?

[0] https://dzaima.github.io/intrinsics-viewer

dzaima2y ago

Go ahead! I'm not much of a web dev either, but decided to struggle through it to, mainly, just have better searching. (originally for intel & ARM intrinsics, which are also available if downloaded offline)

camel-cdr2y ago

I'm not sure I fully understand fault-only-first load, but reading the description of vle8ff.v I think I only need to exchange the load inside of the loop?

How does the normal load deal with faults?

I'll update the parent comment, it slowed down the speed from 2/1.7 to 1.57/1.36 Bytes/Cycle.

dzaima2y ago

You'd probably want to have a new __riscv_vsetvlmax_e8m8 at the start of each loop iteration, as otherwise an earlier iteration could cut off the vl (e.g. page unloaded by the OS), and thus the loop continues with the truncated vl.

The normal load should just segfault if any loaded byte is outside of readable memory, same as with a scalar load which is similarly partly outside.

1 more reply

okaleniuk2y ago

I think, it's a particular quirk of x86 architecture. Branching is expensive in comparison because not doing branching is super cheap. https://wordsandbuttons.online/challenge_your_performance_in...

However, on other processors, this might not be the case. https://wordsandbuttons.online/using_logical_operators_for_l...

The good question is what do we need C for in general? Of course, we can hand-tailor our code to run best on one particular piece of hardware. And we don't need C for that, it would be the wrong tool. We need assembly (and a decent macro system for some sugar)

But the original goal of C was to make translating system-level code from one platform to another easier. And we're expected to lose efficiency on this operation. It's like instead of writing a poem in Hindi and translating it in Urdu, we write one in Esperanto and then translate to whatever language we want automatically. You don't get two brilliant poems, you only get two poor translations, but you get them fast. That's what C is for.

eklitzke2y ago

Rearranging branches (and perhaps blocks too?) will definitely be done if you are building using FDO, because without FDO (or PGO) the compiler has no idea how likely each branch is to be taken. Cmov can also be enabled by FDO in some cases.

However, whether or not using cmov is effective compared to a regular test/jump is highly dependent on how predictable the branch is, with cmov typically performing better when the branch is very unpredictable. Since they got a 6x speedup with cmov, I assume that their test input (which isn't described in the post, and is also not in their GitHub repo) consists of random strings consisting almost entirely of s and p characters. There's nothing wrong with this, but it does make the post seem a little misleading to me, as their clever speedup is mostly about exploiting an unmentioned property of the data that is highly specific to their benchmark.

nwallin2y ago

> I assume that their test input (which isn't described in the post, and is also not in their GitHub repo) consists of random strings consisting almost entirely of s and p characters.

test code is here: https://github.com/414owen/blog-code/blob/master/02-the-same... it randomly selects between 's' or 'p'. The characters can't be anything other than 's', 'p', or the terminating null. Knowing that particular fact about our input gives us this ...clever... optimization:

    int run_switches(const char* s) {
        int result = 0;
        while (*s)
            result += (1 | *s++) - 'r';
        return result;
    }

which compiles to:

    run_switches:
            movzx   eax, BYTE PTR [rdi]
            xor     edx, edx
            test    al, al
            je      .L1
    .L3:
            or      eax, 1
            inc     rdi
            movsx   eax, al
            lea     edx, [rdx-114+rax]
            movzx   eax, BYTE PTR [rdi]
            test    al, al
            jne     .L3
    .L1:
            mov     eax, edx
            ret

This is too clever by half, of course, but it perfectly illustrates your point about exploiting properties of the data.

gpderetta2y ago

You do not even need to subtract 'r'.

    int run_switches(const char* s) {
        int s_count = 0;
        const char *begin = s;
        while(*s) {
            s_count += (1 & *s++);
        }
        int count = s-begin;
        return count - s_count;
    }

which compiles to:

    .L49:
        and     edx, 1
        add     rax, 1
        add     ecx, edx
        movzx   edx, BYTE PTR [rax]
        test    dl, dl
        jne     .L49

edit: other variant

    int run_switches2(const char* s) {
        const char *begin = s;
        int sum = 0;
        while(*s) {
            sum += *s++;
        }
        int count = s-begin;
        int s_count = sum - ('s'*count)/('p'-'s');
        int p_count = count - s_count;
        return p_count - s_count;
   }

which compiles to:

   run_switches2(char const*):
        movsx   eax, BYTE PTR [rdi]
        test    al, al
        je      .L56
        mov     rdx, rdi
        xor     ecx, ecx
   .L55:
        add     rdx, 1
        add     ecx, eax
        movsx   eax, BYTE PTR [rdx]
        test    al, al
        jne     .L55
        sub     rdx, rdi
        imul    esi, edx, 115
        movsx   rax, esi
        sar     esi, 31
        imul    rax, rax, 1431655766
        shr     rax, 32
        sub     eax, esi
        add     ecx, eax
        sub     edx, ecx
        mov     eax, edx
        sub     eax, ecx
        ret
   .L56:
        xor     eax, eax
        ret

None of these will beat the clever blocked SIMD someone showed elsethread.

414owenOP2y ago

> because without FDO (or PGO) the compiler has no idea how likely each branch is to be taken

So, the maximum amount of times you can hit '\0' is once in the string, because then the function returns, but you can hit the other characters many times, which seems to be information a compiler has access to without PGO.

PGO does help, of course, and on my machine gives me 2.80s, which is better than the code at the end of the `Rearranging blocks` section :)

> I assume that their test input (which isn't described in the post, and is also not in their GitHub repo)

It's described under `Benchmarking setup`, and is in the repository here: https://github.com/414owen/blog-code/blob/master/01-six-time...

Side note: There's a part two to this post (linked at the bottom) where I make the C code as fast as I possibly can, and it beats all the assembly in this post.

I never said writing assembly is (necessarily) a good idea, I just find optimizing it, and deciphering compiler output, an interesting challenge, and a good learning opportunity.

aengvs2y ago

Imagine a scenario where most of the strings being processed contain a single null character, with no other characters. In that case checking for the null character first would be optimal.

Does the compiler know that this isn't true? No, it doesn't. The author of the article is making an assumption about the contents of the data that might seem reasonable but isn't necessarily true.

zokier2y ago

But because in the single-null case the loop body is executed only once, the gains of arrangement that prefers nulls are pretty slim compared to long-string cases where the loop body is executed many times. For example if your dataset contains 99 cases of single null strings and one case of 100 chars long string, it might still be optimal on aggregate to use the long-string optimizing arrangement.

Of course there are still some cases where non-zero strings are extremely rare and as such optimizing for those makes sense.

xoranth2y ago

I think I managed to improve on both this post, and its sequel, at the cost of specializing the function for the case of a string made only of 's' and 'p'.

The benchmark only tests strings made of 's' and 'p', so I think it is fair.

The idea is as follow. We want to increase `res` by one when the next character is `s`. Naively, we might try something like this:

    res += (c - 'r');  // is `res += 1` when c == 's'

This doesn't work, as `'p' - 'r' == -2`, and we'd need it to be -1.

But `'p' - 'r'`, when viewer as an unsigned integer, underflows, setting the carry flag. Turns out x64 has an instruction (adc) that adds two registers _plus_ the carry flag.

Therefore we can replace two `cmp, cmov` with one `sub, adc`:

    run_switches:
            xor    eax, eax            # res = 0
    loop:
            movsx  ecx, byte ptr [rdi]
            test   ecx, ecx
            je     ret
            inc    rdi
            sub    ecx, 'r'
            adc    eax, ecx     # Magic happens here
            jmp    loop
    ret:
            ret

Benchmarks are as follows (`bench-x64-8` is the asm above):

    Summary
      '01-six-times-faster-than-c/bench-x64-8 1000 1' ran
        1.08 ± 0.00 times faster than '02-the-same-speed-as-c/bench-c-4-clang 1000 1'
        1.66 ± 0.00 times faster than '01-six-times-faster-than-c/bench-x64-7 1000 1'

Of course, one could improve things further using SWAR/SIMD...

414owenOP2y ago

Very interesting approach. I should probably have specified that the somewhat naive assembly in `02-the-same-speed-as-c/loop-5.x64.s` is the fastest version I have.

On my machine I'm getting 0.244s for `loop-5.x64.s` and 0.422s for your implementation above.

I'm not sure why exactly we're seeing this discrepancy, and for what it's worth your implementation looks faster to me. I guess this is why you need to always benchmark on the hardware you're going to be running the code on...

xoranth2y ago

I rerun the benchmark vs loop-5 and loop-7 from the second post. Runtime is basically the same on my machine.

I would have expected yours to be faster given that it needs to execute fewer instructions per loop iteration. Though maybe the CPU can run `adc` on more ports compared to a load from memory?

    Summary
      '01-six-times-faster-than-c/bench-x64-8 1000 1' ran
        1.00 ± 0.00 times faster than '02-the-same-speed-as-c/bench-x64-7 1000 1'
        1.66 ± 0.00 times faster than '01-six-times-faster-than-c/bench-x64-7 1000 1'

    Summary
      '01-six-times-faster-than-c/bench-x64-8 1000 1' ran
        1.01 ± 0.00 times faster than '02-the-same-speed-as-c/bench-x64-5 1000 1'
        1.66 ± 0.00 times faster than '01-six-times-faster-than-c/bench-x64-7 1000 1'

gpderetta2y ago

Even simpler: just sum all elements of the array. Then at the end subtract 'p'*len from the sum, then divide by ('s'-'p') to get the s count. The 'p' count is len minus the 's' count.

The initial sum is easily vectorized as well.

If I've not made any mistakes it should work. Only issue is possible overflow on the running sum.

Can't be bothered to benchmark it though:).

edit: missed the decrement when you see 's'. So the final result is p_count - s_count.

xoranth2y ago

That's likely the fastest way to do that without vectorization. But you'd need to upcast 's' to an uint64 (or at least an uint32). That means that vectorization would operate on 32/64 bit lanes.

With vectorization, I think the way to go is to have two nested loops, an outer advances by 32 * 255 elements at a time, and an inner one that loads 32 bytes, compares each character to 's', and accumulates on 8 bit lanes.

Then in the outer loop you do an horizontal sum of the 8 bit accumulators.

fuber20182y ago

My SWAR version almost does what your vectorization algorithm description does - just that the SWAR-code looks rather gnarly because the compiler isn't auto-generating the vector code for you, it's hand-coded in C by me and I'm limited to 64 bits at a time.

gpderetta2y ago

Indeed, the blocked vectorization with 8 bits accumulators shown elsethread is going to be faster and there reducing the sum to 1 bit per iteration is worth it.

fuber20182y ago

I took the 64-bit SWAR ('S'IMD-'W'ithin-'A'-'R'egister) road and passed in the string length - the calling code has the length "right there"!!!

Using the original run_switches function, app took 3.554s (average of 10 runs).

With the SWAR-version with the string length passed in, app took 0.117s (average of 10 runs).

That's an overall 27.6x speedup.

fuber20182y ago

If I unroll the main while loop to handle 4x as much each time through the loop in the SWAR-version, the runtime drops to 0.0562s (average 10 runs).

That's an overall 57.5x speedup.

1 more reply

sltkr2y ago

How much faster is this:

    int run_switches(const char *buf) {
       size_t len = strlen(buf);
       int res = 0;
       for (size_t i = 0; i < len; ++i) {
         res += (buf[i] == 's') - (buf[i] == 'p');
       }
       return res;
    }

strlen() should be implemented in a pretty fast way, and after the buffer size is known, the compiler can autovectorize the inner loop, which does happen in practice: https://gcc.godbolt.org/z/qYfadPYoq

aidenn02y ago

A while back, I wrote a UTF-8 decoder in Common Lisp, targeting SBCL (it already has one built in, this was an exercise). Pretty much all of the optimization win (after the obvious low-hanging fruit) was structuring the code so that the compiler would generate cmov* instructions rather than branches.

whartung2y ago

What's some examples of the code changes that you made? And did you just do repeated disassemblies of the functions to see that it was using the correct instructions, or did you do some benchmarking to show your changes were actual improvements?

aidenn02y ago

Gosh, I'd have to see if I can dig it up this was a few years ago.

I did all of the above, plus profiling (sb-sprof combined with disassemble will show assembly level profiling).

moonchild2y ago

Branches are prone to be faster than conditional moves if they are correctly predicted, because they do not increase the critical path length. And utf-8 decoders are commonly run on all-ascii input. What were you benchmarking on?

aidenn02y ago

I ran separate benchmarks on all-ASCII, BMP-only, and ascii with non-BMP. ASCII was not slower on the low-branch version.

fefe232y ago

First, before optimizing you should consider correctness and security. input should be const and the return value should be ssize_t (so you don't have numeric overflow on 64-bit).

Second, consider this replacement function:

  ssize_t test(const char \*input) {
    ssize_t res = 0;
    size_t l = strlen(input);
    size_t i;
    for (i=0; i<l; ++i) {
      res += (input[i] == 's') - (input[i] == 'p');
    }
    return res;
  }

The timings are (using gcc -O3 -march=native): your function 640 cycles, mine 128 cycles. How can that be? I'm reading the memory twice! I have one call to strlen in there, and memory is slow. Shouldn't this be much slower?

No. strlen is a hack that uses vector instructions even though it may technically read beyond the string length. It makes sure not to cross page boundaries so it will not cause adverse reactions, but valgrind needs a suppression exception to not complain about it.

If you know the length beforehand, the compiler can vectorize and unroll the loop, which it happens to do here. To great effect, if I may say so.

The art of writing fast code is usually to get out of the way of the compiler, which will do a perfectly fine job if you let it.

If you really wanted to, you could get rid of the strlen by hacking your logic into what strlen does. That would make the C code much less readable and not actually help that much. My test string is "abcdefghijklmnopqrstuvxyz", so it's all in the l1 cache.

torstenvl2y ago

There's an error in the pseudocode.

      cmp     ecx, 's'            #   if (c == 's')
      jne     loop                #     continue
      add     eax, 1              #   res++
      jmp     loop                #   continue

should be

      cmp     ecx, 's'            #   if (c != 's')
      jne     loop                #     continue
      add     eax, 1              #   res++
      jmp     loop                #   continue

agumonkey2y ago

I believe the first `jne` should be `je`, right ?

torstenvl2y ago

No, the assembler is correct. Jump (early) back to the beginning of the loop if not equal to s; otherwise, continue executing the next instruction (add eax, 1) and then unconditionally jump back to the beginning of the loop.

agumonkey2y ago

well then there's a magical bit somewhere since both assembly listing are identical

1 more reply

414owenOP2y ago

A clickbait title for an in-depth look at hand-optimizing a very simple loop.

ftxbro2y ago

I'm not a compiler expert but if it's a "very simple loop" is it still too complex for the compiler to make good machine code? Did they use a bad compiler on purpose? Or are computers just not yet fast enough to do a good job with very simple loops in practical compilers?

moonchild2y ago

> are computers just not yet fast enough to do a good job with very simple loops in practical compilers?

The short answer to this question is 'yes', but there are some extenuating factors:

- Although we could do interesting things with unlimited computational resources, the current crop of c compilers is simply not very good, compared with what's possible today.

- Performance is always workload-dependent; the compiler has been somewhat shafted here because it doesn't know what sorts of inputs the function usually receives. The compiler output is better than the 'improved' code for some inputs. (It's possible you could get a better result from the existing compilers and c code just by using profile-guided optimisation.)

- The difference is prone to be more pronounced in simple loops than large ones. This is a contrived use-case. There is not a factor of 6 of performance hiding in optimised c code which could be recovered by doing the sorts of optimisations done by the op. Probably something more like 10-20%.

bruce3434342y ago

> the current crop of c compilers is simply not very good, compared with what's possible today.

That's quite dismissive. What exactly "is possible today" and why aren't these top compilers using them?

1 more reply

cjensen2y ago

The problem is the author of the article is making some huge implicit assumptions that the compiler can't possibly know about.

Consider this statement: "However, we know some things about this loop. We know that the only time we break out of it is when we hit the null terminator (’\0’). The code clang generates checks for the null terminator first, but this makes no sense."

This statement contains huge assumptions about the lengths of the input strings and the frequency of the letters 's' and 'p' in the input. And then has the chutzpah to call the compiler's failure to read his mind about this as "making no sense."

Good first effort by the author, but has not sufficiently thought through the problem.

414owenOP2y ago

That's the thing, a C compiler has all the information it needs to know that the maximum amount of times a '\0' can be processed in the loop is once (because the function returns), but there's no upper bound on the amount of times other characters are seen in the loop.

I might be missing a reason that this information of opaque to the compiler though, in which case, this section of the article is indeed lacking, but I'm happy to learn :)

2 more replies

smcin2y ago

Ok the author should have written "this makes no sense... on this particular case"

twoodfin2y ago

This is the right answer:

https://news.ycombinator.com/item?id=36622584

Optimal assembly (forgoing SIMD, at least) for this loop on modern x86 is highly dependent on the entropy of the runtime data.

ftxbro2y ago

OK so they were abusing the benchmark, like the compiler's output would be faster on less contrived test data? Do I have to search what are fdo or pgo or cmov to understand the answer?

1 more reply

zokier2y ago

I wonder what superoptimizers like stoke and souper would do with this code.

bjourne2y ago

Don't get discouraged by the comments and that others made faster variants. I liked both your articles very much and learned a few new things.

jtriangle2y ago

It's a cardinal rule that any time someone utters "XYZ is n faster than C" someone comes along and shows C is actually 2x faster than XYZ.

JohnMakin2y ago

I had an old compilers professor say something like this once. “If you think you can do something better than the C compiler, I promise you you can’t.”

zokier2y ago

That is just dumb and overly reductive. There is just lot of properties about programs that can not be expressed to compiler with portable ISO C but could be exploited for optimization. Of course reducing portability progressively improves situation, like using GCC extensions or machine-specific intrinsics. But even then its ridiculous to claim that C compilers could in general case generate true optimal code.

kstrauser2y ago

To a point. A modern C compiler generates mind boggingly fast assembler. However, some languages make it way easier to write sophisticated algorithms more easily.

For instance, suppose you're writing a program to find the nth Fibonacci number for whatever reason. In Python, the naive version might look like:

  def fib(n):
      if n <= 1:
          return n
      return fib(n - 1) + fib(n - 2)

On my machine, that takes about 12 seconds to find the 40th number. Altering that slightly like:

  from functools import cache
  @cache
  def fib(n): ...

makes the whole program take about 30 milliseconds total. The 400th takes about 32ms and emits an answer that won't fit in a 256-bit int.

Of course you can do the exact same kind of caching in C! I mean, the main Python interpreter's written in C, so by extension any algorithm you can express in Python you can also express in C. It'd probably be a lot faster, too!

But in practice, if I'm writing that in Python, I can use the obvious algorithm, spent 10 seconds slapping a caching decorator on it, verify that the end result is ridiculous fast and efficient, then move on to other problems.

Any reasonable C compiler will emit assembler that's vastly better than anything I could come up with. Conversely, I personally can write far better algorithms in Python than I could in C, because it's easier for me to express cleverness in that language. Those algorithmic improvements tend to have a far better speed payoff than I'd personally get from a more efficient implementation of a crappy method.

hinkley2y ago

How many brains has the Fibonacci example broken...

You'd unroll it to a loop on both C and Python. Fibonacci doesn't need a cache. It needs K previous values, where K=1.

2 more replies

olliej2y ago

adding @cache meaningfully changes the algorithmic complexity from O(1.8^^N) (iirc - it's obviously exponential) to O(N).

1 more reply

saagarjha2y ago

Someone has to teach the compiler how to be clever.

BoppreH2y ago

You can also use math to avoid most of the jumps:

    int run_switches(char *input) {
      int res = 0;
      while (true) {
        char c = *input++;
        if (c == '\0') return res;
        // Here's the trick:
        res += (c == 's') - (c == 'p');
      }
    }

This gives a 3.7x speed compared to loop-1.c. The lower line count is also nice.

svachalek2y ago

Nice. The way I read the cmove version, it's more or less this except the trick line goes

    res += (c == 's') ? 1 : (c == 'p') ? -1 : 0

I haven't done C in decades so I don't trust myself to performance test this but I'm curious how it compares. Pretty disappointed that TFA didn't go back and try that in C.

414owenOP2y ago

So I actually did try that, but and IIRC it didn't produce a CMOV with either gcc or clang. I didn't put it in the repo because it wasn't an improvement (on my machine) and I decided not to write about it.

Maybe you get different results though?

eru2y ago

Compare also https://codegolf.stackexchange.com/a/236630/32575 "High throughput Fizz Buzz" where someone uses assembly to generate Fizz Buzz at around 54-56GiB/s.

gavinray2y ago

Fantastic post, I appreciated that the ASM was displayed in tabs as both "standard" and "visual-arrows"-annotated.

Kept me reading into the follow-up article.

Also, I love the UI of this blog.

414owenOP2y ago

Kind words, much appreciated!

arun-mani-j2y ago

Any guide on how a person who uses Python or JavaScript can learn such things? I mean knowing which assembly code would be better, which algorithm makes better usage of processor etc.? :)

Also, how is such optimization carried out in a large scale software? Like, do you tweak the generated assembly code manually? (Sorry I'm a very very very beginner to low-level code)

sigmoid102y ago

You do this by first learning C (or similar languages) and then compilers and maybe also operating systems. What you're seeing in this blog is the equivalent result of at least one or two years university level education, so it's not like there is a single book or tutorial you could use to get you up to speed, especially if you have no previous experience in that area. And building a better compiler optimisation in general is a PhD thesis level task. But it's also not necessary if you want to design user applications on today's hardware.

dottedmag2y ago

You could try this (in-progress) course: https://www.computerenhance.com/p/table-of-contents

414owenOP2y ago

This is pretty much `assembly language the game`: https://tomorrowcorporation.com/humanresourcemachine

It's not a useful architecture, but it teaches the thought process really well, and you end up discovering a lot of optimization naturally.

For this article, I'm measuring every step to see what the performance implications of the changes are, which, along with some educated guesses and some googling/reading other articles, was enough for me to figure out what was going on.

In part two (https://owen.cafe/posts/the-same-speed-as-c/) especially, I didn't know what was going on with the benchmarks for a long time. Eventually I got lucky and made a change, which led to a hypothesis, which lead to more tests, which led to a conclusion.

secondcoming2y ago

You learn by doing. Compiler Explorer [0] is fantastic for this sort of thing. Typically you would do this sort of optimisation after profiling and then on a per-function level.

[0] godbolt.org

vardump2y ago

I think it's straightforward to optimize to a point it's maybe about 10x faster than the "optimized" version. The answer is of course SIMD vectorization.

red2awn2y ago

I experimented with different optimizations and ended with 128x speedup. The improvement mainly comes from manual SIMD intrinsics, but you can go a long way just by making the code more auto-vectorization friendly as some other comments have mentioned. See:

https://ipthomas.com/blog/2023/07/n-times-faster-than-c-wher...

amm2y ago

Back-of-the-envelope approach that should eliminate most branching:

  int table[256] = {0};                                                           
                                                                                
  void init() {                                                                   
    table['s'] = 1;                                                             
    table['p'] = -1;                                                            
  }                                                                               
                                                                                
  int run_switches(char *input, int size) {                                                 
    int res = 0;                                                                
    while (size-- >= 0) res += table[input[size]];
    return res;                                                                 
  }

414owenOP2y ago

The array lookup approach taken in part two:

https://owen.cafe/posts/the-same-speed-as-c/

But taking the length of the string as a parameter is not, because that changes the problem statement (making the solution vectorizable)

Also note that you'll try to read element -1 of the input. You probably want to change the `>=` to a `>`

lukas0992y ago

Would it be possible to write a code profiler and compiler that work together to optimize code based on real-world data? The profiler would output data that would feed back into the compiler, telling it which branches were selected most often, which would recompile optimizing for the profile. Would this even work? Has it already been done?

garenp2y ago

It's a thing: https://en.wikipedia.org/wiki/Profile-guided_optimization

lukas0992y ago

Cool, thanks.

olliej2y ago

I see other people have done minor rewrites, but the post does mention reordering branches, so the obvious question is whether there was any attempt to use PGO, which is an obvious first step in optimization.

einpoklum2y ago

A very instructional post. I wish more people had such a level of mastery of GPU assembly and its effects, and would post such treatments on outsmarting NVIDIA's (or AMD's) optimizers.

failuser2y ago

Having a full-blown predicate support is so nice to have, but it interferes with compact instruction encoding.

Such bloated ISA like x86 might actually handle predicate support, but who will try such a radical change?

gpderetta2y ago

AVX512?

Also the original ARM 32 bit instruction sent had extensive predication.

sitkack2y ago

This is such a wonderful post! Heavenly.

rajnathani2y ago

Really interesting. The recent HN article on branchless binary search also covered cmov: https://news.ycombinator.com/item?id=35737862

RobotToaster2y ago

Was the C compiled with optimisation enabled?

414owenOP2y ago

Yes, I explained in the `Benchmarking setup` section that I used `march=native`, but I guess I forgot to mention I used -O3.

kristianpaul2y ago

How fast is forth compared to C these days?

stefncb2y ago

Close to nobody works on forth compilers nowadays, and the compilers that are optimising or even fast is very small.

People say that forth isn't very optimisable for our register machines, but I reckon that you can get pretty good results with some clever stack analysis. It's actually possible to determine arity statically if you don't have multiple-arity words, which are very rare. That allows you to pass arguments by register.

Anyway, I'm not even close to an expert so don't take what I said as facts.

throwaway143562y ago

naive q: could one just count one of the letters and subtract it from the total number of letters?

saagarjha2y ago

You’d need to count both as other characters are ignored.

throwaway143562y ago

naïve q2: does that mean most comparisons are no match 3 times? could one do a bitwise operation and fuzzy test for all 3 in one go?

65102y ago

Is it still ASCII? If so p is 01110000 and s is 01110011(?) but I don't know what \0 is, is it 00000000? Is there anything else know about the data? If the rest of the characters are all numbers, those all start with 0011 but that doesn't seem of much use. 4-9 have either the 5th or 6th bit set.

Only if AND with 00001100 yields zero the other 3 tests are needed.

Ofc I have no idea what opcodes the language provides.

1 more reply

orlp2y ago

I made a variant that is (on my Apple m1 machine) 20x faster than the naive C version in the blog by branchlessly processing the string word-by-word:

    int run_switches(const char* input) {
        int res = 0;

        // Align to word boundary.
        while ((uintptr_t) input % sizeof(size_t)) {
            char c = *input++;
            res += c == 's';
            res -= c == 'p';
            if (c == 0) return res;
        }

        // Process word-by-word.
        const size_t ONES = ((size_t) -1) / 255;  // 0x...01010101
        const size_t HIGH_BITS = ONES << 7;       // 0x...80808080
        const size_t SMASK = ONES * (size_t) 's'; // 0x...73737373
        const size_t PMASK = ONES * (size_t) 'p'; // 0x...70707070
        size_t s_accum = 0;
        size_t p_accum = 0;
        int iters = 0;
        while (1) {
            // Load word and check for zero byte.
            // (w - ONES) & ~w has the top bit set in each byte where that byte is zero.
            size_t w;
            memcpy(&w, input, sizeof(size_t));
            if ((w - ONES) & ~w & HIGH_BITS) break;
            input += sizeof(size_t);

            // We reuse the same trick as before, but XORing with SMASK/PMASK first to get
            // exactly the high bits set where a byte is 's' or 'p'.
            size_t s_high_bits = ((w ^ SMASK) - ONES) & ~(w ^ SMASK) & HIGH_BITS;
            size_t p_high_bits = ((w ^ PMASK) - ONES) & ~(w ^ PMASK) & HIGH_BITS;

            // Shift down and accumulate.
            s_accum += s_high_bits >> 7;
            p_accum += p_high_bits >> 7;
            if (++iters >= 255 / sizeof(size_t)) {
                // To prevent overflow in our byte-wise accumulators we must flush
                // them every so often. We use a trick by noting that 2^8 = 1 (mod 255)
                // and thus a + 2^8 b + 2^16 c + ... = a + b + c  (mod 255).
                res += s_accum % 255;
                res -= p_accum % 255;
                iters = s_accum = p_accum = 0;
            }
        }
        res += s_accum % 255;
        res -= p_accum % 255;

        // Process tail.
        while (1) {
            char c = *input++;
            res += c == 's';
            res -= c == 'p';
            if (c == 0) break;
        }

        return res;
    }

Fun fact: the above is still 1.6x slower (on my machine) than the naive two-pass algorithm that gets autovectorized by clang:

    int run_switches(const char* input) {
        size_t len = strlen(input);
        int res = 0;
        for (size_t i = 0; i < len; ++i) {
            char c = input[i];
            res += c == 's';
            res -= c == 'p';
        }
        return res;
    }

fuber20182y ago

I assume the M1's SIMD registers are wider/more numerous than just the couple of size_t registers used for the loading/masking/accumulating inner loop in your run_swtches().

You can speedup the code by unrolling your inner loop a few times (try 4x or 8x) - it does mean that your overflow prevention limit is lowered (to a multiple of the unrolled grouping number) and run a few more times. But the speedup offsets the increased bookkeeping.

A version I played with showed increased speed by saving the in-progress accumulation in an array and then doing the final accumulation after the main loop is done. But that may be due to the CPU arch/compiler I'm using.

fuber20182y ago

If this code only runs on one compiler version/CPU arch, then ASSUMING the compiler will do the RIGHT THING and auto-vectorize the code is okay.

But if your code will be cross-platform/run on different OSes/CPU arch's, then a SWAR version may be more consistently performant - no need to guess if the compiler's optimization heuristics decided to go with the general purpose CPU registers or faster SIMD registers.

Downside is that the devs are exposed to the gnarly optimized code.

fuber20182y ago

Almost the same as my SWAR version - which is what you're doing.

But aren't you reading off the end of the buffer in your memcpy(&w...)? Say with an empty input string whose start address is aligned to sizeof(size_t) bytes?

I just passed in the string length since the caller had that info, otherwise you'd scan the whole string again looking for the zero terminator.

orlp2y ago

> But aren't you reading off the end of the buffer in your memcpy(&w...)?

If we go by the absolute strictest interpretation of the C standard my above implementation is UB.

But in practice, if p is word-aligned and is at least valid for 1 byte, then you will not pagefault for reading a whole word. In fact, this is how GCC/musl implement strlen itself.

> Say with an empty input string whose start address is aligned to sizeof(size_t) bytes?

Then the start address is valid (it must contain the null byte), and aligned to a word boundary, in which case I assume it is ok to also read a whole word there.

gpderetta2y ago

If I read it correctly, your implementation might read beyond the end of the buffer, and if it crosses a page boundary into an unmapped page, it will segfault. That's one of the many evils of null terminated strings.

orlp2y ago

If we go by the absolute strictest interpretation of the C standard, yes, the above is UB.

But in practice no one has page boundaries that cross word boundaries, and I align to a word boundary before doing the word-by-word loop.

gpderetta2y ago

good point of course!

j / k navigate · click thread line to collapse

237 comments

torstenvl2y ago

Rewriting the C function as follows got a 5.5x speedup:

    int run_switches(char *input) {
        int r = 0;
        char c; 
        while (1) {
            c = *input++;
            if (c == 's') r++;
            if (c == 'p') r--;
            if (c == '\0') break;
        }
        return r;
    }

Results:

    [16:50:14 user@boxer ~/looptest] $ gcc -O3 bench.c loop1.c -o lone
    [16:50:37 user@boxer ~/looptest] $ gcc -O3 bench.c loop2.c -o ltwo
    [16:50:47 user@boxer ~/looptest] $ time ./lone 1000 1
    449000
    ./lone 1000 1  3.58s user 0.00s system 99% cpu 3.589 total
    [16:50:57 user@boxer ~/looptest] $ time ./ltwo 1000 1
    449000
    ./ltwo 1000 1  0.65s user 0.00s system 99% cpu 0.658 total

414owenOP2y ago

Nice! There's a part two in which I rewrote the C. I got a 12x speedup :)

https://owen.cafe/posts/the-same-speed-as-c/

And as others have pointed out, you can tweak the input, then vectorize the algo, if you want to go that route.

I considered this a pedagogical exercise and I sincerely hope nobody will start dropping down to assembly without a very good reason to.

sriku2y ago

Wondering how res += (c=='s')-(c=='p') might do. I sure there is some C undefined behaviour relevant there. Curious but too lazy to check it myself!

Tempest19812y ago

While `false` evaluates to 0, not sure `true` always evaluate to 1 in C... maybe compiler dependent. Maybe add `? 1 : 0`

2 more replies

romnon2y ago

ive seen people doing += !!(c=='s')-!!(c=='p') for that

2 more replies

rajnathani2y ago

Great post!

One thought: If the code is rewritten using bit arithmetic, then potentially the result could be even faster as there need not be a pointer look-up.

Update: The bit arithmetic could also be done with an XOR on the mask, and following which the ‘popcnt’ x86 instruction could be used to figure out if all are 0 bits.

JohnMakin2y ago

Thank you for your post and reply but I fear with a post + title like this you may just be chumming the waters.

jdsalaro2y ago

What do you mean by "chumming the waters" in this context?

2 more replies

haberman2y ago

> jumps are a lot slower than conditional arithmetic.

This statement is true if the jumps are unpredictable. If the jumps are predictable, then jumps will be faster.

Linus had a whole rant about this back in the day, arguing that cmov is not useful if branches are predictable: https://yarchive.net/comp/linux/cmov.html

torstenvl2y ago

I haven't run any benchmarks, but jump-if-equal and set-if-equal would seem to have the same level of predictability.

My naive, untested intuition is that there's only one meaningful difference: the former has to dump the entire pipeline on a miss, and the latter only has to nop a single instruction on a miss.

But maybe I'm missing something. I'll re-read his rant.

EDIT:

Linus rants a lot, but makes one concrete claim:

    You can always replace it by
    
      j<negated condition> forward
      mov ..., %reg
     forward:
    
    and assuming the branch is AT ALL predictable (and 95+% of all branches
    are), *the branch-over will actually be a LOT better for a CPU.*

So, I decided to test that.

    [18:50:14 user@boxer ~/src/looptest] $ diff -u loop2.s loop4.s
    --- loop2.s 2023-07-06 18:40:11.000000000 -0400
    +++ loop4.s 2023-07-06 18:46:58.000000000 -0400
    @@ -17,11 +17,15 @@
      incq %rdi
      xorl %edx, %edx
      cmpb $115, %cl
    - sete %dl
    + jne _run_switches_jmptgt1
    + mov $1,   %dl
    +_run_switches_jmptgt1:  
      addl %edx, %eax
      xorl %edx, %edx
      cmpb $112, %cl
    - sete %dl
    + jne _run_switches_jmptgt2
    + mov $1,   %dl
    +_run_switches_jmptgt2:  
      subl %edx, %eax
      testb %cl, %cl
      jne LBB0_1
    [18:50:29 user@boxer ~/src/looptest] $ gcc -O3 bench.c loop2.s -o l2
    [18:50:57 user@boxer ~/src/looptest] $ gcc -O3 bench.c loop4.s -o l4
    [18:51:02 user@boxer ~/src/looptest] $ time ./l2 1000 1
    449000
    ./l2 1000 1  0.69s user 0.00s system 99% cpu 0.697 total
    [18:51:09 user@boxer ~/src/looptest] $ time ./l4 1000 1
    449000
    ./l4 1000 1  4.53s user 0.01s system 99% cpu 4.542 total

I feel pretty confident that Linus has made a poor prediction about poor prediction here. Jumps are indeed slower.

To be fair to Linus, since Clang and I are using sete here, not cmov, I also tested cmov, and the difference was insignificant:

    [19:53:12 user@boxer ~/src/looptest] $ time ./l2 1000 1            
    449000
    ./l2 1000 1  0.69s user 0.00s system 99% cpu 0.700 total
    [19:53:15 user@boxer ~/src/looptest] $ time ./l5 1000 1            
    449000
    ./l5 1000 1  0.68s user 0.00s system 99% cpu 0.683 total

Jumps are slower.

haberman2y ago

> jump-if-equal and set-if-equal would seem to have the same level of predictability.

2 more replies

seventhson2y ago

Linus' post is 15+ years old. Much has changed in Intel hardware since then. He was probably right on the money re the hardware available at the time.

2 more replies

AnotherGoodName2y ago

The inputs here are random which is the problem and why this isn't demonstrating that. Create an input of all 's' and compare it.

1 more reply

MobiusHorizons2y ago

zaxomi2y ago

Did you test this on a Pentium 4, the processor that Linus is talking about?

1 more reply

EVa5I7bHFq9mnYK2y ago

Is this the reason I dont usually see any speed up if I eliminate array boundary checking in C#? The jump condition is almost always false, is this what "predictable" means?

gpderetta2y ago

Indeed.

1 more reply

BoppreH2y ago

What version of GCC are you using? For me both versions perform the same, both on Ubuntu and Windows:

    $ time ./lone 1000 1
        851000

        real    0m3.578s
        user    0m3.574s
        sys     0m0.004s
        
    $ time ./ltwo 1000 1
        851000

        real    0m3.583s
        user    0m3.583s
        sys     0m0.000s

    $ gcc --version
        gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
        Copyright (C) 2019 Free Software Foundation, Inc.
        This is free software; see the source for copying conditions.  There is NO
        warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

torstenvl2y ago

Sorry, I write 'gcc' purely out of force of habit. I'm using Clang/LLVM.

    [17:23:00 user@boxer ~/looptest] $ uname -a
    Darwin boxer.local 21.6.0 Darwin Kernel Version 21.6.0: Thu Jun  8 23:57:12 PDT 2023; root:xnu-8020.240.18.701.6~1/RELEASE_X86_64 x86_64
    [17:23:47 user@boxer ~/looptest] $ cc -v
    Apple clang version 14.0.0 (clang-1400.0.29.202)
    Target: x86_64-apple-darwin21.6.0
    Thread model: posix
    InstalledDir: /Library/Developer/CommandLineTools/usr/bin

Clang generates the sete instruction for me with the above code:

    [17:23:49 user@boxer ~/looptest] $ gcc -c -O3 loop2.c   
    [17:25:00 user@boxer ~/looptest] $ objdump -d --symbolize-operands --x86-asm-syntax=intel --no-show-raw-insn loop2.o
    
    loop2.o: file format mach-o 64-bit x86-64
    
    Disassembly of section __TEXT,__text:
    
    0000000000000000 <_run_switches>:
           0:       push rbp
           1:       mov rbp, rsp
           4:       xor eax, eax
           6:       nop word ptr cs:[rax + rax]
    <L0>:
          10:       movzx ecx, byte ptr [rdi]
          13:       add rdi, 1
          17:       xor edx, edx
          19:       cmp cl, 115
          1c:       sete dl
          1f:       add eax, edx
          21:       xor edx, edx
          23:       cmp cl, 112
          26:       sete dl
          29:       sub eax, edx
          2b:       test cl, cl
          2d:       jne  <L0>
          2f:       pop rbp
          30:       ret

p1necone2y ago

Karellen2y ago

It shouldn't be.

But if the number of `if`s is high and the distribution sufficiently dense, where a jump table is better than a bunch of `if`s, then a `switch` should output that.

kevin_thibedeau2y ago

flohofwoe2y ago

It shouldn't if the control flow is actually identical.

E.g. note how both the switch- and if-based functions generate the same code using a lookup table here:

https://godbolt.org/z/qoT3M7r5G

DeathArrow2y ago

Shouldn't the compiler be able to do that, too?

AnotherGoodName2y ago

It's not true in general.

In general branching code is faster than branchless code and there's many many places that will demonstrate this with a quick Google. You know how many cycles a correctly predicted branch takes? 0.

It's an artificial benchmark that works against the output the compiler produces.

rendaw2y ago

thomasahle2y ago

I tried https://godbolt.org/, and neither Clang nor GCC trunk give the same code for the two programs.

Pretty shocking for such a simple program.

ModernMech2y ago

Yes, there’s always the “sufficiently smart compiler” that can generate this code. Question is, does that compiler exist?

Groxx2y ago

I wholly admit that this implies nothing about all optimizers. But it's a pretty reasonable one to expect.

charcircuit2y ago

>does that compiler exist?

and if so are the compile times worth it

icsa2y ago

These days, I's put my money on zig cc. I.E. zig cc -Os or zig cc -O3

1 more reply

okaleniuk2y ago

Yes, but this will backfire on ARM, where jumps are as roughly fast as conditional arithmetic.

dzaima2y ago

If a compiler can convert jumpy code to the predicated instrs, it should be able to trivially convert conditional arith to such too (even easier & more consistently than branches I'd say).

bruce3434342y ago

Is that because cjumps on ARM are faster, or cmovs on ARM are slower?

BbzzbB2y ago

Is there any good article comparing performance across programming languages? Seems like everytime I see one they're broken because the tested logic is poorly implemented in language(s) XYZ.

MisterTea2y ago

You can shorten the loop to just tree conditionals:

  while (c = *input++) {
      if (c == 's') r++;
      if (c == 'p') r--;
  }

nwallin2y ago

IMHO the original code wasn't written in a way that's particularly friendly to compilers. If you write it like this:

    int run_switches_branchless(const char* s) {
        int result = 0;
        for (; *s; ++s) {
            result += *s == 's';
            result -= *s == 'p';
        }
        return result;
    }

    int run_switches_vectorized(const char* s, size_t size) {
        int result = 0;
        for (; size--; ++s) {
            result += *s == 's';
            result -= *s == 'p';
        }
        return result;
    }

https://godbolt.org/z/rde51zMd8

414owenOP2y ago

The version that's friendly to the compiler is described in part two: https://owen.cafe/posts/the-same-speed-as-c/

It achieves 3.88GiB/s

nwallin2y ago

So I downloaded your code. On my desktop, with loop-9 gcc I got ~4.5GB/s, and with loop-7 I got ~4.4GB/s. With the following code:

    #include <stddef.h>
    
    int run_switches(const char *s, size_t n) {
      int res = 0;
      for (; n--; ++s)
        res += (*s == 's') - (*s == 'p');
      return res;
    }

Don't want to pass the string length? That's fine, we can figure that out for ourselves. This code:

    #include <stddef.h>
    #include <string.h>

    int run_switches(const char *s) {
      int res = 0;
      for (size_t n = strlen(s); n--; ++s)
        res += (*s == 's') - (*s == 'p');

      return res;
    }

Is 27GB/s. With a little bit of blocking:

    #include <stddef.h>
    
    int run_switches(const char *s, size_t n) {
      int res = 0;
      char tmp = 0;
      for (size_t i = n & 63; i--; ++s)
        tmp += (*s == 's') - (*s == 'p');
      res += tmp;
    
      for (n >>= 6; n--;) {
        tmp = 0;
        for (size_t i = 64; i--; ++s)
          tmp += (*s == 's') - (*s == 'p');
        res += tmp;
      }
    
      return res;
    }

That's ~55GB/s.

Anyway, the point is, you're pretty far from the point where you ought to give up on C and dive into assembly.

utopcell2y ago

Indeed. I suppose the two lessons are, stick with C, and don't forget the semantics of your original problem when optimizing.

    int run_switches(const char *s) {
      int res = 0;
      uint8_t tmp = 0;
      size_t n = strlen(s);
      for (size_t i = n & 127; i--; ++s)
        tmp += (*s == 's');
      res += tmp;
    
      for (size_t j = n >> 7; j--;) {
        tmp = 0;
        for (size_t i = 128; i--; ++s)
          tmp += (*s == 's');
        res += tmp;
      }
    
      return 2 * res - n;
    }

3 more replies

magicalhippo2y ago

Another good reason to write optimization-friendly C (or similar) over assembly code, especially in libraries, is that the compiler will evolve with CPUs, while the assembly code will not.

rajnathani2y ago

    cmp dl, 's'    ; Compare the character with 's'
    sete dl        ; If the character is 's', set dl to 1. Otherwise, set dl to 0.
    sub al, dl     ; Subtract the result from res

    cmp dl, 'p'    ; Compare the character with 'p'
    sete dl        ; If the character is 'p', set dl to 1. Otherwise, set dl to 0.
    add al, dl     ; Add the result to res

SleepyMyroslav2y ago

>Anyway, the point is, you're pretty far from the point where you ought to give up on C and dive into assembly.

Thank you. I hope people who post random assembly listings on HN written in some extinct ISA will read your posts.

shusaku2y ago

You forgot an important line of the code:

/* DON’T REFACTOR THIS FOR READABILITY IT WILL SLOW DOWN */

elcritch2y ago

Nice! I tried it in Nim and it appears to trigger it with:

    {.overflowChecks:off.}
    proc run_switches*(input: cstring): int {.exportc.} =
      result = 0
      for c in input:
        result.inc int('s' == c)
        result.dec int('p' == c)

That gives a ~5x speedup on an Apple M1. Keeping overflow checks on only gets it up to ~2x the default C version. Always nice to know good ways to trigger SIMD opts.

jonny_eh2y ago

> But you really really probably shouldn't.

Shouldn't "not" keep track of string length?

nwallin2y ago

Err... yes. You shouldn't not keep track of string/buffer sizes.

Const-me2y ago

I’m probably an optimization expert, and I would solve that problem completely differently.

On my computer, the initial C version runs at 389 MB / second. I haven’t tested the assembly versions, but if they deliver the same 6.2x speedup, would result in 2.4 GB/second here.

xoranth2y ago

Interesting. I think you can vectorize the prologue using movemask + popcnt instead of keeping a counter in the ymm registers (warning: untested code, still need to benchmark it):

    const __m256i zero = _mm256_setzero_si256();
    const __m256i s = _mm256_set1_epi8( 's' );
    const __m256i p = _mm256_set1_epi8( 'p' );

    const size_t a = (size_t)input;
    const size_t rem = a % 32;
    const char* aligned = input - rem;

    const __m256i v = _mm256_load_si256(( const __m256i*) input);
    const __m256i z = _mm256_cmpeq_epi8( v, zero );

    size_t m_plus = _mm256_movemask_epi8(_mm_cmpeq_epi8(v, s));
    size_t m_minus = _mm256_movemask_epi8(_mm_cmpeq_epi8(v, p));
    size_t m_zero = _mm256_movemask_epi8(_mm_cmpeq_epi8(v, z));
    size_t offset_zero = _mm_tzcnt_64(m_zero >> rem);

    m_plus = _bzhi_u64(m_plus >> rem, offset_zero);
    m_minus = _bzhi_u64(m_minus >> rem, offset_zero);

    // Skip loop we already found the end of the string...
    while (m_zero == 0) {
        // ...
    }
    
    // ...
    
    return m_plus + res - m_minus;

Const-me2y ago

Good idea, and it can be used for both prologue and epilogue pieces. Updated that gist.

However, this is only relevant for very small inputs. For longer inputs the vectorized portion of the function gonna dominate the performance.

gavinray2y ago

Do you know if this is possible using "std::experimental::simd" out of curiosity?

https://en.cppreference.com/w/cpp/experimental/simd

Const-me2y ago

I don’t have any experience with that library.

utopcell2y ago

you might want to rewrite this in a form that is compatible with @414owen's repo.

bluedevilzn2y ago

What’s a good source to learn and practice AVX?

Const-me2y ago

For a starting point, I wrote that article couple years ago: http://const.me/articles/simd/simd.pdf

utopcell2y ago

You could study Google's Highway library [1].

[1] https://github.com/google/highway

kierank2y ago

FFmpeg has a lot of assembly language that can be added to:

https://blogs.gnome.org/rbultje/2017/07/14/writing-x86-simd-...

Sesse__2y ago

camel-cdr2y ago

I threw together a quick risc-v vectorized implementation:

    size_t run(char *str) {
            uint8_t *p = (uint8_t*)str;
            long end = 0;
            size_t res = 0, vl;
            while (1) {
             vl = __riscv_vsetvlmax_e8m8();
                    vuint8m8_t v = __riscv_vle8ff_v_u8m8(p, &vl, vl);
                    end = __riscv_vfirst_m_b1(__riscv_vmseq_vx_u8m8_b1(v, '\0', vl), vl);
                    if (end >= 0)
                            break;
                    res += __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 's', vl), vl);
                    res -= __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 'p', vl), vl);
                    p += vl;
            }
            vl = __riscv_vsetvl_e8m8(end);
            vuint8m8_t v = __riscv_vle8_v_u8m8(p, vl);
            res += __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 's', vl), vl);
            res -= __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 'p', vl), vl);
            return res;
    }

Here are the results from the above, the switch and the table c implementation, ran on my mangopi mq pro (C906, in order rv64gc with rvv 0.7.1, and a 128 bit vector length):

    switch: 0.19 Bytes/Cycle
    tbl:    0.17 Bytes/Cycle
    rvv:    1.57 Bytes/Cycle (dips down to 1.35 after ~30 KiB)

Edit: you can go up to 2/1.7 Bytes/Cycle, if you make sure the pointer is page aligned (and vl isn't larger than the page size), see comments

dzaima2y ago

To be fully correct, you'd need the load to be a fault-only-first load (which rvv does have), otherwise that could fail if the null byte was just before the end of allocated memory.

camel-cdr2y ago

I just found your rvv intrinsics-viewer [0], that'll be so helpful.

I tried building one, my self, but my miserable web skills didn't allow me to lazily load the instructions, which made it too slow for actual use.

Can I share your project on lemmy?

[0] https://dzaima.github.io/intrinsics-viewer

dzaima2y ago

camel-cdr2y ago

I'm not sure I fully understand fault-only-first load, but reading the description of vle8ff.v I think I only need to exchange the load inside of the loop?

How does the normal load deal with faults?

I'll update the parent comment, it slowed down the speed from 2/1.7 to 1.57/1.36 Bytes/Cycle.

dzaima2y ago

The normal load should just segfault if any loaded byte is outside of readable memory, same as with a scalar load which is similarly partly outside.

1 more reply

okaleniuk2y ago

I think, it's a particular quirk of x86 architecture. Branching is expensive in comparison because not doing branching is super cheap. https://wordsandbuttons.online/challenge_your_performance_in...

However, on other processors, this might not be the case. https://wordsandbuttons.online/using_logical_operators_for_l...

eklitzke2y ago

nwallin2y ago

> I assume that their test input (which isn't described in the post, and is also not in their GitHub repo) consists of random strings consisting almost entirely of s and p characters.

    int run_switches(const char* s) {
        int result = 0;
        while (*s)
            result += (1 | *s++) - 'r';
        return result;
    }

which compiles to:

    run_switches:
            movzx   eax, BYTE PTR [rdi]
            xor     edx, edx
            test    al, al
            je      .L1
    .L3:
            or      eax, 1
            inc     rdi
            movsx   eax, al
            lea     edx, [rdx-114+rax]
            movzx   eax, BYTE PTR [rdi]
            test    al, al
            jne     .L3
    .L1:
            mov     eax, edx
            ret

This is too clever by half, of course, but it perfectly illustrates your point about exploiting properties of the data.

gpderetta2y ago

You do not even need to subtract 'r'.

    int run_switches(const char* s) {
        int s_count = 0;
        const char *begin = s;
        while(*s) {
            s_count += (1 & *s++);
        }
        int count = s-begin;
        return count - s_count;
    }

which compiles to:

    .L49:
        and     edx, 1
        add     rax, 1
        add     ecx, edx
        movzx   edx, BYTE PTR [rax]
        test    dl, dl
        jne     .L49

edit: other variant

    int run_switches2(const char* s) {
        const char *begin = s;
        int sum = 0;
        while(*s) {
            sum += *s++;
        }
        int count = s-begin;
        int s_count = sum - ('s'*count)/('p'-'s');
        int p_count = count - s_count;
        return p_count - s_count;
   }

which compiles to:

   run_switches2(char const*):
        movsx   eax, BYTE PTR [rdi]
        test    al, al
        je      .L56
        mov     rdx, rdi
        xor     ecx, ecx
   .L55:
        add     rdx, 1
        add     ecx, eax
        movsx   eax, BYTE PTR [rdx]
        test    al, al
        jne     .L55
        sub     rdx, rdi
        imul    esi, edx, 115
        movsx   rax, esi
        sar     esi, 31
        imul    rax, rax, 1431655766
        shr     rax, 32
        sub     eax, esi
        add     ecx, eax
        sub     edx, ecx
        mov     eax, edx
        sub     eax, ecx
        ret
   .L56:
        xor     eax, eax
        ret

None of these will beat the clever blocked SIMD someone showed elsethread.

414owenOP2y ago

> because without FDO (or PGO) the compiler has no idea how likely each branch is to be taken

PGO does help, of course, and on my machine gives me 2.80s, which is better than the code at the end of the `Rearranging blocks` section :)

> I assume that their test input (which isn't described in the post, and is also not in their GitHub repo)

It's described under `Benchmarking setup`, and is in the repository here: https://github.com/414owen/blog-code/blob/master/01-six-time...

Side note: There's a part two to this post (linked at the bottom) where I make the C code as fast as I possibly can, and it beats all the assembly in this post.

I never said writing assembly is (necessarily) a good idea, I just find optimizing it, and deciphering compiler output, an interesting challenge, and a good learning opportunity.

aengvs2y ago

Imagine a scenario where most of the strings being processed contain a single null character, with no other characters. In that case checking for the null character first would be optimal.

Does the compiler know that this isn't true? No, it doesn't. The author of the article is making an assumption about the contents of the data that might seem reasonable but isn't necessarily true.

zokier2y ago

Of course there are still some cases where non-zero strings are extremely rare and as such optimizing for those makes sense.

xoranth2y ago

I think I managed to improve on both this post, and its sequel, at the cost of specializing the function for the case of a string made only of 's' and 'p'.

The benchmark only tests strings made of 's' and 'p', so I think it is fair.

The idea is as follow. We want to increase `res` by one when the next character is `s`. Naively, we might try something like this:

    res += (c - 'r');  // is `res += 1` when c == 's'

This doesn't work, as `'p' - 'r' == -2`, and we'd need it to be -1.

But `'p' - 'r'`, when viewer as an unsigned integer, underflows, setting the carry flag. Turns out x64 has an instruction (adc) that adds two registers _plus_ the carry flag.

Therefore we can replace two `cmp, cmov` with one `sub, adc`:

    run_switches:
            xor    eax, eax            # res = 0
    loop:
            movsx  ecx, byte ptr [rdi]
            test   ecx, ecx
            je     ret
            inc    rdi
            sub    ecx, 'r'
            adc    eax, ecx     # Magic happens here
            jmp    loop
    ret:
            ret

Benchmarks are as follows (`bench-x64-8` is the asm above):

    Summary
      '01-six-times-faster-than-c/bench-x64-8 1000 1' ran
        1.08 ± 0.00 times faster than '02-the-same-speed-as-c/bench-c-4-clang 1000 1'
        1.66 ± 0.00 times faster than '01-six-times-faster-than-c/bench-x64-7 1000 1'

Of course, one could improve things further using SWAR/SIMD...

414owenOP2y ago

Very interesting approach. I should probably have specified that the somewhat naive assembly in `02-the-same-speed-as-c/loop-5.x64.s` is the fastest version I have.

On my machine I'm getting 0.244s for `loop-5.x64.s` and 0.422s for your implementation above.

xoranth2y ago

I rerun the benchmark vs loop-5 and loop-7 from the second post. Runtime is basically the same on my machine.

I would have expected yours to be faster given that it needs to execute fewer instructions per loop iteration. Though maybe the CPU can run `adc` on more ports compared to a load from memory?

    Summary
      '01-six-times-faster-than-c/bench-x64-8 1000 1' ran
        1.00 ± 0.00 times faster than '02-the-same-speed-as-c/bench-x64-7 1000 1'
        1.66 ± 0.00 times faster than '01-six-times-faster-than-c/bench-x64-7 1000 1'

    Summary
      '01-six-times-faster-than-c/bench-x64-8 1000 1' ran
        1.01 ± 0.00 times faster than '02-the-same-speed-as-c/bench-x64-5 1000 1'
        1.66 ± 0.00 times faster than '01-six-times-faster-than-c/bench-x64-7 1000 1'

gpderetta2y ago

Even simpler: just sum all elements of the array. Then at the end subtract 'p'*len from the sum, then divide by ('s'-'p') to get the s count. The 'p' count is len minus the 's' count.

The initial sum is easily vectorized as well.

If I've not made any mistakes it should work. Only issue is possible overflow on the running sum.

Can't be bothered to benchmark it though:).

edit: missed the decrement when you see 's'. So the final result is p_count - s_count.

xoranth2y ago

That's likely the fastest way to do that without vectorization. But you'd need to upcast 's' to an uint64 (or at least an uint32). That means that vectorization would operate on 32/64 bit lanes.

Then in the outer loop you do an horizontal sum of the 8 bit accumulators.

fuber20182y ago

gpderetta2y ago

Indeed, the blocked vectorization with 8 bits accumulators shown elsethread is going to be faster and there reducing the sum to 1 bit per iteration is worth it.

fuber20182y ago

I took the 64-bit SWAR ('S'IMD-'W'ithin-'A'-'R'egister) road and passed in the string length - the calling code has the length "right there"!!!

Using the original run_switches function, app took 3.554s (average of 10 runs).

With the SWAR-version with the string length passed in, app took 0.117s (average of 10 runs).

That's an overall 27.6x speedup.

fuber20182y ago

If I unroll the main while loop to handle 4x as much each time through the loop in the SWAR-version, the runtime drops to 0.0562s (average 10 runs).

That's an overall 57.5x speedup.

1 more reply

sltkr2y ago

How much faster is this:

    int run_switches(const char *buf) {
       size_t len = strlen(buf);
       int res = 0;
       for (size_t i = 0; i < len; ++i) {
         res += (buf[i] == 's') - (buf[i] == 'p');
       }
       return res;
    }

aidenn02y ago

whartung2y ago

aidenn02y ago

Gosh, I'd have to see if I can dig it up this was a few years ago.

I did all of the above, plus profiling (sb-sprof combined with disassemble will show assembly level profiling).

moonchild2y ago

aidenn02y ago

I ran separate benchmarks on all-ASCII, BMP-only, and ascii with non-BMP. ASCII was not slower on the low-branch version.

fefe232y ago

First, before optimizing you should consider correctness and security. input should be const and the return value should be ssize_t (so you don't have numeric overflow on 64-bit).

Second, consider this replacement function:

  ssize_t test(const char \*input) {
    ssize_t res = 0;
    size_t l = strlen(input);
    size_t i;
    for (i=0; i<l; ++i) {
      res += (input[i] == 's') - (input[i] == 'p');
    }
    return res;
  }

If you know the length beforehand, the compiler can vectorize and unroll the loop, which it happens to do here. To great effect, if I may say so.

The art of writing fast code is usually to get out of the way of the compiler, which will do a perfectly fine job if you let it.

torstenvl2y ago

There's an error in the pseudocode.

      cmp     ecx, 's'            #   if (c == 's')
      jne     loop                #     continue
      add     eax, 1              #   res++
      jmp     loop                #   continue

should be

      cmp     ecx, 's'            #   if (c != 's')
      jne     loop                #     continue
      add     eax, 1              #   res++
      jmp     loop                #   continue

agumonkey2y ago

I believe the first `jne` should be `je`, right ?

torstenvl2y ago

agumonkey2y ago

well then there's a magical bit somewhere since both assembly listing are identical

1 more reply

414owenOP2y ago

A clickbait title for an in-depth look at hand-optimizing a very simple loop.

ftxbro2y ago

moonchild2y ago

> are computers just not yet fast enough to do a good job with very simple loops in practical compilers?

The short answer to this question is 'yes', but there are some extenuating factors:

- Although we could do interesting things with unlimited computational resources, the current crop of c compilers is simply not very good, compared with what's possible today.

bruce3434342y ago

> the current crop of c compilers is simply not very good, compared with what's possible today.

That's quite dismissive. What exactly "is possible today" and why aren't these top compilers using them?

1 more reply

cjensen2y ago

The problem is the author of the article is making some huge implicit assumptions that the compiler can't possibly know about.

Good first effort by the author, but has not sufficiently thought through the problem.

414owenOP2y ago

I might be missing a reason that this information of opaque to the compiler though, in which case, this section of the article is indeed lacking, but I'm happy to learn :)

2 more replies

smcin2y ago

Ok the author should have written "this makes no sense... on this particular case"

twoodfin2y ago

This is the right answer:

https://news.ycombinator.com/item?id=36622584

Optimal assembly (forgoing SIMD, at least) for this loop on modern x86 is highly dependent on the entropy of the runtime data.

ftxbro2y ago

OK so they were abusing the benchmark, like the compiler's output would be faster on less contrived test data? Do I have to search what are fdo or pgo or cmov to understand the answer?

1 more reply

zokier2y ago

I wonder what superoptimizers like stoke and souper would do with this code.

bjourne2y ago

Don't get discouraged by the comments and that others made faster variants. I liked both your articles very much and learned a few new things.

jtriangle2y ago

It's a cardinal rule that any time someone utters "XYZ is n faster than C" someone comes along and shows C is actually 2x faster than XYZ.

JohnMakin2y ago

I had an old compilers professor say something like this once. “If you think you can do something better than the C compiler, I promise you you can’t.”

zokier2y ago

kstrauser2y ago

To a point. A modern C compiler generates mind boggingly fast assembler. However, some languages make it way easier to write sophisticated algorithms more easily.

For instance, suppose you're writing a program to find the nth Fibonacci number for whatever reason. In Python, the naive version might look like:

  def fib(n):
      if n <= 1:
          return n
      return fib(n - 1) + fib(n - 2)

On my machine, that takes about 12 seconds to find the 40th number. Altering that slightly like:

  from functools import cache
  @cache
  def fib(n): ...

makes the whole program take about 30 milliseconds total. The 400th takes about 32ms and emits an answer that won't fit in a 256-bit int.

hinkley2y ago

How many brains has the Fibonacci example broken...

You'd unroll it to a loop on both C and Python. Fibonacci doesn't need a cache. It needs K previous values, where K=1.

2 more replies

olliej2y ago

adding @cache meaningfully changes the algorithmic complexity from O(1.8^^N) (iirc - it's obviously exponential) to O(N).

1 more reply

saagarjha2y ago

Someone has to teach the compiler how to be clever.

BoppreH2y ago

You can also use math to avoid most of the jumps:

    int run_switches(char *input) {
      int res = 0;
      while (true) {
        char c = *input++;
        if (c == '\0') return res;
        // Here's the trick:
        res += (c == 's') - (c == 'p');
      }
    }

This gives a 3.7x speed compared to loop-1.c. The lower line count is also nice.

svachalek2y ago

Nice. The way I read the cmove version, it's more or less this except the trick line goes

    res += (c == 's') ? 1 : (c == 'p') ? -1 : 0

I haven't done C in decades so I don't trust myself to performance test this but I'm curious how it compares. Pretty disappointed that TFA didn't go back and try that in C.

414owenOP2y ago

Maybe you get different results though?

eru2y ago

Compare also https://codegolf.stackexchange.com/a/236630/32575 "High throughput Fizz Buzz" where someone uses assembly to generate Fizz Buzz at around 54-56GiB/s.

gavinray2y ago

Fantastic post, I appreciated that the ASM was displayed in tabs as both "standard" and "visual-arrows"-annotated.

Kept me reading into the follow-up article.

Also, I love the UI of this blog.

414owenOP2y ago

Kind words, much appreciated!

arun-mani-j2y ago

Any guide on how a person who uses Python or JavaScript can learn such things? I mean knowing which assembly code would be better, which algorithm makes better usage of processor etc.? :)

Also, how is such optimization carried out in a large scale software? Like, do you tweak the generated assembly code manually? (Sorry I'm a very very very beginner to low-level code)

sigmoid102y ago

dottedmag2y ago

You could try this (in-progress) course: https://www.computerenhance.com/p/table-of-contents

414owenOP2y ago

This is pretty much `assembly language the game`: https://tomorrowcorporation.com/humanresourcemachine

It's not a useful architecture, but it teaches the thought process really well, and you end up discovering a lot of optimization naturally.

secondcoming2y ago

You learn by doing. Compiler Explorer [0] is fantastic for this sort of thing. Typically you would do this sort of optimisation after profiling and then on a per-function level.

[0] godbolt.org

vardump2y ago

I think it's straightforward to optimize to a point it's maybe about 10x faster than the "optimized" version. The answer is of course SIMD vectorization.

red2awn2y ago

https://ipthomas.com/blog/2023/07/n-times-faster-than-c-wher...

amm2y ago

Back-of-the-envelope approach that should eliminate most branching:

  int table[256] = {0};                                                           
                                                                                
  void init() {                                                                   
    table['s'] = 1;                                                             
    table['p'] = -1;                                                            
  }                                                                               
                                                                                
  int run_switches(char *input, int size) {                                                 
    int res = 0;                                                                
    while (size-- >= 0) res += table[input[size]];
    return res;                                                                 
  }

414owenOP2y ago

The array lookup approach taken in part two:

https://owen.cafe/posts/the-same-speed-as-c/

But taking the length of the string as a parameter is not, because that changes the problem statement (making the solution vectorizable)

Also note that you'll try to read element -1 of the input. You probably want to change the `>=` to a `>`

lukas0992y ago

garenp2y ago

It's a thing: https://en.wikipedia.org/wiki/Profile-guided_optimization

lukas0992y ago

Cool, thanks.

olliej2y ago

einpoklum2y ago

A very instructional post. I wish more people had such a level of mastery of GPU assembly and its effects, and would post such treatments on outsmarting NVIDIA's (or AMD's) optimizers.

failuser2y ago

Having a full-blown predicate support is so nice to have, but it interferes with compact instruction encoding.

Such bloated ISA like x86 might actually handle predicate support, but who will try such a radical change?

gpderetta2y ago

AVX512?

Also the original ARM 32 bit instruction sent had extensive predication.

sitkack2y ago

This is such a wonderful post! Heavenly.

rajnathani2y ago

Really interesting. The recent HN article on branchless binary search also covered cmov: https://news.ycombinator.com/item?id=35737862

RobotToaster2y ago

Was the C compiled with optimisation enabled?

414owenOP2y ago

Yes, I explained in the `Benchmarking setup` section that I used `march=native`, but I guess I forgot to mention I used -O3.

kristianpaul2y ago

How fast is forth compared to C these days?

stefncb2y ago

Close to nobody works on forth compilers nowadays, and the compilers that are optimising or even fast is very small.

Anyway, I'm not even close to an expert so don't take what I said as facts.

throwaway143562y ago

naive q: could one just count one of the letters and subtract it from the total number of letters?

saagarjha2y ago

You’d need to count both as other characters are ignored.

throwaway143562y ago

naïve q2: does that mean most comparisons are no match 3 times? could one do a bitwise operation and fuzzy test for all 3 in one go?

65102y ago

Only if AND with 00001100 yields zero the other 3 tests are needed.

Ofc I have no idea what opcodes the language provides.

1 more reply

orlp2y ago

I made a variant that is (on my Apple m1 machine) 20x faster than the naive C version in the blog by branchlessly processing the string word-by-word:

    int run_switches(const char* input) {
        int res = 0;

        // Align to word boundary.
        while ((uintptr_t) input % sizeof(size_t)) {
            char c = *input++;
            res += c == 's';
            res -= c == 'p';
            if (c == 0) return res;
        }

        // Process word-by-word.
        const size_t ONES = ((size_t) -1) / 255;  // 0x...01010101
        const size_t HIGH_BITS = ONES << 7;       // 0x...80808080
        const size_t SMASK = ONES * (size_t) 's'; // 0x...73737373
        const size_t PMASK = ONES * (size_t) 'p'; // 0x...70707070
        size_t s_accum = 0;
        size_t p_accum = 0;
        int iters = 0;
        while (1) {
            // Load word and check for zero byte.
            // (w - ONES) & ~w has the top bit set in each byte where that byte is zero.
            size_t w;
            memcpy(&w, input, sizeof(size_t));
            if ((w - ONES) & ~w & HIGH_BITS) break;
            input += sizeof(size_t);

            // We reuse the same trick as before, but XORing with SMASK/PMASK first to get
            // exactly the high bits set where a byte is 's' or 'p'.
            size_t s_high_bits = ((w ^ SMASK) - ONES) & ~(w ^ SMASK) & HIGH_BITS;
            size_t p_high_bits = ((w ^ PMASK) - ONES) & ~(w ^ PMASK) & HIGH_BITS;

            // Shift down and accumulate.
            s_accum += s_high_bits >> 7;
            p_accum += p_high_bits >> 7;
            if (++iters >= 255 / sizeof(size_t)) {
                // To prevent overflow in our byte-wise accumulators we must flush
                // them every so often. We use a trick by noting that 2^8 = 1 (mod 255)
                // and thus a + 2^8 b + 2^16 c + ... = a + b + c  (mod 255).
                res += s_accum % 255;
                res -= p_accum % 255;
                iters = s_accum = p_accum = 0;
            }
        }
        res += s_accum % 255;
        res -= p_accum % 255;

        // Process tail.
        while (1) {
            char c = *input++;
            res += c == 's';
            res -= c == 'p';
            if (c == 0) break;
        }

        return res;
    }

Fun fact: the above is still 1.6x slower (on my machine) than the naive two-pass algorithm that gets autovectorized by clang:

    int run_switches(const char* input) {
        size_t len = strlen(input);
        int res = 0;
        for (size_t i = 0; i < len; ++i) {
            char c = input[i];
            res += c == 's';
            res -= c == 'p';
        }
        return res;
    }

fuber20182y ago

I assume the M1's SIMD registers are wider/more numerous than just the couple of size_t registers used for the loading/masking/accumulating inner loop in your run_swtches().

fuber20182y ago

If this code only runs on one compiler version/CPU arch, then ASSUMING the compiler will do the RIGHT THING and auto-vectorize the code is okay.

Downside is that the devs are exposed to the gnarly optimized code.

fuber20182y ago

Almost the same as my SWAR version - which is what you're doing.

But aren't you reading off the end of the buffer in your memcpy(&w...)? Say with an empty input string whose start address is aligned to sizeof(size_t) bytes?

I just passed in the string length since the caller had that info, otherwise you'd scan the whole string again looking for the zero terminator.

orlp2y ago

> But aren't you reading off the end of the buffer in your memcpy(&w...)?

If we go by the absolute strictest interpretation of the C standard my above implementation is UB.

But in practice, if p is word-aligned and is at least valid for 1 byte, then you will not pagefault for reading a whole word. In fact, this is how GCC/musl implement strlen itself.

> Say with an empty input string whose start address is aligned to sizeof(size_t) bytes?

Then the start address is valid (it must contain the null byte), and aligned to a word boundary, in which case I assume it is ok to also read a whole word there.

gpderetta2y ago

orlp2y ago

If we go by the absolute strictest interpretation of the C standard, yes, the above is UB.

But in practice no one has page boundaries that cross word boundaries, and I align to a word boundary before doing the word-by-word loop.

gpderetta2y ago

good point of course!

j / k navigate · click thread line to collapse