[19:13:34 user@boxer ~/src/looptest] $ diff -u bench.c bench-alls.c
--- bench.c 2023-07-06 16:04:16.000000000 -0400
+++ bench-alls.c 2023-07-06 19:13:34.000000000 -0400
@@ -17,7 +17,7 @@
int num_rand_calls = number / CHAR_BIT + 1;
unsigned char *buffer = malloc(num_rand_calls * CHAR_BIT);
for (int i = 0; i < num_rand_calls; i++) {
- buffer[i] = rand();
+ buffer[i] = 's'; //rand();
}
return buffer;
}
[19:13:37 user@boxer ~/src/looptest] $ gcc -O3 bench-alls.c loop2.s -o l2
[19:13:42 user@boxer ~/src/looptest] $ gcc -O3 bench-alls.c loop4.s -o l4
[19:13:47 user@boxer ~/src/looptest] $ time ./l2 1000 1
250001000
./l2 1000 1 0.69s user 0.00s system 99% cpu 0.699 total
[19:13:55 user@boxer ~/src/looptest] $ time ./l4 1000 1
250001000
./l4 1000 1 1.28s user 0.00s system 99% cpu 1.290 total
Jumps are slower.Similarly you might be busting the pipeline by chaining together the jumps so close together.
Not saying your point is wrong, just saying your proof isn't super solid.