Rust (1.13.0-nightly) 1m32.392s
Nim (0.14.2) 1m53.320s
C 1m59.116s
Julia (0.4.6) 2m01.166s
Crystal (0.18.7) 2m01.735s
C Double Precision 2m26.546s
Java (1.7.0_111) 2m36.949s
Nim Double Precision (0.14.2) 3m19.547s
OCaml 3m59.597s
Go 1.6 6m44.151s
node.js (6.2.1) 7m59.041s
node.js (5.7.1) 8m49.170s
C# 12m18.463s
PyPy 14m02.406s
Lisp 24m43.216s
Haskell 26m34.955s
Elixir 123m59.025s
Elixir MP 138m48.241s
Luajit 225m58.621s
Python 348m35.965s
Lua 611m38.925s C (gcc -O3): 23.8s
Julia (julia 1.1.0): 32.8s
Go (alt) (go 1.12): 39.3s
Java (java 1.8.0_60): 44.2s
Go (org) (go 1.12): 64.8s
OCaml (ocaml 4.07.1): 79.1s
JS (node 11.14.0): 137.0s
Pypy (pypy 6.0.0): 139.4s
C# (mono 4.2.1): 187.3s
Rust: DOES NOT COMPILE
So Go is only twice as slow as C, not thrice as slow. This puts it just ahead of Java and just behind Julia.I'm sure a "mechanical translation of [the] C" version would improve things for the Java ver as well. If we removed startup costs (the class file validation, etc) I'd expect it to be on par with C.
Last time I checked, many years back, the spec was changing and the run time did crash. Guess it has gone a long way since.
The other one is Lua. My assumption was that it's one of the lightest and fastest language around. Looks like "fastest" isn't true in some cases.
Rules like "code should be simple, as in, easy to read and understand" are also hard to judge, especially near the top of the list where there's a lot of pressure to optimize. Is SIMD easy to understand? What if it's in a library? What if the library was written specifically for this benchmark? Etc. I think https://benchmarksgame-team.pages.debian.net/benchmarksgame/ has to deal with every possible permutation of this debate.
I wonder if the performance gap is attributable to some overhead in Go's function calls? I know Go passes parameters on the stack instead of via registers... Maybe it's due to passing struct copies instead of references (looks like the C version passes references)? Generally poor code generation?
Anyone else have ideas or care to profile?
EDIT: From my 2015 MBP, Go (version 1.12) is indeed quite a lot slower than C, but only if you're doing an optimized build `-03`:
tmp $ time ./gorb
real 1m15.128s
user 1m9.366s
sys 0m6.754s
tmp $ clang crb.c
tmp $ time ./a.out
real 1m13.041s
user 1m10.284s
sys 0m0.624s
tmp $ gcc crb.c -o crb -std=c11 -O3 -lm -D_XOPEN_SOURCE=600
tmp $ time ./crb
real 0m22.703s
user 0m22.550s
sys 0m0.073s
tmp $ clang crb.c -o crb -std=c11 -O3 -lm -D_XOPEN_SOURCE=600
tmp $ time ./crb
real 0m22.689s
user 0m22.564s
sys 0m0.060s
EDIT2: I re-modified the Go version (https://gist.github.com/weberc2/2aed4f8d3189d09067d564448367...) to pass references and that seems to put it on par with C (or I mistranslated, which is also likely): $ time ./gorb
real 0m19.282s
user 0m14.467s
sys 0m7.523s(Where Go "wants" to play is that same benchmark, except including compilation time.)
A couple of the things below Go I suspect are bad implementations. I would expect a warmed-up C# to beat Go if both have reasonable (not super-crazy optimized implementations) or at least be at parity, and Luajit may also be a slow implementation. In both cases because ray-traced code is a great place for a JIT to come out and play. EDIT: Oh, I see C# is Mono, and not the Windows implementation. In that case that makes sense.
Oh, and I find it helpful to look at these things logarithmically. I think it matches real-world experiences somewhat better, even though we truly pay for performance in the linear world. From that perspective, it's still only the second largest. The largest is Haskell to Elixir, which is substantially larger. O'Caml->Go is large, but not crazily so; several other differences come close.
- initial time: 49s
- replacing default thread-safe RNG with rand.New sliced 6 seconds off. (the default RNG uses mutexes), = 43s
- use float64 instead of float32 and remove many type conversions. Another two seconds off. = 41s.
As others suggested, go still lacks many compile time optimizations and the implementation could be improved.
If so it may be something to do with the creation of new vectors on the heap instead of the stack. The compiler would have to determine the full lifetime of the vector value to be able to bump it to the stack. That's an optimization, and sometimes it's just not possible (but probably is here).
In the C instance no such optimization is necessary. They ask for it on the stack, and if you try to use it after the stack is gone, Bad Things™ happen.
A hypothesis to test would be that languages above the jump are managing to work on the stack and ones below are allocating objects on the heap.
EDIT: Look at C code assembly, it's generating mostly SIMD instructions and using xmm registers. That's why it's faster. Golang compiler still do not have autovectorization implemented that's why it's so much slower in this case.
EDIT2: It seems Go version also uses SSE here, which is nice. So probably unnecessary allocation from my original post was the reason.
crb-omp 0m22.949s
crb 0m23.240s
crb_opt 0m31.404s
nimrb_pmap 0m8.828s
nimrb_fn 0m22.988s
nimrb 0m26.556s
The base C code is faster than base Nim code. The optimised C code is significantly slower than everything else(!?). The Nim program that uses a threadpool is the fastest of these.I couldn't get the CPP versions to compile. I'll do the Rust programs some other time.
$ time ./crb
That means time spent writing the .ppm file is included.In the implementations I browsed, that is about a million print calls, each of which might flush the output buffer, and whose performance may depend on locale.
To benchmark ray tracing I would, instead, just output the sum of the pixel values, or set the exit code depending on that value.
Even though ray tracing is cpu intensive, it also wouldn’t completely surprise me if some of the implementations in less mature languages spent significant time writing that output because their programmers haven’t come around to optimizing such code.
-data Vector3 = Vector3 {vx::Float, vy::Float, vz::Float} deriving (Show)
+data Vector3 = Vector3 {vx :: !Float, vy :: !Float, vz :: !Float} deriving (Show)Baseline of the non-multithreaded variant on my machine: 1m56s
Making Vector3 a struct: 1m3s
Making Vector3 a readonly struct: 1m1s
Making Hit and Ray a struct: 1m26s
Will test more tomorrow, I guess, but the most obvious change already yields a 2× speedup. This was also without any profiling, so I don't even know what I did there.
Oftentimes compile to C is "It's C Jim, but not as we know it"
You can write C as if it is a SSA VM or similar intermediate representation that leaves very little work for the first stages of the compiler.
$ time python pyrb.py
real 348m35.965s
user 345m51.776s
sys 0m22.880s
$ time pypy pyrb.py
real 14m2.406s
user 13m55.292s
sys 0m1.416sI've certainly seen speedups like that on stuff like project euler code.
what's an ancient version of rust. Interesting it is faster than C, though.
It would actually be quite interesting to see a comparison with all of the languages using more recent builds to see which ones are developing their performance.
[0]https://gitlab.haskell.org/ghc/ghc/wikis/commentary/compiler...
[1]http://blog.llvm.org/2010/05/glasgow-haskell-compiler-and-ll...
There is a big variation in performance, some of which I find surprising. Do you know what exactly causes some languages to be so slow (e.g., small objects being created and garbage collected frequently)?
EDIT: forgot to mention the obvious things: compiler/interpreter maturity and inherent overhead.
(declaim (optimize (speed 3) (safety 0) (space 0) (debug 0) (compilation-speed 0)))
Never use `(optimize (safety 0))` in SBCL — it throws safety completely out the window. We're talking C-levels of safety at that point. Buffer overruns, the works. It might buy you 10-20% speed, but it's not worth it. Lisp responsibly, use `(safety 1)`. (defconstant WIDTH 1280)
People generally name constants in CL with +plus-muffs+. Naming them as uppercase doesn't help because the reader uppercases symbol names by default when it reads. So `(defconstant WIDTH ...)` means you can no longer have a variable named `width` (in the same package). (defstruct (vec
(:conc-name v-)
(:constructor v-new (x y z))
(:type (vector float)))
x y z)
Using `:type (vector float)` here is trying to make things faster, but failing. The type designator `float` covers all kinds of floats, e.g. both `single-float`s and `double-float`s in SBCL. So all SBCL knows is that the struct contains some kind of float, and it can't really do much with that information. This means all the vector math functions below have to fall back to generic arithmetic, which is extremely slow. SBCL even warns you about this when it's compiling, thanks to the `(optimize (speed 3))` declaration, but I guess they ignored or didn't understand those warnings. (defconstant ZERO (v-new 0.0 0.0 0.0))
This will cause problems because if it's ever evaluated more than once it'll try to redefine the constant to a new `vec` instance, which will not be `eql` to the old one. Use `alexandria:define-constant` or just make it a global variable.All the vector math functions are slow because they have no useful type information to work with:
(disassemble 'v-add)
; disassembly for V-ADD
; Size: 160 bytes. Origin: #x52D799AF
; 9AF: 488B45F8 MOV RAX, [RBP-8] ; no-arg-parsing entry point
; 9B3: 488B5001 MOV RDX, [RAX+1]
; 9B7: 488B45F0 MOV RAX, [RBP-16]
; 9BB: 488B7801 MOV RDI, [RAX+1]
; 9BF: FF1425A8001052 CALL QWORD PTR [#x521000A8] ; GENERIC-+
; 9C6: 488955E8 MOV [RBP-24], RDX
; 9CA: 488B45F8 MOV RAX, [RBP-8]
; 9CE: 488B5009 MOV RDX, [RAX+9]
; 9D2: 488B45F0 MOV RAX, [RBP-16]
; 9D6: 488B7809 MOV RDI, [RAX+9]
; 9DA: FF1425A8001052 CALL QWORD PTR [#x521000A8] ; GENERIC-+
; 9E1: 488BDA MOV RBX, RDX
; 9E4: 488B45F8 MOV RAX, [RBP-8]
; 9E8: 488B5011 MOV RDX, [RAX+17]
; 9EC: 488B45F0 MOV RAX, [RBP-16]
; 9F0: 488B7811 MOV RDI, [RAX+17]
; 9F4: 48895DE0 MOV [RBP-32], RBX
; 9F8: FF1425A8001052 CALL QWORD PTR [#x521000A8] ; GENERIC-+
; 9FF: 488B5DE0 MOV RBX, [RBP-32]
; A03: 49896D40 MOV [R13+64], RBP ; thread.pseudo-atomic-bits
; A07: 498B4520 MOV RAX, [R13+32] ; thread.alloc-region
; A0B: 4C8D5830 LEA R11, [RAX+48]
; A0F: 4D3B5D28 CMP R11, [R13+40]
; A13: 772E JNBE L2
; A15: 4D895D20 MOV [R13+32], R11 ; thread.alloc-region
; A19: L0: C600D9 MOV BYTE PTR [RAX], -39
; A1C: C6400806 MOV BYTE PTR [RAX+8], 6
; A20: 0C0F OR AL, 15
; A22: 49316D40 XOR [R13+64], RBP ; thread.pseudo-atomic-bits
; A26: 7402 JEQ L1
; A28: CC09 BREAK 9 ; pending interrupt trap
; A2A: L1: 488B4DE8 MOV RCX, [RBP-24]
; A2E: 48894801 MOV [RAX+1], RCX
; A32: 48895809 MOV [RAX+9], RBX
; A36: 48895011 MOV [RAX+17], RDX
; A3A: 488BD0 MOV RDX, RAX
; A3D: 488BE5 MOV RSP, RBP
; A40: F8 CLC
; A41: 5D POP RBP
; A42: C3 RET
; A43: L2: 6A30 PUSH 48
; A45: FF142520001052 CALL QWORD PTR [#x52100020] ; ALLOC-TRAMP
; A4C: 58 POP RAX
; A4D: EBCA JMP L0
If they had done the type declarations correctly, it would look more like this: ; disassembly for V-ADD
; Size: 122 bytes. Origin: #x52C33A78
; 78: F30F104A05 MOVSS XMM1, [RDX+5] ; no-arg-parsing entry point
; 7D: F30F105F05 MOVSS XMM3, [RDI+5]
; 82: F30F58D9 ADDSS XMM3, XMM1
; 86: F30F104A0D MOVSS XMM1, [RDX+13]
; 8B: F30F10670D MOVSS XMM4, [RDI+13]
; 90: F30F58E1 ADDSS XMM4, XMM1
; 94: F30F104A15 MOVSS XMM1, [RDX+21]
; 99: F30F105715 MOVSS XMM2, [RDI+21]
; 9E: F30F58D1 ADDSS XMM2, XMM1
; A2: 49896D40 MOV [R13+64], RBP ; thread.pseudo-atomic-bits
; A6: 498B4520 MOV RAX, [R13+32] ; thread.alloc-region
; AA: 4C8D5820 LEA R11, [RAX+32]
; AE: 4D3B5D28 CMP R11, [R13+40]
; B2: 7734 JNBE L2
; B4: 4D895D20 MOV [R13+32], R11 ; thread.alloc-region
; B8: L0: 66C7005903 MOV WORD PTR [RAX], 857
; BD: 0C03 OR AL, 3
; BF: 49316D40 XOR [R13+64], RBP ; thread.pseudo-atomic-bits
; C3: 7402 JEQ L1
; C5: CC09 BREAK 9 ; pending interrupt trap
; C7: L1: C7400103024F50 MOV DWORD PTR [RAX+1], #x504F0203 ; #<SB-KERNEL:LAYOUT for VEC {504F0203}>
; CE: F30F115805 MOVSS [RAX+5], XMM3
; D3: F30F11600D MOVSS [RAX+13], XMM4
; D8: F30F115015 MOVSS [RAX+21], XMM2
; DD: 488BD0 MOV RDX, RAX
; E0: 488BE5 MOV RSP, RBP
; E3: F8 CLC
; E4: 5D POP RBP
; E5: C3 RET
; E6: CC0F BREAK 15 ; Invalid argument count trap
; E8: L2: 6A20 PUSH 32
; EA: E8F1C64CFF CALL #x521001E0 ; ALLOC-TRAMP
; EF: 58 POP RAX
; F0: EBC6 JMP L0
The weirdness continues: (defstruct (ray
(:conc-name ray-)
(:constructor ray-new (origin direction))
(:type vector))
origin direction)
The `:conc-name ray-` is useless, that's the default conc-name. And again with the `:type vector`… just make it a normal struct. I was going to guess that they were doing it so they could use vector literals to specify the objects, but then why are they bothering to define a BOA constructor here? And the slots are untyped, which, if you're looking for speed, is not doing you any favors.I took a few minutes over lunch to add some type declarations to the slots and important functions, inlined the math, cleaned up the broken indentation and naming issues:
https://gist.github.com/sjl/005f27274adacd12ea2fc7f0b7200b80...
The old version runs in 5m12s on my laptop, the new version runs in 58s. So if we unscientifically extrapolate that to their 24m time, it puts it somewhere around 5m in their list. This matches what I usually see from SBCL: for numeric-heavy code generic arithmetic is very slow, and some judicious use of type declarations can get you to within ~5-10x of C. Getting more improvements beyond that can require really bonkers stuff that often isn't worth it.
EDIT: I'm also curious how much using an optimized vector math library (e.g. sb-cga) would buy you instead of hand-rolling your own vector math. It would certainly be easier.
...and then add SIMD.
Look at Steve Losh's comment here for something a lot better. My own (further) improvements put SBCL performance in the same order as Julia.
Good to see a few languages like Nim and Rust actually beating C for raw performance, too.
Now I’m tempted to try speeding it up, there’s no way it should be behind Python...