Programming language ray tracing benchmarks project (opens in new tab)

(github.com)

80 pointsjblindsay7y ago86 comments

86 comments

Ordered by realtime, fastest to slowest for those like me who got annoyed by the scrolling up and down trying to compare:

  Rust (1.13.0-nightly)         1m32.392s
  Nim (0.14.2)                  1m53.320s
  C                             1m59.116s
  Julia (0.4.6)                 2m01.166s
  Crystal (0.18.7)              2m01.735s
  C Double Precision            2m26.546s
  Java (1.7.0_111)              2m36.949s
  Nim Double Precision (0.14.2) 3m19.547s
  OCaml                         3m59.597s
  Go 1.6                        6m44.151s
  node.js (6.2.1)               7m59.041s
  node.js (5.7.1)               8m49.170s
  C#                           12m18.463s
  PyPy                         14m02.406s
  Lisp                         24m43.216s
  Haskell                      26m34.955s
  Elixir                      123m59.025s
  Elixir MP                   138m48.241s
  Luajit                      225m58.621s
  Python                      348m35.965s
  Lua                         611m38.925s

weberc27y ago

I rewrote the Go benchmark to be a mechanical translation of C and it performs much better.

    C        (gcc -O3):       23.8s
    Julia    (julia 1.1.0):   32.8s
    Go (alt) (go 1.12):       39.3s
    Java     (java 1.8.0_60): 44.2s
    Go (org) (go 1.12):       64.8s
    OCaml    (ocaml 4.07.1):  79.1s
    JS       (node 11.14.0):  137.0s
    Pypy     (pypy 6.0.0):    139.4s
    C#       (mono 4.2.1):    187.3s

    Rust: DOES NOT COMPILE

So Go is only twice as slow as C, not thrice as slow. This puts it just ahead of Java and just behind Julia.

RandyRanderson7y ago

Note that the Java impl is creating objects all over the place in the inner loop - madness!

I'm sure a "mechanical translation of [the] C" version would improve things for the Java ver as well. If we removed startup costs (the class file validation, etc) I'd expect it to be on par with C.

1 more reply

z927y ago

Julia was astonishing. It's a high level language that's performing almost like C.

Last time I checked, many years back, the spec was changing and the run time did crash. Guess it has gone a long way since.

The other one is Lua. My assumption was that it's one of the lightest and fastest language around. Looks like "fastest" isn't true in some cases.

oconnor6637y ago

Different languages' benchmarks might not be equally well-written / optimized. In particular, I'd expect C and Rust to be very close to each other, and a 20% gap between them is a red flag.

Rules like "code should be simple, as in, easy to read and understand" are also hard to judge, especially near the top of the list where there's a lot of pressure to optimize. Is SIMD easy to understand? What if it's in a library? What if the library was written specifically for this benchmark? Etc. I think https://benchmarksgame-team.pages.debian.net/benchmarksgame/ has to deal with every possible permutation of this debate.

1 more reply

hackerpacker7y ago

shouldn't rust being faster than C be something of a red flag that they aren't quite the same algorithm? Or that the algorithm is sub-optimal?

4 more replies

chrisseaton7y ago

Wasn’t Julia specifically designed to be easy to optimise? It’s not quite like other higher level languages a they thought about performance first.

eigenspace7y ago

That's using version 0.4 of Julia too (current is 1.1). Current version has a a lot of improved optimization passes that would likely benefit this benchmark.

weberc27y ago

That's a big jump between OCaml and Go. I'm not familiar with ray tracing, but skimming the source code it mostly looks like it's doing floating point math; it doesn't look like it's using the runtime (no allocations, no virtual function calls, no scheduling, etc), so I'm surprised that Go is performing relatively poorly.

I wonder if the performance gap is attributable to some overhead in Go's function calls? I know Go passes parameters on the stack instead of via registers... Maybe it's due to passing struct copies instead of references (looks like the C version passes references)? Generally poor code generation?

Anyone else have ideas or care to profile?

EDIT: From my 2015 MBP, Go (version 1.12) is indeed quite a lot slower than C, but only if you're doing an optimized build `-03`:

    tmp $  time ./gorb
    real    1m15.128s
    user    1m9.366s
    sys     0m6.754s

    tmp $  clang crb.c
    tmp $  time ./a.out
    real    1m13.041s
    user    1m10.284s
    sys     0m0.624s

    tmp $  gcc crb.c -o crb -std=c11 -O3 -lm -D_XOPEN_SOURCE=600
    tmp $  time ./crb
    real    0m22.703s
    user    0m22.550s
    sys     0m0.073s

    tmp $  clang crb.c -o crb -std=c11 -O3 -lm -D_XOPEN_SOURCE=600
    tmp $  time ./crb
    real    0m22.689s
    user    0m22.564s
    sys     0m0.060s

EDIT2: I re-modified the Go version (https://gist.github.com/weberc2/2aed4f8d3189d09067d564448367...) to pass references and that seems to put it on par with C (or I mistranslated, which is also likely):

    $ time ./gorb 
    real    0m19.282s
    user    0m14.467s
    sys     0m7.523s

jerf7y ago

There's a variety of possibilities. Lerc mentions GC as one possibility, which could definitely be the case. Another one that would be high on my "first guess" list is that everything above it has much better optimizers, and raytracing code is one of the places this is really going to show. Go does basically very little optimization, because it prioritizes fast compilation.

(Where Go "wants" to play is that same benchmark, except including compilation time.)

A couple of the things below Go I suspect are bad implementations. I would expect a warmed-up C# to beat Go if both have reasonable (not super-crazy optimized implementations) or at least be at parity, and Luajit may also be a slow implementation. In both cases because ray-traced code is a great place for a JIT to come out and play. EDIT: Oh, I see C# is Mono, and not the Windows implementation. In that case that makes sense.

Oh, and I find it helpful to look at these things logarithmically. I think it matches real-world experiences somewhat better, even though we truly pay for performance in the linear world. From that perspective, it's still only the second largest. The largest is Haskell to Elixir, which is substantially larger. O'Caml->Go is large, but not crazily so; several other differences come close.

2 more replies

cabirum7y ago

Profiling results:

- initial time: 49s

- replacing default thread-safe RNG with rand.New sliced 6 seconds off. (the default RNG uses mutexes), = 43s

- use float64 instead of float32 and remove many type conversions. Another two seconds off. = 41s.

As others suggested, go still lacks many compile time optimizations and the implementation could be improved.

1 more reply

Lerc7y ago

I'm not familiar with go but I seem to recall it is garbage collected.

If so it may be something to do with the creation of new vectors on the heap instead of the stack. The compiler would have to determine the full lifetime of the vector value to be able to bump it to the stack. That's an optimization, and sometimes it's just not possible (but probably is here).

In the C instance no such optimization is necessary. They ask for it on the stack, and if you try to use it after the stack is gone, Bad Things™ happen.

A hypothesis to test would be that languages above the jump are managing to work on the stack and ones below are allocating objects on the heap.

1 more reply

lossolo7y ago

Probably because Go version is different, compare C vector operations with Go vector operations. C version is operating on pointer without allocating new vector, Go version is allocation new vector on every op.

EDIT: Look at C code assembly, it's generating mostly SIMD instructions and using xmm registers. That's why it's faster. Golang compiler still do not have autovectorization implemented that's why it's so much slower in this case.

EDIT2: It seems Go version also uses SSE here, which is nice. So probably unnecessary allocation from my original post was the reason.

1 more reply

Shish2k7y ago

Python gets 20% faster if you use `__slots__` on the Vector class which is created and destroyed millions of times. It's still the second-slowest, but it's a nice improvement :P

krull107y ago

I wouldn’t give much weight to those benchmark numbers at this point. Some of those language versions are quite out of date...

ilitirit7y ago

The numbers are indeed out of date. I re-ran the tests for some of the C and Nim programs using Nim 0.19.4 and gcc 7.3.0 on Windows 10. Here are the results:

  crb-omp      0m22.949s
  crb          0m23.240s
  crb_opt      0m31.404s

  nimrb_pmap   0m8.828s
  nimrb_fn     0m22.988s
  nimrb        0m26.556s

The base C code is faster than base Nim code. The optimised C code is significantly slower than everything else(!?). The Nim program that uses a threadpool is the fastest of these.

I couldn't get the CPP versions to compile. I'll do the Rust programs some other time.

2 more replies

kenhwang7y ago

I would imagine current versions would only give the newer, actively developed languages a boost like Rust, Nim, Crystal, and Go. Java and C I don't see improving moving much.

1 more reply

Someone7y ago

This times performance like this:

  $ time ./crb

That means time spent writing the .ppm file is included.

In the implementations I browsed, that is about a million print calls, each of which might flush the output buffer, and whose performance may depend on locale.

To benchmark ray tracing I would, instead, just output the sum of the pixel values, or set the exit code depending on that value.

Even though ray tracing is cpu intensive, it also wouldn’t completely surprise me if some of the implementations in less mature languages spent significant time writing that output because their programmers haven’t come around to optimizing such code.

weberc27y ago

The Go version spends 8 seconds writing data. This is probably par for the course for most implementations.

piinbinary7y ago

The haskell version can be made >= 3x faster by making the computations non-lazy, e.g.

    -data Vector3 = Vector3 {vx::Float, vy::Float, vz::Float} deriving (Show)
    +data Vector3 = Vector3 {vx :: !Float, vy :: !Float, vz :: !Float} deriving (Show)

niofis7y ago

This seems like a nice improvement, could you send a PR for it?

piinbinary7y ago

https://github.com/niofis/raybench/pull/8

kyberias7y ago

I don't think the C# time is representative. I suspect Mono is really slow here. I just ran it with VS 2015 in 1 min 24 sec.

mikece7y ago

If I'm not mistaken, Miguel himself said that Mono was meant for portability (can run on Linux) and not performance. Would be a far better test to use .NET Core as you could still run this test on Linux or any other place where Core runs.

mikece7y ago

The one exception to what I initially wrote is if Mono was used to compile to native binary (which is what Xamarin apps for iOS do: compile to ARM binary). You'll get very different results if you go that route but that's going to require different compiler options.

ygra7y ago

For some workloads Mono seems awfully slow. The compiler I'm maintaining at work takes about twice as long on Mono on Windows and about four times as long as Mono on Linux compared to NET. I guess NET Core would be comparable or faster than .NET Framework and similar on both platforms.

alkonaut7y ago

The C# implementation looks flawed (uses reference types for vectors etc). Using value types and .NET Core should give a much better result than that. Will try to remember doing a PR.

ygra7y ago

I've just tried out a bit with .NET Core 2.2.

Baseline of the non-multithreaded variant on my machine: 1m56s

Making Vector3 a struct: 1m3s

Making Vector3 a readonly struct: 1m1s

Making Hit and Ray a struct: 1m26s

Will test more tomorrow, I guess, but the most obvious change already yields a 2× speedup. This was also without any profiling, so I don't even know what I did there.

kyberias7y ago

Tested this on the same machine with Ubuntu for Windows + old .NET Core 1.0. 3 min 20 seconds.

ilitirit7y ago

Interesting that Nim is slightly faster than C it considering that it compiles down to C.

Lerc7y ago

That's possibly because of the "Code must follow best practices" restriction.

Oftentimes compile to C is "It's C Jim, but not as we know it"

You can write C as if it is a SSA VM or similar intermediate representation that leaves very little work for the first stages of the compiler.

niofis7y ago

That's correct, as the nim's generated C code is not idiomatic.

kev61687y ago

I am surprised PyPy has such a huge lead over Python.

    $ time python pyrb.py 

    real    348m35.965s
    user    345m51.776s
    sys     0m22.880s

    $ time pypy pyrb.py

    real    14m2.406s
    user    13m55.292s
    sys     0m1.416s

X6S1x6Okd1st7y ago

Raytracing is a pretty great place to apply pypy, you have a very heavy loop that will hit the JIT.

I've certainly seen speedups like that on stuff like project euler code.

FartyMcFarter7y ago

10-100x speedup seems normal for CPython vs well compiled code. I don't have experience with PyPy though, so in this specific case I'm not sure.

fnord777y ago

> rustc 1.13.0-nightly

what's an ancient version of rust. Interesting it is faster than C, though.

Lerc7y ago

To be fair, The README.md seems to be three years old.

It would actually be quite interesting to see a comparison with all of the languages using more recent builds to see which ones are developing their performance.

niofis7y ago

Yeah, I've meant to update it and add more language implementations, but haven't really got the time. Might as well do so soon as almost all compilers/interpreters have new versions which I suspect have many nice optimizations.

mratsim7y ago

Nim 0.14 is also from more than 3 years ago.

JoshuaScript7y ago

You should see a performance boost in the Haskell implementation by compiling with GHC's LLVM backend[0]. Another Haskell ray tracer ran 30 % faster than the native codegen this way[1].

[0]https://gitlab.haskell.org/ghc/ghc/wikis/commentary/compiler...

[1]http://blog.llvm.org/2010/05/glasgow-haskell-compiler-and-ll...

azhenley7y ago

This is awesome! More good press for Nim.

There is a big variation in performance, some of which I find surprising. Do you know what exactly causes some languages to be so slow (e.g., small objects being created and garbage collected frequently)?

niofis7y ago

Did some testing and found that it boils down to three things: RNG algorithm used in the standard library, forced use of double precision floating point numbers (the case for OCaml and javascript), and like you mentioned, memory management.

EDIT: forgot to mention the obvious things: compiler/interpreter maturity and inherent overhead.

technological7y ago

Wonder how would D lang would have been placed

xiphias27y ago

Looking at the Julia implementation fast math wasn't used. In my experience it's usually worth experimenting with turning it on (also of course for the other LLVM based languages), though I understand that this benchmark tries to keep the program correct at all costs.

stevelosh7y ago

I looked over the Common Lisp version at https://github.com/niofis/raybench/blob/master/lisprb.lisp and it's… really bad, in a lot of ways.

    (declaim (optimize (speed 3) (safety 0) (space 0) (debug 0) (compilation-speed 0)))

Never use `(optimize (safety 0))` in SBCL — it throws safety completely out the window. We're talking C-levels of safety at that point. Buffer overruns, the works. It might buy you 10-20% speed, but it's not worth it. Lisp responsibly, use `(safety 1)`.

    (defconstant WIDTH 1280)

People generally name constants in CL with +plus-muffs+. Naming them as uppercase doesn't help because the reader uppercases symbol names by default when it reads. So `(defconstant WIDTH ...)` means you can no longer have a variable named `width` (in the same package).

    (defstruct (vec
                 (:conc-name v-)
                 (:constructor v-new (x y z))
                 (:type (vector float)))
      x y z)

Using `:type (vector float)` here is trying to make things faster, but failing. The type designator `float` covers all kinds of floats, e.g. both `single-float`s and `double-float`s in SBCL. So all SBCL knows is that the struct contains some kind of float, and it can't really do much with that information. This means all the vector math functions below have to fall back to generic arithmetic, which is extremely slow. SBCL even warns you about this when it's compiling, thanks to the `(optimize (speed 3))` declaration, but I guess they ignored or didn't understand those warnings.

    (defconstant ZERO (v-new 0.0 0.0 0.0))

This will cause problems because if it's ever evaluated more than once it'll try to redefine the constant to a new `vec` instance, which will not be `eql` to the old one. Use `alexandria:define-constant` or just make it a global variable.

All the vector math functions are slow because they have no useful type information to work with:

    (disassemble 'v-add)
    ; disassembly for V-ADD
    ; Size: 160 bytes. Origin: #x52D799AF
    ; 9AF:       488B45F8         MOV RAX, [RBP-8]                ; no-arg-parsing entry point
    ; 9B3:       488B5001         MOV RDX, [RAX+1]
    ; 9B7:       488B45F0         MOV RAX, [RBP-16]
    ; 9BB:       488B7801         MOV RDI, [RAX+1]
    ; 9BF:       FF1425A8001052   CALL QWORD PTR [#x521000A8]     ; GENERIC-+
    ; 9C6:       488955E8         MOV [RBP-24], RDX
    ; 9CA:       488B45F8         MOV RAX, [RBP-8]
    ; 9CE:       488B5009         MOV RDX, [RAX+9]
    ; 9D2:       488B45F0         MOV RAX, [RBP-16]
    ; 9D6:       488B7809         MOV RDI, [RAX+9]
    ; 9DA:       FF1425A8001052   CALL QWORD PTR [#x521000A8]     ; GENERIC-+
    ; 9E1:       488BDA           MOV RBX, RDX
    ; 9E4:       488B45F8         MOV RAX, [RBP-8]
    ; 9E8:       488B5011         MOV RDX, [RAX+17]
    ; 9EC:       488B45F0         MOV RAX, [RBP-16]
    ; 9F0:       488B7811         MOV RDI, [RAX+17]
    ; 9F4:       48895DE0         MOV [RBP-32], RBX
    ; 9F8:       FF1425A8001052   CALL QWORD PTR [#x521000A8]     ; GENERIC-+
    ; 9FF:       488B5DE0         MOV RBX, [RBP-32]
    ; A03:       49896D40         MOV [R13+64], RBP               ; thread.pseudo-atomic-bits
    ; A07:       498B4520         MOV RAX, [R13+32]               ; thread.alloc-region
    ; A0B:       4C8D5830         LEA R11, [RAX+48]
    ; A0F:       4D3B5D28         CMP R11, [R13+40]
    ; A13:       772E             JNBE L2
    ; A15:       4D895D20         MOV [R13+32], R11               ; thread.alloc-region
    ; A19: L0:   C600D9           MOV BYTE PTR [RAX], -39
    ; A1C:       C6400806         MOV BYTE PTR [RAX+8], 6
    ; A20:       0C0F             OR AL, 15
    ; A22:       49316D40         XOR [R13+64], RBP               ; thread.pseudo-atomic-bits
    ; A26:       7402             JEQ L1
    ; A28:       CC09             BREAK 9                         ; pending interrupt trap
    ; A2A: L1:   488B4DE8         MOV RCX, [RBP-24]
    ; A2E:       48894801         MOV [RAX+1], RCX
    ; A32:       48895809         MOV [RAX+9], RBX
    ; A36:       48895011         MOV [RAX+17], RDX
    ; A3A:       488BD0           MOV RDX, RAX
    ; A3D:       488BE5           MOV RSP, RBP
    ; A40:       F8               CLC
    ; A41:       5D               POP RBP
    ; A42:       C3               RET
    ; A43: L2:   6A30             PUSH 48
    ; A45:       FF142520001052   CALL QWORD PTR [#x52100020]     ; ALLOC-TRAMP
    ; A4C:       58               POP RAX
    ; A4D:       EBCA             JMP L0

If they had done the type declarations correctly, it would look more like this:

    ; disassembly for V-ADD
    ; Size: 122 bytes. Origin: #x52C33A78
    ; 78:       F30F104A05       MOVSS XMM1, [RDX+5]              ; no-arg-parsing entry point
    ; 7D:       F30F105F05       MOVSS XMM3, [RDI+5]
    ; 82:       F30F58D9         ADDSS XMM3, XMM1
    ; 86:       F30F104A0D       MOVSS XMM1, [RDX+13]
    ; 8B:       F30F10670D       MOVSS XMM4, [RDI+13]
    ; 90:       F30F58E1         ADDSS XMM4, XMM1
    ; 94:       F30F104A15       MOVSS XMM1, [RDX+21]
    ; 99:       F30F105715       MOVSS XMM2, [RDI+21]
    ; 9E:       F30F58D1         ADDSS XMM2, XMM1
    ; A2:       49896D40         MOV [R13+64], RBP                ; thread.pseudo-atomic-bits
    ; A6:       498B4520         MOV RAX, [R13+32]                ; thread.alloc-region
    ; AA:       4C8D5820         LEA R11, [RAX+32]
    ; AE:       4D3B5D28         CMP R11, [R13+40]
    ; B2:       7734             JNBE L2
    ; B4:       4D895D20         MOV [R13+32], R11                ; thread.alloc-region
    ; B8: L0:   66C7005903       MOV WORD PTR [RAX], 857
    ; BD:       0C03             OR AL, 3
    ; BF:       49316D40         XOR [R13+64], RBP                ; thread.pseudo-atomic-bits
    ; C3:       7402             JEQ L1
    ; C5:       CC09             BREAK 9                          ; pending interrupt trap
    ; C7: L1:   C7400103024F50   MOV DWORD PTR [RAX+1], #x504F0203  ; #<SB-KERNEL:LAYOUT for VEC {504F0203}>
    ; CE:       F30F115805       MOVSS [RAX+5], XMM3
    ; D3:       F30F11600D       MOVSS [RAX+13], XMM4
    ; D8:       F30F115015       MOVSS [RAX+21], XMM2
    ; DD:       488BD0           MOV RDX, RAX
    ; E0:       488BE5           MOV RSP, RBP
    ; E3:       F8               CLC
    ; E4:       5D               POP RBP
    ; E5:       C3               RET
    ; E6:       CC0F             BREAK 15                         ; Invalid argument count trap
    ; E8: L2:   6A20             PUSH 32
    ; EA:       E8F1C64CFF       CALL #x521001E0                  ; ALLOC-TRAMP
    ; EF:       58               POP RAX
    ; F0:       EBC6             JMP L0

The weirdness continues:

    (defstruct (ray
                 (:conc-name ray-)
                 (:constructor ray-new (origin direction))
                 (:type vector))
      origin direction)

The `:conc-name ray-` is useless, that's the default conc-name. And again with the `:type vector`… just make it a normal struct. I was going to guess that they were doing it so they could use vector literals to specify the objects, but then why are they bothering to define a BOA constructor here? And the slots are untyped, which, if you're looking for speed, is not doing you any favors.

I took a few minutes over lunch to add some type declarations to the slots and important functions, inlined the math, cleaned up the broken indentation and naming issues:

https://gist.github.com/sjl/005f27274adacd12ea2fc7f0b7200b80...

The old version runs in 5m12s on my laptop, the new version runs in 58s. So if we unscientifically extrapolate that to their 24m time, it puts it somewhere around 5m in their list. This matches what I usually see from SBCL: for numeric-heavy code generic arithmetic is very slow, and some judicious use of type declarations can get you to within ~5-10x of C. Getting more improvements beyond that can require really bonkers stuff that often isn't worth it.

armitron7y ago

I did some quick changes to your code (inlining, stack allocating) and got a further ~2x speedup which makes SBCL performance equivalent to Julia.

stevelosh7y ago

Yeah I considered trying some dynamic-extent declarations but just didn't care all that much. Can you post your version? I'm curious how far into the declaration weeds you need to go to get that extra 2x.

EDIT: I'm also curious how much using an optimized vector math library (e.g. sb-cga) would buy you instead of hand-rolling your own vector math. It would certainly be easier.

PorterDuff7y ago

It would be interesting to take that C version and hammer on it a bit for speed.

...and then add SIMD.

omaranto7y ago

The results are more or less in line with what I would have expected, except for SBCL and Luajit, which I would have expected to be much faster.

armitron7y ago

The Lisp code is terrible performance-wise and written by someone who obviously doesn't know Common Lisp very well.

Look at Steve Losh's comment here for something a lot better. My own (further) improvements put SBCL performance in the same order as Julia.

iainmerrick7y ago

The most impressive result here is Lua -- not far behind C! LuaJIT is amazing.

Good to see a few languages like Nim and Rust actually beating C for raw performance, too.

wlesieutre7y ago

Am I reading it wrong or did LuaJIT take 113x as long as C?

ddlutz7y ago

It sure looks that slow to me.

1 more reply

Narishma7y ago

You should look again. Lua seems to be the slowest of the bunch.

iainmerrick7y ago

Whoops, yes, you’re right!

Now I’m tempted to try speeding it up, there’s no way it should be behind Python...

1 more reply

Mac21257y ago

Sarcasm?

iainmerrick7y ago

Unfortunately for me, Hanlon’s Razor applies.

j / k navigate · click thread line to collapse

86 comments

kenhwang7y ago

Ordered by realtime, fastest to slowest for those like me who got annoyed by the scrolling up and down trying to compare:

  Rust (1.13.0-nightly)         1m32.392s
  Nim (0.14.2)                  1m53.320s
  C                             1m59.116s
  Julia (0.4.6)                 2m01.166s
  Crystal (0.18.7)              2m01.735s
  C Double Precision            2m26.546s
  Java (1.7.0_111)              2m36.949s
  Nim Double Precision (0.14.2) 3m19.547s
  OCaml                         3m59.597s
  Go 1.6                        6m44.151s
  node.js (6.2.1)               7m59.041s
  node.js (5.7.1)               8m49.170s
  C#                           12m18.463s
  PyPy                         14m02.406s
  Lisp                         24m43.216s
  Haskell                      26m34.955s
  Elixir                      123m59.025s
  Elixir MP                   138m48.241s
  Luajit                      225m58.621s
  Python                      348m35.965s
  Lua                         611m38.925s

weberc27y ago

I rewrote the Go benchmark to be a mechanical translation of C and it performs much better.

    C        (gcc -O3):       23.8s
    Julia    (julia 1.1.0):   32.8s
    Go (alt) (go 1.12):       39.3s
    Java     (java 1.8.0_60): 44.2s
    Go (org) (go 1.12):       64.8s
    OCaml    (ocaml 4.07.1):  79.1s
    JS       (node 11.14.0):  137.0s
    Pypy     (pypy 6.0.0):    139.4s
    C#       (mono 4.2.1):    187.3s

    Rust: DOES NOT COMPILE

So Go is only twice as slow as C, not thrice as slow. This puts it just ahead of Java and just behind Julia.

RandyRanderson7y ago

Note that the Java impl is creating objects all over the place in the inner loop - madness!

I'm sure a "mechanical translation of [the] C" version would improve things for the Java ver as well. If we removed startup costs (the class file validation, etc) I'd expect it to be on par with C.

1 more reply

z927y ago

Julia was astonishing. It's a high level language that's performing almost like C.

Last time I checked, many years back, the spec was changing and the run time did crash. Guess it has gone a long way since.

The other one is Lua. My assumption was that it's one of the lightest and fastest language around. Looks like "fastest" isn't true in some cases.

oconnor6637y ago

Different languages' benchmarks might not be equally well-written / optimized. In particular, I'd expect C and Rust to be very close to each other, and a 20% gap between them is a red flag.

1 more reply

hackerpacker7y ago

shouldn't rust being faster than C be something of a red flag that they aren't quite the same algorithm? Or that the algorithm is sub-optimal?

4 more replies

chrisseaton7y ago

Wasn’t Julia specifically designed to be easy to optimise? It’s not quite like other higher level languages a they thought about performance first.

eigenspace7y ago

That's using version 0.4 of Julia too (current is 1.1). Current version has a a lot of improved optimization passes that would likely benefit this benchmark.

weberc27y ago

Anyone else have ideas or care to profile?

EDIT: From my 2015 MBP, Go (version 1.12) is indeed quite a lot slower than C, but only if you're doing an optimized build `-03`:

    tmp $  time ./gorb
    real    1m15.128s
    user    1m9.366s
    sys     0m6.754s

    tmp $  clang crb.c
    tmp $  time ./a.out
    real    1m13.041s
    user    1m10.284s
    sys     0m0.624s

    tmp $  gcc crb.c -o crb -std=c11 -O3 -lm -D_XOPEN_SOURCE=600
    tmp $  time ./crb
    real    0m22.703s
    user    0m22.550s
    sys     0m0.073s

    tmp $  clang crb.c -o crb -std=c11 -O3 -lm -D_XOPEN_SOURCE=600
    tmp $  time ./crb
    real    0m22.689s
    user    0m22.564s
    sys     0m0.060s

    $ time ./gorb 
    real    0m19.282s
    user    0m14.467s
    sys     0m7.523s

jerf7y ago

(Where Go "wants" to play is that same benchmark, except including compilation time.)

2 more replies

cabirum7y ago

Profiling results:

- initial time: 49s

- replacing default thread-safe RNG with rand.New sliced 6 seconds off. (the default RNG uses mutexes), = 43s

- use float64 instead of float32 and remove many type conversions. Another two seconds off. = 41s.

As others suggested, go still lacks many compile time optimizations and the implementation could be improved.

1 more reply

Lerc7y ago

I'm not familiar with go but I seem to recall it is garbage collected.

In the C instance no such optimization is necessary. They ask for it on the stack, and if you try to use it after the stack is gone, Bad Things™ happen.

A hypothesis to test would be that languages above the jump are managing to work on the stack and ones below are allocating objects on the heap.

1 more reply

lossolo7y ago

EDIT2: It seems Go version also uses SSE here, which is nice. So probably unnecessary allocation from my original post was the reason.

1 more reply

Shish2k7y ago

Python gets 20% faster if you use `__slots__` on the Vector class which is created and destroyed millions of times. It's still the second-slowest, but it's a nice improvement :P

krull107y ago

I wouldn’t give much weight to those benchmark numbers at this point. Some of those language versions are quite out of date...

ilitirit7y ago

The numbers are indeed out of date. I re-ran the tests for some of the C and Nim programs using Nim 0.19.4 and gcc 7.3.0 on Windows 10. Here are the results:

  crb-omp      0m22.949s
  crb          0m23.240s
  crb_opt      0m31.404s

  nimrb_pmap   0m8.828s
  nimrb_fn     0m22.988s
  nimrb        0m26.556s

The base C code is faster than base Nim code. The optimised C code is significantly slower than everything else(!?). The Nim program that uses a threadpool is the fastest of these.

I couldn't get the CPP versions to compile. I'll do the Rust programs some other time.

2 more replies

kenhwang7y ago

I would imagine current versions would only give the newer, actively developed languages a boost like Rust, Nim, Crystal, and Go. Java and C I don't see improving moving much.

1 more reply

Someone7y ago

This times performance like this:

  $ time ./crb

That means time spent writing the .ppm file is included.

In the implementations I browsed, that is about a million print calls, each of which might flush the output buffer, and whose performance may depend on locale.

To benchmark ray tracing I would, instead, just output the sum of the pixel values, or set the exit code depending on that value.

weberc27y ago

The Go version spends 8 seconds writing data. This is probably par for the course for most implementations.

piinbinary7y ago

The haskell version can be made >= 3x faster by making the computations non-lazy, e.g.

    -data Vector3 = Vector3 {vx::Float, vy::Float, vz::Float} deriving (Show)
    +data Vector3 = Vector3 {vx :: !Float, vy :: !Float, vz :: !Float} deriving (Show)

niofis7y ago

This seems like a nice improvement, could you send a PR for it?

piinbinary7y ago

https://github.com/niofis/raybench/pull/8

kyberias7y ago

I don't think the C# time is representative. I suspect Mono is really slow here. I just ran it with VS 2015 in 1 min 24 sec.

mikece7y ago

ygra7y ago

alkonaut7y ago

The C# implementation looks flawed (uses reference types for vectors etc). Using value types and .NET Core should give a much better result than that. Will try to remember doing a PR.

ygra7y ago

I've just tried out a bit with .NET Core 2.2.

Baseline of the non-multithreaded variant on my machine: 1m56s

Making Vector3 a struct: 1m3s

Making Vector3 a readonly struct: 1m1s

Making Hit and Ray a struct: 1m26s

Will test more tomorrow, I guess, but the most obvious change already yields a 2× speedup. This was also without any profiling, so I don't even know what I did there.

kyberias7y ago

Tested this on the same machine with Ubuntu for Windows + old .NET Core 1.0. 3 min 20 seconds.

ilitirit7y ago

Interesting that Nim is slightly faster than C it considering that it compiles down to C.

Lerc7y ago

That's possibly because of the "Code must follow best practices" restriction.

Oftentimes compile to C is "It's C Jim, but not as we know it"

You can write C as if it is a SSA VM or similar intermediate representation that leaves very little work for the first stages of the compiler.

niofis7y ago

That's correct, as the nim's generated C code is not idiomatic.

kev61687y ago

I am surprised PyPy has such a huge lead over Python.

    $ time python pyrb.py 

    real    348m35.965s
    user    345m51.776s
    sys     0m22.880s

    $ time pypy pyrb.py

    real    14m2.406s
    user    13m55.292s
    sys     0m1.416s

X6S1x6Okd1st7y ago

Raytracing is a pretty great place to apply pypy, you have a very heavy loop that will hit the JIT.

I've certainly seen speedups like that on stuff like project euler code.

FartyMcFarter7y ago

10-100x speedup seems normal for CPython vs well compiled code. I don't have experience with PyPy though, so in this specific case I'm not sure.

fnord777y ago

> rustc 1.13.0-nightly

what's an ancient version of rust. Interesting it is faster than C, though.

Lerc7y ago

To be fair, The README.md seems to be three years old.

It would actually be quite interesting to see a comparison with all of the languages using more recent builds to see which ones are developing their performance.

niofis7y ago

mratsim7y ago

Nim 0.14 is also from more than 3 years ago.

JoshuaScript7y ago

You should see a performance boost in the Haskell implementation by compiling with GHC's LLVM backend[0]. Another Haskell ray tracer ran 30 % faster than the native codegen this way[1].

[0]https://gitlab.haskell.org/ghc/ghc/wikis/commentary/compiler...

[1]http://blog.llvm.org/2010/05/glasgow-haskell-compiler-and-ll...

azhenley7y ago

This is awesome! More good press for Nim.

niofis7y ago

EDIT: forgot to mention the obvious things: compiler/interpreter maturity and inherent overhead.

technological7y ago

Wonder how would D lang would have been placed

xiphias27y ago

stevelosh7y ago

I looked over the Common Lisp version at https://github.com/niofis/raybench/blob/master/lisprb.lisp and it's… really bad, in a lot of ways.

    (declaim (optimize (speed 3) (safety 0) (space 0) (debug 0) (compilation-speed 0)))

    (defconstant WIDTH 1280)

    (defstruct (vec
                 (:conc-name v-)
                 (:constructor v-new (x y z))
                 (:type (vector float)))
      x y z)

    (defconstant ZERO (v-new 0.0 0.0 0.0))

All the vector math functions are slow because they have no useful type information to work with:

    (disassemble 'v-add)
    ; disassembly for V-ADD
    ; Size: 160 bytes. Origin: #x52D799AF
    ; 9AF:       488B45F8         MOV RAX, [RBP-8]                ; no-arg-parsing entry point
    ; 9B3:       488B5001         MOV RDX, [RAX+1]
    ; 9B7:       488B45F0         MOV RAX, [RBP-16]
    ; 9BB:       488B7801         MOV RDI, [RAX+1]
    ; 9BF:       FF1425A8001052   CALL QWORD PTR [#x521000A8]     ; GENERIC-+
    ; 9C6:       488955E8         MOV [RBP-24], RDX
    ; 9CA:       488B45F8         MOV RAX, [RBP-8]
    ; 9CE:       488B5009         MOV RDX, [RAX+9]
    ; 9D2:       488B45F0         MOV RAX, [RBP-16]
    ; 9D6:       488B7809         MOV RDI, [RAX+9]
    ; 9DA:       FF1425A8001052   CALL QWORD PTR [#x521000A8]     ; GENERIC-+
    ; 9E1:       488BDA           MOV RBX, RDX
    ; 9E4:       488B45F8         MOV RAX, [RBP-8]
    ; 9E8:       488B5011         MOV RDX, [RAX+17]
    ; 9EC:       488B45F0         MOV RAX, [RBP-16]
    ; 9F0:       488B7811         MOV RDI, [RAX+17]
    ; 9F4:       48895DE0         MOV [RBP-32], RBX
    ; 9F8:       FF1425A8001052   CALL QWORD PTR [#x521000A8]     ; GENERIC-+
    ; 9FF:       488B5DE0         MOV RBX, [RBP-32]
    ; A03:       49896D40         MOV [R13+64], RBP               ; thread.pseudo-atomic-bits
    ; A07:       498B4520         MOV RAX, [R13+32]               ; thread.alloc-region
    ; A0B:       4C8D5830         LEA R11, [RAX+48]
    ; A0F:       4D3B5D28         CMP R11, [R13+40]
    ; A13:       772E             JNBE L2
    ; A15:       4D895D20         MOV [R13+32], R11               ; thread.alloc-region
    ; A19: L0:   C600D9           MOV BYTE PTR [RAX], -39
    ; A1C:       C6400806         MOV BYTE PTR [RAX+8], 6
    ; A20:       0C0F             OR AL, 15
    ; A22:       49316D40         XOR [R13+64], RBP               ; thread.pseudo-atomic-bits
    ; A26:       7402             JEQ L1
    ; A28:       CC09             BREAK 9                         ; pending interrupt trap
    ; A2A: L1:   488B4DE8         MOV RCX, [RBP-24]
    ; A2E:       48894801         MOV [RAX+1], RCX
    ; A32:       48895809         MOV [RAX+9], RBX
    ; A36:       48895011         MOV [RAX+17], RDX
    ; A3A:       488BD0           MOV RDX, RAX
    ; A3D:       488BE5           MOV RSP, RBP
    ; A40:       F8               CLC
    ; A41:       5D               POP RBP
    ; A42:       C3               RET
    ; A43: L2:   6A30             PUSH 48
    ; A45:       FF142520001052   CALL QWORD PTR [#x52100020]     ; ALLOC-TRAMP
    ; A4C:       58               POP RAX
    ; A4D:       EBCA             JMP L0

If they had done the type declarations correctly, it would look more like this:

    ; disassembly for V-ADD
    ; Size: 122 bytes. Origin: #x52C33A78
    ; 78:       F30F104A05       MOVSS XMM1, [RDX+5]              ; no-arg-parsing entry point
    ; 7D:       F30F105F05       MOVSS XMM3, [RDI+5]
    ; 82:       F30F58D9         ADDSS XMM3, XMM1
    ; 86:       F30F104A0D       MOVSS XMM1, [RDX+13]
    ; 8B:       F30F10670D       MOVSS XMM4, [RDI+13]
    ; 90:       F30F58E1         ADDSS XMM4, XMM1
    ; 94:       F30F104A15       MOVSS XMM1, [RDX+21]
    ; 99:       F30F105715       MOVSS XMM2, [RDI+21]
    ; 9E:       F30F58D1         ADDSS XMM2, XMM1
    ; A2:       49896D40         MOV [R13+64], RBP                ; thread.pseudo-atomic-bits
    ; A6:       498B4520         MOV RAX, [R13+32]                ; thread.alloc-region
    ; AA:       4C8D5820         LEA R11, [RAX+32]
    ; AE:       4D3B5D28         CMP R11, [R13+40]
    ; B2:       7734             JNBE L2
    ; B4:       4D895D20         MOV [R13+32], R11                ; thread.alloc-region
    ; B8: L0:   66C7005903       MOV WORD PTR [RAX], 857
    ; BD:       0C03             OR AL, 3
    ; BF:       49316D40         XOR [R13+64], RBP                ; thread.pseudo-atomic-bits
    ; C3:       7402             JEQ L1
    ; C5:       CC09             BREAK 9                          ; pending interrupt trap
    ; C7: L1:   C7400103024F50   MOV DWORD PTR [RAX+1], #x504F0203  ; #<SB-KERNEL:LAYOUT for VEC {504F0203}>
    ; CE:       F30F115805       MOVSS [RAX+5], XMM3
    ; D3:       F30F11600D       MOVSS [RAX+13], XMM4
    ; D8:       F30F115015       MOVSS [RAX+21], XMM2
    ; DD:       488BD0           MOV RDX, RAX
    ; E0:       488BE5           MOV RSP, RBP
    ; E3:       F8               CLC
    ; E4:       5D               POP RBP
    ; E5:       C3               RET
    ; E6:       CC0F             BREAK 15                         ; Invalid argument count trap
    ; E8: L2:   6A20             PUSH 32
    ; EA:       E8F1C64CFF       CALL #x521001E0                  ; ALLOC-TRAMP
    ; EF:       58               POP RAX
    ; F0:       EBC6             JMP L0

The weirdness continues:

    (defstruct (ray
                 (:conc-name ray-)
                 (:constructor ray-new (origin direction))
                 (:type vector))
      origin direction)

I took a few minutes over lunch to add some type declarations to the slots and important functions, inlined the math, cleaned up the broken indentation and naming issues:

https://gist.github.com/sjl/005f27274adacd12ea2fc7f0b7200b80...

armitron7y ago

I did some quick changes to your code (inlining, stack allocating) and got a further ~2x speedup which makes SBCL performance equivalent to Julia.

stevelosh7y ago

EDIT: I'm also curious how much using an optimized vector math library (e.g. sb-cga) would buy you instead of hand-rolling your own vector math. It would certainly be easier.

PorterDuff7y ago

It would be interesting to take that C version and hammer on it a bit for speed.

...and then add SIMD.

omaranto7y ago

The results are more or less in line with what I would have expected, except for SBCL and Luajit, which I would have expected to be much faster.

armitron7y ago

The Lisp code is terrible performance-wise and written by someone who obviously doesn't know Common Lisp very well.

Look at Steve Losh's comment here for something a lot better. My own (further) improvements put SBCL performance in the same order as Julia.

iainmerrick7y ago

The most impressive result here is Lua -- not far behind C! LuaJIT is amazing.

Good to see a few languages like Nim and Rust actually beating C for raw performance, too.

wlesieutre7y ago

Am I reading it wrong or did LuaJIT take 113x as long as C?

ddlutz7y ago

It sure looks that slow to me.

1 more reply

Narishma7y ago

You should look again. Lua seems to be the slowest of the bunch.

iainmerrick7y ago

Whoops, yes, you’re right!

Now I’m tempted to try speeding it up, there’s no way it should be behind Python...

1 more reply

Mac21257y ago

Sarcasm?

iainmerrick7y ago

Unfortunately for me, Hanlon’s Razor applies.

j / k navigate · click thread line to collapse