High-Performance GPU Computing in the Julia Programming Language (opens in new tab)

(devblogs.nvidia.com)

139 pointsceyhunkazel8y ago28 comments

28 comments

> This is in part because of the work by Google on the NVPTX LLVM back-end.

I'm one of the maintainers at Google of the LLVM NVPTX backend. Happy to answer questions about it.

As background, Nvidia's CUDA ("CUDA C++?") compiler, nvcc, uses a fork of LLVM as its backend. Clang can also compile CUDA code, using regular upstream LLVM as its backend. The relevant backend in LLVM was originally contributed by nvidia, but these days the team I'm on at Google is the main contributor.

I don't know much (okay, anything) about Julia except what I read in this blog post, but the dynamic specialization looks a lot like XLA, a JIT backend for TensorFlow that I work on. So that's cool; I'm happy to see this work.

Full debug information is not supported by the LLVM NVPTX back-end yet, so cuda-gdb will not work yet.

We'd love help with this. :)

Bounds-checked arrays are not supported yet, due to a bug [1] in the NVIDIA PTX compiler. [0]

We ran into what appears to be the same issue [2] about a year and a half ago. nvidia is well aware of the issue, but I don't expect a fix except by upgrading to Volta hardware.

[0] https://julialang.org/blog/2017/03/cudanative [1] https://github.com/JuliaGPU/CUDAnative.jl/issues/4 [2] https://bugs.llvm.org/show_bug.cgi?id=27738

syllogism8y ago

Does this mean we could hook Cython up to NVPTX as the backend?

I've always thought it weird that I'm writing all my code in this language that compiles to C++, with semantics for any type declaration etc...And then I write chunks of code in strings, like an animal.

nicwilson8y ago

IDK about Cython, but I remember a blog post using Python's AST reflection to jit to LLVM ->NVPTX -> PTX. It's relatively simple to do, I've done it for LDC/D/DCompute[1,2,3]. It's a little tricker if you want to be able to express shared memory surfaces & textures, but it should still be doable.

[1] https://github.com/ldc-developers/ldc [2] dlang.org [3] http://github.com/libmir/dcompute

jlebar8y ago

Running code on the GPU usually isn't as easy as "compile and go", but...yeah, if you can emit LLVM IR, you can get PTX which you can run on the GPU.

wallnuss8y ago

I spend a while looking at debug information for NVPTX last year and came to the conclusion that it luckily dwarf, with some weird serialisation for the assembler.

The NVPTX backend would benefit imo to move towards the more general LLVM infrastructure so that emitting the dwarf info is not another special case.

jlebar8y ago

We'd like this too. Unfortunately a lot of the special cases can't be eliminated because we have to interface with ptxas, the closed-source PTX -> SASS (GPU machine code) optimizing assembler.

wallnuss8y ago

Yeah I know and the DWARF info special cases are even worse for ptxas. I never had enough time, but Nvidia has surprisingly a lot information on it out there.

waynecochran8y ago

nvcc installed on a Mac seems tied to the current clang and the latest clang's don't support CUDA development so I have to retrograde my clang to an older version to use CUDA. Why is nvcc tied to clang?

jlebar8y ago

To be clear, there are two ways to compile CUDA (C++) code. You can either use nvcc (which itself may use clang), or you can use regular, vanilla clang, without ever involving nvcc.

Nvidia's closed-source compiler, nvcc, uses your host (i.e. CPU) compiler (gcc or clang) because it transforms your input .cu file into two files, one of which it compiles for the GPU (using a program called cicc), and the other of which it compiles for the CPU using the host compiler.

The other way to do it is to use regular open-source clang without ever involving nvcc. The version of clang that comes with your xcode may not be new enough (I dunno), but the LLVM 5.0 release should be plenty new, unless you want to target CUDA 9, in which case you'll need to build from head.

I don't know the technical reasons why nvcc is so closely tied to the host compiler version -- it annoys me sometimes, too.

CyberDildonics8y ago

Are you guys working with the people on the ISPC team? They also had experimental compiling from ISPC to NVPTX, was it this backend?

jlebar8y ago

I haven't been privy to any work with the ISPC folks. It's totally possible they're using the LLVM NVPTX backend, I dunno.

dragontamer8y ago

In my experience, CUDA / OpenCL are actually rather easy to use.

The hard part is optimization, because the GPU architecture (SIMD / SIMT) is so alien compared to normal CPUs.

Here's a step-by-step example of one guy optimizing a Matrix Multiplication scheme in OpenCL (specifically for NVidia GPUs): https://cnugteren.github.io/tutorial/pages/page1.html

Just like how high-performance CPU computing requires a deep understanding of cache and stuff... high-performance GPU computing requires a deep understanding of the various memory-spaces on the GPU.

------------

Now granted: deep optimization of routines on CPUs is similarly challenging, and actually undergoes a very similar process in how to partition your work problem into L1-sized blocks. But high-performance GPUs not only have to consider their L1 Cache... but also "Shared" (or OpenCL __local) memory and "Register" (or OpenCL __private) memory as well. Furthermore, GPUs in my experience have way less memory than CPUs per thread/shader. IE: Intel "Sandy Bridge" CPU has 64kb L1 cache per core, which can be used as 2-threads if hyperthreading is enabled. A "Pascal" GPU has 64kb of "Shared" memory, which is extremely fast like L1 cache. But this 64kb is shared between 64 FP32 cores!!!.

Furthermore, not all algorithms run faster on GPGPUs either. For example:

https://askeplaat.files.wordpress.com/2013/01/ispa2015.pdf

This paper claims that their GPGPU implementation (Xeon Phi) was slower than the CPU implementation! Apparently, the game of "Hex" is hard to parallelize / vectorize.

---------------

Now don't get me wrong, this is all very cool and stuff. Making various programming tasks easier is always welcome. Just be aware that GPUs are no silver bullet for performance. It takes a lot of work to get "high-performance code", regardless of your platform.

And sometimes, CPUs are faster.

ViralBShah8y ago

Absolutely. The goal with Julia is to make it easy to use whatever hardware is best suited for the problem you are solving. This work, IMO, reduces the barrier to entry for writing code for GPUs and gives Julia users more options.

gravypod8y ago

> Julia has recently gained support for syntactic loop fusion, where chained vector operations are fused into a single broadcast

Wow. That's very impressive.

I hope one day we get this sort of tooling with AMD GPUs.

one-more-minute8y ago

Ask and ye shall receive: https://github.com/JuliaGPU/CLArrays.jl

gravypod8y ago

That's amazing. I'm very excited about the prospect of auto-magically trans-piling code into GPU code. This sort of tech will make GPUs approachable to many more scientists and programmers.

wallnuss8y ago

ROCm will make similar things possible! I prefer native codegen, CLArrays.jl is an excellent solution in the meantime.

jernfrost8y ago

How does the Julia approach compare to the alternatives in performance and ease of use? Can e.g. Python or R do this in any way?

wallnuss8y ago

The big difference is that Julia can handle user defined structs and handle higher-level functions, e.g. you pass a Julia function to you GPU kernel and that function will get compile for the GPU without you having to declare it GPU-compatible.

ChrisRackauckas8y ago

The key difference here is that, while Python and R has a lot of their standard library written in other languages (C), Julia's is mostly written in Julia. Same with Julia's packages. This means that you can throw a lot of library functions and they will GPU compile just fine because the whole stack is Julia all the way down (in many cases. There are of course exceptions).

kxyvr8y ago

I keep hearing this, but each time I look at the links on HN, I see that the high-performance libraries being cited are those still written in C, C++, or some other low level language. For example, even in this link, the code is tying into things like cuBLAS, which is definitely not Julia code. For me, high performance linear algebra routines are important and I just checked here:

https://docs.julialang.org/en/latest/stdlib/linalg/

It looks like Julia uses a combination of LAPACK and SuiteSparse. These are good choices, but it's not Julia code and these routines are callable from all sorts of other languages like Python, MATLAB, and Octave. As such, it still appears as though Julia is operating more like a glue language rather than a write all of your numerical libraries in Julia language, which is fine, but I don't feel like that's what it's being sold as.

3 more replies

j / k navigate · click thread line to collapse

28 comments

jlebar8y ago

> This is in part because of the work by Google on the NVPTX LLVM back-end.

I'm one of the maintainers at Google of the LLVM NVPTX backend. Happy to answer questions about it.

Full debug information is not supported by the LLVM NVPTX back-end yet, so cuda-gdb will not work yet.

We'd love help with this. :)

Bounds-checked arrays are not supported yet, due to a bug [1] in the NVIDIA PTX compiler. [0]

We ran into what appears to be the same issue [2] about a year and a half ago. nvidia is well aware of the issue, but I don't expect a fix except by upgrading to Volta hardware.

[0] https://julialang.org/blog/2017/03/cudanative [1] https://github.com/JuliaGPU/CUDAnative.jl/issues/4 [2] https://bugs.llvm.org/show_bug.cgi?id=27738

syllogism8y ago

Does this mean we could hook Cython up to NVPTX as the backend?

nicwilson8y ago

[1] https://github.com/ldc-developers/ldc [2] dlang.org [3] http://github.com/libmir/dcompute

jlebar8y ago

Running code on the GPU usually isn't as easy as "compile and go", but...yeah, if you can emit LLVM IR, you can get PTX which you can run on the GPU.

wallnuss8y ago

I spend a while looking at debug information for NVPTX last year and came to the conclusion that it luckily dwarf, with some weird serialisation for the assembler.

The NVPTX backend would benefit imo to move towards the more general LLVM infrastructure so that emitting the dwarf info is not another special case.

jlebar8y ago

We'd like this too. Unfortunately a lot of the special cases can't be eliminated because we have to interface with ptxas, the closed-source PTX -> SASS (GPU machine code) optimizing assembler.

wallnuss8y ago

Yeah I know and the DWARF info special cases are even worse for ptxas. I never had enough time, but Nvidia has surprisingly a lot information on it out there.

waynecochran8y ago

jlebar8y ago

To be clear, there are two ways to compile CUDA (C++) code. You can either use nvcc (which itself may use clang), or you can use regular, vanilla clang, without ever involving nvcc.

I don't know the technical reasons why nvcc is so closely tied to the host compiler version -- it annoys me sometimes, too.

CyberDildonics8y ago

Are you guys working with the people on the ISPC team? They also had experimental compiling from ISPC to NVPTX, was it this backend?

jlebar8y ago

I haven't been privy to any work with the ISPC folks. It's totally possible they're using the LLVM NVPTX backend, I dunno.

dragontamer8y ago

In my experience, CUDA / OpenCL are actually rather easy to use.

The hard part is optimization, because the GPU architecture (SIMD / SIMT) is so alien compared to normal CPUs.

Here's a step-by-step example of one guy optimizing a Matrix Multiplication scheme in OpenCL (specifically for NVidia GPUs): https://cnugteren.github.io/tutorial/pages/page1.html

Just like how high-performance CPU computing requires a deep understanding of cache and stuff... high-performance GPU computing requires a deep understanding of the various memory-spaces on the GPU.

------------

Furthermore, not all algorithms run faster on GPGPUs either. For example:

https://askeplaat.files.wordpress.com/2013/01/ispa2015.pdf

This paper claims that their GPGPU implementation (Xeon Phi) was slower than the CPU implementation! Apparently, the game of "Hex" is hard to parallelize / vectorize.

---------------

And sometimes, CPUs are faster.

ViralBShah8y ago

gravypod8y ago

> Julia has recently gained support for syntactic loop fusion, where chained vector operations are fused into a single broadcast

Wow. That's very impressive.

I hope one day we get this sort of tooling with AMD GPUs.

one-more-minute8y ago

Ask and ye shall receive: https://github.com/JuliaGPU/CLArrays.jl

gravypod8y ago

That's amazing. I'm very excited about the prospect of auto-magically trans-piling code into GPU code. This sort of tech will make GPUs approachable to many more scientists and programmers.

wallnuss8y ago

ROCm will make similar things possible! I prefer native codegen, CLArrays.jl is an excellent solution in the meantime.

jernfrost8y ago

How does the Julia approach compare to the alternatives in performance and ease of use? Can e.g. Python or R do this in any way?

wallnuss8y ago

ChrisRackauckas8y ago

kxyvr8y ago

https://docs.julialang.org/en/latest/stdlib/linalg/

3 more replies

j / k navigate · click thread line to collapse