Benchmarks of an early version on NVIDIA's blog show that it is about the same (the average across the benchmark shows Julia as actually a little faster here, but it's basically a wash)
https://developer.nvidia.com/blog/gpu-computing-julia-progra....
While much has changed since then, the architecture is effectively the same. Julia's native CUDA support simply boils down to compiling via the LLVM .ptx backend (Julia always generates LLVM IR, and the CUDA infrastructure "simply" retargets LLVM to .ptx, generates the binary, and then wraps that binary into a function which Julia calls), so it's really just a matter of the performance difference between the code generated by the LLVM .ptx backend vs the NVCC compiler.