VkFFT – Vulkan Fast Fourier Transform Library (opens in new tab)

(github.com)

220 pointsah-5y ago127 comments

127 comments

zdw5y ago

If I were a hiring person at AMD or Intel, I'd shortlist this guy for a job, as they need help competing against the headstart CUDA has in the GPU-base compute space.

slavik815y ago

The AMD Math Libraries team is hiring [1], and one of the libraries they develop is rocFFT [2]. Disclosure: I work at AMD, though not on rocFFT.

[1]: https://jobs.amd.com/job/Calgary-GPU-Libraries-Software-Deve... [2]: https://github.com/ROCmSoftwarePlatform/rocFFT

tinus_hn5y ago

The author lists his email address on the site and indicates he’s looking for a position.

slavik815y ago

I should probably make it clear that I have nothing to do with the hiring process whatsoever. Though, it seems I could provide a referral.

jjeaff5y ago

Ya, but the important question is can they invert a binary tree on a whiteboard?

umvi5y ago

Just get a clear whiteboard, draw the binary tree, then flip the whiteboard 180 around the vertical axis so you are now looking through the back of the whiteboard.

repsilat5y ago

Honestly, it's an O(1) solution depending on choice of data structure, possibly constant-factor more efficient through your program depending on language, and if you're programming on a whiteboard it might also arguably be the idiomatic way to do it in that context.

Limited in that you can only have one tree per whiteboard, though.

qppo5y ago

"Write an FFT" is the DSP engineer interview question that's analogous to tree traversal algorithm whiteboarding. The hard part is remembering how a butterfly computation works, and you'll almost never need to implement it.

mangamadaiyan5y ago

... or are leetcode-proficient, these days.

TomVDB5y ago

One should hope that the non-CUDA GPU compute library ecosystem has already advanced beyond being able to calculate FFTs!

singhrac5y ago

Sure, but if Nvidia/OpenAI/Google/Facebook have shown anything, it's that there's always more kernels to invent and train bigger nets with.

andi9995y ago

Last time I checked there was no good fft for AMD.

slavik815y ago

What are the common applications for these sorts of GPU-accelerated FFTs? We mostly just solved problems analytically in undergrad, and the little bit of naive coding we did seemed pretty fast. I feel like this must be used for problems I would have learned about in grad school, if I had continued in electrical engineering.

DTolm5y ago

I have used VkFFT to create GPU version of a magnetic simulation software Spirit (https://github.com/DTolm/spirit). Except for FFT it also has a lot of general linear algebra routines, like efficient GPU reduce/scan and system solvers, like CG, LBFGS, VP, Runge-Kutta and Depondt. This version of Spirit is faster than CUDA based software that has been out and updated for ~6 years due to the fact that I have full control over all the code I use. You might want to check the discussions on reddit for this project: https://www.reddit.com/r/MachineLearning/comments/ilcw2f/p_v... and https://www.reddit.com/r/programming/comments/il9sar/vulkan_...

Reelin5y ago

Likely any HPC application that has an FFT somewhere in its pipeline and is otherwise amenable to being run on a GPU.

Fluid flow, heat transfer, and other such physical phenomena that you might want to simulate.

Phase correlation in image processing is another example. (https://en.wikipedia.org/wiki/Phase_correlation)

MD simulations rely on FFT but I'm not sure how much is typically (or can be) done on the GPU. For example, NAMD employs cuFFT on the GPU in some cases. (https://aip.scitation.org/doi/10.1063/5.0014475)

amelius5y ago

Machine learning uses CNNs, which are directly based on FFTs.

1 more reply

hadeson5y ago

It could be used to accelerate Convolutional Neural Nets training [0]

[0] https://arxiv.org/abs/1312.5851

enriquto5y ago

If you could filter and focus raw radar data in realtime it would be really cool!

gorkish5y ago

Software defined radio / RF DSP is another area where FFT and IFFT performance and accuracy are critical.

looping__lui5y ago

Imaging. E.g., large convolutions.

HelloNurse5y ago

The same as any FFT, but accelerated; with the tradeoff that the cost of moving data from and to the GPU needs to be amortized. It's also a good proof of concept for other kinds of GPU computations.

p1mrx5y ago

How does using Vulkan for computation fit into the OpenCL/CUDA landscape? Is CUDA's proprietary nature doing meaningful harm, and does Vulkan help?

Jhsto5y ago

You can run OpenCL kernels on Vulkan at least in theory: SPIR-V supports OpenCL memory model. CUDA might be machine translatable if you can compile into LLVM target (clang seems to have experimental support developed outside of Nvidia) which you then retarget into SPIR-V using a cross-compiler. The LLVM to SPIR-V cross-compiler however is limited in its translation for the time being.

In general, Vulkan is a thing which commands the GPU, but is not opinionated on what the language used to represent the kernel is as long as it compiles to SPIR-V. SPIR-V in itself is like parallel LLVM IR. If you look into the project source, the shaders are in GLSL which have been pre-compiled using a cross-compiler into SPIR-V. The C file you find on the project root constitutes as the loader program for the SPIR-V files.

Futhark project did some initial benchmarks on translating OpenCL to Vulkan. The results were mainly slowdowns. You can read about it in here: https://futhark-lang.org/student-projects/steffen-msc-projec...

jgavris5y ago

We run OpenCL on top of Vulkan in a production application on Android, thanks to a project from Google / Codeplay and other contributors https://github.com/google/clspv. SPIR-V can't represent all of OpenCL, but maybe enough for most people's use cases.

pjmlp5y ago

Badly, OctaneRender had moved away from Vulkan into CUDA, because they found out that Vulkan compute wasn't at the level that they wanted.

https://home.otoy.com/octane2020-rndr-released/

"OTOY | GTC 2020: Real-Time Raytracing, Holographic Displays, Light Field Media and RNDR Network"

https://www.youtube.com/watch?v=Qfy6CTaSHcc

littlestymaar5y ago

I couldn't find any details about the migration on either links but it looks like they make massive use of Nvidia-specific features, so even with exactly the same performances it would make total sense to use Cuda just because the tooling is more mature.

pjmlp5y ago

The video presentation at GTC clearly discusses it.

They moved into Optix 7 as backend.

1 more reply

querez5y ago

"VkFFT aims to provide community with an open-source alternative to Nvidia's cuFFT library, while achieving better performance."

There are no error bars on the graphs, so it's very hard to judge if the minor differences are significant. I work in research, so probably I'm peculiar about this point, but: I'd expect better from anyone who's taken basic statistics. But from a quick look, it seems like the performance is pretty much just "on par".

It would also be nice to know how performance is on other hardware. I'm assuming it's tuned to nvidida GPUs (or maybe even the specific GPU mentioned). But how does this perform on Intel or AMD hardware? How does it compare to `rocFFT` or Intel's own implementation?

DTolm5y ago

The FFT and iFFT are performed consecutively up to 1000 times and then each run is done 5 more times. The total result is averaged both for VkFFT and cuFFT and stays roughly the same between launches. The minor performance gains (5-20%) are noticeable. If you have a better testing technique, I am open to the suggestions.

I have tested VkFFT on Intel UHD620 GPU and the performance scaled on the same rate as most benchmarks do. There are a couple of parameters that can be modified for different GPUs (like the amount of memory coalesced, which is 32bits on Nvidia GPUs after Pascal and is 64bits for Intel). I have no access to an AMD machine, otherwise I would have refined the lauch configuration parameters for it too. I have not tested other libraries than cuFFT yet.

querez5y ago

Thanks for the further clarification! If you ran this several times, you could calculate standard deviations or confidence intervals. It would be nice if you could report one such measure, so it's clearer that the differences are not just some random fluctuations. E.g. you could include them as error bars in your plots. You could also run a statistical test (in this case, a t-test is very easy to do) and report the p-value. Those are the things I'd expect my students to do if they'd have to do something like this for a report or a project, because it's the only way for people to judge if differences show clear signal or are just random fluctuations due to measurement noise.

Also: I should've said this in my first post already, which in hindsight might sound too negative: I think this is a cool project and you did a great job! I just thought this might improve the presentation of your results a bit.

DTolm5y ago

GPU is a very consistent device, so the purpose of such big sample sizes and multiple launches with averaging is to reduce all the deviations almost to zero. The error is <1% in this case and showing it on the plot will not really change it. The values, however, change when I update the code and improve it, so this is by no means the final way the benchmark will look like. I will think on how to adress this better in the future, but for now I think the best solution if you doubt the results is to launch VkFFT and see what it outputs for yourself.

1 more reply

Jhsto5y ago

I think this guy will have no problem getting hired. Being conscious enough to push code online works so much better than the CV preparation courses. You know you're on the right path when you are asked to play up your CV abstract than to downplay it.

Personally, I would have a hard time hiring anyone without a Github account and less so working in a place where nobody has one.

ncmncm5y ago

To me a Gitlab account, instead, would signify superior judgment.

adamnemecek5y ago

Not if you want your work to be discovered.

ncmncm5y ago

Only the best will discover it.

solipsism5y ago

Getting downvoted, but this is no more arbitrary, myopic, and unfair to the applicant than the parent.

2 more replies

oxxoxoxooo5y ago

What is "Native zero padding to model open systems"? And how come it is "up to 2x faster than simply padding input array with zeros"?

gct5y ago

So you can pad your input array with zeros, but the algorithm doesn't know that it's padded, and will just compute with those zeros like any other value. If you could tell it that they were zeros it could take advantage of x*0=0 and x+0=x to significantly reduce computation. That's what I think that is.

DTolm5y ago

That is almost the correct answer. To go even further, there are sequences that are completely full of zeros in the padded case of multidimensional FFTs and we can omit their FFTs entirely.

1 more reply

Lichtso5y ago

Very cool!

Seems a bit more feature complete than my take on the problem: https://github.com/Lichtso/VulkanFFT

Still, to beat CUDA with Vulkan a lot is still missing: Scan, Reduce, Sort, Aggregate, Partition, Select, Binning, etc.

DTolm5y ago

I have some of these routines like Reduce and Scan in my other project https://github.com/DTolm/spirit. It also has implementations of linear algebra solvers like CG, VP, Runge-Kutta and some others. These routines have to be inlined in users shaders in some way to have a good performance. Releasing them as a standalone library will require some thinking due to the fact that some routines have multiple shader dispatches.

meisel5y ago

Warning: LGPL license

ncmncm5y ago

... which, being a header-only library, happens to place no restrictions or requirements of any kind on the calling program.

detaro5y ago

I don't think it's that easy? LGPLv3 has an explicit carve-out for headers which makes that scenario easy, but this is 2.1...

loa_in_5y ago

Paragraph 5 of the LGPL version 2.1 states:

A program that contains no derivative of any portion of the Library, but is designed to work with the Library by being compiled or linked with it, is called a "work that uses the Library". Such a work, in isolation, is not a derivative work of the Library, and therefore falls outside the scope of this License.

5 more replies

phkahler5y ago

Isn't LGPL 2.1 is an odd license for something like this? Does it produce a library?

microcolonel5y ago

> Does it produce a library?

It is a library.

bialpio5y ago

A _header-only_ library. Not sure how LGPL works for those - not much to avoid linking against... Throw it in your own .dll / .so and use that in your closed-source projects? Standard disclosure: IANAL.

2 more replies

fluffything5y ago

> Support for big FFT dimension sizes. Current limits: C2C - (2^24, 2^15, 2^15),

What about bigger than big? > 2^29 or so ? Are these sizes for double precision ?

DTolm5y ago

Currently, I hit the limit of maximum workgroups amount for one submit dispatch (this is why y and z axis are lower than x one for now). It can be removed by adding multiple dispatches to the code, which I will do in one of the next updates. To go past 2^24 I need to polish the four stage FFT algorithm to allow for >2 data transfers, which I have implemented, but not yet tested. There will also be a single precision limit in this range, as the twiddle factors values will be close to 1e-8 which will be close to a machine error.

bobowzki5y ago

I wonder if this works on the raspberry pi with the new Vulkan drivers.

Mizza5y ago

I'm very eager to see GPU acceleration make its way into audio production, which is all still heavily CPU bound.

A Free GPUFFT implementation will certainly help! Great work.

mmis10005y ago

https://en.wikipedia.org/wiki/AMD_TrueAudio I believe AMD did that, but there is little to no softwares actually make use of it.

adamnemecek5y ago

It's not gonna happen, audio is much less throughput intensive but a lot more latency sensitive.

reitzensteinm5y ago

You can read off a GPU in 10us, which is just a single sample at 96khz.

If your entire stack lived in the GPU, and you're just reading out the result, this is trivial.

If you're constantly copying buffers back and forth because some effects are implemented in the CPU and some in the GPU, not so much!

It's probably the case that a full stack GPU implementation would blow what we have out of the water, but you'd lose your entire ecosystem in the process, so it's probably never going to happen.

2 more replies

codetrotter5y ago

I would think a GPU might help if you have a lot of audio channels and a lot of effects on each channel.

But even if that is not the case, machine learning is making its way into music production tools more and more. No doubt a beefy GPU will be useful to a lot of music production professionals in the future at least, as the tools they are using begin to leverage ML more and more.

viraptor5y ago

Why do you think it's not going to happen? And for which use case?

The time budget to refresh a video frame is 8ms on 120HZ if everything else came free. In practice closer to <4ms. So even looking at the close to worst conditions, that's about the delay of the sound traveling a meter - should be fine for a lot of real life applications.

1 more reply

colejohnson665y ago

Could it be possible to “prerender” the audio on the GPU when it’s not being worked on (say, a track not being edited)? Then just play that track if it’s not edited before the user hits play?

Mizza5y ago

This is classic way of reducing CPU usage, just bounce a part of a track to raw audio an play it back so it doesn't need to render in real time. A GPU doesn't really change the equation there.

There are some methods of synthesis which rely on FFT which can't really be done well in real-time with the CPU (PadSynth, PaulStretch) that I'm hoping this will help with.

singhrac5y ago

I've heard credible claims that GPUs these days (esp. TPUs) have lower latency for big models than CPUs. I haven't really investigated, but I could see it happening if you give the TPU a huge L1 cache or something.

someguydave5y ago

Perhaps for large calculations? Otherwise the PCI transfer delay would be a big latency hit?

1 more reply

rektide5y ago

may someday please someone help dethrone the underlord of AI & rise us up

person_of_color5y ago

This guy will get a foot in but still have to do a gotcha interview loop

j / k navigate · click thread line to collapse

127 comments

zdw5y ago

If I were a hiring person at AMD or Intel, I'd shortlist this guy for a job, as they need help competing against the headstart CUDA has in the GPU-base compute space.

slavik815y ago

The AMD Math Libraries team is hiring [1], and one of the libraries they develop is rocFFT [2]. Disclosure: I work at AMD, though not on rocFFT.

[1]: https://jobs.amd.com/job/Calgary-GPU-Libraries-Software-Deve... [2]: https://github.com/ROCmSoftwarePlatform/rocFFT

tinus_hn5y ago

The author lists his email address on the site and indicates he’s looking for a position.

slavik815y ago

I should probably make it clear that I have nothing to do with the hiring process whatsoever. Though, it seems I could provide a referral.

jjeaff5y ago

Ya, but the important question is can they invert a binary tree on a whiteboard?

umvi5y ago

Just get a clear whiteboard, draw the binary tree, then flip the whiteboard 180 around the vertical axis so you are now looking through the back of the whiteboard.

repsilat5y ago

Limited in that you can only have one tree per whiteboard, though.

qppo5y ago

mangamadaiyan5y ago

... or are leetcode-proficient, these days.

TomVDB5y ago

One should hope that the non-CUDA GPU compute library ecosystem has already advanced beyond being able to calculate FFTs!

singhrac5y ago

Sure, but if Nvidia/OpenAI/Google/Facebook have shown anything, it's that there's always more kernels to invent and train bigger nets with.

andi9995y ago

Last time I checked there was no good fft for AMD.

slavik815y ago

DTolm5y ago

Reelin5y ago

Likely any HPC application that has an FFT somewhere in its pipeline and is otherwise amenable to being run on a GPU.

Fluid flow, heat transfer, and other such physical phenomena that you might want to simulate.

Phase correlation in image processing is another example. (https://en.wikipedia.org/wiki/Phase_correlation)

amelius5y ago

Machine learning uses CNNs, which are directly based on FFTs.

1 more reply

hadeson5y ago

It could be used to accelerate Convolutional Neural Nets training [0]

[0] https://arxiv.org/abs/1312.5851

enriquto5y ago

If you could filter and focus raw radar data in realtime it would be really cool!

gorkish5y ago

Software defined radio / RF DSP is another area where FFT and IFFT performance and accuracy are critical.

looping__lui5y ago

Imaging. E.g., large convolutions.

HelloNurse5y ago

The same as any FFT, but accelerated; with the tradeoff that the cost of moving data from and to the GPU needs to be amortized. It's also a good proof of concept for other kinds of GPU computations.

p1mrx5y ago

How does using Vulkan for computation fit into the OpenCL/CUDA landscape? Is CUDA's proprietary nature doing meaningful harm, and does Vulkan help?

Jhsto5y ago

jgavris5y ago

pjmlp5y ago

Badly, OctaneRender had moved away from Vulkan into CUDA, because they found out that Vulkan compute wasn't at the level that they wanted.

https://home.otoy.com/octane2020-rndr-released/

"OTOY | GTC 2020: Real-Time Raytracing, Holographic Displays, Light Field Media and RNDR Network"

https://www.youtube.com/watch?v=Qfy6CTaSHcc

littlestymaar5y ago

pjmlp5y ago

The video presentation at GTC clearly discusses it.

They moved into Optix 7 as backend.

1 more reply

querez5y ago

"VkFFT aims to provide community with an open-source alternative to Nvidia's cuFFT library, while achieving better performance."

DTolm5y ago

querez5y ago

DTolm5y ago

1 more reply

Jhsto5y ago

Personally, I would have a hard time hiring anyone without a Github account and less so working in a place where nobody has one.

ncmncm5y ago

To me a Gitlab account, instead, would signify superior judgment.

adamnemecek5y ago

Not if you want your work to be discovered.

ncmncm5y ago

Only the best will discover it.

solipsism5y ago

Getting downvoted, but this is no more arbitrary, myopic, and unfair to the applicant than the parent.

2 more replies

oxxoxoxooo5y ago

What is "Native zero padding to model open systems"? And how come it is "up to 2x faster than simply padding input array with zeros"?

gct5y ago

DTolm5y ago

That is almost the correct answer. To go even further, there are sequences that are completely full of zeros in the padded case of multidimensional FFTs and we can omit their FFTs entirely.

1 more reply

Lichtso5y ago

Very cool!

Seems a bit more feature complete than my take on the problem: https://github.com/Lichtso/VulkanFFT

Still, to beat CUDA with Vulkan a lot is still missing: Scan, Reduce, Sort, Aggregate, Partition, Select, Binning, etc.

DTolm5y ago

meisel5y ago

Warning: LGPL license

ncmncm5y ago

... which, being a header-only library, happens to place no restrictions or requirements of any kind on the calling program.

detaro5y ago

I don't think it's that easy? LGPLv3 has an explicit carve-out for headers which makes that scenario easy, but this is 2.1...

loa_in_5y ago

Paragraph 5 of the LGPL version 2.1 states:

5 more replies

phkahler5y ago

Isn't LGPL 2.1 is an odd license for something like this? Does it produce a library?

microcolonel5y ago

> Does it produce a library?

It is a library.

bialpio5y ago

2 more replies

fluffything5y ago

> Support for big FFT dimension sizes. Current limits: C2C - (2^24, 2^15, 2^15),

What about bigger than big? > 2^29 or so ? Are these sizes for double precision ?

DTolm5y ago

bobowzki5y ago

I wonder if this works on the raspberry pi with the new Vulkan drivers.

Mizza5y ago

I'm very eager to see GPU acceleration make its way into audio production, which is all still heavily CPU bound.

A Free GPUFFT implementation will certainly help! Great work.

mmis10005y ago

https://en.wikipedia.org/wiki/AMD_TrueAudio I believe AMD did that, but there is little to no softwares actually make use of it.

adamnemecek5y ago

It's not gonna happen, audio is much less throughput intensive but a lot more latency sensitive.

reitzensteinm5y ago

You can read off a GPU in 10us, which is just a single sample at 96khz.

If your entire stack lived in the GPU, and you're just reading out the result, this is trivial.

If you're constantly copying buffers back and forth because some effects are implemented in the CPU and some in the GPU, not so much!

It's probably the case that a full stack GPU implementation would blow what we have out of the water, but you'd lose your entire ecosystem in the process, so it's probably never going to happen.

2 more replies

codetrotter5y ago

I would think a GPU might help if you have a lot of audio channels and a lot of effects on each channel.

viraptor5y ago

Why do you think it's not going to happen? And for which use case?

1 more reply

colejohnson665y ago

Could it be possible to “prerender” the audio on the GPU when it’s not being worked on (say, a track not being edited)? Then just play that track if it’s not edited before the user hits play?

Mizza5y ago

This is classic way of reducing CPU usage, just bounce a part of a track to raw audio an play it back so it doesn't need to render in real time. A GPU doesn't really change the equation there.

There are some methods of synthesis which rely on FFT which can't really be done well in real-time with the CPU (PadSynth, PaulStretch) that I'm hoping this will help with.

singhrac5y ago

someguydave5y ago

Perhaps for large calculations? Otherwise the PCI transfer delay would be a big latency hit?

1 more reply

rektide5y ago

may someday please someone help dethrone the underlord of AI & rise us up

person_of_color5y ago

This guy will get a foot in but still have to do a gotcha interview loop

j / k navigate · click thread line to collapse