VkFFT – Vulkan Fast Fourier Transform library (opens in new tab)

(github.com)

123 pointsDTolm5y ago50 comments

50 comments

DTolmOP5y ago

Hello! Since the last post VkFFT has experienced a number of huge improvements and optimizations. Namely:

-It now supports sequences up to 2^32 in all dimensions (algorithmically, in reality limited to allocatable memory size, switch to 64-bit addressing scheme is planned for future release)

-configurations optimized for bigger range of systems and vendors

-benchmarked Radeon VII and RTX 3080, shows that FFT is extremely bandwidth limited on modern GPUs

-VkFFT is able to match and outperform cuFFT on the whole tested range from 2^7 to 2^28 in single precision

-added double and half precision support and precision tests against FFTW on CPU

-improved native zeropadding - up to 3x performance boost

-switched license to MPL 2.0

Thanks for your attention! I am happy to answer any questions.

devit5y ago

> VkFFT is able to match and outperform cuFFT on the whole tested range from 2^7 to 2^28 in single precision

What is your explanation for this?

Is the VkFFT algorithm better? Is SPIR-V fundamentally more expressive than PTX? Are nVidia drivers better at compiling SPIR-V than PTX?

Have you compared the generated GPU assembly from both?

DTolmOP5y ago

FFT is an extremely bandwidth limited problem, so if most time is taken by one upload by both algorithms, the overall time will be similar. More in-depth analysis of how VkFFT and cuFFT scales with memory clocks and bandwidth can be found here: https://www.reddit.com/r/nvidia/comments/jxlbjs/rtx_3090_ove...

I don't know exactly what cuFFT does differently, but I am fairly certain they use very similar memory layout and algorithms behind their code (judging by execution times only).

What should be the main take from this is that Vulkan allows for similar in performance low-level memory control, while being cross platform and open source. I don't think that SPIR-V is more expressive - bet Nvidia wouldn't allow this. But it doesn't prohibit it from still being good.

stagger875y ago

Do I understand your benchmark plots correctly?

Using the single precision at 1k FFT size as my example.

~165,000 kB/ms performance

Converts to 165,000 MB/s performance

Divide by 8 to convert to complex samples, so 20,625 M complex samples per second.

Divide by 1k to get FFT count of ~20.14M FFT/IFFTs per second?

These benchmarks also include transfer time to and from the GPU?

DTolmOP5y ago

1k FFT size in single precision is 1024 x 2 x sizeof(float) = 8KB. If we don't think that it won't utilize full GPU (not even one compute unit) and assume that it scales similarly to big systems then: 1)165GB/s is an algorithmic bandwidth of benchmark, including consecutive FFT+iFFT. Both of them take one upload and one download from chip - total 4 memory transfers. The real bandwidth for this value will be 4*165=660GB/s. 2)one FFT is 2 transfers - upload and download. Total 16KB. 3)660GB/s / 16KB = 43M iterations per second. Similar to your number, but your number didn't account that benchmark has 4 uploads instead of 2.

These benchmarks don't include transfers to and from GPU, as those are done with PCI-E bandwidth (30GB/s) which is really slow compared to VRAM-chip bandwidth (>500GB/s). This is why it is important to have enough VRAM and avoid CPU communications as much as possible.

2 more replies

jiehong5y ago

> -benchmarked Radeon VII and RTX 3080, shows that FFT is extremely bandwidth limited on modern GPUs

Great to see that!

I expect huge improvements in that area with AMD's new RX series with SAM activated [0].

[0]: https://www.amd.com/en/technologies/smart-access-memory

DTolmOP5y ago

Actually, it is still best to aim at zero transfers between GPU and CPU during the execution. The GPU is limited by VRAM-chip bandwidth which is much bigger than the PCI-E bandwidth. And it should not be affected by SAM.

enriquto5y ago

Any plans for arbitrary-size transforms? (i.e., not restricted to vectors whose dimension is a power of two)

DTolmOP5y ago

Yes, this is indeed something I would like to add in the future. While adding different radix kernels support for small prime factors is not that hard, writing efficient scheduler is a much more challenging task (each sequence, even for power of 2 now is split differently targeting different architectures to optimize performance).

The Bluestein's algorithm typically used for arbitrary prime sizes requires both zero-padding and convolutions support which are already efficiently implemented, so it is also not completely out of reach.

dcgudeman5y ago

why did you choose MPL 2.0?

DTolmOP5y ago

It is a great open-source license for library projects. For example, Eigen uses it: https://eigen.tuxfamily.org/index.php?title=News:Relicensing...!

p0sixlang5y ago

Can someone ELI5 what this library is useful for?

Lichtso5y ago

So far there have been two ways to to heavy compute tasks on GPUs: CUDA (Nvidia only) and OpenCL (all vendors). Nvidia invested a lot in software and toolchains to make CUDA the go to option for many projects (especially in the machine learning community). Meanwhile OpenCL is falling apart and sees less and less support and updates.

However, the Vulkan API which is also supported by most vendors (except Apple where you have to use a compatibility layer called MoltenVK) is gaining traction in the compute sector. If you trust the benchmarks, then this library here is showing that you can get a similar performance out of Vulkan compute than what you would expect from CUDA. It is just that this library only provides a very small fraction of the features of what the CUDA ecosystem does, so the Vulkan compute ecosystem still has a lot catching up to do.

Edit: In case it is not obvious from the title, the library is used to calculate the https://en.wikipedia.org/wiki/Fast_Fourier_transform

matthiasv5y ago

> Meanwhile OpenCL is falling apart and sees less and less support and updates.

I think this view is too pessimistic. In fact, support either gets better (Intel oneAPI, Microsoft CLonD3D12, AMD ROCm, Mesa NIR-clover, …) or is unchanged but still maintained (NVIDIA). Moreover, Khronos noticed that OpenCL 2.x was a dead end and was to start over from a point that all vendors could agree on.

2 more replies

enriquto5y ago

I'm fascinated, and at the same time slightly troubled, by your usage of the word "compute".

2 more replies

mrweasel5y ago

I’m don’t write C++, but isn’t the code extremely messy? Also it appears to be C++ and not C like the “Read me” says.

DTolmOP5y ago

The library only includes vkFFT.h file (in C) and a set of shaders (C-like language compiled to SPIR-V). Vulkan_FFT.cpp is only an example that shows how VkFFT can be used. It also contains the benchmark in it, but it is not a part of the library.

mrweasel5y ago

Aah, okay, I where a little confused about the Vulkan_FFT.cpp. It seemed a little weird to have everything in the .h file, and not just the functions you want to expose in the library.

Again, I know no C++ and a very limited amount of C, so don’t put to much value in my comment. You seem to be very fond of switch statements, consider not stuffing to much code into each case. It make the flow hard to follow. Break the case code into functions and call those.

You have a switch with 40 cases, to load the SPIR-V. I feel like there’s a better way to deal with that. Maybe just a strict naming convention, so having the ID is enough to locate the file.

Impressive work in anycase.

1 more reply

29athrowaway5y ago

Rendering a triangle in Vulkan will make you cry.

Narann5y ago

With OpenGL you draw a triangle, and eventually write a pipeline.

With Vulkan you write a pipeline, and eventually draw a triangle.

exDM695y ago

Best explanation of the two APIs I've ever heard.

OpenGL "hello triangle" is short only if you cut corners. If you do it the way you'd do in a production app, you're not that far off from the lines of code it takes to do it in Vulkan. It's still less, but on the same order of magnitude.

0-_-05y ago

OpenGL gives you a Toyota, Vulkan gives you the parts to a Ferrari

NL8075y ago

That's because Vulkan is designed to render millions of triangles, not one.

You wouldn't use Vulkan to render a single triangle for the same reasons why you wouldn't use a helicopter to get a bottle of milk from your local shop.

Reelin5y ago

What's your point? So will writing your own custom UI toolkit, or manually doing your own font rendering, or implementing a custom equivalent to TensorFlow, or writing your own implementation of the C standard library, or ...

If you don't need low level control, you should be using middleware or a full blown engine (Godot, Unity, Unreal, etc).

meekrohprocess5y ago

Eh, it's an investment.

Debugging your first segfault will also make you cry, but it's good for you. It builds character, and prepares you for the more insidious segfaults that are lurking out in the tall grasses.

OneGuy1235y ago

This is why you use a game/rendering framework/engine if you want to render a single triangle to play around.

This kind of "hardcore requirements to render a single triangle" is what is needed if you wish to have AAA game titles with realistic graphics at 60fps as it gives developers the freedom to do anything they wish.

More freedom -> better graphics at higher performance but more difficult to draw a single triangle.

Less freedom -> easier for beginners to draw a triangle but that is irrelevant in practice because that is not what those APIs are for.

slater5y ago

wasn't that same joke made about OpenGL? plus ça change :)

29athrowaway5y ago

OpenGL 1 was very straightforward, and if you want, you can still use it.

Recent version of OpenGL are more difficult. But compared to OpenGL, Vulkan is in an entirely different class of difficulty.

1 more reply

sudosysgen5y ago

Plus ça reste pareil! Plus, it's not even that hard, and if you really find it that hard you can always use a wrapper library.

j / k navigate · click thread line to collapse

50 comments

DTolmOP5y ago

Hello! Since the last post VkFFT has experienced a number of huge improvements and optimizations. Namely:

-It now supports sequences up to 2^32 in all dimensions (algorithmically, in reality limited to allocatable memory size, switch to 64-bit addressing scheme is planned for future release)

-configurations optimized for bigger range of systems and vendors

-benchmarked Radeon VII and RTX 3080, shows that FFT is extremely bandwidth limited on modern GPUs

-VkFFT is able to match and outperform cuFFT on the whole tested range from 2^7 to 2^28 in single precision

-added double and half precision support and precision tests against FFTW on CPU

-improved native zeropadding - up to 3x performance boost

-switched license to MPL 2.0

Thanks for your attention! I am happy to answer any questions.

devit5y ago

> VkFFT is able to match and outperform cuFFT on the whole tested range from 2^7 to 2^28 in single precision

What is your explanation for this?

Is the VkFFT algorithm better? Is SPIR-V fundamentally more expressive than PTX? Are nVidia drivers better at compiling SPIR-V than PTX?

Have you compared the generated GPU assembly from both?

DTolmOP5y ago

I don't know exactly what cuFFT does differently, but I am fairly certain they use very similar memory layout and algorithms behind their code (judging by execution times only).

stagger875y ago

Do I understand your benchmark plots correctly?

Using the single precision at 1k FFT size as my example.

~165,000 kB/ms performance

Converts to 165,000 MB/s performance

Divide by 8 to convert to complex samples, so 20,625 M complex samples per second.

Divide by 1k to get FFT count of ~20.14M FFT/IFFTs per second?

These benchmarks also include transfer time to and from the GPU?

DTolmOP5y ago

2 more replies

jiehong5y ago

> -benchmarked Radeon VII and RTX 3080, shows that FFT is extremely bandwidth limited on modern GPUs

Great to see that!

I expect huge improvements in that area with AMD's new RX series with SAM activated [0].

[0]: https://www.amd.com/en/technologies/smart-access-memory

DTolmOP5y ago

enriquto5y ago

Any plans for arbitrary-size transforms? (i.e., not restricted to vectors whose dimension is a power of two)

DTolmOP5y ago

dcgudeman5y ago

why did you choose MPL 2.0?

DTolmOP5y ago

It is a great open-source license for library projects. For example, Eigen uses it: https://eigen.tuxfamily.org/index.php?title=News:Relicensing...!

p0sixlang5y ago

Can someone ELI5 what this library is useful for?

Lichtso5y ago

Edit: In case it is not obvious from the title, the library is used to calculate the https://en.wikipedia.org/wiki/Fast_Fourier_transform

matthiasv5y ago

> Meanwhile OpenCL is falling apart and sees less and less support and updates.

2 more replies

enriquto5y ago

I'm fascinated, and at the same time slightly troubled, by your usage of the word "compute".

2 more replies

mrweasel5y ago

I’m don’t write C++, but isn’t the code extremely messy? Also it appears to be C++ and not C like the “Read me” says.

DTolmOP5y ago

mrweasel5y ago

Aah, okay, I where a little confused about the Vulkan_FFT.cpp. It seemed a little weird to have everything in the .h file, and not just the functions you want to expose in the library.

You have a switch with 40 cases, to load the SPIR-V. I feel like there’s a better way to deal with that. Maybe just a strict naming convention, so having the ID is enough to locate the file.

Impressive work in anycase.

1 more reply

29athrowaway5y ago

Rendering a triangle in Vulkan will make you cry.

Narann5y ago

With OpenGL you draw a triangle, and eventually write a pipeline.

With Vulkan you write a pipeline, and eventually draw a triangle.

exDM695y ago

Best explanation of the two APIs I've ever heard.

0-_-05y ago

OpenGL gives you a Toyota, Vulkan gives you the parts to a Ferrari

NL8075y ago

That's because Vulkan is designed to render millions of triangles, not one.

You wouldn't use Vulkan to render a single triangle for the same reasons why you wouldn't use a helicopter to get a bottle of milk from your local shop.

Reelin5y ago

If you don't need low level control, you should be using middleware or a full blown engine (Godot, Unity, Unreal, etc).

meekrohprocess5y ago

Eh, it's an investment.

Debugging your first segfault will also make you cry, but it's good for you. It builds character, and prepares you for the more insidious segfaults that are lurking out in the tall grasses.

OneGuy1235y ago

This is why you use a game/rendering framework/engine if you want to render a single triangle to play around.

More freedom -> better graphics at higher performance but more difficult to draw a single triangle.

Less freedom -> easier for beginners to draw a triangle but that is irrelevant in practice because that is not what those APIs are for.

slater5y ago

wasn't that same joke made about OpenGL? plus ça change :)

29athrowaway5y ago

OpenGL 1 was very straightforward, and if you want, you can still use it.

Recent version of OpenGL are more difficult. But compared to OpenGL, Vulkan is in an entirely different class of difficulty.

1 more reply

sudosysgen5y ago

Plus ça reste pareil! Plus, it's not even that hard, and if you really find it that hard you can always use a wrapper library.

j / k navigate · click thread line to collapse