-It now supports sequences up to 2^32 in all dimensions (algorithmically, in reality limited to allocatable memory size, switch to 64-bit addressing scheme is planned for future release)
-configurations optimized for bigger range of systems and vendors
-benchmarked Radeon VII and RTX 3080, shows that FFT is extremely bandwidth limited on modern GPUs
-VkFFT is able to match and outperform cuFFT on the whole tested range from 2^7 to 2^28 in single precision
-added double and half precision support and precision tests against FFTW on CPU
-improved native zeropadding - up to 3x performance boost
-switched license to MPL 2.0
Thanks for your attention! I am happy to answer any questions.
What is your explanation for this?
Is the VkFFT algorithm better? Is SPIR-V fundamentally more expressive than PTX? Are nVidia drivers better at compiling SPIR-V than PTX?
Have you compared the generated GPU assembly from both?
I don't know exactly what cuFFT does differently, but I am fairly certain they use very similar memory layout and algorithms behind their code (judging by execution times only).
What should be the main take from this is that Vulkan allows for similar in performance low-level memory control, while being cross platform and open source. I don't think that SPIR-V is more expressive - bet Nvidia wouldn't allow this. But it doesn't prohibit it from still being good.
Using the single precision at 1k FFT size as my example.
~165,000 kB/ms performance
Converts to 165,000 MB/s performance
Divide by 8 to convert to complex samples, so 20,625 M complex samples per second.
Divide by 1k to get FFT count of ~20.14M FFT/IFFTs per second?
These benchmarks also include transfer time to and from the GPU?
These benchmarks don't include transfers to and from GPU, as those are done with PCI-E bandwidth (30GB/s) which is really slow compared to VRAM-chip bandwidth (>500GB/s). This is why it is important to have enough VRAM and avoid CPU communications as much as possible.
Great to see that!
I expect huge improvements in that area with AMD's new RX series with SAM activated [0].
[0]: https://www.amd.com/en/technologies/smart-access-memory
The Bluestein's algorithm typically used for arbitrary prime sizes requires both zero-padding and convolutions support which are already efficiently implemented, so it is also not completely out of reach.
However, the Vulkan API which is also supported by most vendors (except Apple where you have to use a compatibility layer called MoltenVK) is gaining traction in the compute sector. If you trust the benchmarks, then this library here is showing that you can get a similar performance out of Vulkan compute than what you would expect from CUDA. It is just that this library only provides a very small fraction of the features of what the CUDA ecosystem does, so the Vulkan compute ecosystem still has a lot catching up to do.
Edit: In case it is not obvious from the title, the library is used to calculate the https://en.wikipedia.org/wiki/Fast_Fourier_transform
I think this view is too pessimistic. In fact, support either gets better (Intel oneAPI, Microsoft CLonD3D12, AMD ROCm, Mesa NIR-clover, …) or is unchanged but still maintained (NVIDIA). Moreover, Khronos noticed that OpenCL 2.x was a dead end and was to start over from a point that all vendors could agree on.
Again, I know no C++ and a very limited amount of C, so don’t put to much value in my comment. You seem to be very fond of switch statements, consider not stuffing to much code into each case. It make the flow hard to follow. Break the case code into functions and call those.
You have a switch with 40 cases, to load the SPIR-V. I feel like there’s a better way to deal with that. Maybe just a strict naming convention, so having the ID is enough to locate the file.
Impressive work in anycase.
With Vulkan you write a pipeline, and eventually draw a triangle.
OpenGL "hello triangle" is short only if you cut corners. If you do it the way you'd do in a production app, you're not that far off from the lines of code it takes to do it in Vulkan. It's still less, but on the same order of magnitude.
You wouldn't use Vulkan to render a single triangle for the same reasons why you wouldn't use a helicopter to get a bottle of milk from your local shop.
If you don't need low level control, you should be using middleware or a full blown engine (Godot, Unity, Unreal, etc).
Debugging your first segfault will also make you cry, but it's good for you. It builds character, and prepares you for the more insidious segfaults that are lurking out in the tall grasses.
This kind of "hardcore requirements to render a single triangle" is what is needed if you wish to have AAA game titles with realistic graphics at 60fps as it gives developers the freedom to do anything they wish.
More freedom -> better graphics at higher performance but more difficult to draw a single triangle.
Less freedom -> easier for beginners to draw a triangle but that is irrelevant in practice because that is not what those APIs are for.
Recent version of OpenGL are more difficult. But compared to OpenGL, Vulkan is in an entirely different class of difficulty.