I don’t think it has much to do with the case of an algorithm that offers a faster solution to a problem that is rarely a bottleneck (not sure if that’s true in this case anyway).
And this algorithm has low constant costs, and does not take dramatically more icache than the simple versions. There is no reason not to use this if your compile target can handle avx-512.
I doubt CPU cycles are problem in that case.
That number looked familiar, and yep it's the taxicab number. Coincidence? Neither of the two references seems to mention it
However, 1728 isn't the minimum possible value for K. With more precise estimates, sketched in the paper, we can bring it down to 8 + ε. So assuming that this finer estimate works, using a 9 dimensional Fourier transform would also make the algorithm be O(n log(n)).
Nonetheless, there are plenty of people who advocate the use of JSON, XML and similar formats, in which case I assume that number conversions can take a non-negligible time, which might be decreased by such fast algorithms.