http://www.adapteva.com/white-papers/using-a-scalable-parall...
Corner turns for 2D FFTs are usually quite challenging for GPUs and CPUs.[ref] Yaniv, our DSP guru, completed the corner turn part of the algorithm with ease in a couple of days and the on chip data movement constitutes a very small portion of the total application wall time.(complete with source code published as well if you really want to dig).
It's hard to market FFT cycle counts to the general audience:-)