This was our thought process:
We have received a lot of negative feedback regarding this number so we want to explain the meaning and motivation. A single number can never characterize the performance of an architecture. The only thing that really matters is how many seconds and how many joules YOUR application consumes on a specific platform.
Still, we think multiplying the core frequency(700MHz) times the number of cores (64) is as good a metric as any. As a comparison point, the theoretical peak GFLOPS number often quoted for GPUs is really only reachable if you have an application with significant data parallelism and limited branching. Other numbers used in the past by processors include: peak GFLOPS, MIPS, Dhrystone scores, CoreMark scores, SPEC scores, Linpack scores, etc. Taken by themselves, datasheet specs mean very little. We have published all of our data and manuals and we hope it's clear what our architecture can do. If not, let us know how we can convince you.
That said, I still think that the GHz stat is just about as BAD a metric as any (I suppose "pin count times # of cores" would be worse :-). About the only positive inference I can draw from this is that you have the thermal situation in your system under control.
But piling up cores and cooling them is, IMHO, one of the easiest parts of designing a massively parallel system. The interesting part of the design is the interconnections between the cores, and any metric that multiplies single core performance by number of cores tells me nothing about that.
So not only am I not learning a key part of the performance characteristics of your system, but by omitting it, you make me wonder whether the ENGINEERING of the system might be similarly misguided on this aspect as the MARKETING seems to be (i.e. does marketing omit this aspect of the system because it was not important to the engineers either?).
Linpack at least has benchmarks both for showing off the cores in nearly independent scenarios, and for showing the system when actual communication has to take place. Obviously, each parallel application is different, but you'd at least show ONE indication of performance in situations that are not embarrassingly parallel (http://en.wikipedia.org/wiki/Embarrassingly_parallel).
http://www.adapteva.com/white-papers/using-a-scalable-parall...
Corner turns for 2D FFTs are usually quite challenging for GPUs and CPUs.[ref] Yaniv, our DSP guru, completed the corner turn part of the algorithm with ease in a couple of days and the on chip data movement constitutes a very small portion of the total application wall time.(complete with source code published as well if you really want to dig).
It's hard to market FFT cycle counts to the general audience:-)