Sony gave you 6 of the 8 SPE cores to use (I think they reserved two, but it's been ages). They are indeed very fast, however, they have no cache coherent access to main RAM and only 256k of memory for each element. So, you have to meticulously write DMA scheduling code to keep them fed. If you're a simpleton like me, you double buffer your SPE memory, cutting in in half, so 128k to work with, 128k for paging into, and you hope to be done paging before it's needed. Latency to memory is on the order of 2,000 cycles to first byte, but then they arrive fast.
So, what you do is decompose your problem into data streams that can be cruched through, but in such a way that you minimize the need to randomly access much memory. It's often cheaper to recompute things locally than to fetch them from RAM. Random access into your RAM is pointless, so you have to marshal all your input into DMA buffers, do some work, marshal all your output into other DMA buffers, and send back to host CPU.
Anyhow, I got this working. Meshes were being skinned at very high rate, but it was very frustrating. The PPE was really slow, so you had to offload as much as you could to those SPE's. But hey, I may be complaining, but it sure beats dealing with the "Emotion Engine" on the PS2. I can tell you which emotion that engine brings up.