APL (and J by extension) are more tricky to parallelise than you might expect. The frequent reliance on boxing leads to irregular pointer structures, and the absence of compile-time type information makes it hard to generate code at all. APL is usually based on efficient implementations of primitives, but that is certainly too fine-grained to be sufficient for bandwidth-starved devices such as GPUs. I contributed to an APL-to-GPU compiler[0], and it was hard to make it work on more than a small (well-behaved) subset.
[0]: https://github.com/melsman/apltail