Yes, roughly single-threaded async, with each core running a disjoint subset of the workload on data private to that core. You can’t beat the operation throughput. The software architecture challenge is shedding load between cores, since this
will hotspot under real workloads with a naive design. Fortunately, smoothly and dynamically shedding load across cores with minimal overhead and latency is a solved design problem.
It works pretty well for ordinary heavy crunch code too. I originally designed code like this on supercomputers. You do need a practical mechanism for efficiently decomposing loads at a fine granularity but you rarely see applications that don’t have this property that are also compute bound.