LLMs are worse at many things than human programmers, so you have to try to compensate by leveraging the things they're better at. Don't give up with "they're bad at such and such" until you've tried using their strengths.
If you could run N tests in parallel, then you could probably also run the components of one test in parallel and keep it from taking 2 hours in the first place.
To me this all sounds like snake oil to convince people to do something they were already doing, but by also spinning up N times as many compute instances and run a burn endless tokens along the way. And by the time it's demonstrated that it doesn't really offer anything more than doing it yourself, well you've already given them all of your money so their job is done.
In one of the systems (supply chain SaaS) we invested so much effort in having good tests in a simulated environment, that we could run full-stack tests at kHz. Roughly ~5k tests per second or so on a laptop.
Say you're Human A, working on a feature. Running the full testing suite takes 2 hours from start to finish. Every change you do to existing code needs to be confirmed to not break existing stuff with the full testing suite, so some changes it takes 2 hours before you have 100% understanding that it doesn't break other things. How quickly do you lose interest, and at what point do you give up to either improve the testing suite, or just skip that feature/implement it some other way?
Now say you're Robot A working on the same task. The robot doesn't care if each change takes 2 hours to appear on their screen, the context is exactly the same, and they're still "a helpful assistant" 48 hours later when they still try to get the feature put together without breaking anything.
If you're feeling brave, you start Robot B and C at the same time.
No one really cares about improving test times. Everyone either suffers in private or gets convinced it's all normal and look at you weird when you suggest something needs to be done.
But AI will do a pretty decent job of telling you which tests are most likely to fail on a given PR. Just run those ones, then commit. Cuts your test time from hours down to seconds.
Then run the full test suite only periodically and automatically bisect to find out the cause of any regressions.
Dramatically cuts the compute costs of tests too, which in big codebase can easily become whole-engineers worth of costs.