Ok, so each instance takes a a "reasonable" amount of time and the whole structure optimization is taking hours? If you can get a setup with MPI and/or CUDA you may be able to cut the instance runtimes, but if the optimization is just slow to converge it won't help much. If it's the former then a higher end workstation or a small cluster could get you there. Cloud could be an option here too if you can aggregate the compute resources.
I'm not as familiar with GPAW but if you're able to share a minimal example I could try to take a look.