How do we scale this up when these audio models have their "stable diffusion moment" (thanks simonw for the phrase).