Phil's homepage [1] links to a form [2] where you can suggest a paper for him to implement.
He also is creator of ThisPersonDoesNotExist.com
Of course, it's good work, and knowing lucidrains trajectory it's probably going to be implemented in the following days/weeks. But I wonder how many people have at least opened the link before upvoting it.
I'm pretty sure soon enough we'll start seeing the same kind of dynamics that have played out for the arts community in music, not that the dust has settled there yet. I hope there isn't much negative financial impact on people's livelihoods, but maybe some will be unavoidable. And of course, AI is also coming for programmer's jobs, which will hit even closer to home. The next decade will be "interesting", so to speak.
However, there are many models which do output midi. That's actually much simpler, and has been done already a few years ago.
I thought OpenAI did this. But then, I might misremember, because their Jukebox actually also seems to produce raw audio (https://openai.com/blog/jukebox/).
Edit: Ah, it was even earlier, OpenAI MuseNet, this: https://openai.com/blog/musenet/
However, midi generation is so easy, you even find it in some tutorials: https://www.tensorflow.org/tutorials/audio/music_generation
You could train a model that could, but these models can’t.
Paper: https://google-research.github.io/seanet/musiclm/examples/
Quote: “By relying on pretrained and frozen MuLan, we need audio- only data for training the other components of MusicLM. We train SoundStream and w2v-BERT on the Free Music Archive (FMA) dataset (Defferrard et al., 2017), whereas the tokenizers and the autoregressive models for the seman- tic and acoustic modeling stages are trained on a dataset con- taining five million audio clips, amounting to 280k hours of music at 24 kHz.”
Tldr: you can only get out of these models what you put in, and these ones are trained on raw audio.
If you want midi output, you need to train a model on midi data.
Won't the model training be a lot of cost to bear, though?
https://github.com/lucidrains/musiclm-pytorch/blob/main/musi...
i assume there's only a superficial description of the architecture, and no weights to load in, so you'll have to train everything from scratch? do we even have their dataset?
[1]: https://github.com/lucidrains/denoising-diffusion-pytorch