Open source implementation of Google's MusicLM in PyTorch (opens in new tab)

(github.com)

118 pointsbevenky3y ago22 comments

22 comments

This guy (Phil Wang, https://github.com/lucidrains) seems to have the hobby to just implement all models and papers he finds interesting. See his GitHub page. He has 228 repos, and most of them are some implementation of some machine learning paper. Some of those repos are quite popular.

jamessb3y ago

The project README thanks "Stability.ai for the generous sponsorship to work and open source cutting edge artificial intelligence research", so it's not necessarily just a hobby (though it's possible they just provide compute resources).

Phil's homepage [1] links to a form [2] where you can suggest a paper for him to implement.

[1]: https://lucidrains.github.io/

[2]: https://forms.gle/Dtrxc6CceHEcqS6X6

asciii3y ago

He is open to consultation and work so the repo is a nice gallery of what's possible and learnable material

https://lucidrains.github.io/

He also is creator of ThisPersonDoesNotExist.com

PartiallyTyped3y ago

The implementation is quite clean too, and provided that you have read the papers, they are easy to understand.

alexmolas3y ago

I don't understand how this got so many upvotes. It takes only one minute to read the code and realize that the model is not yet completely implemented. Sometimes I have the feeling that people upvote posts without even reading them...

Of course, it's good work, and knowing lucidrains trajectory it's probably going to be implemented in the following days/weeks. But I wonder how many people have at least opened the link before upvoting it.

hall0ween3y ago

This question is a tangent to your work. Having never used music LMs, and only being cursorily aware of them - how do you keep up with the sota in your field?

kavalg3y ago

Google's MusicLM sounds plausible, but quite dull and even sometimes irritating to my musician's ear.

jtode3y ago

As another musician, I'll point out that there was a point when the only thing AI could produce visually was a lot of dog faces.

dimatura3y ago

My day job is in ML, but I also enjoy music making as a hobby (on a very amateur level - mostly making 4-bar loops on a handheld tracker or knob twiddling on 90s synths I couldn't afford as a kid). I see an interesting mix of curiosity, hostility and head-in-the-sand attitudes from the musician communities. Though the "head-in-the-sand" component will almost certainly start becoming less prevalent with this and other models that are sure to come out in short order.

I'm pretty sure soon enough we'll start seeing the same kind of dynamics that have played out for the arts community in music, not that the dust has settled there yet. I hope there isn't much negative financial impact on people's livelihoods, but maybe some will be unavoidable. And of course, AI is also coming for programmer's jobs, which will hit even closer to home. The next decade will be "interesting", so to speak.

1 more reply

kavalg3y ago

I am not arguing that it won't be successful, just that I don't like it at its current stage of progress. Actually most of the music that people hear our days is nothing special either (either quite dumb or just a rinse-repeat of some older successful musical forms).

Lucasoato3y ago

Does anyone know if these models can output also Midi instead of plain audio?

albertzeyer3y ago

This model is designed to output raw audio.

However, there are many models which do output midi. That's actually much simpler, and has been done already a few years ago.

I thought OpenAI did this. But then, I might misremember, because their Jukebox actually also seems to produce raw audio (https://openai.com/blog/jukebox/).

Edit: Ah, it was even earlier, OpenAI MuseNet, this: https://openai.com/blog/musenet/

However, midi generation is so easy, you even find it in some tutorials: https://www.tensorflow.org/tutorials/audio/music_generation

kolinko3y ago

Not out of the box, afaik. They produce spectograms that get converted into wav/mp3.

dimatura3y ago

I think that description applies to Riffusion, one of the earlier models in this area that was a pretty straightforward to adapt image-based diffusion models to making music, since you can treat spectrograms as images. But this model uses "soundstream", which is another model that has its own paper. It's described as a "neural audio codec" which, by itself, is a model that encodes and decodes audio into "tokens"; so sort of like other codecs (eg, MP3) except that the compressed representation it uses is a more high-level learned representation. This model outputs the tokens which are then decoded by soundstream. The tokens probably encode a lot of the same kind of spectral information contained in spectrograms (or similarly, mel-frequency features) but seem to be a little bit more expressive/data efficient.

wokwokwok3y ago

No. They can’t.

You could train a model that could, but these models can’t.

Paper: https://google-research.github.io/seanet/musiclm/examples/

Quote: “By relying on pretrained and frozen MuLan, we need audio- only data for training the other components of MusicLM. We train SoundStream and w2v-BERT on the Free Music Archive (FMA) dataset (Defferrard et al., 2017), whereas the tokenizers and the autoregressive models for the seman- tic and acoustic modeling stages are trained on a dataset con- taining five million audio clips, amounting to 280k hours of music at 24 kHz.”

Tldr: you can only get out of these models what you put in, and these ones are trained on raw audio.

If you want midi output, you need to train a model on midi data.

alephxyz3y ago

Seems to be an early WIP.

bevenkyOP3y ago

This one is AudioLM modified from here https://github.com/lucidrains/audiolm-pytorch to support the music generation needs of Mulan.

henearkr3y ago

Nice work!

Won't the model training be a lot of cost to bear, though?

bevenkyOP3y ago

Implementation of MusicLM, Google's new SOTA model for music generation using attention networks, in Pytorch.

https://github.com/lucidrains/musiclm-pytorch/blob/main/musi...

swyx3y ago

pardon my ignorance - what exactly is involved in reimplementing these models?

i assume there's only a superficial description of the architecture, and no weights to load in, so you'll have to train everything from scratch? do we even have their dataset?

f_devd3y ago

Generally it's without weights, but MusicLM is also a WIP. More mature implementations have descriptions on how to train them and follow ups on small scale/crowd-sourced experiments & research[1].

[1]: https://github.com/lucidrains/denoising-diffusion-pytorch

j / k navigate · click thread line to collapse

22 comments

albertzeyer3y ago

jamessb3y ago

Phil's homepage [1] links to a form [2] where you can suggest a paper for him to implement.

[1]: https://lucidrains.github.io/

[2]: https://forms.gle/Dtrxc6CceHEcqS6X6

asciii3y ago

He is open to consultation and work so the repo is a nice gallery of what's possible and learnable material

https://lucidrains.github.io/

He also is creator of ThisPersonDoesNotExist.com

PartiallyTyped3y ago

The implementation is quite clean too, and provided that you have read the papers, they are easy to understand.

alexmolas3y ago

hall0ween3y ago

This question is a tangent to your work. Having never used music LMs, and only being cursorily aware of them - how do you keep up with the sota in your field?

kavalg3y ago

Google's MusicLM sounds plausible, but quite dull and even sometimes irritating to my musician's ear.

jtode3y ago

As another musician, I'll point out that there was a point when the only thing AI could produce visually was a lot of dog faces.

dimatura3y ago

1 more reply

kavalg3y ago

Lucasoato3y ago

Does anyone know if these models can output also Midi instead of plain audio?

albertzeyer3y ago

This model is designed to output raw audio.

However, there are many models which do output midi. That's actually much simpler, and has been done already a few years ago.

I thought OpenAI did this. But then, I might misremember, because their Jukebox actually also seems to produce raw audio (https://openai.com/blog/jukebox/).

Edit: Ah, it was even earlier, OpenAI MuseNet, this: https://openai.com/blog/musenet/

However, midi generation is so easy, you even find it in some tutorials: https://www.tensorflow.org/tutorials/audio/music_generation

kolinko3y ago

Not out of the box, afaik. They produce spectograms that get converted into wav/mp3.

dimatura3y ago

wokwokwok3y ago

No. They can’t.

You could train a model that could, but these models can’t.

Paper: https://google-research.github.io/seanet/musiclm/examples/

Tldr: you can only get out of these models what you put in, and these ones are trained on raw audio.

If you want midi output, you need to train a model on midi data.

alephxyz3y ago

Seems to be an early WIP.

bevenkyOP3y ago

This one is AudioLM modified from here https://github.com/lucidrains/audiolm-pytorch to support the music generation needs of Mulan.

henearkr3y ago

Nice work!

Won't the model training be a lot of cost to bear, though?

bevenkyOP3y ago

Implementation of MusicLM, Google's new SOTA model for music generation using attention networks, in Pytorch.

https://github.com/lucidrains/musiclm-pytorch/blob/main/musi...

swyx3y ago

pardon my ignorance - what exactly is involved in reimplementing these models?

i assume there's only a superficial description of the architecture, and no weights to load in, so you'll have to train everything from scratch? do we even have their dataset?

f_devd3y ago

Generally it's without weights, but MusicLM is also a WIP. More mature implementations have descriptions on how to train them and follow ups on small scale/crowd-sourced experiments & research[1].

[1]: https://github.com/lucidrains/denoising-diffusion-pytorch

j / k navigate · click thread line to collapse