I definitely was not aware Spotify DRM had been cracked to enable downloading at scale like this.
The thing is, this doesn't even seem particularly useful for average consumers/listeners, since Spotify itself is so convenient, and trying to locate individual tracks in massive torrent files of presumably 10,000's of tracks each sounds horrible.
But this does seem like it will be a godsend for researchers working on things like music classification and generation. The only thing is, you can't really publicly admit exactly what dataset you trained/tested on...?
Definitely wondering if this was in response to desire from AI researchers/companies who wanted this stuff. Or if the major record labels already license their entire catalogs for training purposes cheaply enough, so this really is just solely intended as a preservation effort?
I wouldn’t be so sure. There are already tools to automatically locate and stream pirated TV and movie content automatic and on demand. They’re so common that I had non-technical family members bragging at Thanksgiving about how they bought at box at their local Best Buy that has an app which plays any movie or TV show they want on demand without paying anything. They didn’t understand what was happening, but they said it worked great.
> Definitely wondering if this was in response to desire from AI researchers/companies who wanted this stuff.
The Anna’s archive group is ideologically motivated. They’re definitely not doing this for AI companies.
More serious response: research is explicitly included in fair use protections in US copyright law. News organizations regularly use leaked / stolen copyrighted material in investigative journalism.
Are you aware Annas Archive already solved the exact same problem with books?
I can imagine this making it wayyy easier to build something like Lidarr but for individual tracks instead of albums.
It's probably going to make the AI music generation problem worse anyway...
Can you imagine your favorite playlist needing to swap among 10 apps, each requiring a $10/month subscription?
it's an archive to defend against Spotify going away. Remember when Netflix had everything, and then that eroded and now you can only rely on stuff that Netflix produced itself?
the average consumer will flock when Spotify ultimately enshitifies
Didn't Meta already publicly admit they trained their current models on pirated content? They're too big to fail. I look forward to my music Slop.
Largest example: a lot of Russian music is not available on Spotify because of the Russia-Ukrane war, and Spotify pulling out of Russia. So they don't have the licneses to a lot of stuff because that belongs to companies operating within Russia.
What's stopping someone from sticking a microphone next to their speaker?
Slow, but effective.
Do they have DRM at all? Youtube and Pandora don't.
Download the lot to a big Nas and get Claude to write a little fronted with song search and auto playlist recommendations?
Curious why not? Assuming you only used the metadata. I think they would be considered raw facts and not copyrightable.
For them, 300TB is just cheap
This is not to defend Spotify (death to it), but to state that opening all of this data for even MORE garbage generation is a step in the wrong direction. The right direction would be to heavily legislate around / regulate companies like Spotify to more fairly compensate the musicians who create the works they train their slop generators with.
As a society, we should do our best to preserve this trove.
Yeah. To me it is not really relevant. I actually was not using spotify and if I need to have songs I use ytldp for youtube but even that is becoming increasingly rare. Today's music just doesn't interest me as much and I have the songs I listen to regularly. I do, however had, also listen to music on youtube in the background; in fact, that is now my primary use case for youtube, even surpassing watching movies or anything else. (I do use youtube for getting some news too though; it is so sad that Google controls this.)
Additionally there was a lot of discourse about music and a lot of curated discovery mechanisms I sorely miss to this day. An algorithm is no replacement for the amount of time and care people put into the web of similar artists, playlists of recommendations and reviews. Despite it being piracy, music consumption through it felt more purposeful. It's introduced me to some of my all time favourite artists, which I've seen live and own records and merchandise of.
It was quality in technical quality of the audio in the files, but also in the organization and sourcing of the material, the QA-process of the encoding - down the the specific release the audio-file was from.
There was quantity, sure, but that was secondary to the quality. The quantity was just a side-effect of the place being known for quality, making it an attractive arena to participate in.
And it also had all the "weird"/non-standard things you don't find on mainstream streaming-services precisely because that is what independent curators are good at and often driven by.
This Anna's release... While in itself impressive in many ways does not compare to the things What.CD represented. It's almost the exact opposite:
- focus on most popular content - niche content (even by mainstream Spotify-standards) is not included
- quality is 160kbps ogg files, which is far from lossless, it's not tightly coupled to a release and even as so far the audio-grading goes, there's no transparent QA process for the content, nor is it available in audiophile fidelity.
This is definitely Apples vs Oranges.
So there’s some way to go for a comprehensive music archive.
while one can compare in terms of number of tracks, the quality used to be in another level altogether. from the article:
> The quality is the original OGG Vorbis at 160kbit/s.
meanwhile the tracker had 16/24-bit flac rips of vinyl, with decent quality control where the track's metadata was verified for any artifacts. for the given quality, one could rip youtube music (maybe not as easily anymore) and achieve a larger scale in a similar quality level.
now if hypothetically tidal had all the music of the world and was accessible this way, then it would be a comparable resource. insane regardless.
I didn't know German providers do this.
- https://de.wikipedia.org/wiki/Clearingstelle_Urheberrecht_im...
- https://netzpolitik.org/2024/cuii-liste-diese-websites-sperr...
Its a DNS based block, so overriding your default DNS server is enough to circumvent it. I think Dns over Https also works.
Alternative: https://archive.ph/2025.12.21-050644/https://annas-archive.l...
alextud popcorntime
which should trivially yield http://github.com/alextud/PopcornTimeTV results in anything but that one particular URL in every search engine: Google, Kagi, DuckDuckGo, BingThey even find a fork of that particular repo, which in turn links back to it, but refuse to show the result I want. Have't found any DMCA notices. What is going on?
Read an article that was published just 10 years ago, and witness the bit rot as most external links will 404, gone forever.
I think it's worth questioning the value of preserving -everything-, but it seems like if we can, we should.
HN crowd is, of course, biased in the technocratic sense, but you see - everyone seems to actually rejoice the move.
The closest to remorse is `linhns` and `locusofself` expressing concern about artists getting hurt (not Spotify itself), but locusofself prefaces with "I hate spotify as a company but..."
(disclaimer: this text is NOT LLM generated, I wrote myself a summary of the summary. here's the Claude thread should anyone care https://claude.ai/share/cfc4ca63-2b9e-47ac-a360-202025d1a134)
There is contemporary lost media being created every day because of how we distribute things now. I think in some cases, the intent of the publisher was to literally destroy every copy of the information. I understand the legal arguments for this, but from a spiritual perspective, this is one of the most offensive things I can imagine. Intentionally destroying all copies of a creative work is simply evil. I don't care how you frame it.
Making media effectively lost is not much different in my mind. Is it available if it's sitting on a tape in an iron mountain bunker that no one will ever look at again?
> A while ago, we discovered a way to scrape Spotify at scale.
They wont and shouldn’t divulge the details, but I imagine that would be a fun read!
https://codeberg.org/raphson/music-server/src/branch/main/sp...
If you like the goal and you have even a few 100gb available on your server, consider "donating" some of that space to seeding the data (music or books). It's absolutely how we can fight the system, even if just a tiny bit. https://annas-archive.org/torrents
If AA goes down, it's not the end of it all, a new one comes back up and the seeders are still there.
They are based in russia. And they currently do not work together so well with the west.
So it is imaginable, that if some people give Trump quite some money, to make Annas takedown part of some deal to lift sanctions after a ceasefire in Ukraine, but .. it does not seem like it. I rather suspect more effort in the west to block access to unwanted sites like this. My ISP in germany is already blocking it.
It may be only ~30 years for webpages to have emerged, but there are also many young people who may not have experienced that since they are too young to have experienced it. There is always a generational change; our generation has the opportunity to store more things.
That's why I divide music to the one that I want to have forever - I buy it on CDs - and dance music that I can live without one day
https://www.scribd.com/document/56651812/kreitz-spotify-kth1...
That being said it’s no secret Spotify and other streaming services barely pay even popular artists. Artists make money from live shows and merch. The fact that their music is behind a paywall at all could mean they make less money from some lack of exposure.
I do hope one day self-hosting music with an extremely easy setup with torrenting for sourcing is set up again. What I’m talking about exists to some extent, but it’s not trivial for most people.
I've always found it interesting how streaming services have become the de facto music library of record, yet they can and do remove content at will. When Spotify pulled out of Russia, entire catalogs became inaccessible. Physical media and personal archives suddenly matter again in ways we thought were obsolete.
The copyright discussion is complex, but from a pure preservation standpoint, I'm glad someone is doing this work.--------------
This is by far the largest music metadata database that is publicly available. For comparison, we have 256 million tracks, while others have 50-150 million. Our data is well-annotated: MusicBrainz has 5 million unique ISRCs, while our database has 186 million.
--------------
If they truly are on a mission to protect world's information from disappearing, they should work with MusicBrainz to get this data on it.
Alternatively, it would be amazing, if they built a MusicBrainz like service around it.
In either case, to make the data truly useful, they'd need to solve the problem on how to match the metadata to a fingerprint used to identify the music tracks, assuming that data is not part of the metadata they collected.
The value that MusicBrainz adds is the community editor who spent a few hours going through YouTube videos and wayback machine social links to figure out that Fog (Wellington, NZ, punk/post-punk) and Fog (Auckland, NZ, Post-Punk) are different bands - even if they share a Spotify profile. The editor that hunted down and listened to 5 compilations that have mixed up a radio edit and an original mix of a track, to find out which is which, and separate them in MB and make notes. [these are made up examples]
That's not to imply that these two projects are 'competing', or that the ISRC figure comparison isn't useful and correct. But community database + scraped data is apples and oranges. And a mixed fruit bowl is wonderful.
How is that a problem?
for each track in collection do extract_fingerprintEven perceived involvement in music piracy puts a much bigger target on their back from far more aggressive actors (RIAA, major labels)
The data will be released in different stages on our Torrents page:
[X] Metadata (Dec 2025)
[ ] Music files (releasing in order of popularity)
[ ] Additional file metadata (torrent paths and checksums)
[ ] Album art
[ ] .zstdpatch files (to reconstruct original files before we added embedded metadata)
> We're curious about the peaks at whole minutes (particularly 2:00, 3:00, 4:00). If you know why this is, please let us know!
As a hobby video/audio editor, people will start with their track taking up a preset amount and fill up the time - even if it means having some dead space at the end.
The other alternative is algorithmically created music.
So you might see a lot of anchoring just like YouTube videos kept stretching to almost exactly ten minutes?
The best metadata I've found, though, is the MySpace Dragon Hoard: https://archive.org/details/myspace_dragon_hoard_2010
That included the artist location, allowing me to tag songs based on their country. I then created playlists such as "NERAS" Non-English Rock Artist Sample, where the one most popular song for a particular artist was chosen, and only when the country of origin was not English-speaking, and the genre was Rock. I like listening to music while working, but English lyrics distract me because I understand what they're saying.
After discovering music via the MySpace archive, I've since purchased 73 songs from 35 artists that I'd never heard of before digging into the data. I rebuilt my playlist on Spotify, but got greyed out tracks, and YouTube Music, but got "unavailable video". So I still prefer purchasing tracks via the iTunes Music Store, Qobuz, Bandcamp, and 7digital.
Other data sources such as the MP3.com rescue barge, PureVolume archive, and Anna's Spotify archive lack the country-of-origin metadata, so are of less interest to me. It may be possible to use an LLM to guess the language of each track title, but someone else will have to do that.
Meanwhile, if you're interested in the genre-by-country MySpace data, or have questions about the iTunes EPF, feel free to reach out and we can discuss your research.
I would guess that combining these sources, along with info from MusicBrainz, would help quite a bit? Still, I'm rather surprised Spotify doesn't provide more information about artists.
Also sort and classify the articles by binary size, vs page count, plot count, raster image count etc, in order to compress the outliers and detect when a raster image should have been a plot and convert it to vectorized images etc.
How compact can we get the collective human scientific corpus?
They also removed a lot of discovery features - Playlist Radio - for example. And they still do have some version of it on the backend, but you have to go through some weird mechanisms to trigger it - like play the last song in playlist, wait till it ends (or rewind) and you get the playlist radio. But it's also a crippled version of it - prefers playing the exact same popular songs for some reason.
Then they released this DJ thing, which is laughably bad. No Spotify, I don't want someone talking to me with useless information in between songs. Who though that was a good idea? Who actually uses that?
There hasn't been a change in Spotify in last 7 years or so that wasn't negative.
Another extremely annoying effect is, being 40+, they only suggest music for my age. In “New” and “Trending”, I see Muse and Coldplay! I should make myself a fake ID just to discover new music, but that gets creepy very fast.
Magnet link found here: https://annas-archive.li/torrents/spotify
Are magnet links allowed on HN?
If I were to do it today, I could get so much farther with hyperscaler products and this dataset.
Anecdotally, I know a few vocalists that sound great in these keys and use them as a starting point
Increasing or decreasing? IMHO increasing would make more sense, as the most popular music is already mirrored in countless other places. It's the rare stuff that is most in need of preservation.
I wonder how much of the content there is AI-generated. Honestly, even as someone who was initially skeptical, I've found some of it to be rather good --- not knowing that it was AI-generated at first. Now if they could only reverse-engineer the prompt and only store the model, that would be an extremely efficient form of "compression".
I'm a music archivist & preservationist, I've archived and found several formerly lost or on the verge of becoming lost albums, EPs, and Singles, and I've been wondering if the backup of Spotify so far, even with the available info, contain any taken down, region limited, or no longer available songs?
any response is appreciated!
https://developer.spotify.com/documentation/web-api/referenc...
I bet you can whip up a super simple script with an LLM to do this!
But they're not that good. They look for the songs on youtube, and the versions uploaded there are often modified (or just very low quality). And I've had some issues with metadata. I'd say about 5% of my songs had some issues, and 1% were completely off.
Once they release the actual torrents and not just the metadata, I'm assuming that new playlist export tools will soon show up, and they'll use these new torrents as source instead of youtube. They'll be a lot more reliable. I'd wait for that to happen. In fact I may end up re-exporting my old spotify playlist.
I've used ChatGPT to write a whole bunch of playlist logic scripts (e.g. create a playlist that takes tracks from playlists A, B and C, but exclude tracks in playlist D.)
In spotify_clean_track_files.sqlite3:
SELECT count(*), sum(filesize_bytes) FROM track_files;
255966403|15970064861274
That's only 14.5 TiB, nowhere near 300 TiB. What makes up the other 285 TiB of content?I'm a bit sad that they chose to focus on music rather than audiobooks. Creating an archive of audiobooks seem like it would be more aligned with their mission.
https://open.spotify.com/album/07IyzOA9jJWPZcLDysQwpo?si=KZO...
This is not an issue in my view. I like the fact that I can download 100 MiB ultra-high resolution TIFF files of scans of photographs from the original negative from the Library of Congress and 24-bit/96kHz FLAC files of captures of 78 RPM records from the Internet Archive. In addition to maintaining completeness and quality of information, one of the main goals of preservation is to guard against further degradation and information loss. You should try to preserve the highest quality copies available (because they contain more information) and re-encoding (deliberate degradation) should only be used to create convenient access copies.
Inferior copies, in addition to being less informative, have the potential to misinform. Only the archivist will enjoy space savings. All the readers who might consult your library in the infinite future will bear the cost.
> ...(e.g. lossless FLAC). This inflates the file size...
This is entirely the wrong view. The file size of a raw capture compressed to FLAC should be thought of as the “true” or “correct” size. It is roughly the most efficient (balancing various trade-offs) representation of sampled audio data that we can presently achieve. In preservation we seek to preserve the item or signal itself and not simply what we might perceive thereof. This human-centric perception view is just wrong. There is data in film photographs which cannot be perceived visually yet can be of interest to researchers and be revealed with digital image analysis tools.
As an example of how much information celluloid can contain see: https://vimeo.com/89784677 (context: he is comparing a Blu-ray and a scan of a 35mm print)
That would also be a good fit for [the new delta-encoded posting lists I am working on](https://github.com/meilisearch/meilisearch/pull/5985). Let's see how good it can get. My early benchmarks showed a 50% reduction in disk usage.
This is literally all you need to back up Spotify.
Error HTTP 451 - Unavailable For Legal Reasons
Jokes aside, I always thought the best way to deal with piracy was to understand or convince the demand not to do it over dealing with the supply.
But, more importantly, I cannot even say "good for you", because I don't actually think it is good for Anna's Archive. I wouldn't touch that thing, if I was them. Do we even have any solid alternatives for books, if Anna's Archive gets shot down, by the way? Don't recommend Amazon, please.
Now imagine a dedicated music client that will download and stream (and share, because we are polite) only the needed files :)
a client can selectively list and then stream individual files from a huge torrent. if you've ever watched illegal movies/shows on those random domain websites, you're likely streaming it from a torrent on the backend somewhere.
it wouldn't surprise me if we start to see some docker images pop up in a few days to do exactly this as a sort of "quasi-self-hosted jellyfin". Where a person host a thin client on a machine that then fetches the data from the torrent, then allows the user to "select" their library. A user can just select "Top hits from the 80s" and it'll grab those files from the torrent, then stream or back them up.
I don't really see why it wouldn't, from an end user perspective, be any different than a self hosted jellyfin or plexamp.
Is there any way to search this spotify database without downloading the currently available metadata torrent?
Releasing indie music, like really low-level indie music, for free in the name of "preservation" is so misguided.
Don't do this. You will only end up hurting the artists who rely on paid downloads.
There is a ton of good bands with under 10k or even 1k monthly listeners.
Relying on an external hosted service would never cross my mind, and surely wouldn’t be something I go to on a daily basis.
I envision an army of lawyers and cyber security companies being prepared to unleash a scorched earth campaign that book publishers might want to be part of as well.
At the end it may take down more than just this publication but most others as well.
Yeah, the original quality is either a 320kbps OGG or lossless. Not 160.
While this is _a_ backup, it's a pretty lossy one.
A distributed ripping project to do that would be a fine thing.
Until we have reasonable copyright terms, Pirate On !
If you could identify a track supposedly by artist X was actually AI slop not created by artist X, you could use that information to skip tracks on (web) music players, for example.
So much interesting but undiscovered music is out there!
Psy-trance... I thought it was the same as any other electronic genres, but do people get high and just start shoveling psy-trance tracks out or something?
Opera I thought was a very strict discipline, needing rigorous somewhat esoteric training in order to produce the right sounds. How could there be so many opera artists?
I mean, I'm sure there's some misclassification, but chamber music is basically a couple people with any sort of music training on classical instruments so that doesn't surprise me nearly as much... I can easily imagine there being _lots_ of those, and you might come up with a different artist name for each unique set of people you collaborate with.
> /audio-features/{id} "Get audio feature information for a single track identified by its unique Spotify ID."
this combined with track metadata can finally allow those motivated enough to create their own personalized shuffle. potentially better than the slop we get nowadays. no generative ai required*.
The point is human connection. Art is a living reflection and record of human experience. Art will persevere- the kinds of folks who prioritize what they like based on popularity were never the supporters artists (contrast with craftspeople trying to make a buck) counted on in the first place. Enjoy your derivative slop - we’ll continue on our imperfect, messy, individual, human artistic lives.