Do you know if is there a place to host other of sqlite dumps? I mean from other websites? Recently I dumped the whole hackernews api and I got thinking about it.
I keep meaning to do the same thing with Wikipedia. Although the Wikipedia dumps are so inscrutably named and seemingly undocumented it seems the organization does not want me to pursue the idea.
Pulling useful content out of the dumps has been an exercise in frustration. I'm sure I could figure something out if I had a bunch of time to dedicate to the effort.
If I just had sqlite dumps they'd be trivial to work with and I'd be much happier with them.
Presumably they have a script that does something similar to that process, and then writes the resulting data into a predefined table structure.
Yep, my process is similar. It goes...
- decompress (users|posts)
- split into batches of 10,000
- xsltproc the batch into sql statements
- pipe the batches of statements into sqlite in parallel using flocks for coordination
On my M1 Max it takes about 40 minutes for the whole network. Then I compress each database with brotli which takes about 5 hours.