If anyone wanna know the specifics on how I used wget, I wrote it down here: https://github.com/SpeedcubeDE/speedcube.de-forum-archive
Also, if anyone has experience archiving similar websites with HTTrack and maybe know how it compares to wget for my use case, I'd love to hear about it!
Is most of that data because of there being like a zillion different views and sortings of the same posts? That’s been the main difficulty for me when wanting to crawl some sites. There’s like an infinite number of permutations of URLs with different parameters because every page has a bunch of different link with auto-generated URL parameters for various things, that results in often retrieving the same data over and over and over again throughout an attempted crawl. And sometimes URL parameters are needed and sometimes not so it’s not like you can just strip all URL parameters either.
So then you start adding things to your crawler like, starting with shortest URLs first, and then maybe you make it so whenever you pick the next URL to visit it will take one that is most different from what you’ve seen so far. And after that you start adding super specific rules for different paths of a specific site.
In terms of possible permutations, MyBB is pretty tame thankfully. Only the forums are sortable, posts only have the regular and the aforementioned threaded mode to view them. Even the calender widget only goes from 1901-2030, otherwise wget might have crawled forever.
I originally considered excluding threaded mode using wget's `--reject-regex` and then just adding an nginx rule later to redirect any incoming such links to the normal view mode. Basically just saying "fuck it, you only get this version". That might be worth a try for your case
I've read py4e, ostep, Pgs essays using this.
I am who I am because of httrack. Thank you
Real copy of the netlify.com website for demonstration: https://crawler.siteone.io/examples-exports/netlify.com/
Sample analysis of the netlify.com website, which this tool can also provide: https://crawler.siteone.io/html/2024-08-23/forever/x2-vuvb0o...
It's a sad point to be at. Fortunately, the single file extension still works really well for single pages, even when they are built dynamically by JavaScript on the client side. There isn't a solution for cloning an entire site though, at least that I know of
Good to know they're still around, however, now that the web is much more dynamic I guess it's not as useful anymore as it was back then
Also less useful because the web is so easy to access, I remember using it back then to draw things down over the university link for reference in my room (1st year, no network access at all in rooms) or house (or per-minute costed modem access).
Sites can vanish easily of course still these days, so having a local copy could be a bonus, but they just as likely go out of date or get replaced, and if not are usually archived elsewhere already.
so, did developer of the github repo took over and updating/upgrading? very good!