Show HN: Tesoro – Personal internet archive (opens in new tab)

(tesoro.io)

174 pointsagamble8y ago100 comments

100 comments

For personal web archiving, I highly recommend http://webrecorder.io. The site lets you download archives in standard WARC format and play them back in an offline (Electron) player. It's also open source and has a quick local setup via Docker - https://github.com/webrecorder/webrecorder .

Webrecorder is by a former Internet Archive engineer, Ilya Kreymer, who now captures online performance art for an art museum. What he's doing with capture and playback of Javascript, web video, streaming content, etc. is state of the art as far as I know.

(Disclaimer - I use bits of Webrecorder for my own archive, perma.cc.)

For OP, I would say consider building on and contributing back to Webrecorder -- or alternatively figure out what Webrecorder is good at and make sure you're good at something different. It's a crazy hard problem to do well and it's great to have more ideas in the mix.

motdiem8y ago

Seconding Webrecorder (and the newly updated WAIL) - I had the chance of meeting Ilya Kremer at a conference a few weeks ago, and I can confirm what he's doing is top notch - I'm hoping to see more work around WARC viewing and sharing in the future.

(Disclaimer: I also do personal archiving stuff with getkumbu)

johnaberlin8y ago

Hi motdiem,

Thank you for seconding the newly updated WAIL. I am the maintainer/creator of the newly update WAIL (the Electron version) https://github.com/N0taN3rd/wail

I was unable to attend IIPC Web Archiving Conference (WAC) but the original creator of WAIL(Python) Mat Kelly did attend (we both are apart of the same research group WSDL).

If you or anyone else have any questions about WAIL I am more than happy to answer them.

amrrs8y ago

Is offline playback still relevant in the age of ubiquitous always connected Internet?

3 more replies

shasheene8y ago

Last I played with it, the latency on webrecorder was uncomfortably high for always-on recording of personal web usage (the pages only display once fully rendered). I wish webpages would render as normal and get asynchronously archived once loading is complete.

That would allow constant archival of every webpage a user ever visits -- an immutable record over the user's years of crawling the web.

unicornporn8y ago

> That would allow constant archival of every webpage a user ever visits -- an immutable record over the user's years of crawling the web.

This is usually solved by using a proxy: http://netpreserve.org/projects/live-archiving-http-proxy/

1 more reply

ikreymer8y ago

Thanks Jack for mentioning Webrecorder! This is a project I started and it is now part of rhizome.org, a non-profit dedicated to promoting internet-based art and digital culture.

I thought I’d add a few notes here, as there’s a few ways you can use Webrecorder and related tools.

First, Webrecorder supports two distinct modes:

- Native recording mode — http/s traffic goes to through the browser and is rewritten to point to the Webrecorder server (This is the default).

- Remote browser mode — Webrecorder launches a browser in Docker container, and streams the screen to your browser (using noVNC). The traffic is either recorded or replayed depending on the mode, but the operation is the same (we call this ‘symmetrical archiving’) This gives you a recording proxy w/o having to configure your browser or install any plugins.

You can choose this mode by clicking the dropdown to choose a browser (currently Chrome and FF) This is essentially a remote browser configured via HTTP/S proxy, and allows us to record things like Flash and even Java applets, and other technologies that may become obsolete.

- We also have a desktop player app, Webrecorder Player, available for download from: https://github.com/webrecorder/webrecorderplayer-electron/re...

This is an app that plays back WARCs files (created by Webrecorder and elsewhere), and allows browsing any WARC file offline.

Another way to create a web archive (for developers): You can use the devtools in the browser to export HAR files, and Webrecorder and Webrecorder Player will convert them on the fly and play them back. Unfortunately, this option is sort of limited for developers, but you can actually create a fairly good archive locally using HAR export (available in Chrome and Firefox at least). The conversion is done using this tool: https://github.com/webrecorder/har2warc

- If you use webrecorder.io, you can register for an account or use it anonymously. If you register for an account, we provide 5GB storage and you have a permanent url for your archive. You can also upload existing WARCs (or HARs)

- You can also run Webrecorder on your own! The main codebase is at: https://github.com/webrecorder/webrecorder and the remote browser system is actually a separate component and was first used for oldweb.today and lives at https://github.com/oldweb-today

Finally, the core replay/recording tech is actually a separate component, an advanced ‘wayback machine’ being developed in https://github.com/ikreymer/pywb

There’s a lot of different components here, and we would definitely appreciate help to any and all parts of the stack if anyone is interested! All our work is open-source and we are a non-profit, so any help is appreciated.

owenversteeg8y ago

Wow, webrecorder seems very cool, especially since it's OSS. Is there any way to set it up to record all incoming traffic? In these days of cheap storage that'd be very cool. I know I personally only use about 30GB a month, so a $70 2TB hard drive would last me five and a half years of browsing.

JackC8y ago

If you're a coder I bet you could hack it to do that. It has an amazing containerized browser mode where you can browse in a remote browser via VNC, with the remote browser set up to use a WARC-writing proxy. So the general outline would be to run it locally in Docker; expose the proxy port used by the containerized browsers; and configure your own browsers to use the same proxy.

I'm not sure how much this would interfere with normal browsing -- it's not a typical usecase.

psteinweber8y ago

This is great and helps me a ton, thanks for mentioning it here.

agambleOP8y ago

Thanks Jack, I hadn't heard of webrecorder before, but I'll check it out. :)

smoyer8y ago

It's not mine unless it's running on my own servers or computer - I created a really rough version of this several years ago that is saved to my computer (and from there into box).

pbhjpbhj8y ago

I adapted a bash script someone posted here, it uses Firefox bookmarks (pages.sqlite). Cron runs the script and downloads every page I've bookmarked that month (after some filtering). I don't use it often but sometimes I'll awk-grep it; I'm a hoarder in real life too!

gkya8y ago

Would you mind posting it?

1 more reply

kchr8y ago

Please share!

1 more reply

dddw8y ago

indeed interested in this too

1 more reply

flippant8y ago

I wrote a similar tool which uses Electron to create PDFs of webpages and bookmark them in a SQLite database.

https://github.com/marvelm/erised

agumonkey8y ago

Could be a browser plugin

ps: nice project btw (thanks in advance)

Piskvorrr8y ago

That's just as much "my own" as The Internet Archive: a website Out There somewhere. Worse, it's much more likely to rot and disappear than archive.org. Now, if I could run this locally...

(Yes, yes, `wget --convert-links`, I know. Not quite as convenient, though.)

agambleOP8y ago

OP here. The internet archive is great, but it's not so awesome if there's some ephemeral content you need to save right away, like Tweets or social media posts. Being able to trigger an archive immediately let's you save temporary content such as that which is more prone to deletion. I'm going to build a Chrome extension to click and make cloud copy of the page you're on, hopefully that will make it seem more personally controllable.

Do you think being able to download the archive locally would be useful?

ia_user8y ago

This exists for (& from) the Internet Archive!

Firefox: https://addons.mozilla.org/en-US/firefox/addon/wayback-machi...

Chrome: https://chrome.google.com/webstore/detail/wayback-machine/fp...

Safari: https://safari-extensions.apple.com/details/?id=archive.org....

Android: https://play.google.com/store/apps/details?id=com.archive.wa...

iOS: https://itunes.apple.com/us/app/wayback-machine/id1201888313

dsacco8y ago

This might sound insane, but if you modified this into a browser extension that runs locally (with options for one-off or continuous saving for entire browsing sessions) I would probably download it. Personally, I have well over 100TB of personal hard drive space in my home, and I would love to just download entire portions of my browsing history locally for archival reasons (and to truly defeat link rot).

As it is now, I personally wouldn't use it (but it's a cool project, definitely please keep working on this idea!).

5 more replies

dschep8y ago

So like another toplevel commenter asked. Why build this or use this instead of archive.is? And there are already multiple extensions available for chrome for it ;)

I agree with GP here, that anything billed as "My own internet archive" should be run on my computer. Not some one elses.

johnaberlin8y ago

HI agamble,

You can do just that via https://chrome.google.com/webstore/detail/warcreate/kenncghf... http://warcreate.com.

I am a core contributor to this project on github (https://github.com/machawk1/warcreate) and the maintainer/creator of the latest version of WAIL. So I am not biased in anyway ;)

detaro8y ago

You can trigger the Internet Archive manually as well.

1 more reply

jtrip8y ago

A local download only increases the redundancy. Tesoro keeps a copy, and the user keeps a local copy that they can also use however they want. A bit like keeping newspaper clippings, that are found decades later by some relative to be then posted on social websites as something interesting.

rathish_g8y ago

Good work. For research and citation purpose a permalink is needed outside the source domain. Which can be trusted and stay for decades.

unicornporn8y ago

https://github.com/webrecorder/webrecorder can be run using Docker. There's also plenty of Proxys that can save your browsing. See: http://netpreserve.org/projects/live-archiving-http-proxy/

WhiteOwlLion8y ago

Have you looked at WorldBrain? It is a fork of falcon, but it keeps a cache and let's you perform keyword searches against the cached content.

j_s8y ago

I would be interested in an attestation service that can provide court-admissable evidence that a particular piece of content was publically accessible on the web at a particular point in time via a particular url.

I believe the only way to incentivise participation in such a system is by paying for timestamp'ed signatures, eg. "some subset of downloaded [content] from [url] at [time] hashed to [hash]" all tucked into a Bitcoin transaction or something. There are services that will do this with user-provided content[1]; I am looking for something that will pull a url and timestamp the content.

This would also be a way to detect when different users are being served different content at the same url, thus the need for a global network of validators.

[1] https://proofofexistence.com/

rjeli8y ago

Interesting - it is trivial to prove something was done today rather than yesterday, by hashing with the most recent bitcoin block or some new info.

Is it possible to prove something was done in the past? All I can think of is some sort of scheme involving destroyed information.

j_s8y ago

trivial to prove something was done today

My focus is on the something much more so than the when. I can do my own doctoring of any data, or use some service to make something that looks real[1]. Getting some proof that this fake data existed is not what I'm after.

Instead, I want multiple, completely separate (and ideally as independent and diverse as possible) attestations that something was out there online, as proof that some person or organization intended for it to be seen by everyone as their content. Being able to prove that irrefutably seems nearly impossible today even for the present time, particularly against insider threats.

Your question regarding proving something in the past is going far beyond what I'm hoping for; it will take me quite a while to come up with anything that might be helpful for such a situation. I assume most would hit up the various archive sites, but my gut feeling is that it winds up being a probability based on how well forensics holds up / are not falsifiable.

[1] simitator.com - not linking because ads felt a bit extra-sketch!

naiveattack8y ago

This is interesting.

The only way to do it should be to sync your observation to other observers as soon as the observation is made. The other observers can confirm the time then by knowing when they received the information.

Block chain with comments.

an278y ago

Isn't it the other way around?

I can prove I had today's papers today, but once I've seen it I can prove it any day. So you can say "this information existed at day X or earlier".

There's also the issue of proving the content you hashed actually came from the place you say it came from. To me it seems that would require proofs of authenticity, by the source itself; not something that's easy to come by.

unicornporn8y ago

In what way could this considered to be “your own internet archive”? I see no way to register a user and save pages to a collection.

If you really want to create your own archive, set up a Live Archiving HTTP Proxy[1], run SquidMan [2] or check out WWWOFFLE[3].

If you want something simpler, have a look at Webrecorder[4] or a paid Pinboard account with the “Bookmark Archive”[5].

[1] http://netpreserve.org/projects/live-archiving-http-proxy/

[2] http://squidman.net/squidman/index.html

[3] http://www.gedanken.org.uk/software/wwwoffle/

[4] https://webrecorder.io/

[5] https://pinboard.in/upgrade/

agambleOP8y ago

Great points.

You're right, for now it's a single rate-limited HTML form and you'll have to manually collate the links to the archives you create. I'll be adding specialty features (with accounts) next. :)

falcolas8y ago

Another pair of even simpler solutions:

Print and store pages as PDFs.

Download and save entire pages as webarchives (Safari, wget)

rahiel8y ago

An internet archive can only provide value if it's there for the long-term. What's your plan to keep this service running if it gets popular? For example, archive.is costed about $2000/month at the start of 2014 [1]. I expect it to cost even more now.

[1]: http://blog.archive.is/post/72136308644/how-much-does-it-cos...

venning8y ago

Thoughts:

I like the look. Very clean. I like how fast it's responding; better than archive.org (though, obviously, they have different scaling problems).

"Your own internet archive" might be overselling it, as other commenters have pointed out; the "Your" feels a bit misleading. I think "Save a copy of any webpage." gives a better impression, which you use on the site itself.

The "Archive!" link probably shouldn't work if there's nothing in the URL box. It just gives me an archive link that errors. Example: [1]

Using it on news.YC as a test gave me errors with the CSS & JS [2]. This might be due to the fact that HN uses query parameters in their CSS and JS, which repeat in the tesoro URL, which you may not be parsing correctly.

Maybe have something in addition to an email link for submitting error reports like the above, just cause I'd be more likely to file a GitHub issue (even if the repo is empty) than send a stranger an email.

As other commenters have pointed out, archive.is also does this, and their longevity helps me feel confident that they'll still be around. Perhaps, if you wish to differentiate, offer some way for me to "own" the copy of the page, like downloading it or emailing it to myself or sharing it with another site (like Google Docs or Imgur) to leverage redundancy, or something like that. Just a thought.

All in all, nice Show HN.

EDIT: You also may want to adjust the header to work properly on mobile devices. Still though, nice job. Sorry if I'm sounding critical.

[1] https://archive.tesoro.io/320b55cc9b78e271c94716ee23554da8

[2] https://archive.tesoro.io/a7bf03e247224bc3b4e5a7c1f2ad42b1

agambleOP8y ago

Thanks! These are great comments - I'll look into the issue with saving Hacker News CSS + JS.

bfirsh8y ago

What's the best way to automatically archive all of the data I produce on websites? Facebook, Twitter, Instagram, blogs, and so on. At some point these services will disappear, and I want to preserve them.

I know a lot of these sites have archiving features, but want something centralised and automatic.

donpdonp8y ago

The IndieWeb(.org) group recommends 'Publish (on your) Own Site, Syndicate Elsewhere' (https://indieweb.org/POSSE) that you might find interesting.

aethertron8y ago

The hypothetical system that makes most sense to me for this: a process that runs 24/7 on a server, watching your feeds on those services. Grabbing and saving everything via APIs or screen-scraping.

fiatjaf8y ago

Is that creepy resource-eater bug-prone service what makes most sense to you?

1 more reply

akerro8y ago

Nice, post it on https://www.reddit.com/r/DataHoarder/

They will love it!

zippoxer8y ago

Cool tool, but by using it, you depend on it staying alive for longer than any page you archive on it.

This got me thinking about how a decentralized p2p internet archive could solve the trust problem that exists in centralized internet archives. Such solution could also increase the capacity of archived pages and the frequency at which archived pages are updated.

It is true that keeping the entire history of the internet on your local drive is likely impossible, but a solution similar to what Sia is doing could solve this problem: split each page to 20 pieces and distribute each piece to 10 peers such that every y pieces can recover the original page. So, you only have to trust that 10 peers out of 20 that store a page are still alive to get the complete page.

The main problem I can see right now would be lack of motivation to contribute to the system -- why would people run nodes? Just because it would feature a yet another cryptocurrency? Sure, this could hold now, but when the cryptocurrency craze quiets down and people stop buying random cryptocurrencies just for the sake of trading them, what then? Who would run the nodes and why?

burkemw38y ago

IPFS [0] and it's sibling Filecoin [1] are dealing in a very similar space as your wonderings

[0]: https://ipfs.io/ [1]: https://filecoin.io/

j_s8y ago

The discussion 3 months ago on bookmarks mentioned several options for archiving pages (some locally): Ask HN: Do you still use browser bookmarks? | https://news.ycombinator.com/item?id=14064096

extensions: Firefox "Print Edit" Addon / Firefox Scrapbook X / Chrome Falcon / Firefox Recoll

open source: Zotero / WorldBrain / Wallabag

commercial: Pinboard / InstaPaper / Pocket / Evernote / Mochimarks / Diigo / PageDash / URL Manager Pro / Save to Google / OneNote / Stash / Fetching

public: http://web.archive.org / https://archive.is/

idlewords8y ago

You're going to get this service shut down if you let anonymous people republish arbitrary content while running everything on Google.

I (obviously) think personal archives are a great idea, but republishing is a hornets' nest.

Retr0spectrum8y ago

Is this any different to archive.is?

If I want my own archive, Ctrl+S in Firefox usually works fine for me.

crispytx8y ago

You know your site actually does a better job reproducing webpages than archive.org. I've noticed that if you use a CDN to serve up CSS & JS for a webpage that you're trying to archive on archive.org, it won't render correctly. On your site, there doesn't seem to be a problem including CSS & JS from an external domain. Thumbs up :)

agambleOP8y ago

OP here. Thanks! Could you point me to the pages where it worked well for you vs archive.org?

zichy8y ago

So this is like archive.is, but I can't search through archived sites?

CM308y ago

When you said 'own internet archive' I thought you meant some sort of program you could download that'd save your browsing history (or whatever full website you wanted) to your hard drive. I think that would have been significantly more useful here.

As is it, while it's a nice service, it's still got all the issues of other archive ones:

1. It's online only, so one failed domain renewal or hosting payment takes everything offline.

2. It being online also means I can't access any saved pages if my connection goes down or has issues.

3. The whole thing is wide open to having content taken down by websites wanting to cover their tracks. I mean, what do you do if someone tells you to remove a page? What about with a DMCA notice?

It's a nice alternative to archive.is, but still doesn't really do what the title suggests if you ask me.

jpalomaki8y ago

This might be a good use case for distributed storage (IPFS?).

Instead of hosting this directly on my computer, it would be interesting to have a setup where the archiving is done via the service and I would just provide somewhere a storage space where the content would end up being mirrored (just to guarantee that my valuable things are saved at least somewhere, should the the other nodes decide to remove the content).

I would prefer this setup, because it would be easily accessible for me from any device and I would not need to worry about running some always available system. With some suitable P2P setup my storage node would have less strict uptime requirements.

johnaberlin8y ago

Hi jpalomaki,

Have you heard of InterPlanetary Wayback (ipwb)? https://github.com/oduwsdl/ipwb

InterPlanetary Wayback (ipwb) facilitates permanence and collaboration in web archives by disseminating the contents of WARC files into the IPFS network.

dbz8y ago

This is pretty cool. I have a chrome extension that let's you view the cached version of a web page [1]. Would I be able to use this through an API? I currently support Google Cache, WayBack Machine, and CoralCDN, but Coral doesn't work well and I'd like to replace it with something else.

[1] https://chrome.google.com/webstore/detail/cmmlgikpahieigpccl...

agambleOP8y ago

OP here.

Yup, API and chrome extension are next on the feature list. :)

prirun8y ago

I think you should explain why you're paying Google to archive web pages for others, ie, how do plan on benefiting from this? If you have some business model in mind, let people know now. It's the first question that comes to my mind when someone offers a service that is free yet costs the provider real money. You obviously can't pay Google to archive everyone's web pages just for the fun of it.

agambleOP8y ago

OP here.

Great point. Right now this is just a single rate-limited HTML form to gauge interest. Next is to build specialty features that are worth paying for and make this sustainable. :)

crispytx8y ago

Haters gonna hate.

gorbachev8y ago

You should try and rewrite relative links in websites that get archived. I tested your app with a news site, and all the links go to archive.tesoro.io/sites/internal/url/structure/article.html

I also second the need for user accounts. If I am to use your site as my personal archive, then I would need to log in and create a collection of my own archived sites.

arkenflame8y ago

I made a simple Chrome extension to automatically save local copies of pages you bookmark, if you prefer that instead: https://chrome.google.com/webstore/detail/backmark-back-up-t...

lozzo8y ago

it would be nice to have a bit of explanation on how it works and why we can be confident that we can rely upon it

agambleOP8y ago

OP here. Definitely, great idea :)

Briefly: Sites are archived using a system written in Golang and uploaded to a Google Cloud bucket.

More: The system downloads the remote HTML, parses it to extract the relevant dependencies (<script>, <link>, <img> etc) and then downloads these as well. Tesoro is even parsing CSS files to extract the url('...') file dependencies from here as well, meaning most background images and fonts should continue to work. All dependencies (even those hosted at remote domains) are downloaded and hosted with the archive, meaning the src attributes on the original page tags are wrangled to support the new location.

The whole thing is hosted on GCP Container Engine and I deploy with Kubernetes.

I'll write up a more comprehensive blog post in some time, which portion of this would you like to hear more about?

19eightyfour8y ago

The issue is cost. Your costs are disk space for people's archives, instances for people's use, and bandwidth for the fetches and crawls and access.

How can you pay for this if it's free? It's unreliable unless its financially viable.

1 more reply

Faaak8y ago

How about avoiding redundancies ? Are same CSS files cached twice or referenced by their hash ?

The page URI is a bit obscure though. I think a tresoro.io/example.tld/page/foobar/timestamp would look good.

What about big media content and/or small differences between them ?

1 more reply

jdc05898y ago

> Tesoro saves linked assets, such as images, Javascript and CSS files.

I'm confused. It looks like image sources in "archived" pages on Tesoro still point back to the origin domain.

Edit: it works as expected. I just didn't notice the relative paths.

agambleOP8y ago

OP here.

The site will rewrite absolute image URLs as relative ones pointing to Tesoro. For example, in the Chicken Teryaki example on the homepage, the main image is sourced from the relative location "static01.nyt.com/.../28COOKING-CHICKEN-TERIYAKI1-articleLarge.jpg", which looks like it's coming from nytimes.com, but you can check in the Chrome dev console that it isn't.

Have you found an example where it isn't working correctly? If so would you mind posting it here and I'll fix it :).

ikreymer8y ago

Unfortunately, this approach alone will only work for sites that are mostly static, eg. do not use JS to load dynamic content. That is a small (and shrinking) percent of the web. Once JS is involved, all bets are off -- JS will attempt to load content via ajax, or generate new html, load iframes, etc and you will have 'live leaks' where the content seems to be coming form the archive but is actually coming form the live web.

Here is an example from archiving nytimes home page:

https://archive.tesoro.io/665dbeab57a4d57d8140f89cfedc69b5

If you look at network traffic (domain in devtools), you'll see that only a small % is coming from archive.tesoro.io -- the rest of the content is loaded from the live web. This can be misleading and possibly a security risk as well.

Not to discourage you, but this is a hard problem and I've been working on for years now. This area is a moving target, but we think live leaks are mostly eliminated in Webrecorder and pywb, although there are lots of areas to work on to maintain high-fidelity preservation.

If you want chat about possible solutions or want to collaborate (we're always looking for contributors!), feel free to reach out to us at support [at] webrecorder.io or find my contact on GH.

jdc05898y ago

Nope, you are right. I just missed that there wasn't a protocol on the src I was looking at.

salmonfamine8y ago

Worth noting that Tesoro is the name of a major oil/fuel company in Texas.

NicoJuicy8y ago

When a company went down, i downloaded every one of their clients with httrack and wget. Just to be sure their clients wouldn't lose their site. ( and some other things)

I wonder what this site uses

pbhjpbhj8y ago

How are you handling copyright infringement? Outside USAs Fair Use terms this looks like pretty blatant infringement.

iso-8859-18y ago

What is there to handle? You take down stuff when you get an email? Most users of this will be so small that they'll never get noticed. Maybe they won't even be online, how are you going to know you were infringed? Maybe the crawler allows for spoofing the user-agent.

pbhjpbhj8y ago

So, ignoring it basically.

If a person in the UK uses your service you're committing contributory infringement for commercial purposes, AFAICT.

Moreover, the ECD has different protections than DMCA. Particularly getting a takedown notice isn't required.

>Maybe the crawler allows for spoofing the user-agent. //

As a tort you only need to get a preponderance of evidence. IP of the crawler that made the copy puts the owner of that IP in court for contributory infringement, no?

If you make copies of parts of BBC sites and serve those copies from your server how is that not copyright infringement by you??

FWIW I like the service and do not like the copyright regime as it stands, particularly how UK law lacks the breadth of liberties of Fair Use.

brianobush8y ago

Probably in the same manner as archive.org

skdotdan8y ago

Nice. How are you planning to pay the servers? Your service seems quite storage-intensive.

WhiteOwlLion8y ago

I got a dedicated server in France that cost me less than $20 USD/month. 16GB RAM, 1TB storage: https://www.online.net/en/dedicated-server/dedibox-xc

mkroman8y ago

With no redundancy, no backup and no way to extend storage. I'm not sure how you'd archive the internet with low-range dedicated server deals.

j / k navigate · click thread line to collapse

100 comments

JackC8y ago

(Disclaimer - I use bits of Webrecorder for my own archive, perma.cc.)

motdiem8y ago

(Disclaimer: I also do personal archiving stuff with getkumbu)

johnaberlin8y ago

Hi motdiem,

Thank you for seconding the newly updated WAIL. I am the maintainer/creator of the newly update WAIL (the Electron version) https://github.com/N0taN3rd/wail

I was unable to attend IIPC Web Archiving Conference (WAC) but the original creator of WAIL(Python) Mat Kelly did attend (we both are apart of the same research group WSDL).

If you or anyone else have any questions about WAIL I am more than happy to answer them.

amrrs8y ago

Is offline playback still relevant in the age of ubiquitous always connected Internet?

3 more replies

shasheene8y ago

That would allow constant archival of every webpage a user ever visits -- an immutable record over the user's years of crawling the web.

unicornporn8y ago

> That would allow constant archival of every webpage a user ever visits -- an immutable record over the user's years of crawling the web.

This is usually solved by using a proxy: http://netpreserve.org/projects/live-archiving-http-proxy/

1 more reply

ikreymer8y ago

Thanks Jack for mentioning Webrecorder! This is a project I started and it is now part of rhizome.org, a non-profit dedicated to promoting internet-based art and digital culture.

I thought I’d add a few notes here, as there’s a few ways you can use Webrecorder and related tools.

First, Webrecorder supports two distinct modes:

- Native recording mode — http/s traffic goes to through the browser and is rewritten to point to the Webrecorder server (This is the default).

- We also have a desktop player app, Webrecorder Player, available for download from: https://github.com/webrecorder/webrecorderplayer-electron/re...

This is an app that plays back WARCs files (created by Webrecorder and elsewhere), and allows browsing any WARC file offline.

Finally, the core replay/recording tech is actually a separate component, an advanced ‘wayback machine’ being developed in https://github.com/ikreymer/pywb

owenversteeg8y ago

JackC8y ago

I'm not sure how much this would interfere with normal browsing -- it's not a typical usecase.

psteinweber8y ago

This is great and helps me a ton, thanks for mentioning it here.

agambleOP8y ago

Thanks Jack, I hadn't heard of webrecorder before, but I'll check it out. :)

smoyer8y ago

It's not mine unless it's running on my own servers or computer - I created a really rough version of this several years ago that is saved to my computer (and from there into box).

pbhjpbhj8y ago

gkya8y ago

Would you mind posting it?

1 more reply

kchr8y ago

Please share!

1 more reply

dddw8y ago

indeed interested in this too

1 more reply

flippant8y ago

I wrote a similar tool which uses Electron to create PDFs of webpages and bookmark them in a SQLite database.

https://github.com/marvelm/erised

agumonkey8y ago

Could be a browser plugin

ps: nice project btw (thanks in advance)

Piskvorrr8y ago

That's just as much "my own" as The Internet Archive: a website Out There somewhere. Worse, it's much more likely to rot and disappear than archive.org. Now, if I could run this locally...

(Yes, yes, `wget --convert-links`, I know. Not quite as convenient, though.)

agambleOP8y ago

Do you think being able to download the archive locally would be useful?

ia_user8y ago

This exists for (& from) the Internet Archive!

Firefox: https://addons.mozilla.org/en-US/firefox/addon/wayback-machi...

Chrome: https://chrome.google.com/webstore/detail/wayback-machine/fp...

Safari: https://safari-extensions.apple.com/details/?id=archive.org....

Android: https://play.google.com/store/apps/details?id=com.archive.wa...

iOS: https://itunes.apple.com/us/app/wayback-machine/id1201888313

dsacco8y ago

As it is now, I personally wouldn't use it (but it's a cool project, definitely please keep working on this idea!).

5 more replies

dschep8y ago

So like another toplevel commenter asked. Why build this or use this instead of archive.is? And there are already multiple extensions available for chrome for it ;)

I agree with GP here, that anything billed as "My own internet archive" should be run on my computer. Not some one elses.

johnaberlin8y ago

HI agamble,

You can do just that via https://chrome.google.com/webstore/detail/warcreate/kenncghf... http://warcreate.com.

I am a core contributor to this project on github (https://github.com/machawk1/warcreate) and the maintainer/creator of the latest version of WAIL. So I am not biased in anyway ;)

detaro8y ago

You can trigger the Internet Archive manually as well.

1 more reply

jtrip8y ago

rathish_g8y ago

Good work. For research and citation purpose a permalink is needed outside the source domain. Which can be trusted and stay for decades.

unicornporn8y ago

https://github.com/webrecorder/webrecorder can be run using Docker. There's also plenty of Proxys that can save your browsing. See: http://netpreserve.org/projects/live-archiving-http-proxy/

WhiteOwlLion8y ago

Have you looked at WorldBrain? It is a fork of falcon, but it keeps a cache and let's you perform keyword searches against the cached content.

j_s8y ago

This would also be a way to detect when different users are being served different content at the same url, thus the need for a global network of validators.

[1] https://proofofexistence.com/

rjeli8y ago

Interesting - it is trivial to prove something was done today rather than yesterday, by hashing with the most recent bitcoin block or some new info.

Is it possible to prove something was done in the past? All I can think of is some sort of scheme involving destroyed information.

j_s8y ago

trivial to prove something was done today

[1] simitator.com - not linking because ads felt a bit extra-sketch!

naiveattack8y ago

This is interesting.

Block chain with comments.

an278y ago

Isn't it the other way around?

I can prove I had today's papers today, but once I've seen it I can prove it any day. So you can say "this information existed at day X or earlier".

unicornporn8y ago

In what way could this considered to be “your own internet archive”? I see no way to register a user and save pages to a collection.

If you really want to create your own archive, set up a Live Archiving HTTP Proxy[1], run SquidMan [2] or check out WWWOFFLE[3].

If you want something simpler, have a look at Webrecorder[4] or a paid Pinboard account with the “Bookmark Archive”[5].

[1] http://netpreserve.org/projects/live-archiving-http-proxy/

[2] http://squidman.net/squidman/index.html

[3] http://www.gedanken.org.uk/software/wwwoffle/

[4] https://webrecorder.io/

[5] https://pinboard.in/upgrade/

agambleOP8y ago

Great points.

You're right, for now it's a single rate-limited HTML form and you'll have to manually collate the links to the archives you create. I'll be adding specialty features (with accounts) next. :)

falcolas8y ago

Another pair of even simpler solutions:

Print and store pages as PDFs.

Download and save entire pages as webarchives (Safari, wget)

rahiel8y ago

[1]: http://blog.archive.is/post/72136308644/how-much-does-it-cos...

venning8y ago

Thoughts:

I like the look. Very clean. I like how fast it's responding; better than archive.org (though, obviously, they have different scaling problems).

The "Archive!" link probably shouldn't work if there's nothing in the URL box. It just gives me an archive link that errors. Example: [1]

All in all, nice Show HN.

EDIT: You also may want to adjust the header to work properly on mobile devices. Still though, nice job. Sorry if I'm sounding critical.

[1] https://archive.tesoro.io/320b55cc9b78e271c94716ee23554da8

[2] https://archive.tesoro.io/a7bf03e247224bc3b4e5a7c1f2ad42b1

agambleOP8y ago

Thanks! These are great comments - I'll look into the issue with saving Hacker News CSS + JS.

bfirsh8y ago

I know a lot of these sites have archiving features, but want something centralised and automatic.

donpdonp8y ago

The IndieWeb(.org) group recommends 'Publish (on your) Own Site, Syndicate Elsewhere' (https://indieweb.org/POSSE) that you might find interesting.

aethertron8y ago

The hypothetical system that makes most sense to me for this: a process that runs 24/7 on a server, watching your feeds on those services. Grabbing and saving everything via APIs or screen-scraping.

fiatjaf8y ago

Is that creepy resource-eater bug-prone service what makes most sense to you?

1 more reply

akerro8y ago

Nice, post it on https://www.reddit.com/r/DataHoarder/

They will love it!

zippoxer8y ago

Cool tool, but by using it, you depend on it staying alive for longer than any page you archive on it.

burkemw38y ago

IPFS [0] and it's sibling Filecoin [1] are dealing in a very similar space as your wonderings

[0]: https://ipfs.io/ [1]: https://filecoin.io/

j_s8y ago

The discussion 3 months ago on bookmarks mentioned several options for archiving pages (some locally): Ask HN: Do you still use browser bookmarks? | https://news.ycombinator.com/item?id=14064096

extensions: Firefox "Print Edit" Addon / Firefox Scrapbook X / Chrome Falcon / Firefox Recoll

open source: Zotero / WorldBrain / Wallabag

commercial: Pinboard / InstaPaper / Pocket / Evernote / Mochimarks / Diigo / PageDash / URL Manager Pro / Save to Google / OneNote / Stash / Fetching

public: http://web.archive.org / https://archive.is/

idlewords8y ago

You're going to get this service shut down if you let anonymous people republish arbitrary content while running everything on Google.

I (obviously) think personal archives are a great idea, but republishing is a hornets' nest.

Retr0spectrum8y ago

Is this any different to archive.is?

If I want my own archive, Ctrl+S in Firefox usually works fine for me.

crispytx8y ago

agambleOP8y ago

OP here. Thanks! Could you point me to the pages where it worked well for you vs archive.org?

zichy8y ago

So this is like archive.is, but I can't search through archived sites?

CM308y ago

As is it, while it's a nice service, it's still got all the issues of other archive ones:

1. It's online only, so one failed domain renewal or hosting payment takes everything offline.

2. It being online also means I can't access any saved pages if my connection goes down or has issues.

3. The whole thing is wide open to having content taken down by websites wanting to cover their tracks. I mean, what do you do if someone tells you to remove a page? What about with a DMCA notice?

It's a nice alternative to archive.is, but still doesn't really do what the title suggests if you ask me.

jpalomaki8y ago

This might be a good use case for distributed storage (IPFS?).

johnaberlin8y ago

Hi jpalomaki,

Have you heard of InterPlanetary Wayback (ipwb)? https://github.com/oduwsdl/ipwb

InterPlanetary Wayback (ipwb) facilitates permanence and collaboration in web archives by disseminating the contents of WARC files into the IPFS network.

dbz8y ago

[1] https://chrome.google.com/webstore/detail/cmmlgikpahieigpccl...

agambleOP8y ago

OP here.

Yup, API and chrome extension are next on the feature list. :)

prirun8y ago

agambleOP8y ago

OP here.

Great point. Right now this is just a single rate-limited HTML form to gauge interest. Next is to build specialty features that are worth paying for and make this sustainable. :)

crispytx8y ago

Haters gonna hate.

gorbachev8y ago

You should try and rewrite relative links in websites that get archived. I tested your app with a news site, and all the links go to archive.tesoro.io/sites/internal/url/structure/article.html

I also second the need for user accounts. If I am to use your site as my personal archive, then I would need to log in and create a collection of my own archived sites.

arkenflame8y ago

I made a simple Chrome extension to automatically save local copies of pages you bookmark, if you prefer that instead: https://chrome.google.com/webstore/detail/backmark-back-up-t...

lozzo8y ago

it would be nice to have a bit of explanation on how it works and why we can be confident that we can rely upon it

agambleOP8y ago

OP here. Definitely, great idea :)

Briefly: Sites are archived using a system written in Golang and uploaded to a Google Cloud bucket.

The whole thing is hosted on GCP Container Engine and I deploy with Kubernetes.

I'll write up a more comprehensive blog post in some time, which portion of this would you like to hear more about?

19eightyfour8y ago

The issue is cost. Your costs are disk space for people's archives, instances for people's use, and bandwidth for the fetches and crawls and access.

How can you pay for this if it's free? It's unreliable unless its financially viable.

1 more reply

Faaak8y ago

How about avoiding redundancies ? Are same CSS files cached twice or referenced by their hash ?

The page URI is a bit obscure though. I think a tresoro.io/example.tld/page/foobar/timestamp would look good.

What about big media content and/or small differences between them ?

1 more reply

jdc05898y ago

> Tesoro saves linked assets, such as images, Javascript and CSS files.

I'm confused. It looks like image sources in "archived" pages on Tesoro still point back to the origin domain.

Edit: it works as expected. I just didn't notice the relative paths.

agambleOP8y ago

OP here.

Have you found an example where it isn't working correctly? If so would you mind posting it here and I'll fix it :).

ikreymer8y ago

Here is an example from archiving nytimes home page:

https://archive.tesoro.io/665dbeab57a4d57d8140f89cfedc69b5

If you want chat about possible solutions or want to collaborate (we're always looking for contributors!), feel free to reach out to us at support [at] webrecorder.io or find my contact on GH.

jdc05898y ago

Nope, you are right. I just missed that there wasn't a protocol on the src I was looking at.

salmonfamine8y ago

Worth noting that Tesoro is the name of a major oil/fuel company in Texas.

NicoJuicy8y ago

When a company went down, i downloaded every one of their clients with httrack and wget. Just to be sure their clients wouldn't lose their site. ( and some other things)

I wonder what this site uses

pbhjpbhj8y ago

How are you handling copyright infringement? Outside USAs Fair Use terms this looks like pretty blatant infringement.

iso-8859-18y ago

pbhjpbhj8y ago

So, ignoring it basically.

If a person in the UK uses your service you're committing contributory infringement for commercial purposes, AFAICT.

Moreover, the ECD has different protections than DMCA. Particularly getting a takedown notice isn't required.

>Maybe the crawler allows for spoofing the user-agent. //

As a tort you only need to get a preponderance of evidence. IP of the crawler that made the copy puts the owner of that IP in court for contributory infringement, no?

If you make copies of parts of BBC sites and serve those copies from your server how is that not copyright infringement by you??

FWIW I like the service and do not like the copyright regime as it stands, particularly how UK law lacks the breadth of liberties of Fair Use.

brianobush8y ago

Probably in the same manner as archive.org

skdotdan8y ago

Nice. How are you planning to pay the servers? Your service seems quite storage-intensive.

WhiteOwlLion8y ago

I got a dedicated server in France that cost me less than $20 USD/month. 16GB RAM, 1TB storage: https://www.online.net/en/dedicated-server/dedibox-xc

mkroman8y ago

With no redundancy, no backup and no way to extend storage. I'm not sure how you'd archive the internet with low-range dedicated server deals.

j / k navigate · click thread line to collapse