@@ 0,0 1,98 @@
++++
+title = "How to Recover a Deleted Website's Content"
++++
+
+**Disclaimer: You should carefully consult the terms of service of the Internet
+Archive and the TOS and copyright of any site you try this on. I make no
+promises as to the legality of this for any particular website.**
+
+**This article is intended for people pretty comfortable with the command line.
+Should you need help, please <a target="_blank" href="mailto:chris@vittal.dev">email me</a> and I
+will do what I can to assist.**
+
+Data on the internet is often ephemeral. This is what it is. Often you can't
+find what you were looking for because it's been moved or deleted. Most of the
+time people should be able to be forgotten, but sometimes there's important
+reasons why we need to try and save information that people have deleted.
+
+[The Internet Archive] and [the Wayback Machine] are invaluable resources that
+work to save our digital history. By searching the wayback machine you can find
+all kinds of information that may have been lost for any number of reasons, like
+hosting lapsing or company existence failure. But you still can't download an
+entire site. I found a pretty neat tool, the [Wayback Machine Downloader], that
+allows for the fetching of an entire site, so even if the site disappears from
+the Internet Archive, we will still have a record of it.
+
+[The Internet Archive]: https://archive.org
+[the Wayback Machine]: https://archive.org/web/
+[Wayback Machine Downloader]: https://github.com/hartator/wayback-machine-downloader
+
+To install the downloader, we can just use `gem install`, or the script can be
+run directly.
+
+```txt
+$ git clone https://github.com/hartator/wayback-machine-downloader
+$ cd wayback-machine-downloader
+$ ruby bin/wayback_machine_downloader
+Usage: wayback_machine_downloader http://example.com
+
+Download an entire website from the Wayback Machine.
+
+Optional options:
+ -d, --directory PATH Directory to save the downloaded files into
+ Default is ./websites/ plus the domain name
+ -s, --all-timestamps Download all snapshots/timestamps for a given
+ website
+ -f, --from TIMESTAMP Only files on or after timestamp supplied
+ (ie. 20060716231334)
+ -t, --to TIMESTAMP Only files on or before timestamp supplied
+ (ie. 20100916231334)
+ -e, --exact-url Download only the url provied and not the full site
+ -o, --only ONLY_FILTER Restrict downloading to urls that match this filter
+ (use // notation for the filter to be treated as
+ a regex)
+ -x, --exclude EXCLUDE_FILTER Skip downloading of urls that match this filter
+ (use // notation for the filter to be treated
+ as a regex)
+ -a, --all Expand downloading to error files (40x and 50x)
+ and redirections (30x)
+ -c, --concurrency NUMBER Number of multiple files to dowload at a time
+ Default is one file at a time (ie. 20)
+ -p, --maximum-snapshot NUMBER Maximum snapshot pages to consider (Default is 100)
+ Count an average of 150,000 snapshots per page
+ -l, --list Only list file urls in a JSON format with the
+ archived timestamps, won't download anything
+ -v, --version Display version
+```
+
+Then it is simple enough to run it using `wayback_machine_downloader [site]`, or
+if a site has been deleted, we can use the `-t` option to fetch an older version
+of the site. For example `-t 20200601` to fetch the latest version up to June
+first of 2020.
+
+Sites can be very very large, they can be compressed using any standard tool
+like tar and _insert compression utility here_, or zip, or 7zip, the list goes
+on. Another way to deal with this is to filter the HTML into other forms like
+text. Utilities that can do this include the [`html2text`] utility. Or
+[this](https://git.sr.ht/~sircmpwn/aerc/blob/master/filters/html) little script
+that's part of the [aerc] email program which I also replicate below.
+
+```txt
+#!/bin/sh
+# aerc filter which runs w3m using socksify (from the dante package) to prevent
+# any phoning home by rendered emails
+export SOCKS_SERVER="127.0.0.1:1"
+exec socksify w3m \
+ -T text/html \
+ -cols $(tput cols) \
+ -dump \
+ -o display_image=false \
+ -o display_link_number=true
+```
+
+I hope this helps us preserve our collective memory of the online spaces we
+share better so we can not forget what is important to us, even if information
+may try to hide from us.
+
+[`html2text`]: https://pypi.org/project/html2text/
+[aerc]: https://aerc-mail.org/
@@ 1,6 1,6 @@
<footer id=copyright>
<hr>
© 2019 Christopher Vittal.
© 2019-2020 Christopher Vittal.
The content on this site (unless otherwise specified) is licensed under
<a rel="license" target="_blank" href="https://creativecommons.org/licenses/by-sa/4.0/">
CC-BY-SA 4.0</a>.