~cdv/chris.vittal.dev

6930bd029e9320b5e93ea109365b911170f3b476 — Chris Vittal 4 years ago 09f0c96 master
Post on downloading a deleted website.
2 files changed, 99 insertions(+), 1 deletions(-)

A content/posts/2020-06-23_howto-get-deleted-website.md
M templates/footer.html
A content/posts/2020-06-23_howto-get-deleted-website.md => content/posts/2020-06-23_howto-get-deleted-website.md +98 -0
@@ 0,0 1,98 @@
+++
title = "How to Recover a Deleted Website's Content"
+++

**Disclaimer: You should carefully consult the terms of service of the Internet
Archive and the TOS and copyright of any site you try this on. I make no
promises as to the legality of this for any particular website.**

**This article is intended for people pretty comfortable with the command line.
Should you need help, please <a target="_blank" href="mailto:chris@vittal.dev">email me</a> and I
will do what I can to assist.**

Data on the internet is often ephemeral. This is what it is. Often you can't
find what you were looking for because it's been moved or deleted. Most of the
time people should be able to be forgotten, but sometimes there's important
reasons why we need to try and save information that people have deleted.

[The Internet Archive] and [the Wayback Machine] are invaluable resources that
work to save our digital history. By searching the wayback machine you can find
all kinds of information that may have been lost for any number of reasons, like
hosting lapsing or company existence failure. But you still can't download an
entire site. I found a pretty neat tool, the [Wayback Machine Downloader], that
allows for the fetching of an entire site, so even if the site disappears from
the Internet Archive, we will still have a record of it.

[The Internet Archive]: https://archive.org
[the Wayback Machine]: https://archive.org/web/
[Wayback Machine Downloader]: https://github.com/hartator/wayback-machine-downloader

To install the downloader, we can just use `gem install`, or the script can be
run directly.

```txt
$ git clone https://github.com/hartator/wayback-machine-downloader
$ cd wayback-machine-downloader
$ ruby bin/wayback_machine_downloader
Usage: wayback_machine_downloader http://example.com

Download an entire website from the Wayback Machine.

Optional options:
    -d, --directory PATH             Directory to save the downloaded files into
                                     Default is ./websites/ plus the domain name
    -s, --all-timestamps             Download all snapshots/timestamps for a given
                                       website
    -f, --from TIMESTAMP             Only files on or after timestamp supplied
                                       (ie. 20060716231334)
    -t, --to TIMESTAMP               Only files on or before timestamp supplied
                                       (ie. 20100916231334)
    -e, --exact-url                  Download only the url provied and not the full site
    -o, --only ONLY_FILTER           Restrict downloading to urls that match this filter
                                       (use // notation for the filter to be treated as
                                        a regex)
    -x, --exclude EXCLUDE_FILTER     Skip downloading of urls that match this filter
                                       (use // notation for the filter to be treated
                                        as a regex)
    -a, --all                        Expand downloading to error files (40x and 50x)
                                       and redirections (30x)
    -c, --concurrency NUMBER         Number of multiple files to dowload at a time
                                     Default is one file at a time (ie. 20)
    -p, --maximum-snapshot NUMBER    Maximum snapshot pages to consider (Default is 100)
                                     Count an average of 150,000 snapshots per page
    -l, --list                       Only list file urls in a JSON format with the
                                       archived timestamps, won't download anything
    -v, --version                    Display version
```

Then it is simple enough to run it using `wayback_machine_downloader [site]`, or
if a site has been deleted, we can use the `-t` option to fetch an older version
of the site. For example `-t 20200601` to fetch the latest version up to June
first of 2020.

Sites can be very very large, they can be compressed using any standard tool
like tar and _insert compression utility here_, or zip, or 7zip, the list goes
on. Another way to deal with this is to filter the HTML into other forms like
text. Utilities that can do this include the [`html2text`] utility. Or
[this](https://git.sr.ht/~sircmpwn/aerc/blob/master/filters/html) little script
that's part of the [aerc] email program which I also replicate below.

```txt
#!/bin/sh
# aerc filter which runs w3m using socksify (from the dante package) to prevent
# any phoning home by rendered emails
export SOCKS_SERVER="127.0.0.1:1"
exec socksify w3m \
	-T text/html \
	-cols $(tput cols) \
	-dump \
	-o display_image=false \
	-o display_link_number=true
```

I hope this helps us preserve our collective memory of the online spaces we
share better so we can not forget what is important to us, even if information
may try to hide from us.

[`html2text`]: https://pypi.org/project/html2text/
[aerc]: https://aerc-mail.org/

M templates/footer.html => templates/footer.html +1 -1
@@ 1,6 1,6 @@
<footer id=copyright>
    <hr>
  &copy; 2019 Christopher Vittal.
  &copy; 2019-2020 Christopher Vittal.
  The content on this site (unless otherwise specified) is licensed under
  <a rel="license" target="_blank" href="https://creativecommons.org/licenses/by-sa/4.0/">
    CC-BY-SA 4.0</a>.