~gpanders/garchive

0e6ed22a553a3e831a9255b8ae3cb52f47298584 — Greg Anders 1 year, 7 months ago 9e136c9
Wait for wget jobs to complete

Have the `fetch` script wait for all wget jobs to complete instead of
forking them all immediately. This provides a mechanism to do some kind
of "post processing" once the archive is generated.

Note that in order to generate the list of PIDs we have to pipe the
output of awk to a temp file and then read the temp file in the while
loop, instead of simply piping awk into the while loop directly. This is
due to the fact that variables modified within a pipeline do not have
their values modified. See SC2031 [1] for more information.

[1]: https://github.com/koalaman/shellcheck/wiki/SC2031
1 files changed, 17 insertions(+), 3 deletions(-)

M bin/fetch
M bin/fetch => bin/fetch +17 -3
@@ 5,21 5,35 @@ if [ $# -lt 2 ]; then
    exit 1
fi

awk -F'\t' '{print $2}' "$1" | while IFS= read -r url; do
links="$(realpath "$1")"
cd "$2" || exit 1

tmp="$(mktemp)"
awk -F'\t' '{print $2}' "$links" > "$tmp"

pids=""
while IFS= read -r url; do
    wget \
        --adjust-extension \
        --timestamping \
        --span-hosts \
        --background \
        --convert-links \
        --page-requisites \
        --directory-prefix="$2" \
        --continue \
        --quiet \
        --append-output=wget.log \
        --wait=1 \
        --random-wait \
        --user-agent="" \
        --execute robots=off \
        "$url"
    pids="$! $pids"
done < "$tmp"

for pid in $pids; do
    wait "$pid"
done

rm -f "$tmp"

echo "Done!"