~marnold128/web2text

Makes the web into text
7dbeda40 — Matt Arnold 2 years ago
Fix bug causing javascript to appear in output, add license file so sourcehut doesn't bug me
4f5efa81 — Matt Arnold 3 years ago
Add uglydump subcommand to web2text.
f0caf134 — Matt Arnold 3 years ago
* Set a proper user agent, and add a safety valve so infinate recursions don't happen

refs

master
browse  log 

clone

read-only
https://git.sr.ht/~marnold128/web2text
read/write
git@git.sr.ht:~marnold128/web2text

You can also use your local clone with git send-email.

#Snarfbot

This will eventually be a web crawler that saves websites in plaintext files. For now please enjoy a few cli tools, written as POC. Comments, compliments, complaints, and pull requests accepted.

#web2text

Command line tool that does exactly what it says on the tin. Extract the content of a web document to plain text. With a choice of two scraping engines.

The scrape command will attempt scraping with Newspaper3k. Which produces pretty text file, and attempts to filter out things like comments sections, page navgation links and so forth. However may truncate long pages. Has trouble with some javascript navigation elements. And uses a fairly obvious user agent that may be blocked or limited by some sites.

The uglydump command will dump the contents of a page with minimal filtering using a spoofed user agent by default. You may get javascript source and style information in your output. But the minimal filtering was chosen in order not to lose potentially important data. The default user agent chosen is a currentish version of firefox on Ubuntu X11