Bumping version to 3.0.0
Adding licence so that sourcehut stops screaming at me
Migrating from Github to sourcehut
This section is about motivation behind the project. See tech stack.
Some stories on HN are frustrating and time consuming for dubious value. I believe there are other people who would also like to see less of certain type of content, hence suckless hn.
A filter is given a story data and flags the story if it passes the filter.
Each filter has a two landing pages. One with only stories which were flagged,
one with anything but. This is decided by two modifiers: +
and -
. For
example to only see stories from large newspapers visit
https://sucklesshn.porkbrain.com/+bignews
.
To get HN without large newspapers visit
https://sucklesshn.porkbrain.com/-bignews
.
There are also groups of filters. For example
https://sucklesshn.porkbrain.com/-amfg-bignews
filters out large newspapers and all mentions of big tech. This also happens
to be the default view on the homepage. -
modifier in a group is
conjunctive, i.e. only stories which didn't pass any of the filters are shown.
+
modifier is disjunctive, i.e. stories which passed any of the filters are
shown. For example
sucklesshn.porkbrain.com/+askhn+showhn
shows "Show HN" or "Ask HN" stories.
List of implemented filters:
+bignews
/-bignews
flags urls from large news sites Bloomberg, VICE, The Guardian, WSJ, CNBC,
BBC, Forbes, Spectator, LA Times, The Hill and NY Times. More large news may
be added later. Any general news website which has ~60 submissions (2
pages) in the past year falls into this category. HN search query:
https://hn.algolia.com/?dateRange=pastYear&page=2&prefix=true&sort=byPopularity&type=story&query=${DOMAIN}
.
+amfg
/-amfg
flags titles which mention "Google", "Facebook", "Apple" or "Microsoft". No
more endless Google-bashing comment binging at 3 AM. Most of the time the
submissions are scandalous and comment sections low entropy but addictive.
special +all
front page which
includes all HN top stories
List of filter groups:
Filters in a group are alphabetically sorted ASC.
The binary is supposed to be executed periodically (~ 30 min). Each generated page is an S3 object, therefore we don't need to provision a web server.
sqlite
database stores ids of top HN posts that are already downloaded + some other data (timestamp of insertion, submission title, url, which filters it passed).
The endpoint to query top stories on HN is https://hacker-news.firebaseio.com/v0/topstories.json. We download stories which we haven't checked before. The data about a story is available via item endpoint.
We check each new story against Suckless filters before inserting it into the database table stories
.
The flags for each filter are persisted in story_filters
table.
Final step is generating a new html for the sucklesshn.porkbrain.com front pages and uploading it into an S3 bucket.
The S3 bucket is behind Cloudfront distribution to which the sucklesshn.porkbrain.com
DNS zone records point.
We set up different combinations of filters and upload those combinations as different S3 objects.
The objects are all of Content-type: text/html
, however they don't have .html
extension.
We handle rate limiting by simply skipping submission. Since we poll missing stories periodically, they will be fetched eventually.
We don't need to check all top stories. We can slice the top stories endpoint and only download first ~ 50 entries.
Wayback machine has some kind of rate limiting which fails concurrent requests.
We run wayback machine GET
requests sequentially.
We leverage wayback machine APIs to provide users link to the latest archived snapshot at the time of the submission.
Please donate to keep Wayback machine awesome.
To skip fetching snapshots use env var SKIP_WAYBACK_MACHINE=true
.
Generating a url for a submission is potentially a lengthy process, so the requests should be done in parallel.
First we check whether a snapshot already exists. If it does, we take the easy way out and return the url to it. However if it doesn't, we must first submit the url, wait for Wayback machine to index the page, and then again query the APIs.
POST https://web.archive.org/save/${URL}
Originally, this ran as a cron job on my Raspberry pi4.
To build the docker file that runs this on ARM, you'll need to install cross
and then build it with:
cross build --release
See the .env.example
file for environment variable the binary
expects.
On Hacker News there's the domain showed next to each submission. Very useful to see what are we clicking on.