~porkbrain/suckless.hn

Polls Hacker News top stories, applies Suckless Filters and publishes html to a public S3
Bumping version to 3.0.0
Adding licence so that sourcehut stops screaming at me
Migrating from Github to sourcehut

clone

read-only
https://git.sr.ht/~porkbrain/suckless.hn
read/write
git@git.sr.ht:~porkbrain/suckless.hn

You can also use your local clone with git send-email.

#sucklesshn.porkbrain.com

This section is about motivation behind the project. See tech stack.

Some stories on HN are frustrating and time consuming for dubious value. I believe there are other people who would also like to see less of certain type of content, hence suckless hn.

#Suckless filters

A filter is given a story data and flags the story if it passes the filter.

Each filter has a two landing pages. One with only stories which were flagged, one with anything but. This is decided by two modifiers: + and -. For example to only see stories from large newspapers visit https://sucklesshn.porkbrain.com/+bignews. To get HN without large newspapers visit https://sucklesshn.porkbrain.com/-bignews.

There are also groups of filters. For example https://sucklesshn.porkbrain.com/-amfg-bignews filters out large newspapers and all mentions of big tech. This also happens to be the default view on the homepage. - modifier in a group is conjunctive, i.e. only stories which didn't pass any of the filters are shown. + modifier is disjunctive, i.e. stories which passed any of the filters are shown. For example sucklesshn.porkbrain.com/+askhn+showhn shows "Show HN" or "Ask HN" stories.

#List

List of implemented filters:

  • +askhn/-askhn flags "Ask HN" titles

  • +showhn/-showhn flags "Show HN" titles

  • +bignews/-bignews flags urls from large news sites Bloomberg, VICE, The Guardian, WSJ, CNBC, BBC, Forbes, Spectator, LA Times, The Hill and NY Times. More large news may be added later. Any general news website which has ~60 submissions (2 pages) in the past year falls into this category. HN search query: https://hn.algolia.com/?dateRange=pastYear&page=2&prefix=true&sort=byPopularity&type=story&query=${DOMAIN}.

  • +amfg/-amfg flags titles which mention "Google", "Facebook", "Apple" or "Microsoft". No more endless Google-bashing comment binging at 3 AM. Most of the time the submissions are scandalous and comment sections low entropy but addictive.

  • special +all front page which includes all HN top stories

List of filter groups:

Filters in a group are alphabetically sorted ASC.

#Design

The binary is supposed to be executed periodically (~ 30 min). Each generated page is an S3 object, therefore we don't need to provision a web server.

sqlite database stores ids of top HN posts that are already downloaded + some other data (timestamp of insertion, submission title, url, which filters it passed).

The endpoint to query top stories on HN is https://hacker-news.firebaseio.com/v0/topstories.json. We download stories which we haven't checked before. The data about a story is available via item endpoint.

We check each new story against Suckless filters before inserting it into the database table stories. The flags for each filter are persisted in story_filters table.

Final step is generating a new html for the sucklesshn.porkbrain.com front pages and uploading it into an S3 bucket. The S3 bucket is behind Cloudfront distribution to which the sucklesshn.porkbrain.com DNS zone records point. We set up different combinations of filters and upload those combinations as different S3 objects. The objects are all of Content-type: text/html, however they don't have .html extension.

#Rate limiting

We handle rate limiting by simply skipping submission. Since we poll missing stories periodically, they will be fetched eventually.

We don't need to check all top stories. We can slice the top stories endpoint and only download first ~ 50 entries.

Wayback machine has some kind of rate limiting which fails concurrent requests. We run wayback machine GET requests sequentially.

#Wayback machine

We leverage wayback machine APIs to provide users link to the latest archived snapshot at the time of the submission.

Please donate to keep Wayback machine awesome.

To skip fetching snapshots use env var SKIP_WAYBACK_MACHINE=true.

#TODO: Submit snapshots to Wayback machine

Generating a url for a submission is potentially a lengthy process, so the requests should be done in parallel.

First we check whether a snapshot already exists. If it does, we take the easy way out and return the url to it. However if it doesn't, we must first submit the url, wait for Wayback machine to index the page, and then again query the APIs.

POST https://web.archive.org/save/${URL}

#Build

Originally, this ran as a cron job on my Raspberry pi4. To build the docker file that runs this on ARM, you'll need to install cross and then build it with:

cross build --release

#Env

See the .env.example file for environment variable the binary expects.

#TODO: Show domain

On Hacker News there's the domain showed next to each submission. Very useful to see what are we clicking on.