cljq: cleanup
cljq: ignore broken pipe error
cljq: remove old sh version and catch error in CL one
Little tool to scrap release info from Rate Your Music and output it in JSON format to stdout. The CURL environment variable can be used to change the curl binary path (curl-impersonate is strongly recommended if you to plan on scraping repeatedly, to circumvent blocking).
$ rymscrap.tcl Pixies Doolittle album
$ rymscrap.tcl https://rateyourmusic.com/release/album/pixies/doolittle/
$ rymscrap.tcl index.html
The first one can be a little unreliable if you want the "primary" release (you do), since RYM has a very "lolrandom" URL naming algorithm.
The jq/
and cljq/
directories each contains a script to make album queries of the sort:
$ jqalbum 'has_genre("Black Metal") and year == 1999'
$ cljqalbum '(and (has-genre "Black Metal") (= year 1999))'
htmltidy
dependency should give you an idea, but RYM's generated HTML can be quite janky, so
failures are to be expected at times; though it worked for ~1600 releases, here.url_strip
procedure contains both a "dash" and "underscore" (commented) version for the
punctation removal section, this is due to RYM's aforementioned URL generation algorithm having
changed at least once (with small variations here and there).iconv
(not needed when using URL or HTML arguments)curl
>= 7.52.0htmltidy