~donmcc/books

Tools for producing electronic books
Wrap long lines in block quotes in plain text output.
Right-justify the attribution in plain text block quotes.
Fix indenting of paragraphs in block quotes in plain text.

refs

main
browse  log 

clone

read-only
https://git.sr.ht/~donmcc/books
read/write
git@git.sr.ht:~donmcc/books

You can also use your local clone with git send-email.

#books

Tools for producing electronic books.

#Development Setup

  1. Install make
  2. Install pyenv and pyenv-virtualenv
  3. Create books-venv virtual environment using Python 3.11.3 and activate it
  4. Run tests: make check
  5. Build books: make
  6. Remove generated files: make clean
  7. Remove fetched content: make clean-all

#Workflow

  1. Fetch HTML source files and scans of original publications
  2. Apply repair patches to some HTML source files
  3. Pretty print HTML sources so document structure is visible (prettify)
  4. Strip unneeded HTML elements (strip)
  5. Parse stripped HTML and convert it to JSON (parse)
  6. Join JSON files for individual chapters of a story into a single file (join)
  7. Normalizes spacing, capitalization and punctuation (normalize)
  8. TODO: Correct typos in sources (correct)
  9. Generate fixed layout ASCII text files (plaintext)
  10. Generate minimal semantic HTML files (plainhtml)
  11. TODO: Generate EPUBs (epub)

#Fetching Content

HTML files are fetched from Wikisource and Project Gutenberg Australia. Scans of the original publications are fetched from the Internet Archive and are used for reference when making corrections. Because the list of content files is long, content target files are given in conan/fetch.mk, which is included in Makefile. The ./bin/fetch script contains the curl command used to fetch the content.

The fetch target fetches all HTML files. To get the original publications from the Internet Archive, run fetch-all, but note that the Internet Archive may throttle downloads, causing make fetch-all to fail.

#Repairing Fetched Content Files

Sometimes fetched content is not correctly encoded as UTF-8 and the strip.py script will fail. The process for fixing this is:

  1. Copy ./fetched/<content path> to ./tmp/repaired/<content path>
  2. Manually edit ./tmp/repaired/<content path> and fix any problems
  3. Create a patch file using diff command below
  4. Add ./fetched/<content path> to the needs_repair makefile variable

The diff command looks like:

diff -u \
    ./fetched/<content path> \
    ./tmp/repaired/<content path> \
    > ./<content path>.patch 

#License

The code in books is made available under a BSD license. See the LICENSE file for details. Content for books is fetched from Wikisource and Project Gutenberg Australia, where it is attested to be in the public domain in the United States and Australia respectively.