~turminal/ocraas

OCR as a Service
55beb6b5 — Bor Grošelj Simić 4 years ago
move some things around because python packaging is weird
ed63dd33 — Bor Grošelj Simić 4 years ago
fix setup
1ea8d3fb — Bor Grošelj Simić 4 years ago
rename ...

clone

read-only
https://git.sr.ht/~turminal/ocraas
read/write
git@git.sr.ht:~turminal/ocraas

You can also use your local clone with git send-email.

#OCR as a Service

Simple bottle web app that lets you upload pdf or png documents, runs OCR on them in the background and lists processed documents for download.

Also includes a simple script for pdf to text conversion on command line called readpdf.

#Installation

Python < 3.7 is not supported. Your local version of Python has to include sqlite3 module (it does on most distros by default). See requirements.txt for other required python libraries.

OCR program used is tesseract. By default it is only capable of reading English text, for other languages a separate data package has to be installed. Consult you distribution's package search functionality. Tesseract relies on Leptonica graphics library for image manipulation and OCRaaS will only work with formats your Leptonica installation supports (usually not a problem when installing Leptonica from your distro repository).

OCRaaS uses ghostscript for PDF to PNG conversion. Ghostscript has a long history of vulnerabilities caused by maliciously crafted input files. This means that running OCRaaS publicly for people to upload things without disabling PDF support is a VERY BAD idea at the moment.

After installing the dependencies, copy config.example.ini to config.ini and adjust it to your needs (especially the section about supported file formats), then run ocraas-initdb. OCRaaS should now be ready to run with ocraas-run.

OCRaaS does not lose any queued jobs when process is terminated and later restarted.

The application occasionaly issues ResourceWarning because of a bug in the bottle library.

Do not follow this link