My requirements:
- image layer over text layer
- good character encoding for Hungarian ű and ő chars
- good placement of words and lines
- fair enough good recognition
- handling more column layout
Tesseract:
Set up Tesseract:
sudo apt-get install tesseract-ocr
- from https://github.com/tesseract-ocr/tessdata download the language data you need and put it in tessdata directory (
/usr/share/tesseract-ocr/tessdata
). E.g. for Hungarian:
cd /usr/share/tesseract-ocr/tessdata
sudo wget https://github.com/tesseract-ocr/tessdata/raw/master/hun.traineddata - add environmental variable
TESSDATA_PREFIX
to the directory containing the tessdata directory if you get the error that the language data cannot be found.
Tesseract supported input image formats:
...are the ones supported by Leptonica:
JPEG, PNG, TIFF, BMP, PNM ,GIF, WEBP, JP2
Tesseract searchable pdf output
Example usage with specified language (-l):
tesseract input.png outbase -l hun pdf
Results:
- image layer over text layer - YES
- good character encoding for Hungarian ű and ő chars - YES
- good placement of words and lines - YES
- fair enough good recognition - YES: it depends a lot on input quality.
- handling more column layout - SO-SO: sometimes works (i.e. half page is OK, other half is single column)
This is good enough for me now, so I'm not investigating further.
Nincsenek megjegyzések:
Megjegyzés küldése