Alice@Ubuntu: Creating searchable PDFs on Ubuntu 3rd try

2017. június 11.

Creating searchable PDFs on Ubuntu 3rd try

Following up on the second try:

My requirements:

image layer over text layer
good character encoding for Hungarian ű and ő chars
good placement of words and lines
fair enough good recognition
handling more column layout

Tesseract:

Set up Tesseract:

sudo apt-get install tesseract-ocr
from https://github.com/tesseract-ocr/tessdata download the language data you need and put it in tessdata directory (/usr/share/tesseract-ocr/tessdata). E.g. for Hungarian:
cd /usr/share/tesseract-ocr/tessdata sudo wget https://github.com/tesseract-ocr/tessdata/raw/master/hun.traineddata
add environmental variable TESSDATA_PREFIX to the directory containing the tessdata directory if you get the error that the language data cannot be found.

Tesseract supported input image formats:
...are the ones supported by Leptonica:
JPEG, PNG, TIFF, BMP, PNM ,GIF, WEBP, JP2

Tesseract searchable pdf output
Example usage with specified language (-l):

tesseract  input.png outbase -l hun pdf

Results:

image layer over text layer - YES
good character encoding for Hungarian ű and ő chars - YES
good placement of words and lines - YES
fair enough good recognition - YES: it depends a lot on input quality.
handling more column layout - SO-SO: sometimes works (i.e. half page is OK, other half is single column)

This is good enough for me now, so I'm not investigating further.

Nincsenek megjegyzések:

Megjegyzés küldése