2017. június 11.

Creating searchable PDFs on Ubuntu 3rd try

Following up on the second try:


My requirements:

  • image layer over text layer
  • good character encoding for Hungarian ű and ő chars
  • good placement of words and lines
  • fair enough good recognition
  • handling more column layout

Tesseract:


Set up Tesseract:
  • sudo apt-get install tesseract-ocr
  • from https://github.com/tesseract-ocr/tessdata download the language data you need and put it in tessdata directory (/usr/share/tesseract-ocr/tessdata). E.g. for Hungarian:
    cd /usr/share/tesseract-ocr/tessdata
    sudo wget https://github.com/tesseract-ocr/tessdata/raw/master/hun.traineddata
  • add environmental variable TESSDATA_PREFIX to the directory containing the tessdata directory if you get the error that the language data cannot be found.

Tesseract supported input image formats:
...are the ones supported by Leptonica:
JPEG, PNG, TIFF, BMP, PNM ,GIF, WEBP, JP2

Tesseract searchable pdf output
Example usage with specified language (-l):
tesseract  input.png outbase -l hun pdf

Results:
  • image layer over text layer - YES
  • good character encoding for Hungarian ű and ő chars - YES
  • good placement of words and lines - YES 
  • fair enough good recognition - YES: it depends a lot on input quality.
  • handling more column layout - SO-SO: sometimes works (i.e. half page is OK, other half is single column)

This is good enough for me now, so I'm not investigating further.

Nincsenek megjegyzések: