2014. január 26.

Creating searchable PDFs on Ubuntu 1st try

Process is the following:

  1. have a good resolution image in leptonica allowed format to use with tesseract: JPEG, PNG, TIFF, BMP, PNM ,GIF and WEBP.
  2. produce hocr output like
    tesseract image.pbm textfile -l hun hocr
  3. merge image and hocr to searchable pdf in hocr2pdf: image layer on top of text layer like
    hocr2pdf -i input.pbm -o output.pdf < textfile.html

First impressions:

  • character encoding after hocr2pdf is off
    setting character encoding in html file header to ISO-8859-2 or Windows-1250 does not help.
    "ő" turns "Q" and "ű" turns "q" :-(
  • 2 columns was not recognized automatically in tesseract
    psm option does not solve this
    this is probably impossible in the current version.
    have to try out another way to produce hocr with an ocr software that handles layout.
  • font sizes are chaotic
    I think this probably depends on the bbox size and therefor on the ocr software.
  • output file size is 159kb from a 818kb pbm, which is way too big.
    this cannot be helped if the pdf is not generated with my own methods...

Anyhow, this looks like a disaster :-( character encoding has to work.

Nincsenek megjegyzések: