Alice@Ubuntu: Creating searchable PDFs on Ubuntu 1st try

2014. január 26.

Process is the following:

have a good resolution image in leptonica allowed format to use with tesseract: JPEG, PNG, TIFF, BMP, PNM ,GIF and WEBP.
produce hocr output like
tesseract image.pbm textfile -l hun hocr
merge image and hocr to searchable pdf in hocr2pdf: image layer on top of text layer like
hocr2pdf -i input.pbm -o output.pdf < textfile.html

First impressions:

character encoding after hocr2pdf is off
setting character encoding in html file header to ISO-8859-2 or Windows-1250 does not help.
"ő" turns "Q" and "ű" turns "q" :-(
2 columns was not recognized automatically in tesseract
psm option does not solve this
this is probably impossible in the current version.
have to try out another way to produce hocr with an ocr software that handles layout.
font sizes are chaotic
I think this probably depends on the bbox size and therefor on the ocr software.
output file size is 159kb from a 818kb pbm, which is way too big.
this cannot be helped if the pdf is not generated with my own methods...

Anyhow, this looks like a disaster :-( character encoding has to work.