A következő címkéjű bejegyzések mutatása: ocr. Összes bejegyzés megjelenítése
A következő címkéjű bejegyzések mutatása: ocr. Összes bejegyzés megjelenítése

2017. június 11.

Creating searchable PDFs on Ubuntu 3rd try

Following up on the second try:


My requirements:

  • image layer over text layer
  • good character encoding for Hungarian ű and ő chars
  • good placement of words and lines
  • fair enough good recognition
  • handling more column layout

Tesseract:


Set up Tesseract:
  • sudo apt-get install tesseract-ocr
  • from https://github.com/tesseract-ocr/tessdata download the language data you need and put it in tessdata directory (/usr/share/tesseract-ocr/tessdata). E.g. for Hungarian:
    cd /usr/share/tesseract-ocr/tessdata
    sudo wget https://github.com/tesseract-ocr/tessdata/raw/master/hun.traineddata
  • add environmental variable TESSDATA_PREFIX to the directory containing the tessdata directory if you get the error that the language data cannot be found.

Tesseract supported input image formats:
...are the ones supported by Leptonica:
JPEG, PNG, TIFF, BMP, PNM ,GIF, WEBP, JP2

Tesseract searchable pdf output
Example usage with specified language (-l):
tesseract  input.png outbase -l hun pdf

Results:
  • image layer over text layer - YES
  • good character encoding for Hungarian ű and ő chars - YES
  • good placement of words and lines - YES 
  • fair enough good recognition - YES: it depends a lot on input quality.
  • handling more column layout - SO-SO: sometimes works (i.e. half page is OK, other half is single column)

This is good enough for me now, so I'm not investigating further.

2014. január 28.

Creating searchable PDFs on Ubuntu 2nd try

Need be:

  • image layer over text layer
  • good character encoding for Hungarian ű and ő chars
  • good placement of words and lines
  • fair enough good recognition
  • handling more column layout
1st try was Tesseract output hocr embedded with hocr2pdf in a pnm file.
  • image layer over text layer - YES
  • good character encoding for Hungarian ű and ő chars - NO
  • good placement of words and lines - NO
  • fair enough good recognition - YES
  • handling more column layout - NO
the strangest is, that hOCR editor does not handle well the tesseract output hocr. actually it does not handle it at all, showing html tags and everything where the editable text shoud be...

2nd try is: OCRopus 
I had no success installing and using orcopus.
  • "recognize" does not handle languages, and/or I could not find a Hungarian data file for it.
    it works like: ocroscript recognize input.pnm > output.html
  • rec-tess-complete should recognize through tesseract, and import language files with the --tesslanguage=hun option, but instead I got this error:
    Unable to load unicharset file /usr/share/tesseract-ocr/tessdata/hun.unicharset
  • so I unpacked the hun.traineddata like this:
    combine_tessdata -u hun.traineddata hun.
  • and put the files to /usr/share/tesseract-ocr/tessdata/
  • however I got this error:
    Error: Illegal malloc request size!
    Fatal error: No error trap defined!
    Signal_termination_handler called with signal 2001
  • than I tried with --tesslanguage=eng and it gave me:
    ocroscript: /usr/share/ocropus/scripts//rec-tess-complete.lua:52: attempt to call global 'hardcoded_version_string' (a nil value)
  • so I searched and found a patch, and installed it like this:
    patch /usr/share/ocropus/scripts/rec-tess-complete.lua rec-tess-complete3_r1308.patch
  • and now it gives me (with "eng")
    ocroscript: /usr/share/ocropus/scripts//rec-tess-complete.lua:61: Leptonica is disabled, please compile with it or don't use it!
I already have the newest tesseract on board, but I failed to manage a newest ocropus installation. it had too many unknown aspects with python and all...

results with ocropus 0.3.1-2 recognize and merged with hocr2pdf:
  • image layer over text layer - YES
  • good character encoding for Hungarian ű and ő chars - NO
  • good placement of words and lines - NO (makes large characters, I cannot even tell which line it should be)
  • fair enough good recognition - NO (because of english training data)
  • handling more column layout - DON'T KNOW (text was too big, it was impossible to tell)
maybe the big text was because of the dpi of the image... I should check on this to at least be able to qualify the layout option... nope, it did not help... at all.


...to be continued with:

3rd try is: Cuneiform
4th try: Adobe Acrobat XI on Windows

2014. január 26.

Creating searchable PDFs on Ubuntu 1st try

Process is the following:

  1. have a good resolution image in leptonica allowed format to use with tesseract: JPEG, PNG, TIFF, BMP, PNM ,GIF and WEBP.
  2. produce hocr output like
    tesseract image.pbm textfile -l hun hocr
  3. merge image and hocr to searchable pdf in hocr2pdf: image layer on top of text layer like
    hocr2pdf -i input.pbm -o output.pdf < textfile.html


First impressions:

  • character encoding after hocr2pdf is off
    setting character encoding in html file header to ISO-8859-2 or Windows-1250 does not help.
    "ő" turns "Q" and "ű" turns "q" :-(
  • 2 columns was not recognized automatically in tesseract
    psm option does not solve this
    this is probably impossible in the current version.
    have to try out another way to produce hocr with an ocr software that handles layout.
  • font sizes are chaotic
    I think this probably depends on the bbox size and therefor on the ocr software.
  • output file size is 159kb from a 818kb pbm, which is way too big.
    this cannot be helped if the pdf is not generated with my own methods...

Anyhow, this looks like a disaster :-( character encoding has to work.

Istalling Tesseract

Thanks to THIS post I was able to install tesseract 3 on 10.04 Ubuntu. This is how:

Install Tesseract
Get the required packages available in the repositories:

sudo apt-get install libpng12-dev
sudo apt-get install libjpeg62-dev
sudo apt-get install libtiff4-dev
("sudo apt-get install zlibg-dev" is suggested in the Tesseract readme but isn't available. I found I didn't need this.)

I picked this up from a comment made, you need to be able to compile and make the software. Ubuntu needs some packages to help do this. For many of you these may already be present and installed but it doesn't hurt..

sudo apt-get install gcc
sudo apt-get install g++
sudo apt-get install automake


Download this program which can't be gained with apt-get:
http://www.leptonica.org/download.html version 1.70
unpack, navigate to the folder in terminal, and run:

./configure
make
sudo make install
sudo ldconfig


Now we can actually get and install Tesseract!

download tesseract: https://code.google.com/p/tesseract-ocr/downloads/list version 3.02.02
unpack, navigate to the folder in terminal, and run:

./configure
make
sudo make install
sudo ldconfig   (<-- important="" is="" p="" this="" very="">

Now for whatever reason the training data isn't installed with this.

download whatever language you need and unzip to /usr/local/share/tessdata folder (requires root permissions)
also download osd traineddata from for example here

try with:
sudo nautilus


OCR Hungarian

After six long years I gathered my courage to face the task of OCR-ing texts again.

Goals:

1.) searchable PDF and/or DJVU from image PDF or DJVU.
one guide for PDFs uses cuneiform to ocr, and hocr2pdf to emberd text in pdf.
another option is pdfsandwich

2.) formatted text file preferably HTML from image.


Utilities:

Tesseract:
homepage: http://code.google.com/p/tesseract-ocr/
wikipedia: http://en.wikipedia.org/wiki/Tesseract_(software)
Hungarian training data: http://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.hun.tar.gz&can=2&q=hun (which I lacked six years ago)
Other projects using Tesseract engine: http://code.google.com/p/tesseract-ocr/wiki/3rdParty

Cuneiform:
homepage: http://cognitiveforms.ru/products/cuneiform/
wikipedia: http://en.wikipedia.org/wiki/CuneiForm_(software)

hOcr2Pdf:
homepage: http://hocrtopdf.codeplex.com/

2008. november 26.

Tesseract

Guided with the idea of language files not being separated in tesseract versions below 2.0, so that training will be not supported, I tried to install a newer version from source.
The winner was 2.01. with the ./configure, make and sudo make install it went well. I only had to install the language files manually. I downloaded the 2.00 language files, and overwrited the 0 byte files in the /user/local/share/tessdata.

Now it works fine. I only have to train it for Hungarian.
I also want to find an easier way to prepare the images...