2014. január 26.

OCR Hungarian

After six long years I gathered my courage to face the task of OCR-ing texts again.

Goals:

1.) searchable PDF and/or DJVU from image PDF or DJVU.
one guide for PDFs uses cuneiform to ocr, and hocr2pdf to emberd text in pdf.
another option is pdfsandwich

2.) formatted text file preferably HTML from image.


Utilities:

Tesseract:
homepage: http://code.google.com/p/tesseract-ocr/
wikipedia: http://en.wikipedia.org/wiki/Tesseract_(software)
Hungarian training data: http://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.hun.tar.gz&can=2&q=hun (which I lacked six years ago)
Other projects using Tesseract engine: http://code.google.com/p/tesseract-ocr/wiki/3rdParty

Cuneiform:
homepage: http://cognitiveforms.ru/products/cuneiform/
wikipedia: http://en.wikipedia.org/wiki/CuneiForm_(software)

hOcr2Pdf:
homepage: http://hocrtopdf.codeplex.com/

Nincsenek megjegyzések: