2014. január 28.

Creating searchable PDFs on Ubuntu 2nd try

Need be:

  • image layer over text layer
  • good character encoding for Hungarian ű and ő chars
  • good placement of words and lines
  • fair enough good recognition
  • handling more column layout
1st try was Tesseract output hocr embedded with hocr2pdf in a pnm file.
  • image layer over text layer - YES
  • good character encoding for Hungarian ű and ő chars - NO
  • good placement of words and lines - NO
  • fair enough good recognition - YES
  • handling more column layout - NO
the strangest is, that hOCR editor does not handle well the tesseract output hocr. actually it does not handle it at all, showing html tags and everything where the editable text shoud be...

2nd try is: OCRopus 
I had no success installing and using orcopus.
  • "recognize" does not handle languages, and/or I could not find a Hungarian data file for it.
    it works like: ocroscript recognize input.pnm > output.html
  • rec-tess-complete should recognize through tesseract, and import language files with the --tesslanguage=hun option, but instead I got this error:
    Unable to load unicharset file /usr/share/tesseract-ocr/tessdata/hun.unicharset
  • so I unpacked the hun.traineddata like this:
    combine_tessdata -u hun.traineddata hun.
  • and put the files to /usr/share/tesseract-ocr/tessdata/
  • however I got this error:
    Error: Illegal malloc request size!
    Fatal error: No error trap defined!
    Signal_termination_handler called with signal 2001
  • than I tried with --tesslanguage=eng and it gave me:
    ocroscript: /usr/share/ocropus/scripts//rec-tess-complete.lua:52: attempt to call global 'hardcoded_version_string' (a nil value)
  • so I searched and found a patch, and installed it like this:
    patch /usr/share/ocropus/scripts/rec-tess-complete.lua rec-tess-complete3_r1308.patch
  • and now it gives me (with "eng")
    ocroscript: /usr/share/ocropus/scripts//rec-tess-complete.lua:61: Leptonica is disabled, please compile with it or don't use it!
I already have the newest tesseract on board, but I failed to manage a newest ocropus installation. it had too many unknown aspects with python and all...

results with ocropus 0.3.1-2 recognize and merged with hocr2pdf:
  • image layer over text layer - YES
  • good character encoding for Hungarian ű and ő chars - NO
  • good placement of words and lines - NO (makes large characters, I cannot even tell which line it should be)
  • fair enough good recognition - NO (because of english training data)
  • handling more column layout - DON'T KNOW (text was too big, it was impossible to tell)
maybe the big text was because of the dpi of the image... I should check on this to at least be able to qualify the layout option... nope, it did not help... at all.


...to be continued with:

3rd try is: Cuneiform
4th try: Adobe Acrobat XI on Windows

2014. január 26.

Creating searchable PDFs on Ubuntu 1st try

Process is the following:

  1. have a good resolution image in leptonica allowed format to use with tesseract: JPEG, PNG, TIFF, BMP, PNM ,GIF and WEBP.
  2. produce hocr output like
    tesseract image.pbm textfile -l hun hocr
  3. merge image and hocr to searchable pdf in hocr2pdf: image layer on top of text layer like
    hocr2pdf -i input.pbm -o output.pdf < textfile.html


First impressions:

  • character encoding after hocr2pdf is off
    setting character encoding in html file header to ISO-8859-2 or Windows-1250 does not help.
    "ő" turns "Q" and "ű" turns "q" :-(
  • 2 columns was not recognized automatically in tesseract
    psm option does not solve this
    this is probably impossible in the current version.
    have to try out another way to produce hocr with an ocr software that handles layout.
  • font sizes are chaotic
    I think this probably depends on the bbox size and therefor on the ocr software.
  • output file size is 159kb from a 818kb pbm, which is way too big.
    this cannot be helped if the pdf is not generated with my own methods...

Anyhow, this looks like a disaster :-( character encoding has to work.

Istalling Tesseract

Thanks to THIS post I was able to install tesseract 3 on 10.04 Ubuntu. This is how:

Install Tesseract
Get the required packages available in the repositories:

sudo apt-get install libpng12-dev
sudo apt-get install libjpeg62-dev
sudo apt-get install libtiff4-dev
("sudo apt-get install zlibg-dev" is suggested in the Tesseract readme but isn't available. I found I didn't need this.)

I picked this up from a comment made, you need to be able to compile and make the software. Ubuntu needs some packages to help do this. For many of you these may already be present and installed but it doesn't hurt..

sudo apt-get install gcc
sudo apt-get install g++
sudo apt-get install automake


Download this program which can't be gained with apt-get:
http://www.leptonica.org/download.html version 1.70
unpack, navigate to the folder in terminal, and run:

./configure
make
sudo make install
sudo ldconfig


Now we can actually get and install Tesseract!

download tesseract: https://code.google.com/p/tesseract-ocr/downloads/list version 3.02.02
unpack, navigate to the folder in terminal, and run:

./configure
make
sudo make install
sudo ldconfig   (<-- important="" is="" p="" this="" very="">

Now for whatever reason the training data isn't installed with this.

download whatever language you need and unzip to /usr/local/share/tessdata folder (requires root permissions)
also download osd traineddata from for example here

try with:
sudo nautilus


OCR Hungarian

After six long years I gathered my courage to face the task of OCR-ing texts again.

Goals:

1.) searchable PDF and/or DJVU from image PDF or DJVU.
one guide for PDFs uses cuneiform to ocr, and hocr2pdf to emberd text in pdf.
another option is pdfsandwich

2.) formatted text file preferably HTML from image.


Utilities:

Tesseract:
homepage: http://code.google.com/p/tesseract-ocr/
wikipedia: http://en.wikipedia.org/wiki/Tesseract_(software)
Hungarian training data: http://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.hun.tar.gz&can=2&q=hun (which I lacked six years ago)
Other projects using Tesseract engine: http://code.google.com/p/tesseract-ocr/wiki/3rdParty

Cuneiform:
homepage: http://cognitiveforms.ru/products/cuneiform/
wikipedia: http://en.wikipedia.org/wiki/CuneiForm_(software)

hOcr2Pdf:
homepage: http://hocrtopdf.codeplex.com/

2014. január 15.

Burning DVDs and failing.

With my Samsung SE-208DB/TSBS Black.
It is connected to 2 USB slots, 1 directly and 1 with extension. 


GnomeBaker 
DVD+R, Speed: 8X, Mode: auto. 
1. Time: 71 min.
Output:

Writing an image:
I: -input-charset not specified, using utf-8 (detected in locale settings)
Using 321030_2052795755951_117000.JPG;1 for  Photos_2012_2/jul/321030_2052795755951_1177642115_n.jpg (321030_2052795755951_1177642115_n (1).jpg)
Total translation table size: 0
Total rockridge attributes bytes: 121887
Total directory bytes: 243712
Path table size(bytes): 1104
Max brk space used 108000
2174849 extents written (4247 MB)
Executing 'builtin_dd if=/home/user/Data25.iso of=/dev/sr0 obs=32k seek=0'
/dev/sr0: "Current Write Speed" is 6.1x1352KBps.
:-( unable to WRITE@LBA=16f290h: Input/output error
:-( write failed: Input/output error
/dev/sr0: flushing cache
:-( unable to FLUSH CACHE: Input/output error
:-( unable to SYNCHRONOUS FLUSH CACHE: Input/output error
Also failed.
this is simple non-sense. I have a brand new DVDRW, burned sucsessfully 3 DVD so far, and now it fails. This is unbelievable.




2. Time: 79 min.

Executing 'genisoimage -gui -V DiscLabel -A GnomeBaker -p à -iso-level 3 -l -r -hide-rr-moved -J -joliet-long -graft-points --path-list /tmp/GnomeBaker/gnomebaker-C0YY9W | builtin_dd of=/dev/sr0 obs=32k seek=0'
I: -input-charset not specified, using utf-8 (detected in locale settings)
/dev/sr0: "Current Write Speed" is 6.1x1352KBps.
Total translation table size: 0
Total rockridge attributes bytes: 224932
Total directory bytes: 428032
Path table size(bytes): 1850
Max brk space used 1e6000
2000215 extents written (3906 MB)
/dev/sr0: flushing cache
/dev/sr0: updating RMA

/dev/sr0: closing session





Here's a failure for 2 times:
Executing 'genisoimage -gui -V DataAngel_25 -A GnomeBaker -p A -iso-level 3 -l -r -hide-rr-moved -J -joliet-long -graft-points --path-list /tmp/GnomeBaker/gnomebaker-CC7X9W | builtin_dd of=/dev/sr0 obs=32k seek=0'
I: -input-charset not specified, using utf-8 (detected in locale settings)
Using 321030_2052795755951_117000.JPG;1 for  Photos_2012_2/képek facebookról Julinak/321030_2052795755951_1177642115_n.jpg (321030_2052795755951_1177642115_n (1).jpg)
/dev/sr0: "Current Write Speed" is 6.1x1352KBps.
:-( unable to WRITE@LBA=7c9d0h: Input/output error
:-( write failed: Input/output error
/dev/sr0: flushing cache
:-( unable to FLUSH CACHE: Input/output error
:-( unable to SYNCHRONOUS FLUSH CACHE: Input/output error

2014. január 5.

MobiPocket Creator usage

Download Mobipocket Creator

Follow these instructions to install Publisher Version through WINE

To create e-books:
Follow instructions

Follow User Manual

Be prepared for continuously ignoring software errors while creating e-books.

What Works?

This simple process works okay, e-book is built:
  • Create new publication
  • Add Content:
    • Insert HTML file
    • Insert Image file(s)
  • Add Cover Image
  • Add Metadata
  • (Save publication)
  • Build e-book
 What does not work?
Build fails with "error(htmlparser) no BODY tag found in content file"


KindleGen Usage

Download Kindlegen for Linux
Read publishing guidelines

Extract package anywhere

docs/english/Readme.txt content (relevant):

Creating Kindle ebooks - Advanced users:
-------------------------------------------
Advanced users can use the command line tool to convert EPUB/HTML to Kindle ebooks. This interface is available in Windows, Mac and Linux platform. This tool can be used for automated bulk conversions.

KindleGen for Linux 2.6 i386 :
1. Download the KindleGen tar.gz from www.amazon.com/kindleformat/kindlegen to a folder such as Kindlegen in home directory (~/KindleGen).
2. Extract the contents of the file to '~/KindleGen'. Open the terminal, move to folder containing the downloaded file using command "cd ~/KindleGen" and then use command "tar xvfz kindlegen_linux_2.6_i386_v2.tar.gz" to extract the contents.
3. Open the Terminal application and type ~/KindleGen/kindlegen. Instructions on how to run KindleGen are displayed.
4. Conversion Example: To convert a file called book.html, go to the directory where the book is located, such as cd desktop, and type ~/KindleGen/kindlegen book.html. If the conversion was successful, a new file called book.mobi displays on the desktop.
5. Please note: It is recommended to follow these steps to run KindleGen. Double-clicking the KindleGen icon does not launch this program. Run the above commands without quotes

Instructions on how to run KindleGen:
Navigate in terminal to folder
type ./kindlegen for usage information:
*************************************************************
 Amazon kindlegen(Linux) V2.9 build 0730-890adc2
 A command line e-book compiler
 Copyright Amazon.com and its Affiliates 2013
*************************************************************
Usage : kindlegen [filename.opf/.htm/.html/.epub/.zip or directory] [-c0 or -c1 or c2] [-verbose] [-western] [-o ]
Note:
   zip formats are supported for XMDF and FB2 sources
   directory formats are supported for XMDF sources
Options:
   -c0: no compression
   -c1: standard DOC compression
   -c2: Kindle huffdic compression
   -o : Specifies the output file name. Output file will be created in the same directory as that of input file. should not contain directory path.
   -verbose: provides more information during ebook conversion
   -western: force build of Windows-1252 book
   -releasenotes: display release notes
   -gif: images are converted to GIF format (no JPEG in the book)
   -locale : To display messages in selected language
      en: English
      de: German
      fr: French
      it: Italian
      es: Spanish
      zh: Chinese
      ja: Japanese
      pt: Portuguese
      ru: Russian
First impressions:
This program should be able to convert .html/.htm and .epub files...
it converts them to .mobi to a filesize at least double the original (depending on images and compression)
Uploaded to Kindle, all the files seem to work fine. Text formatting is kept in some way - not perfect, but readable. Kindle shows Title and author for the epub, and title set for the html (not filename!)

Seems okay, but I do not have a real chance to generate a beautiful book this way easily... maybe converting from epub might be a chance to keep the book beautiful...

...or should really read the  publishing guidelines to learn the proper formatting.