2015. február 27.

PDF to HTML for eBook creation

My general goal is to process PDF files through a HTML/XML state before making eBook (mobi/epub) from them.
So the goal is, to somehow generate a clean HTML file out of a PDF.
Some of my general needs are:
  • keep paragraphs together (not mixing up <br /> with <p></p>)
  • get images and text together
  • keep character formatting
  • handle multiple columns (convert to single column)
  • skip page numbers
First of all, here's what I found important from the Google search results:

Thomas Levine's Parsing PDF files walk-through
Tools to use:
  • Basic file analysis tools (ls or another language’s equivalent)
  • PDF metadata tools (pdfinfo or an equivalent)
  • pdftotext
  • pdftohtml -xml
  • Inkscape via pdf2svg
  • PDFMiner
My own experiments:
With this PDF file, and another one that I made for this purpose.

$ pdftohtml example.pdf
  • incorrectly displayed character encoding
    • <meta http-equiv="CONTENT-TYPE" content="text/html; charset=utf-8"> has to be entered in the HTML HEAD for proper character encoding 
    • could put a -enc UTF-8 (Case Sensitive!) as an option if needed, but the header meta still has to be entered manually. 
    • the meta gets entered in the document if the -noframes option is used.
  • no paragraphs, only breaks
  • a navigation html file was generated with two frames: one for a page index html, and one for the actual text.
    • this can be avoided using the -noframes option
    • the -s option to generate single output (however this will only concatenate the whole html files, and links are not corrected)
  • pages are separated with horizontal rulers
  • images are processed all right
 $ pdftohtml -c example.pdf
  • formatting is mostly strictly preserved
    • styles by css
    • absolute positions of paragraphs
      • some paragraphs are kept together, some are not recognized properly
    • left text alignment is the only one that is kept, everything else is shown with absolute left position.
    • bold and italics is preserved, but underline shows wrong
    • columns are preserved okay
    • font face changes do not show.
  • images are embedded in page background
  • every page is a separate html file
$ pdftohtml -xml example.pdf
  • stores formatting info about absolute positions, font size and line height
  • no info about paragraphs, images
Alltogether experinece with PDFtoHTML: it's almost good for nothing, when it is not processed properly afterwards.

Postprocessing a file generated with the pdftohtml -enc UTF-8 -noframes -p -q example.pdf command:
  • style element and some meta element in the html head are unnecessary.
  • <a name=[pagenunber]></a> marks the beginning of every page
  • <br/> in the middle of a line marks line break
  • <br/> at the end of the line marks paragraph break
  • <hr/> marks the end of every page
  • the last line before the end of page is probably a page number
Download and Install PDFMiner.
Review the command line tools and their capacity.

$ pdf2txt.py -o example.html  example.pdf
$ pdf2txt.py -Y normal -o example.html  example.pdf 
  • no paragraphs, only breaks
    (paragraphs are not collected -- no easy way to restore them, unlike in pdftohtml)
  • formatting is not css but html span tags
    • Html Tidy can collect the formatting to the front of the html file as css, making it easier to review and modify:
      tidy -utf8 -c -o example_tidy.html  example.html
  • bold and italics are kept as font family style in span tag, underline is taken as an image (?)
  • images not processed
  • display is messy, text displayed on top of each other
  • code is quite all right.
$ pdf2txt.py -o example.xml  example.pdf
$ pdf2txt.py -Y exact -o example.html  example.pdf 
  • stores exact position of every single character
  • holds space for images
$ pdf2txt.py -t tag -o example.txt  example.pdf
  • stores pdf page data, with unformatted text content
$ pdf2txt.py -Y loose -o example.html  example.pdf
  • does not keep the line breaks, only the span style is present to indicate text changes. paragraphs not recognized properly
  • messy code
Alltogether experinece with PDFMiner: this is not what I'm looking for. It either store too much or too little information for my purposes, so in my case it's actually good for nothing, when it is not processed properly afterwards.

$ pdf2htmlEX example.pdf
  • omg wow amazing pretty output view!
    • everything looks exactly like the pdf file
  • (in exchange for an) extremely messy code :)
    • one page is one line, identified by a (div) id.
    • formatting is kept in classes (div, span, img, ...) by CSS:
      • @font-face {}
      • @media {}
      • .ff: font-family
      • (t) m: transformation matrix
      • v: vertical-align
      • ls: letter-spacing
      • sc: text-shadow
      • ws: word-spacing
      • _: display and width or margin-left
      • fc: color
      • fs: font-size
      • y: bottom
      • h: height
      • w: width
      • x: left
  • all file embedding can be turned off with --embed cfijo (will generate separate output files)
Alltogether useless for my purposes.However, the best if your purpose is to display a pdf file as a html page on the web.

PdfMasher (GUI)
Does not keep formatting or images, but is specialized to keep proper order of the text.
  • as said, does not keep font formatting or image placeholders
  • with a little manual adjustment:
    • can be set to ignore page numbers (amazing!)
    • can be set to collect and link footnotes to be endnotes (amazing!)
  • html code is quite clear
  • paragraphs are well kept if it was possible
Alltogether so far this is the best tool to prepare a simple text pdf for eBook creation.
There are five type of elements that can be set in Edit mode:
  • Normal: will be default text
  • Title: will be Header tag H1 for once pressed, H2 for twice pressed, etc.
    • best to be filtered with sorting by Font Size
  • Footnote: will search for the reference and link it as endnote
    • best to be filtered with sorting by Font Size (or X or Y, respectively)
  • Ignore: will be ignored
    • Page numbers, footers and headers are best to be filtered with sorting by Y (or X, respectively)
  • To Fix:puts a FIXME sign in front of the paragraph. In HTML this becomes an italics formatted text.
These types can be set on the Table or on the Page tab.
Build options are:
  • Generate Markdown: generates a plain text file in the pdf directory with marks specifying the Title and To Fix parts, Ignored elements already ignored, and Footnotes already linked.
  •  Edit Markdown: opens the markdown text file
  • Reveal Markdown: opens the directory in the default file browser containing the markdown
  • View HTML: generates the html file out of the markdown file, and opens it in the default web browser
Markdown signs:
  • # for H1, ## for H2, etc.
  • *FIXME* for italics
  • *** for horizontal ruler  
  • numbers for lists (quite annoying)
  • more on markdown usage
  • I accidentally found an eBook creation software that works from text like this markdown text, so I'll just leave a link here for notice.
Formatting of the text can be fixed manually in the markdown form or in the html form.
When saving as MOBI or EPUB there will be Table of Contents and navigation generated from the headings. The book Start will be set to the first heading.

  • To export all images from your file for further usage,
    use pdf2htmlEX --embed cfijo example.pdf.
    • can be opened in LibreOffice
  • To get a very simple html with proper paragraphs, endnotes, and headings, but without font formatting,
    use PDFMasher (GUI).
  • To get a fair html code with most of the images and font formatting but messed up paragraphing,
    use pdftohtml -enc UTF-8 -noframes -p -q example.pdf
    • cannot be opened in LibreOffice
That's it for so far.
Probably the best way would be to learn SED and create a html cleaning script for myself, but that's likely distant future.

2015. február 25.

PDF smart rename and split based on content

I was planning to do these scripts for some time to ease my work with pdf files.

The original idea was to grep some text from the pdf file and do a file manipulation based on the text.

This StackExchange post suggests the following methods to find a piece of text in a pdf:
  • pdftotext + grep
  • pdfgrep
  • strings + grep
  • pdftohtml + grep
I used pdftotext in the following examples.

I also needed to learn some more Bash scripting:
1st Quest
I have a monthly employee data sheet (payroll), which I get in one big multipage pdf file, having a single page for each employee. The content for each page is: 1st line: title, 2nd line: month, 3rd line: employee name.
I wanted to burst this pdf file into single pages and rename it to title month and employee name.
My solution is:
# automatically splits pdf file to single pages while renaming the output files using the first three of lines of the pdf

# read files
IFS=$'\n' read -d '' -r -a filelist < <(printf '%s\n' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS


# burst and rename pages to separate files
for file in "${filelist[@]}"; do
 pagecount=`pdfinfo $file | grep "Pages" | awk '{ print $2 }'`
 for (( pageindex=1; pageindex<=$pagecount; pageindex+=1 )); do
  newname=`pdftotext -f $pageindex -l $pageindex $file - | head -n 3 | tr '\n' ' ' | cut -f1 -d"("`
  let "datestamp =`date +%s%N`" # to avoid overwriting with same new name
  pdftk $file cat $pageindex output "$newname"$datestamp.pdf

2nd Quest
I have files I would like to rename based on the content of the file. Each file is some kind of employee data file, and has the employee name in the content, but not necessary in the file name.
I wanted to rename the file based on the employee name and the page title (first line of the file).
My solution is:
# automatically renames pdf file based on content maching a list of names

# read files
IFS=$'\n' read -d '' -r -a filelist < <(printf '%s\n' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS

names=( "Aaa Aaa" "Bbb Bbb" "Ccc Ccc") # names to match in the files

# process files
for file in "${filelist[@]}"; do
 # find name from names in the file
 for name in "${names[@]}"; do
  testname=`pdftotext "$file" - | grep $name`
  if [[ $testname != "" ]]; then

 # rename file based on found name
 title=`pdftotext -f 1 -l 1 "$file" - | head -n 1`
 let "datestamp =`date +%s%N`"
 mv "$file" "${file%/*}/$foundname $title $datestamp.pdf"

3rd Quest
I have a yearly employee data sheet (personal tax data), which I get in one big multipage pdf file, having a multiple pages for each employee. This year it is three pages per employee, but the tax ID number is the only common point in the three pages (it it a 10 digit long number beginning with number 8). The first line of each page is a header, and it is different for each page for a single employee.
I wanted to split the pdf to employees and rename it with the header of the pages.
My solution is:
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.

# read files
IFS=$'\n' read -d '' -r -a filelist < <(printf '%s\n' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS

# process files
for file in "${filelist[@]}"; do
 pagecount=`pdfinfo $file | grep "Pages" | awk '{ print $2 }'`
 # MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8: 
 storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]{9}'`

 for (( pageindex=1; pageindex<=$pagecount; pageindex+=1 )); do

  header=`pdftotext -f $pageindex -l $pageindex $file - | head -n 1`
  pageid=`pdftotext -f $pageindex -l $pageindex $file - | egrep '8?[0-9]{9}'`
  let "datestamp =`date +%s%N`" # to avoid overwriting with same new name

  # match ID found on the page to the stored ID
  if [[ $pageid == $storedid ]]; then
   pattern+="$pageindex " # adds number as text to variable separated by spaces

   if [[ $pageindex == $pagecount ]]; then #process last output of the file 
    pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
   #process previous set of pages to output
   pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
   pattern="$pageindex "

4th Quest
I wanted to have an alternate version of the 3rd Quest in case I have to split in equal sets of pages any other type of file, not having the tax ID as a common point in them.
So I wanted to split this multipage pdf based on a manually set block size, i.e. 3 pages / set.
My solution is:
# automatically splits pdf file to multiple pages based on given block size, and renames the files using the original filename plus datestamp

# read files
IFS=$'\n' read -d '' -r -a filelist < <(printf '%s\n' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS

# process files
for file in "${filelist[@]}"; do

 # variables
 filename=${file%.pdf} # remove extension from filename

 # calculate page range blocks
 pagecount=`pdfinfo $file | grep "Pages" | awk '{ print $2 }'` 
 let setcount=$pagecount/$blocksize 
 for (( setindex=1; setindex<=$setcount; setindex+=1 )); do

  if [[ $setindex -lt $setcount ]]; then
   let lastpage=$setindex*$blocksize
   let firstpage=$lastpage-$blocksize+1

  elif [[ $setindex -eq $setcount ]]; then # handle last page
   let lastpage=$pagecount
   let firstpage=$setindex*$blocksize-$blocksize+1


 # process ranges to output
 for block in "${blocks[@]}"; do
  let "datestamp =`date +%s%N`" # to avoid overwriting with same new name
  pdftk $file cat $block output "$filename $datestamp.pdf"

That's it. Could be better, but is functioning all right.

2015. február 19.

Using NCX-generator on Wine

NCX-generator is a Windows Command Line Interface utility to prepare e-books for processing with Kindlegen.

ncx-generator can be used through Wine. It requires .NET Framework 4 to be installed.

ncx-generator have to be run through Wine's Console User Interface (wineconsole - installed with the wine package).

This is the help file:

ncxgen [options] filename

  -h, -?, --help             Display this help.
      --toc                  Generate the html Table of Contents.
      --ncx                  Generate the NCX Global Navigation.
      --opf                  Create the opf file package.
  -a, --all                  Create both html ToC, ncx and opf files.
  -q, --query=VALUE          The XPath query to find the ToC items. Use
                               multiple times to add levels in the ToC.
  -l, --level=VALUE          Number of levels to collapse to generate the NCX
                               file - used with -ncx or -all.
  -e                         Place the generated TOC at the end of the book
      --toc-title=VALUE      Name of the Table of Contents
      --author=VALUE         Author name.
      --title=VALUE          Book title.
  -v, --verbose              Turn on verbose output
  -i                         Convert <PRE class='image'> tages to PNG images
      --overwrite            Overwrite files without any prompt

         "ngen.exe -all -q "//h1" -q "//h2[@class='toc']" source.xhtml"
This expression will parse the xhtml file source.xhtml looking for the tag h1 an
d the tag h2 with an attribute class set to 'toc'. It will then create the html
 Table of Contents, the NCX Global Navigation file and the OPF file using the it
ems found.

The difference between the Ubuntu Terminal and the WineConsole is, that you cannot use the TAB to fill in the filenames for you, and you have to refer to directories in your actual directory with a ./directory_name/ instead of using simply directory_name/ as in the Ubuntu Terminal.

To run the Windows Command Line Interface run wineconsole cmd
To run do this without opening the Windows Command Line Interface, just run the command through WineConsole, as if you were using the Windows CLI. This way it is possible to use the TAB to fill in filenames, which is a pleasure. The Windows CLI will be opened by WineConsole to run the command, and be closed when the running ends. If the user is prompted for keyboard input, the Windows CLI will stay open and wait for the user input.
~$ wineconsole ncxGen.0.2.6.exe -a -q "//h1" --author="Book Author Name" --title="Book Title" --toc-title="TOC Title" ./Test_Book/Book.html

This way NCX-generator can be integrated into bash scripts :)

Running a command like the above (with -a option) having a Book.html and a Cover.jpg to begin with, will create the following files in the file directory:

NOTE: if you name the cover image Cover.jpg, and put it in the same directory as the book HTML, it will be automatically recognized and included properly in the OPF file, when you run ncxGen. By specifying the --author and --title options too, you will not have to edit the OPF manually at all before building your book with KindleGen.
However, some UTF-8 characters are not properly recognized in ncxGen, so if you use i.e. Hungarian characters like ő or ű in the title, toc-title or author, you'd better check the OPF and correct it.

2015. február 15.

Installing MobiPocket Creator on Ubuntu

First of all, install the latest version of wine.
Seconds, set it to have a 32-bit Wine Prefix :
remove the current Wine settings directory:
sudo rm -r ~/.wine
and then create 32 bit prefix with this command:
WINEARCH=win32 WINEPREFIX=~/.wine winecfg 
Now for Mobipocket Creator to run properly, run this in terminal and follow any instructions:
winetricks ie6 vcrun2005
Apparently creating new mobi books work, but it pops up an error every time you do something. Just ignore these errors and build the book anyway.

do this to have a clear GUI for Mobipocket Creator:
winetricks ie8
This solves the pop-up errors and the incomplete user interface!

Install MobiPocket Creator:
Download from official page,
double click to install with Wine.


Installing the latest version of Wine

I installed Wine from the repository, but it did not work well with the program I installed on it, so I tried out what happens if I update Wine.

1. add the wine source to your repository:
sudo add-apt-repository ppa:ubuntu-wine/ppa
2. update repository list:
sudo apt-get update
3. install latest version of wine
sudo apt-get install wine

The program I wanted to run, and were fixed by this update:
  • Amazon Kindle Previewer