Alice@Ubuntu: PDF to HTML for eBook creation

My general goal is to process PDF files through a HTML/XML state before making eBook (mobi/epub) from them.
So the goal is, to somehow generate a clean HTML file out of a PDF.
Some of my general needs are:

keep paragraphs together (not mixing up with )
get images and text together
keep character formatting
handle multiple columns (convert to single column)
skip page numbers

First of all, here's what I found important from the Google search results:

Thomas Levine's Parsing PDF files walk-through
Tools to use:

Basic file analysis tools (ls or another language’s equivalent)
PDF metadata tools (pdfinfo or an equivalent)
pdftotext
pdftohtml -xml
Inkscape via pdf2svg
PDFMiner

My own experiments:
With this PDF file, and another one that I made for this purpose.

PDFtoHTML
$ pdftohtml example.pdf

incorrectly displayed character encoding

<meta http-equiv="CONTENT-TYPE" content="text/html; charset=utf-8"> has to be entered in the HTML HEAD for proper character encoding
could put a -enc UTF-8 (Case Sensitive!) as an option if needed, but the header meta still has to be entered manually.
the meta gets entered in the document if the -noframes option is used.

no paragraphs, only breaks
a navigation html file was generated with two frames: one for a page index html, and one for the actual text.

this can be avoided using the -noframes option
the -s option to generate single output (however this will only concatenate the whole html files, and links are not corrected)

pages are separated with horizontal rulers
images are processed all right

$ pdftohtml -c example.pdf

formatting is mostly strictly preserved

styles by css
absolute positions of paragraphs

some paragraphs are kept together, some are not recognized properly

left text alignment is the only one that is kept, everything else is shown with absolute left position.
bold and italics is preserved, but underline shows wrong
columns are preserved okay
font face changes do not show.

images are embedded in page background
every page is a separate html file

$ pdftohtml -xml example.pdf

stores formatting info about absolute positions, font size and line height
no info about paragraphs, images

Alltogether experinece with PDFtoHTML: it's almost good for nothing, when it is not processed properly afterwards.

Postprocessing a file generated with the pdftohtml -enc UTF-8 -noframes -p -q example.pdf command:
Sturcture:

style element and some meta element in the html head are unnecessary.
<a name=[pagenunber]></a> marks the beginning of every page
in the middle of a line marks line break
at the end of the line marks paragraph break
<hr/> marks the end of every page
the last line before the end of page is probably a page number

PDFMiner
Download and Install PDFMiner.
Review the command line tools and their capacity.

$ pdf2txt.py -o example.html example.pdf
or
$ pdf2txt.py -Y normal -o example.html example.pdf

no paragraphs, only breaks
(paragraphs are not collected -- no easy way to restore them, unlike in pdftohtml)
formatting is not css but html span tags

Html Tidy can collect the formatting to the front of the html file as css, making it easier to review and modify:
tidy -utf8 -c -o example_tidy.html example.html

bold and italics are kept as font family style in span tag, underline is taken as an image (?)
images not processed
display is messy, text displayed on top of each other
code is quite all right.

$ pdf2txt.py -o example.xml example.pdf
or
$ pdf2txt.py -Y exact -o example.html example.pdf

stores exact position of every single character
holds space for images

$ pdf2txt.py -t tag -o example.txt example.pdf

stores pdf page data, with unformatted text content

$ pdf2txt.py -Y loose -o example.html example.pdf

does not keep the line breaks, only the span style is present to indicate text changes. paragraphs not recognized properly
messy code

Alltogether experinece with PDFMiner: this is not what I'm looking for. It either store too much or too little information for my purposes, so in my case it's actually good for nothing, when it is not processed properly afterwards.

pdf2htmlEX
$ pdf2htmlEX example.pdf

omg wow amazing pretty output view!

everything looks exactly like the pdf file

(in exchange for an) extremely messy code :)

one page is one line, identified by a (div) id.
formatting is kept in classes (div, span, img, ...) by CSS:

@font-face {}
@media {}
.ff: font-family
(t) m: transformation matrix
v: vertical-align
ls: letter-spacing
sc: text-shadow
ws: word-spacing
_: display and width or margin-left
fc: color
fs: font-size
y: bottom
h: height
w: width
x: left

all file embedding can be turned off with --embed cfijo (will generate separate output files)

Alltogether useless for my purposes.However, the best if your purpose is to display a pdf file as a html page on the web.

PdfMasher (GUI)
Does not keep formatting or images, but is specialized to keep proper order of the text.

as said, does not keep font formatting or image placeholders
with a little manual adjustment:

can be set to ignore page numbers (amazing!)
can be set to collect and link footnotes to be endnotes (amazing!)

html code is quite clear
paragraphs are well kept if it was possible

Alltogether so far this is the best tool to prepare a simple text pdf for eBook creation.
Usage:
There are five type of elements that can be set in Edit mode:

Normal: will be default text
Title: will be Header tag H1 for once pressed, H2 for twice pressed, etc.

best to be filtered with sorting by Font Size

Footnote: will search for the reference and link it as endnote

best to be filtered with sorting by Font Size (or X or Y, respectively)

Ignore: will be ignored

Page numbers, footers and headers are best to be filtered with sorting by Y (or X, respectively)

To Fix:puts a FIXME sign in front of the paragraph. In HTML this becomes an italics formatted text.

These types can be set on the Table or on the Page tab.
Build options are:

Generate Markdown: generates a plain text file in the pdf directory with marks specifying the Title and To Fix parts, Ignored elements already ignored, and Footnotes already linked.
Edit Markdown: opens the markdown text file
Reveal Markdown: opens the directory in the default file browser containing the markdown
View HTML: generates the html file out of the markdown file, and opens it in the default web browser

Markdown signs:

# for H1, ## for H2, etc.
*FIXME* for italics
*** for horizontal ruler
numbers for lists (quite annoying)
more on markdown usage
I accidentally found an eBook creation software that works from text like this markdown text, so I'll just leave a link here for notice.

Formatting of the text can be fixed manually in the markdown form or in the html form.
When saving as MOBI or EPUB there will be Table of Contents and navigation generated from the headings. The book Start will be set to the first heading.

Summary:

To export all images from your file for further usage,
use pdf2htmlEX --embed cfijo example.pdf.

can be opened in LibreOffice

To get a very simple html with proper paragraphs, endnotes, and headings, but without font formatting,
use PDFMasher (GUI).
To get a fair html code with most of the images and font formatting but messed up paragraphing,
use pdftohtml -enc UTF-8 -noframes -p -q example.pdf

cannot be opened in LibreOffice

That's it for so far.
Probably the best way would be to learn SED and create a html cleaning script for myself, but that's likely distant future.

Alice@Ubuntu

2015. február 27.

PDF to HTML for eBook creation

Nincsenek megjegyzések: