So the goal is, to somehow generate a clean HTML file out of a PDF.
Some of my general needs are:
- keep paragraphs together (not mixing up <br /> with <p></p>)
- get images and text together
- keep character formatting
- handle multiple columns (convert to single column)
- skip page numbers
Thomas Levine's Parsing PDF files walk-through
Tools to use:
- Basic file analysis tools (
ls
or another language’s equivalent) - PDF metadata tools (
pdfinfo
or an equivalent) pdftotext
pdftohtml -xml
- Inkscape via
pdf2svg
- PDFMiner
With this PDF file, and another one that I made for this purpose.
PDFtoHTML
$ pdftohtml example.pdf
- incorrectly displayed character encoding
- <meta http-equiv="CONTENT-TYPE" content="text/html; charset=utf-8"> has to be entered in the HTML HEAD for proper character encoding
- could put a -enc UTF-8 (Case Sensitive!) as an option if needed, but the header meta still has to be entered manually.
- the meta gets entered in the document if the -noframes option is used.
- no paragraphs, only breaks
- a navigation html file was generated with two frames: one for a page index html, and one for the actual text.
- this can be avoided using the -noframes option
- the -s option to generate single output (however this will only concatenate the whole html files, and links are not corrected)
- pages are separated with horizontal rulers
- images are processed all right
- formatting is mostly strictly preserved
- styles by css
- absolute positions of paragraphs
- some paragraphs are kept together, some are not recognized properly
- left text alignment is the only one that is kept, everything else is shown with absolute left position.
- bold and italics is preserved, but underline shows wrong
- columns are preserved okay
- font face changes do not show.
- images are embedded in page background
- every page is a separate html file
- stores formatting info about absolute positions, font size and line height
- no info about paragraphs, images
Postprocessing a file generated with the pdftohtml -enc UTF-8 -noframes -p -q example.pdf command:
Sturcture:
- style element and some meta element in the html head are unnecessary.
- <a name=[pagenunber]></a> marks the beginning of every page
- <br/> in the middle of a line marks line break
- <br/> at the end of the line marks paragraph break
- <hr/> marks the end of every page
- the last line before the end of page is probably a page number
Download and Install PDFMiner.
Review the command line tools and their capacity.
$ pdf2txt.py -o example.html example.pdf
or
$ pdf2txt.py -Y normal -o example.html example.pdf
- no paragraphs, only breaks
(paragraphs are not collected -- no easy way to restore them, unlike in pdftohtml) - formatting is not css but html span tags
- Html Tidy can collect the formatting to the front of the html file as css, making it easier to review and modify:
tidy -utf8 -c -o example_tidy.html example.html
- bold and italics are kept as font family style in span tag, underline is taken as an image (?)
- images not processed
- display is messy, text displayed on top of each other
- code is quite all right.
or
$ pdf2txt.py -Y exact -o example.html example.pdf
- stores exact position of every single character
- holds space for images
- stores pdf page data, with unformatted text content
- does not keep the line breaks, only the span style is present to indicate text changes. paragraphs not recognized properly
- messy code
pdf2htmlEX
$ pdf2htmlEX example.pdf
- omg wow amazing pretty output view!
- everything looks exactly like the pdf file
- (in exchange for an) extremely messy code :)
- one page is one line, identified by a (div) id.
- formatting is kept in classes (div, span, img, ...) by CSS:
- @font-face {}
- @media {}
- .ff: font-family
- (t) m: transformation matrix
- v: vertical-align
- ls: letter-spacing
- sc: text-shadow
- ws: word-spacing
- _: display and width or margin-left
- fc: color
- fs: font-size
- y: bottom
- h: height
- w: width
- x: left
- all file embedding can be turned off with --embed cfijo (will generate separate output files)
PdfMasher (GUI)
Does not keep formatting or images, but is specialized to keep proper order of the text.
- as said, does not keep font formatting or image placeholders
- with a little manual adjustment:
- can be set to ignore page numbers (amazing!)
- can be set to collect and link footnotes to be endnotes (amazing!)
- html code is quite clear
- paragraphs are well kept if it was possible
Usage:
There are five type of elements that can be set in Edit mode:
- Normal: will be default text
- Title: will be Header tag H1 for once pressed, H2 for twice pressed, etc.
- best to be filtered with sorting by Font Size
- Footnote: will search for the reference and link it as endnote
- best to be filtered with sorting by Font Size (or X or Y, respectively)
- Ignore: will be ignored
- Page numbers, footers and headers are best to be filtered with sorting by Y (or X, respectively)
- To Fix:puts a FIXME sign in front of the paragraph. In HTML this becomes an italics formatted text.
Build options are:
- Generate Markdown: generates a plain text file in the pdf directory with marks specifying the Title and To Fix parts, Ignored elements already ignored, and Footnotes already linked.
- Edit Markdown: opens the markdown text file
- Reveal Markdown: opens the directory in the default file browser containing the markdown
- View HTML: generates the html file out of the markdown file, and opens it in the default web browser
- # for H1, ## for H2, etc.
- *FIXME* for italics
- *** for horizontal ruler
- numbers for lists (quite annoying)
- more on markdown usage
- I accidentally found an eBook creation software that works from text like this markdown text, so I'll just leave a link here for notice.
When saving as MOBI or EPUB there will be Table of Contents and navigation generated from the headings. The book Start will be set to the first heading.
Summary:
- To export all images from your file for further usage,
use pdf2htmlEX --embed cfijo example.pdf. - can be opened in LibreOffice
- To get a very simple html with proper paragraphs, endnotes, and headings, but without font formatting,
use PDFMasher (GUI). - To get a fair html code with most of the images and font formatting but messed up paragraphing,
use pdftohtml -enc UTF-8 -noframes -p -q example.pdf - cannot be opened in LibreOffice
Probably the best way would be to learn SED and create a html cleaning script for myself, but that's likely distant future.