Alice@Ubuntu: kindle

A következő címkéjű bejegyzések mutatása: kindle. Összes bejegyzés megjelenítése

2015. február 27.

PDF to HTML for eBook creation

My general goal is to process PDF files through a HTML/XML state before making eBook (mobi/epub) from them.
So the goal is, to somehow generate a clean HTML file out of a PDF.
Some of my general needs are:

keep paragraphs together (not mixing up with )
get images and text together
keep character formatting
handle multiple columns (convert to single column)
skip page numbers

First of all, here's what I found important from the Google search results:

Thomas Levine's Parsing PDF files walk-through
Tools to use:

Basic file analysis tools (ls or another language’s equivalent)
PDF metadata tools (pdfinfo or an equivalent)
pdftotext
pdftohtml -xml
Inkscape via pdf2svg
PDFMiner

My own experiments:
With this PDF file, and another one that I made for this purpose.

PDFtoHTML
$ pdftohtml example.pdf

incorrectly displayed character encoding

<meta http-equiv="CONTENT-TYPE" content="text/html; charset=utf-8"> has to be entered in the HTML HEAD for proper character encoding
could put a -enc UTF-8 (Case Sensitive!) as an option if needed, but the header meta still has to be entered manually.
the meta gets entered in the document if the -noframes option is used.

no paragraphs, only breaks
a navigation html file was generated with two frames: one for a page index html, and one for the actual text.

this can be avoided using the -noframes option
the -s option to generate single output (however this will only concatenate the whole html files, and links are not corrected)

pages are separated with horizontal rulers
images are processed all right

$ pdftohtml -c example.pdf

formatting is mostly strictly preserved

styles by css
absolute positions of paragraphs

some paragraphs are kept together, some are not recognized properly

left text alignment is the only one that is kept, everything else is shown with absolute left position.
bold and italics is preserved, but underline shows wrong
columns are preserved okay
font face changes do not show.

images are embedded in page background
every page is a separate html file

$ pdftohtml -xml example.pdf

stores formatting info about absolute positions, font size and line height
no info about paragraphs, images

Alltogether experinece with PDFtoHTML: it's almost good for nothing, when it is not processed properly afterwards.

Postprocessing a file generated with the pdftohtml -enc UTF-8 -noframes -p -q example.pdf command:
Sturcture:

style element and some meta element in the html head are unnecessary.
<a name=[pagenunber]></a> marks the beginning of every page
in the middle of a line marks line break
at the end of the line marks paragraph break
<hr/> marks the end of every page
the last line before the end of page is probably a page number

PDFMiner
Download and Install PDFMiner.
Review the command line tools and their capacity.

$ pdf2txt.py -o example.html example.pdf
or
$ pdf2txt.py -Y normal -o example.html example.pdf

no paragraphs, only breaks
(paragraphs are not collected -- no easy way to restore them, unlike in pdftohtml)
formatting is not css but html span tags

Html Tidy can collect the formatting to the front of the html file as css, making it easier to review and modify:
tidy -utf8 -c -o example_tidy.html example.html

bold and italics are kept as font family style in span tag, underline is taken as an image (?)
images not processed
display is messy, text displayed on top of each other
code is quite all right.

$ pdf2txt.py -o example.xml example.pdf
or
$ pdf2txt.py -Y exact -o example.html example.pdf

stores exact position of every single character
holds space for images

$ pdf2txt.py -t tag -o example.txt example.pdf

stores pdf page data, with unformatted text content

$ pdf2txt.py -Y loose -o example.html example.pdf

does not keep the line breaks, only the span style is present to indicate text changes. paragraphs not recognized properly
messy code

Alltogether experinece with PDFMiner: this is not what I'm looking for. It either store too much or too little information for my purposes, so in my case it's actually good for nothing, when it is not processed properly afterwards.

pdf2htmlEX
$ pdf2htmlEX example.pdf

omg wow amazing pretty output view!

everything looks exactly like the pdf file

(in exchange for an) extremely messy code :)

one page is one line, identified by a (div) id.
formatting is kept in classes (div, span, img, ...) by CSS:

@font-face {}
@media {}
.ff: font-family
(t) m: transformation matrix
v: vertical-align
ls: letter-spacing
sc: text-shadow
ws: word-spacing
_: display and width or margin-left
fc: color
fs: font-size
y: bottom
h: height
w: width
x: left

all file embedding can be turned off with --embed cfijo (will generate separate output files)

Alltogether useless for my purposes.However, the best if your purpose is to display a pdf file as a html page on the web.

PdfMasher (GUI)
Does not keep formatting or images, but is specialized to keep proper order of the text.

as said, does not keep font formatting or image placeholders
with a little manual adjustment:

can be set to ignore page numbers (amazing!)
can be set to collect and link footnotes to be endnotes (amazing!)

html code is quite clear
paragraphs are well kept if it was possible

Alltogether so far this is the best tool to prepare a simple text pdf for eBook creation.
Usage:
There are five type of elements that can be set in Edit mode:

Normal: will be default text
Title: will be Header tag H1 for once pressed, H2 for twice pressed, etc.

best to be filtered with sorting by Font Size

Footnote: will search for the reference and link it as endnote

best to be filtered with sorting by Font Size (or X or Y, respectively)

Ignore: will be ignored

Page numbers, footers and headers are best to be filtered with sorting by Y (or X, respectively)

To Fix:puts a FIXME sign in front of the paragraph. In HTML this becomes an italics formatted text.

These types can be set on the Table or on the Page tab.
Build options are:

Generate Markdown: generates a plain text file in the pdf directory with marks specifying the Title and To Fix parts, Ignored elements already ignored, and Footnotes already linked.
Edit Markdown: opens the markdown text file
Reveal Markdown: opens the directory in the default file browser containing the markdown
View HTML: generates the html file out of the markdown file, and opens it in the default web browser

Markdown signs:

# for H1, ## for H2, etc.
*FIXME* for italics
*** for horizontal ruler
numbers for lists (quite annoying)
more on markdown usage
I accidentally found an eBook creation software that works from text like this markdown text, so I'll just leave a link here for notice.

Formatting of the text can be fixed manually in the markdown form or in the html form.
When saving as MOBI or EPUB there will be Table of Contents and navigation generated from the headings. The book Start will be set to the first heading.

Summary:

To export all images from your file for further usage,
use pdf2htmlEX --embed cfijo example.pdf.

can be opened in LibreOffice

To get a very simple html with proper paragraphs, endnotes, and headings, but without font formatting,
use PDFMasher (GUI).
To get a fair html code with most of the images and font formatting but messed up paragraphing,
use pdftohtml -enc UTF-8 -noframes -p -q example.pdf

cannot be opened in LibreOffice

That's it for so far.
Probably the best way would be to learn SED and create a html cleaning script for myself, but that's likely distant future.

2015. február 19.

Using NCX-generator on Wine

NCX-generator is a Windows Command Line Interface utility to prepare e-books for processing with Kindlegen.

ncx-generator can be used through Wine. It requires .NET Framework 4 to be installed.

ncx-generator have to be run through Wine's Console User Interface (wineconsole - installed with the wine package).

This is the help file:

ncxgen [options] filename

  -h, -?, --help             Display this help.
      --toc                  Generate the html Table of Contents.
      --ncx                  Generate the NCX Global Navigation.
      --opf                  Create the opf file package.
  -a, --all                  Create both html ToC, ncx and opf files.
  -q, --query=VALUE          The XPath query to find the ToC items. Use
                               multiple times to add levels in the ToC.
  -l, --level=VALUE          Number of levels to collapse to generate the NCX
                               file - used with -ncx or -all.
  -e                         Place the generated TOC at the end of the book
      --toc-title=VALUE      Name of the Table of Contents
      --author=VALUE         Author name.
      --title=VALUE          Book title.
  -v, --verbose              Turn on verbose output
  -i                         Convert <PRE class='image'> tages to PNG images
      --overwrite            Overwrite files without any prompt

Example:
         "ngen.exe -all -q "//h1" -q "//h2[@class='toc']" source.xhtml"
This expression will parse the xhtml file source.xhtml looking for the tag h1 an
d the tag h2 with an attribute class set to 'toc'. It will then create the html
 Table of Contents, the NCX Global Navigation file and the OPF file using the it
ems found.

The difference between the Ubuntu Terminal and the WineConsole is, that you cannot use the TAB to fill in the filenames for you, and you have to refer to directories in your actual directory with a ./directory_name/ instead of using simply directory_name/ as in the Ubuntu Terminal.

To run the Windows Command Line Interface run wineconsole cmd
To run do this without opening the Windows Command Line Interface, just run the command through WineConsole, as if you were using the Windows CLI. This way it is possible to use the TAB to fill in filenames, which is a pleasure. The Windows CLI will be opened by WineConsole to run the command, and be closed when the running ends. If the user is prompted for keyboard input, the Windows CLI will stay open and wait for the user input.

~$ wineconsole ncxGen.0.2.6.exe -a -q "//h1" --author="Book Author Name" --title="Book Title" --toc-title="TOC Title" ./Test_Book/Book.html

This way NCX-generator can be integrated into bash scripts :)

Running a command like the above (with -a option) having a Book.html and a Cover.jpg to begin with, will create the following files in the file directory:

Book.ncx  
Booktoc.html
 Book.opf  
Bookout.html

NOTE: if you name the cover image Cover.jpg, and put it in the same directory as the book HTML, it will be automatically recognized and included properly in the OPF file, when you run ncxGen. By specifying the --author and --title options too, you will not have to edit the OPF manually at all before building your book with KindleGen.
However, some UTF-8 characters are not properly recognized in ncxGen, so if you use i.e. Hungarian characters like ő or ű in the title, toc-title or author, you'd better check the OPF and correct it.

2015. február 15.

Installing MobiPocket Creator on Ubuntu

First of all, install the latest version of wine.
Seconds, set it to have a 32-bit Wine Prefix :

remove the current Wine settings directory:
sudo rm -r ~/.wine
and then create 32 bit prefix with this command:
WINEARCH=win32 WINEPREFIX=~/.wine winecfg 

Now for Mobipocket Creator to run properly, run this in terminal and follow any instructions:

winetricks ie6 vcrun2005

~~Apparently creating new mobi books work, but it pops up an error every time you do something. Just ignore these errors and build the book anyway.~~

UPDATE:
do this to have a clear GUI for Mobipocket Creator:

winetricks ie8

This solves the pop-up errors and the incomplete user interface!

Install MobiPocket Creator:
Download from official page,
double click to install with Wine.

Usage

2014. január 5.

MobiPocket Creator usage

Download Mobipocket Creator

Follow these instructions to install Publisher Version through WINE

To create e-books:
Follow instructions

Follow User Manual

Be prepared for continuously ignoring software errors while creating e-books.

What Works?

This simple process works okay, e-book is built:

Create new publication
Add Content:

Insert HTML file
Insert Image file(s)

Add Cover Image
Add Metadata
(Save publication)
Build e-book

What does not work?
Build fails with "error(htmlparser) no BODY tag found in content file"

Add Table Of Contents

KindleGen Usage

Download Kindlegen for Linux
Read publishing guidelines

Extract package anywhere

docs/english/Readme.txt content (relevant):

Creating Kindle ebooks - Advanced users:
-------------------------------------------
Advanced users can use the command line tool to convert EPUB/HTML to Kindle ebooks. This interface is available in Windows, Mac and Linux platform. This tool can be used for automated bulk conversions.

KindleGen for Linux 2.6 i386 :
1. Download the KindleGen tar.gz from www.amazon.com/kindleformat/kindlegen to a folder such as Kindlegen in home directory (~/KindleGen).
2. Extract the contents of the file to '~/KindleGen'. Open the terminal, move to folder containing the downloaded file using command "cd ~/KindleGen" and then use command "tar xvfz kindlegen_linux_2.6_i386_v2.tar.gz" to extract the contents.
3. Open the Terminal application and type ~/KindleGen/kindlegen. Instructions on how to run KindleGen are displayed.
4. Conversion Example: To convert a file called book.html, go to the directory where the book is located, such as cd desktop, and type ~/KindleGen/kindlegen book.html. If the conversion was successful, a new file called book.mobi displays on the desktop.
5. Please note: It is recommended to follow these steps to run KindleGen. Double-clicking the KindleGen icon does not launch this program. Run the above commands without quotes

Instructions on how to run KindleGen:
Navigate in terminal to folder
type ./kindlegen for usage information:

*************************************************************
Amazon kindlegen(Linux) V2.9 build 0730-890adc2
A command line e-book compiler
Copyright Amazon.com and its Affiliates 2013
*************************************************************
Usage : kindlegen [filename.opf/.htm/.html/.epub/.zip or directory] [-c0 or -c1 or c2] [-verbose] [-western] [-o ]
Note:
zip formats are supported for XMDF and FB2 sources
directory formats are supported for XMDF sources
Options:
-c0: no compression
-c1: standard DOC compression
-c2: Kindle huffdic compression
-o : Specifies the output file name. Output file will be created in the same directory as that of input file. should not contain directory path.
-verbose: provides more information during ebook conversion
-western: force build of Windows-1252 book
-releasenotes: display release notes
-gif: images are converted to GIF format (no JPEG in the book)
-locale : To display messages in selected language
en: English
de: German
fr: French
it: Italian
es: Spanish
zh: Chinese
ja: Japanese
pt: Portuguese
ru: Russian

First impressions:
This program should be able to convert .html/.htm and .epub files...
it converts them to .mobi to a filesize at least double the original (depending on images and compression)
Uploaded to Kindle, all the files seem to work fine. Text formatting is kept in some way - not perfect, but readable. Kindle shows Title and author for the epub, and title set for the html (not filename!)

Seems okay, but I do not have a real chance to generate a beautiful book this way easily... maybe converting from epub might be a chance to keep the book beautiful...

...or should really read the publishing guidelines to learn the proper formatting.