Alice@Ubuntu: pdf

A következő címkéjű bejegyzések mutatása: pdf. Összes bejegyzés megjelenítése

2021. január 4.

Note: How to password-protect a PDF file on Ubuntu

This is a copy of this post for safe-keeping:

In a terminal, type:

sudo apt-get install pdftk

Then, to add a password to a PDF file, type:

pdftk <input-file> output <output-file> user_pw <password>

Example:

pdftk input.pdf output output.pdf user_pw 1234

2018. január 17.

Note: Create fillable PDF forms

Forms from scratch with OpenOffice

2017. június 11.

Creating searchable PDFs on Ubuntu 3rd try

Following up on the second try:

My requirements:

image layer over text layer
good character encoding for Hungarian ű and ő chars
good placement of words and lines
fair enough good recognition
handling more column layout

Tesseract:

Set up Tesseract:

sudo apt-get install tesseract-ocr
from https://github.com/tesseract-ocr/tessdata download the language data you need and put it in tessdata directory (/usr/share/tesseract-ocr/tessdata). E.g. for Hungarian:
cd /usr/share/tesseract-ocr/tessdata sudo wget https://github.com/tesseract-ocr/tessdata/raw/master/hun.traineddata
add environmental variable TESSDATA_PREFIX to the directory containing the tessdata directory if you get the error that the language data cannot be found.

Tesseract supported input image formats:
...are the ones supported by Leptonica:
JPEG, PNG, TIFF, BMP, PNM ,GIF, WEBP, JP2

Tesseract searchable pdf output
Example usage with specified language (-l):

tesseract  input.png outbase -l hun pdf

Results:

image layer over text layer - YES
good character encoding for Hungarian ű and ő chars - YES
good placement of words and lines - YES
fair enough good recognition - YES: it depends a lot on input quality.
handling more column layout - SO-SO: sometimes works (i.e. half page is OK, other half is single column)

This is good enough for me now, so I'm not investigating further.

2015. július 1.

Removing text watermark from PDF

Following this guide, the solution for me was:

Fix your PDF, just in case:
```
pdftk original.pdf output fixed.pdf
```

Uncompress your PDF for text manipulation:

pdftk fixed.pdf output uncompressed.pdf uncompress

Remove text watermark with SED:

sed "s/Wow! eBook <WoweBook.Com>/ /g" uncompressed.pdf > unwatermarked.pdf

Compress the edited PDF:

pdftk unwatermarked.pdf output compressed.pdf compress

As usually, I had trouble using SED.
It turned out that sed -e "s/Wow! eBook <WoweBook.Com>/ /" did not work for me, but somehow the one without the -e option and with the /g flag did.

2015. június 29.

PDF Booklet printing

Using Adobe Reader:

Install Adobe Reader on Ubuntu 14.04:
http://ubuntuhandbook.org/index.php/2014/04/install-adobe-reader-ubuntu-1404/
Booklet printing with Adobe Reader:
https://helpx.adobe.com/acrobat/kb/print-booklets-acrobat-reader.html

Alternate ways:

PdfBooklet:
http://pdfbooklet.sourceforge.net/

2015. április 30.

color to grayscale PDF

The source file:
a PDF with page size US letter portrait
with text layer (different color and text background varying)
with color images as background (many images sum up to one background picture

The quest: convert the source PDF to:
a PDF with page size A4 portrait
with or without text layer (flatten or not - but keep text readable)
with grayscale and/or black and white (but recognizable) images as background

The purpose: print out the pdf on a regular black-and white printer:
keep it readable, and esthetically enjoyable
do not use more ink than necessary

***
The routes I tried:

1.) (from under the text layer) extract images, convert images, put images back.

Toolkit:

Extract images:
pdf2htmlEX --embed cfijo example.pdf

Convert images:
mogrify -type Grayscale -format ps *.png

...convert back to pdf with ps2pdf, join with pdfjoin or pdftk

Extract text layer:
cpdf -draft example.pdf -o example_text.pdf

Put background images behind text layer:
pdftk example_text.pdf multibackground example_images.pdf output modified.pdf

2.) flatten PDF to images, edit images.

Toolkit:

Burst pdf to single pages:
pdftk example.pdf burst

Convert to PostScript:
pdftops (and not pdf2ps)

Flatten ps to pnm and edit image
mogrify -format pnm -density 200x200 -type grayscale *.ps

...convert back to pdf with ps2pdf, join with pdfjoin or pdftk

***
In the end I chose the second route, because I had some trouble with the fonts in the pdf, which were not exported properly, and left a black text box in the images (making the text unreadable after backgrounding back to pdf)

Also the first route would involve a lot of learning of the PDF structure, the layer manipulation, and so on.

2015. február 27.

PDF to HTML for eBook creation

My general goal is to process PDF files through a HTML/XML state before making eBook (mobi/epub) from them.
So the goal is, to somehow generate a clean HTML file out of a PDF.
Some of my general needs are:

keep paragraphs together (not mixing up with )
get images and text together
keep character formatting
handle multiple columns (convert to single column)
skip page numbers

First of all, here's what I found important from the Google search results:

Thomas Levine's Parsing PDF files walk-through
Tools to use:

Basic file analysis tools (ls or another language’s equivalent)
PDF metadata tools (pdfinfo or an equivalent)
pdftotext
pdftohtml -xml
Inkscape via pdf2svg
PDFMiner

My own experiments:
With this PDF file, and another one that I made for this purpose.

PDFtoHTML
$ pdftohtml example.pdf

incorrectly displayed character encoding

<meta http-equiv="CONTENT-TYPE" content="text/html; charset=utf-8"> has to be entered in the HTML HEAD for proper character encoding
could put a -enc UTF-8 (Case Sensitive!) as an option if needed, but the header meta still has to be entered manually.
the meta gets entered in the document if the -noframes option is used.

no paragraphs, only breaks
a navigation html file was generated with two frames: one for a page index html, and one for the actual text.

this can be avoided using the -noframes option
the -s option to generate single output (however this will only concatenate the whole html files, and links are not corrected)

pages are separated with horizontal rulers
images are processed all right

$ pdftohtml -c example.pdf

formatting is mostly strictly preserved

styles by css
absolute positions of paragraphs

some paragraphs are kept together, some are not recognized properly

left text alignment is the only one that is kept, everything else is shown with absolute left position.
bold and italics is preserved, but underline shows wrong
columns are preserved okay
font face changes do not show.

images are embedded in page background
every page is a separate html file

$ pdftohtml -xml example.pdf

stores formatting info about absolute positions, font size and line height
no info about paragraphs, images

Alltogether experinece with PDFtoHTML: it's almost good for nothing, when it is not processed properly afterwards.

Postprocessing a file generated with the pdftohtml -enc UTF-8 -noframes -p -q example.pdf command:
Sturcture:

style element and some meta element in the html head are unnecessary.
<a name=[pagenunber]></a> marks the beginning of every page
in the middle of a line marks line break
at the end of the line marks paragraph break
<hr/> marks the end of every page
the last line before the end of page is probably a page number

PDFMiner
Download and Install PDFMiner.
Review the command line tools and their capacity.

$ pdf2txt.py -o example.html example.pdf
or
$ pdf2txt.py -Y normal -o example.html example.pdf

no paragraphs, only breaks
(paragraphs are not collected -- no easy way to restore them, unlike in pdftohtml)
formatting is not css but html span tags

Html Tidy can collect the formatting to the front of the html file as css, making it easier to review and modify:
tidy -utf8 -c -o example_tidy.html example.html

bold and italics are kept as font family style in span tag, underline is taken as an image (?)
images not processed
display is messy, text displayed on top of each other
code is quite all right.

$ pdf2txt.py -o example.xml example.pdf
or
$ pdf2txt.py -Y exact -o example.html example.pdf

stores exact position of every single character
holds space for images

$ pdf2txt.py -t tag -o example.txt example.pdf

stores pdf page data, with unformatted text content

$ pdf2txt.py -Y loose -o example.html example.pdf

does not keep the line breaks, only the span style is present to indicate text changes. paragraphs not recognized properly
messy code

Alltogether experinece with PDFMiner: this is not what I'm looking for. It either store too much or too little information for my purposes, so in my case it's actually good for nothing, when it is not processed properly afterwards.

pdf2htmlEX
$ pdf2htmlEX example.pdf

omg wow amazing pretty output view!

everything looks exactly like the pdf file

(in exchange for an) extremely messy code :)

one page is one line, identified by a (div) id.
formatting is kept in classes (div, span, img, ...) by CSS:

@font-face {}
@media {}
.ff: font-family
(t) m: transformation matrix
v: vertical-align
ls: letter-spacing
sc: text-shadow
ws: word-spacing
_: display and width or margin-left
fc: color
fs: font-size
y: bottom
h: height
w: width
x: left

all file embedding can be turned off with --embed cfijo (will generate separate output files)

Alltogether useless for my purposes.However, the best if your purpose is to display a pdf file as a html page on the web.

PdfMasher (GUI)
Does not keep formatting or images, but is specialized to keep proper order of the text.

as said, does not keep font formatting or image placeholders
with a little manual adjustment:

can be set to ignore page numbers (amazing!)
can be set to collect and link footnotes to be endnotes (amazing!)

html code is quite clear
paragraphs are well kept if it was possible

Alltogether so far this is the best tool to prepare a simple text pdf for eBook creation.
Usage:
There are five type of elements that can be set in Edit mode:

Normal: will be default text
Title: will be Header tag H1 for once pressed, H2 for twice pressed, etc.

best to be filtered with sorting by Font Size

Footnote: will search for the reference and link it as endnote

best to be filtered with sorting by Font Size (or X or Y, respectively)

Ignore: will be ignored

Page numbers, footers and headers are best to be filtered with sorting by Y (or X, respectively)

To Fix:puts a FIXME sign in front of the paragraph. In HTML this becomes an italics formatted text.

These types can be set on the Table or on the Page tab.
Build options are:

Generate Markdown: generates a plain text file in the pdf directory with marks specifying the Title and To Fix parts, Ignored elements already ignored, and Footnotes already linked.
Edit Markdown: opens the markdown text file
Reveal Markdown: opens the directory in the default file browser containing the markdown
View HTML: generates the html file out of the markdown file, and opens it in the default web browser

Markdown signs:

# for H1, ## for H2, etc.
*FIXME* for italics
*** for horizontal ruler
numbers for lists (quite annoying)
more on markdown usage
I accidentally found an eBook creation software that works from text like this markdown text, so I'll just leave a link here for notice.

Formatting of the text can be fixed manually in the markdown form or in the html form.
When saving as MOBI or EPUB there will be Table of Contents and navigation generated from the headings. The book Start will be set to the first heading.

Summary:

To export all images from your file for further usage,
use pdf2htmlEX --embed cfijo example.pdf.

can be opened in LibreOffice

To get a very simple html with proper paragraphs, endnotes, and headings, but without font formatting,
use PDFMasher (GUI).
To get a fair html code with most of the images and font formatting but messed up paragraphing,
use pdftohtml -enc UTF-8 -noframes -p -q example.pdf

cannot be opened in LibreOffice

That's it for so far.
Probably the best way would be to learn SED and create a html cleaning script for myself, but that's likely distant future.

2015. február 25.

PDF smart rename and split based on content

I was planning to do these scripts for some time to ease my work with pdf files.

The original idea was to grep some text from the pdf file and do a file manipulation based on the text.

This StackExchange post suggests the following methods to find a piece of text in a pdf:

pdftotext + grep
pdfgrep
strings + grep
pdftohtml + grep

I used pdftotext in the following examples.

I also needed to learn some more Bash scripting:

1st Quest
I have a monthly employee data sheet (payroll), which I get in one big multipage pdf file, having a single page for each employee. The content for each page is: 1st line: title, 2nd line: month, 3rd line: employee name.
I wanted to burst this pdf file into single pages and rename it to title month and employee name.
My solution is:

#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to single pages while renaming the output files using the first three of lines of the pdf

# read files
IFS=$'\n' read -d '' -r -a filelist < <(printf '%s\n' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS

pagecount=1
datestamp=1

# burst and rename pages to separate files
for file in "${filelist[@]}"; do
 pagecount=`pdfinfo $file | grep "Pages" | awk '{ print $2 }'`
 for (( pageindex=1; pageindex<=$pagecount; pageindex+=1 )); do
  newname=`pdftotext -f $pageindex -l $pageindex $file - | head -n 3 | tr '\n' ' ' | cut -f1 -d"("`
  let "datestamp =`date +%s%N`" # to avoid overwriting with same new name
  pdftk $file cat $pageindex output "$newname"$datestamp.pdf
 done
done

2nd Quest
I have files I would like to rename based on the content of the file. Each file is some kind of employee data file, and has the employee name in the content, but not necessary in the file name.
I wanted to rename the file based on the employee name and the page title (first line of the file).
My solution is:

#!/bin/bash
# NAUTILUS SCRIPT
# automatically renames pdf file based on content maching a list of names

# read files
IFS=$'\n' read -d '' -r -a filelist < <(printf '%s\n' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS

names=( "Aaa Aaa" "Bbb Bbb" "Ccc Ccc") # names to match in the files

# process files
for file in "${filelist[@]}"; do
 # find name from names in the file
 foundname=''
 for name in "${names[@]}"; do
  testname=`pdftotext "$file" - | grep $name`
  if [[ $testname != "" ]]; then
   foundname=$name
   break
  fi
 done

 # rename file based on found name
 title=`pdftotext -f 1 -l 1 "$file" - | head -n 1`
 let "datestamp =`date +%s%N`"
 mv "$file" "${file%/*}/$foundname $title $datestamp.pdf"
done

3rd Quest
I have a yearly employee data sheet (personal tax data), which I get in one big multipage pdf file, having a multiple pages for each employee. This year it is three pages per employee, but the tax ID number is the only common point in the three pages (it it a 10 digit long number beginning with number 8). The first line of each page is a header, and it is different for each page for a single employee.
I wanted to split the pdf to employees and rename it with the header of the pages.
My solution is:

#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.

# read files
IFS=$'\n' read -d '' -r -a filelist < <(printf '%s\n' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS

# process files
for file in "${filelist[@]}"; do
 pagecount=`pdfinfo $file | grep "Pages" | awk '{ print $2 }'`
 # MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8: 
 storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]{9}'`
 pattern=''
 pagetitle=''
 datestamp=''

 for (( pageindex=1; pageindex<=$pagecount; pageindex+=1 )); do

  header=`pdftotext -f $pageindex -l $pageindex $file - | head -n 1`
  pageid=`pdftotext -f $pageindex -l $pageindex $file - | egrep '8?[0-9]{9}'`
  let "datestamp =`date +%s%N`" # to avoid overwriting with same new name

  # match ID found on the page to the stored ID
  if [[ $pageid == $storedid ]]; then
   pattern+="$pageindex " # adds number as text to variable separated by spaces
   pagetitle+="$header+"

   if [[ $pageindex == $pagecount ]]; then #process last output of the file 
    pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
    storedid=0
    pattern=''
    pagetitle=''
   fi
  else 
   #process previous set of pages to output
   pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
   storedid=$pageid
   pattern="$pageindex "
   pagetitle="$header+"
  fi
 done
done

4th Quest
I wanted to have an alternate version of the 3rd Quest in case I have to split in equal sets of pages any other type of file, not having the tax ID as a common point in them.
So I wanted to split this multipage pdf based on a manually set block size, i.e. 3 pages / set.
My solution is:

#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on given block size, and renames the files using the original filename plus datestamp

# read files
IFS=$'\n' read -d '' -r -a filelist < <(printf '%s\n' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS

# process files
for file in "${filelist[@]}"; do

 # variables
 blocksize=3 # ADD BLOCK SIZE MANUALLY HERE
 blocks=()
 filename=${file%.pdf} # remove extension from filename

 # calculate page range blocks
 pagecount=`pdfinfo $file | grep "Pages" | awk '{ print $2 }'` 
 let setcount=$pagecount/$blocksize 
 for (( setindex=1; setindex<=$setcount; setindex+=1 )); do

  if [[ $setindex -lt $setcount ]]; then
   let lastpage=$setindex*$blocksize
   let firstpage=$lastpage-$blocksize+1

  elif [[ $setindex -eq $setcount ]]; then # handle last page
   let lastpage=$pagecount
   let firstpage=$setindex*$blocksize-$blocksize+1

  fi
  blocks+=("$firstpage-$lastpage")
 done

 # process ranges to output
 for block in "${blocks[@]}"; do
  let "datestamp =`date +%s%N`" # to avoid overwriting with same new name
  pdftk $file cat $block output "$filename $datestamp.pdf"
 done
done

That's it. Could be better, but is functioning all right.

2014. február 19.

Modify PDF pages as images, than reintegrate

what just has happened?!

I successfully reintegrated a modified PDF in the original file!

wow! that is definitely a level up in image-pdf editing for me.

pdftk input.pdf cat 2 output page2.pdf

pdfinfo page2.pdf 
Page size:      595 x 842 pts (A4)

pdfimages page2.pdf img

identify -verbose img-000.pbm
Geometry: 2496x3440+0+0
  Resolution: 72x72
  Page geometry: 2496x3440+0+0

  Modified the image through Gimp. Open, Edit, Save -> Export -> click Ok whenever needed.

convert img-000.pbm -resample 300x300 -resize 2496x3440 img2.ps

identify -verbose img2.ps
  Geometry: 599x826+0+0
  Resolution: 72x72
  Page geometry: 599x826+0+0

ps2pdf12 -sPAPERSIZE=a4 -dFIXEDRESOLUTION img2.ps page2_v2.pdf

pdfinfo page2test4pdf.pdf 
  Page size:      595 x 842 pts (A4)

pdftk A=input.pdf B=page2_V2.pdf cat A1 B A3-end output output.pdf

2012. december 29.

Small size Color PDF

After years of struggling with color PDF-s, I realized today that color PS exists, which lead to a fast solution on the given problem.

Here's how it goes:

GSCAN2PDF
scan image on 8-bit depth and 300 dpi resolution.
save in PNM file format.
this results in an approx. 20MB size file.
CONVERT
resize to 50%
convert bigimg.pnm -resize 50% smallimg.pnm
file size abt. 5MB
PNMTOPS
convert to ps
pnmtops -equalpixels smallimg.pnm > smallimg.ps
file size abt. 10MB
PS2PDF
convert to pdf
ps2pdf12 smallimg.ps smallimg.pdf
*/Sadly, I could not manage to have the pdf cropped to the proper size with ps2pdf, so I have to do it manually in the next step. However if You know the solution to this problem do not hold back on sharing it with me!/*
file size abt. 200KB
PDFCROP
crop pdf of unnecessary margin
pdfcrop smallimg.pdf
(resulting in smallimg-crop.pdf)
final file size abt. 200KB

This is how you do a good quality color pdf from good quality large images.

...And a couple months later (19 May 2013) I realized that my method of producing black-and-white pdfs, which I use since years, is just as good to produce color pdfs as the above or even better because with the old method I do not have to use pdfcrop, ad the pdf page size of the color cover will be compatible with the b&w pages...

2012. augusztus 26.

Watermarking PDF-s

Program used:
PDFTK (get from Synaptic)
* Apply a Background Watermark or a Foreground Stamp

Method:

pdftk [input filename="filename" pdf="pdf" ] stamp [stamp filename="filename" pdf="pdf"] output [output filename="filename"]

What happens?

I made a transparent background GIF with GIMP, converted it to pdf, with Mogrify, and applied it to the pdf.

Application of a different size watermark image than the original pdf page happens with scaling the watermark image to fit the height and with keeping the watermark's size ratio, and fitting it in the center of the original page.

2010. november 13.

image to pdf

I really can't handle ImageMagick when it comes to pdf files. It creates such huge files they are completely useless.

I tried out another way: pbm -> ps -> pdf:

first I ImageMagick convert pbm-s:
mogrify -format ps -density 300x300 *.pbm
then put them into ps2pdf:

#!/bin/bash
lista=(`find -iname '*.ps' | sort`)
for i in "${lista[@]}"
do
ps2pdf -r300 -g1880x2688 $i $i.pdf
done

and bind them:
pdfjoin *.pdf --outfile output.pdf

It worked brilliantly. This way my pdf is 8 times smaller then my original pdf, and it's only 1.5 times bigger then the djvu I made from the same images.

Djvu was made directly from pbm files, converted with cjb2 and binded with djvm. No special option was used.

2009. március 30.

Notice: pdfnup --trim

It removes a given amount of margins on the side, with the syntax 'left bottom right top'.

2009. január 3.

ImageMagick converting multiple files

*I do not use this method anymore, although it was a great start.*

After being a book-worm for half a day, I found out how can I write a working script to solve my imagemagick output problem.
It looks like this:

#!/bin/sh

for i in `find -iname '*.jpg'`
do
convert $i -background white -flatten -colorspace Gray -negate -edge 1 -negate -normalize -sharpen 1 -threshold 50% -despeckle -blur 0x.5 -normalize +dither -posterize 8 gif:- >> $i.gif
done

1.) open your favorite text editor
2.) copy paste the script
3.) save as anyname.sh
4.) make it executable

Usage:
1.) copy anyname.sh in the folder containing only the image files you want to make simple
2.) go (cd) to this folder in the terminal
3.) type ./anyname.sh
4.) wait and see.

How to convert this to pdf?

mogrify -format pdf *.JPG

How to join these single-page pdfs?

pdfjoin *.pdf --outfile newname.pdf

2008. szeptember 23.

Convertig Tiff to PDF

sudo apt-get install libtiff-tools

# tiff2pdf tiff_akármi_file.tiff -z -o pdf_akármi_file.pdf