A következő címkéjű bejegyzések mutatása: pdf. Összes bejegyzés megjelenítése
A következő címkéjű bejegyzések mutatása: pdf. Összes bejegyzés megjelenítése

2021. január 4.

Note: How to password-protect a PDF file on Ubuntu

This is a copy of this post for safe-keeping: 


In a terminal, type:

sudo apt-get install pdftk

Then, to add a password to a PDF file, type:

pdftk <input-file> output <output-file> user_pw <password>

Example:

pdftk input.pdf output output.pdf user_pw 1234

2017. június 11.

Creating searchable PDFs on Ubuntu 3rd try

Following up on the second try:


My requirements:

  • image layer over text layer
  • good character encoding for Hungarian ű and ő chars
  • good placement of words and lines
  • fair enough good recognition
  • handling more column layout

Tesseract:


Set up Tesseract:
  • sudo apt-get install tesseract-ocr
  • from https://github.com/tesseract-ocr/tessdata download the language data you need and put it in tessdata directory (/usr/share/tesseract-ocr/tessdata). E.g. for Hungarian:
    cd /usr/share/tesseract-ocr/tessdata
    sudo wget https://github.com/tesseract-ocr/tessdata/raw/master/hun.traineddata
  • add environmental variable TESSDATA_PREFIX to the directory containing the tessdata directory if you get the error that the language data cannot be found.

Tesseract supported input image formats:
...are the ones supported by Leptonica:
JPEG, PNG, TIFF, BMP, PNM ,GIF, WEBP, JP2

Tesseract searchable pdf output
Example usage with specified language (-l):
tesseract  input.png outbase -l hun pdf

Results:
  • image layer over text layer - YES
  • good character encoding for Hungarian ű and ő chars - YES
  • good placement of words and lines - YES 
  • fair enough good recognition - YES: it depends a lot on input quality.
  • handling more column layout - SO-SO: sometimes works (i.e. half page is OK, other half is single column)

This is good enough for me now, so I'm not investigating further.

2015. július 1.

Removing text watermark from PDF

Following this guide, the solution for me was:
  1. Fix your PDF, just in case:
    pdftk original.pdf output fixed.pdf
  2. Uncompress your PDF for text manipulation:
    pdftk fixed.pdf output uncompressed.pdf uncompress
  3. Remove text watermark with SED:
    sed "s/Wow! eBook <WoweBook.Com>/ /g" uncompressed.pdf > unwatermarked.pdf
  4. Compress the edited PDF:
    pdftk unwatermarked.pdf output compressed.pdf compress
As usually, I had trouble using SED.
It turned out that sed -e "s/Wow! eBook <WoweBook.Com>/ /" did not work for me, but somehow the one without the -e option and with the /g flag did.

2015. április 30.

color to grayscale PDF

The source file:
a PDF with page size US letter portrait
with text layer (different color and text background varying)
with color images as background (many images sum up to one background picture

The quest: convert the source PDF to:
a PDF with page size A4 portrait
with or without text layer (flatten or not - but keep text readable)
with grayscale and/or black and white (but recognizable) images as background

The purpose: print out the pdf on a regular black-and white printer:
keep it readable, and esthetically enjoyable
do not use more ink than necessary

***
The routes I tried:

1.) (from under the text layer) extract images, convert images, put images back.

Toolkit:

Extract images:
pdf2htmlEX --embed cfijo example.pdf

Convert images:
mogrify -type Grayscale -format ps *.png

...convert back to pdf with ps2pdf, join with pdfjoin or pdftk

Extract text layer:  
cpdf -draft example.pdf -o example_text.pdf

Put background images behind text layer:
pdftk  example_text.pdf multibackground example_images.pdf output modified.pdf


2.) flatten PDF to images, edit images.

Toolkit:

Burst pdf to single pages:
pdftk example.pdf burst

Convert to PostScript:
pdftops (and not pdf2ps)

Flatten ps to pnm and edit image
mogrify -format pnm -density 200x200 -type grayscale *.ps

...convert back to pdf with ps2pdf, join with pdfjoin or pdftk



***
In the end I chose the second route, because I had some trouble with the fonts in the pdf, which were not exported properly, and left a black text box in the images (making the text unreadable after backgrounding back to pdf)

Also the first route would involve a lot of learning of the PDF structure, the layer manipulation, and so on.

2015. február 27.

PDF to HTML for eBook creation

My general goal is to process PDF files through a HTML/XML state before making eBook (mobi/epub) from them.
So the goal is, to somehow generate a clean HTML file out of a PDF.
Some of my general needs are:
  • keep paragraphs together (not mixing up <br /> with <p></p>)
  • get images and text together
  • keep character formatting
  • handle multiple columns (convert to single column)
  • skip page numbers
First of all, here's what I found important from the Google search results:

Thomas Levine's Parsing PDF files walk-through
Tools to use:
  • Basic file analysis tools (ls or another language’s equivalent)
  • PDF metadata tools (pdfinfo or an equivalent)
  • pdftotext
  • pdftohtml -xml
  • Inkscape via pdf2svg
  • PDFMiner
My own experiments:
With this PDF file, and another one that I made for this purpose.


PDFtoHTML
$ pdftohtml example.pdf
  • incorrectly displayed character encoding
    • <meta http-equiv="CONTENT-TYPE" content="text/html; charset=utf-8"> has to be entered in the HTML HEAD for proper character encoding 
    • could put a -enc UTF-8 (Case Sensitive!) as an option if needed, but the header meta still has to be entered manually. 
    • the meta gets entered in the document if the -noframes option is used.
  • no paragraphs, only breaks
  • a navigation html file was generated with two frames: one for a page index html, and one for the actual text.
    • this can be avoided using the -noframes option
    • the -s option to generate single output (however this will only concatenate the whole html files, and links are not corrected)
  • pages are separated with horizontal rulers
  • images are processed all right
 $ pdftohtml -c example.pdf
  • formatting is mostly strictly preserved
    • styles by css
    • absolute positions of paragraphs
      • some paragraphs are kept together, some are not recognized properly
    • left text alignment is the only one that is kept, everything else is shown with absolute left position.
    • bold and italics is preserved, but underline shows wrong
    • columns are preserved okay
    • font face changes do not show.
  • images are embedded in page background
  • every page is a separate html file
$ pdftohtml -xml example.pdf
  • stores formatting info about absolute positions, font size and line height
  • no info about paragraphs, images
Alltogether experinece with PDFtoHTML: it's almost good for nothing, when it is not processed properly afterwards.

Postprocessing a file generated with the pdftohtml -enc UTF-8 -noframes -p -q example.pdf command:
Sturcture:
  • style element and some meta element in the html head are unnecessary.
  • <a name=[pagenunber]></a> marks the beginning of every page
  • <br/> in the middle of a line marks line break
  • <br/> at the end of the line marks paragraph break
  • <hr/> marks the end of every page
  • the last line before the end of page is probably a page number
PDFMiner
Download and Install PDFMiner.
Review the command line tools and their capacity.

$ pdf2txt.py -o example.html  example.pdf
or
$ pdf2txt.py -Y normal -o example.html  example.pdf 
  • no paragraphs, only breaks
    (paragraphs are not collected -- no easy way to restore them, unlike in pdftohtml)
  • formatting is not css but html span tags
    • Html Tidy can collect the formatting to the front of the html file as css, making it easier to review and modify:
      tidy -utf8 -c -o example_tidy.html  example.html
  • bold and italics are kept as font family style in span tag, underline is taken as an image (?)
  • images not processed
  • display is messy, text displayed on top of each other
  • code is quite all right.
$ pdf2txt.py -o example.xml  example.pdf
or
$ pdf2txt.py -Y exact -o example.html  example.pdf 
  • stores exact position of every single character
  • holds space for images
$ pdf2txt.py -t tag -o example.txt  example.pdf
  • stores pdf page data, with unformatted text content
$ pdf2txt.py -Y loose -o example.html  example.pdf
  • does not keep the line breaks, only the span style is present to indicate text changes. paragraphs not recognized properly
  • messy code
Alltogether experinece with PDFMiner: this is not what I'm looking for. It either store too much or too little information for my purposes, so in my case it's actually good for nothing, when it is not processed properly afterwards.


pdf2htmlEX
$ pdf2htmlEX example.pdf
  • omg wow amazing pretty output view!
    • everything looks exactly like the pdf file
  • (in exchange for an) extremely messy code :)
    • one page is one line, identified by a (div) id.
    • formatting is kept in classes (div, span, img, ...) by CSS:
      • @font-face {}
      • @media {}
      • .ff: font-family
      • (t) m: transformation matrix
      • v: vertical-align
      • ls: letter-spacing
      • sc: text-shadow
      • ws: word-spacing
      • _: display and width or margin-left
      • fc: color
      • fs: font-size
      • y: bottom
      • h: height
      • w: width
      • x: left
  • all file embedding can be turned off with --embed cfijo (will generate separate output files)
Alltogether useless for my purposes.However, the best if your purpose is to display a pdf file as a html page on the web.

PdfMasher (GUI)
Does not keep formatting or images, but is specialized to keep proper order of the text.
  • as said, does not keep font formatting or image placeholders
  • with a little manual adjustment:
    • can be set to ignore page numbers (amazing!)
    • can be set to collect and link footnotes to be endnotes (amazing!)
  • html code is quite clear
  • paragraphs are well kept if it was possible
Alltogether so far this is the best tool to prepare a simple text pdf for eBook creation.
Usage:
There are five type of elements that can be set in Edit mode:
  • Normal: will be default text
  • Title: will be Header tag H1 for once pressed, H2 for twice pressed, etc.
    • best to be filtered with sorting by Font Size
  • Footnote: will search for the reference and link it as endnote
    • best to be filtered with sorting by Font Size (or X or Y, respectively)
  • Ignore: will be ignored
    • Page numbers, footers and headers are best to be filtered with sorting by Y (or X, respectively)
  • To Fix:puts a FIXME sign in front of the paragraph. In HTML this becomes an italics formatted text.
These types can be set on the Table or on the Page tab.
Build options are:
  • Generate Markdown: generates a plain text file in the pdf directory with marks specifying the Title and To Fix parts, Ignored elements already ignored, and Footnotes already linked.
  •  Edit Markdown: opens the markdown text file
  • Reveal Markdown: opens the directory in the default file browser containing the markdown
  • View HTML: generates the html file out of the markdown file, and opens it in the default web browser
Markdown signs:
  • # for H1, ## for H2, etc.
  • *FIXME* for italics
  • *** for horizontal ruler  
  • numbers for lists (quite annoying)
  • more on markdown usage
  • I accidentally found an eBook creation software that works from text like this markdown text, so I'll just leave a link here for notice.
Formatting of the text can be fixed manually in the markdown form or in the html form.
When saving as MOBI or EPUB there will be Table of Contents and navigation generated from the headings. The book Start will be set to the first heading.

Summary:
  • To export all images from your file for further usage,
    use pdf2htmlEX --embed cfijo example.pdf.
    • can be opened in LibreOffice
  • To get a very simple html with proper paragraphs, endnotes, and headings, but without font formatting,
    use PDFMasher (GUI).
  • To get a fair html code with most of the images and font formatting but messed up paragraphing,
    use pdftohtml -enc UTF-8 -noframes -p -q example.pdf
    • cannot be opened in LibreOffice
That's it for so far.
Probably the best way would be to learn SED and create a html cleaning script for myself, but that's likely distant future.

2015. február 25.

PDF smart rename and split based on content


I was planning to do these scripts for some time to ease my work with pdf files.

The original idea was to grep some text from the pdf file and do a file manipulation based on the text.

This StackExchange post suggests the following methods to find a piece of text in a pdf:
  • pdftotext + grep
  • pdfgrep
  • strings + grep
  • pdftohtml + grep
I used pdftotext in the following examples.

I also needed to learn some more Bash scripting:
1st Quest
I have a monthly employee data sheet (payroll), which I get in one big multipage pdf file, having a single page for each employee. The content for each page is: 1st line: title, 2nd line: month, 3rd line: employee name.
I wanted to burst this pdf file into single pages and rename it to title month and employee name.
My solution is:
#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to single pages while renaming the output files using the first three of lines of the pdf

# read files
IFS=$'\n' read -d '' -r -a filelist < <(printf '%s\n' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS

pagecount=1
datestamp=1

# burst and rename pages to separate files
for file in "${filelist[@]}"; do
 pagecount=`pdfinfo $file | grep "Pages" | awk '{ print $2 }'`
 for (( pageindex=1; pageindex<=$pagecount; pageindex+=1 )); do
  newname=`pdftotext -f $pageindex -l $pageindex $file - | head -n 3 | tr '\n' ' ' | cut -f1 -d"("`
  let "datestamp =`date +%s%N`" # to avoid overwriting with same new name
  pdftk $file cat $pageindex output "$newname"$datestamp.pdf
 done
done

2nd Quest
I have files I would like to rename based on the content of the file. Each file is some kind of employee data file, and has the employee name in the content, but not necessary in the file name.
I wanted to rename the file based on the employee name and the page title (first line of the file).
My solution is:
#!/bin/bash
# NAUTILUS SCRIPT
# automatically renames pdf file based on content maching a list of names

# read files
IFS=$'\n' read -d '' -r -a filelist < <(printf '%s\n' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS

names=( "Aaa Aaa" "Bbb Bbb" "Ccc Ccc") # names to match in the files

# process files
for file in "${filelist[@]}"; do
 # find name from names in the file
 foundname=''
 for name in "${names[@]}"; do
  testname=`pdftotext "$file" - | grep $name`
  if [[ $testname != "" ]]; then
   foundname=$name
   break
  fi
 done

 # rename file based on found name
 title=`pdftotext -f 1 -l 1 "$file" - | head -n 1`
 let "datestamp =`date +%s%N`"
 mv "$file" "${file%/*}/$foundname $title $datestamp.pdf"
done

3rd Quest
I have a yearly employee data sheet (personal tax data), which I get in one big multipage pdf file, having a multiple pages for each employee. This year it is three pages per employee, but the tax ID number is the only common point in the three pages (it it a 10 digit long number beginning with number 8). The first line of each page is a header, and it is different for each page for a single employee.
I wanted to split the pdf to employees and rename it with the header of the pages.
My solution is:
#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.

# read files
IFS=$'\n' read -d '' -r -a filelist < <(printf '%s\n' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS

# process files
for file in "${filelist[@]}"; do
 pagecount=`pdfinfo $file | grep "Pages" | awk '{ print $2 }'`
 # MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8: 
 storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]{9}'`
 pattern=''
 pagetitle=''
 datestamp=''

 for (( pageindex=1; pageindex<=$pagecount; pageindex+=1 )); do

  header=`pdftotext -f $pageindex -l $pageindex $file - | head -n 1`
  pageid=`pdftotext -f $pageindex -l $pageindex $file - | egrep '8?[0-9]{9}'`
  let "datestamp =`date +%s%N`" # to avoid overwriting with same new name

  # match ID found on the page to the stored ID
  if [[ $pageid == $storedid ]]; then
   pattern+="$pageindex " # adds number as text to variable separated by spaces
   pagetitle+="$header+"

   if [[ $pageindex == $pagecount ]]; then #process last output of the file 
    pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
    storedid=0
    pattern=''
    pagetitle=''
   fi
  else 
   #process previous set of pages to output
   pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
   storedid=$pageid
   pattern="$pageindex "
   pagetitle="$header+"
  fi
 done
done

4th Quest
I wanted to have an alternate version of the 3rd Quest in case I have to split in equal sets of pages any other type of file, not having the tax ID as a common point in them.
So I wanted to split this multipage pdf based on a manually set block size, i.e. 3 pages / set.
My solution is:
#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on given block size, and renames the files using the original filename plus datestamp

# read files
IFS=$'\n' read -d '' -r -a filelist < <(printf '%s\n' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS

# process files
for file in "${filelist[@]}"; do

 # variables
 blocksize=3 # ADD BLOCK SIZE MANUALLY HERE
 blocks=()
 filename=${file%.pdf} # remove extension from filename

 # calculate page range blocks
 pagecount=`pdfinfo $file | grep "Pages" | awk '{ print $2 }'` 
 let setcount=$pagecount/$blocksize 
 for (( setindex=1; setindex<=$setcount; setindex+=1 )); do

  if [[ $setindex -lt $setcount ]]; then
   let lastpage=$setindex*$blocksize
   let firstpage=$lastpage-$blocksize+1

  elif [[ $setindex -eq $setcount ]]; then # handle last page
   let lastpage=$pagecount
   let firstpage=$setindex*$blocksize-$blocksize+1

  fi
  blocks+=("$firstpage-$lastpage")
 done

 # process ranges to output
 for block in "${blocks[@]}"; do
  let "datestamp =`date +%s%N`" # to avoid overwriting with same new name
  pdftk $file cat $block output "$filename $datestamp.pdf"
 done
done

That's it. Could be better, but is functioning all right.

2014. február 19.

Modify PDF pages as images, than reintegrate

what just has happened?!

I successfully reintegrated a modified PDF in the original file!

wow! that is definitely a level up in image-pdf editing for me.

pdftk input.pdf cat 2 output page2.pdf
pdfinfo page2.pdf 
Page size: 595 x 842 pts (A4) pdfimages page2.pdf img
identify -verbose img-000.pbm
Geometry: 2496x3440+0+0 Resolution: 72x72 Page geometry: 2496x3440+0+0 Modified the image through Gimp. Open, Edit, Save -> Export -> click Ok whenever needed. convert img-000.pbm -resample 300x300 -resize 2496x3440 img2.ps
identify -verbose img2.ps
  Geometry: 599x826+0+0
  Resolution: 72x72
  Page geometry: 599x826+0+0

ps2pdf12 -sPAPERSIZE=a4 -dFIXEDRESOLUTION img2.ps page2_v2.pdf
pdfinfo page2test4pdf.pdf 
  Page size:      595 x 842 pts (A4)

pdftk A=input.pdf B=page2_V2.pdf cat A1 B A3-end output output.pdf

2012. december 29.

Small size Color PDF

After years of struggling with color PDF-s, I realized today that color PS exists, which lead to a fast solution on the given problem.

Here's how it goes:

  1. GSCAN2PDF
    scan image
    on 8-bit depth and 300 dpi resolution.
    save in PNM file format.
    this results in an approx. 20MB size file.
  2. CONVERT
    resize
    to 50%
    convert bigimg.pnm -resize 50% smallimg.pnm
    file size abt. 5MB
  3. PNMTOPS
    convert to ps
    pnmtops -equalpixels smallimg.pnm > smallimg.ps
    file size abt. 10MB
  4. PS2PDF
    convert to pdf
    ps2pdf12 smallimg.ps smallimg.pdf
    */Sadly, I could not manage to have the pdf cropped to the proper size with ps2pdf, so I have to do it manually in the next step. However if You know the solution to this problem do not hold back on sharing it with me!/*
    file size abt. 200KB
  5. PDFCROP
    crop pdf of unnecessary margin
    pdfcrop smallimg.pdf
    (resulting in smallimg-crop.pdf)
    final file size abt. 200KB
This is how you do a good quality color pdf from good quality large images.

...And a couple months later (19 May 2013) I realized that my method of producing black-and-white pdfs, which I use since years, is just as good to produce color pdfs as the above or even better because with the old method I do not have to use pdfcrop, ad the pdf page size of the color cover will be compatible with the b&w pages...

2012. augusztus 26.

Watermarking PDF-s

Program used:
PDFTK (get from Synaptic)
* Apply a Background Watermark or a Foreground Stamp

Method:


pdftk [input filename="filename" pdf="pdf" ] stamp [stamp filename="filename" pdf="pdf"] output [output filename="filename"]

What happens?

I made a transparent background GIF with GIMP, converted it to pdf, with Mogrify, and applied it to the pdf.

Application of a different size watermark image than the original pdf page happens with scaling the watermark image to fit the height and with keeping the watermark's size ratio, and fitting it in the center of the original page.

2010. november 13.

image to pdf

I really can't handle ImageMagick when it comes to pdf files. It creates such huge files they are completely useless.

I tried out another way: pbm -> ps -> pdf:

first I ImageMagick convert pbm-s:
mogrify -format ps -density 300x300 *.pbm
then put them into ps2pdf:
#!/bin/bash
lista=(`find -iname '*.ps' | sort`)
for i in "${lista[@]}"
do
ps2pdf -r300 -g1880x2688 $i $i.pdf
done

and bind them:
pdfjoin *.pdf --outfile output.pdf

It worked brilliantly. This way my pdf is 8 times smaller then my original pdf, and it's only 1.5 times bigger then the djvu I made from the same images.

Djvu was made directly from pbm files, converted with cjb2 and binded with djvm. No special option was used.

2009. március 30.

Notice: pdfnup --trim

It removes a given amount of margins on the side, with the syntax 'left bottom right top'.

2009. január 3.

ImageMagick converting multiple files

*I do not use this method anymore, although it was a great start.*

After being a book-worm for half a day, I found out how can I write a working script to solve my imagemagick output problem.
It looks like this:


#!/bin/sh

for i in `find -iname '*.jpg'`
do
convert $i -background white -flatten -colorspace Gray -negate -edge 1 -negate -normalize -sharpen 1 -threshold 50% -despeckle -blur 0x.5 -normalize +dither -posterize 8 gif:- >> $i.gif
done
1.) open your favorite text editor
2.) copy paste the script
3.) save as anyname.sh
4.) make it executable

Usage:
1.) copy anyname.sh in the folder containing only the image files you want to make simple
2.) go (cd) to this folder in the terminal
3.) type ./anyname.sh
4.) wait and see.

How to convert this to pdf?

mogrify -format pdf *.JPG

How to join these single-page pdfs?

pdfjoin *.pdf --outfile newname.pdf

2008. szeptember 23.

Convertig Tiff to PDF

sudo apt-get install libtiff-tools

# tiff2pdf tiff_akármi_file.tiff -z -o pdf_akármi_file.pdf