2015. február 25.

PDF smart rename and split based on content


I was planning to do these scripts for some time to ease my work with pdf files.

The original idea was to grep some text from the pdf file and do a file manipulation based on the text.

This StackExchange post suggests the following methods to find a piece of text in a pdf:
  • pdftotext + grep
  • pdfgrep
  • strings + grep
  • pdftohtml + grep
I used pdftotext in the following examples.

I also needed to learn some more Bash scripting:
1st Quest
I have a monthly employee data sheet (payroll), which I get in one big multipage pdf file, having a single page for each employee. The content for each page is: 1st line: title, 2nd line: month, 3rd line: employee name.
I wanted to burst this pdf file into single pages and rename it to title month and employee name.
My solution is:
#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to single pages while renaming the output files using the first three of lines of the pdf

# read files
IFS=$'\n' read -d '' -r -a filelist < <(printf '%s\n' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS

pagecount=1
datestamp=1

# burst and rename pages to separate files
for file in "${filelist[@]}"; do
 pagecount=`pdfinfo $file | grep "Pages" | awk '{ print $2 }'`
 for (( pageindex=1; pageindex<=$pagecount; pageindex+=1 )); do
  newname=`pdftotext -f $pageindex -l $pageindex $file - | head -n 3 | tr '\n' ' ' | cut -f1 -d"("`
  let "datestamp =`date +%s%N`" # to avoid overwriting with same new name
  pdftk $file cat $pageindex output "$newname"$datestamp.pdf
 done
done

2nd Quest
I have files I would like to rename based on the content of the file. Each file is some kind of employee data file, and has the employee name in the content, but not necessary in the file name.
I wanted to rename the file based on the employee name and the page title (first line of the file).
My solution is:
#!/bin/bash
# NAUTILUS SCRIPT
# automatically renames pdf file based on content maching a list of names

# read files
IFS=$'\n' read -d '' -r -a filelist < <(printf '%s\n' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS

names=( "Aaa Aaa" "Bbb Bbb" "Ccc Ccc") # names to match in the files

# process files
for file in "${filelist[@]}"; do
 # find name from names in the file
 foundname=''
 for name in "${names[@]}"; do
  testname=`pdftotext "$file" - | grep $name`
  if [[ $testname != "" ]]; then
   foundname=$name
   break
  fi
 done

 # rename file based on found name
 title=`pdftotext -f 1 -l 1 "$file" - | head -n 1`
 let "datestamp =`date +%s%N`"
 mv "$file" "${file%/*}/$foundname $title $datestamp.pdf"
done

3rd Quest
I have a yearly employee data sheet (personal tax data), which I get in one big multipage pdf file, having a multiple pages for each employee. This year it is three pages per employee, but the tax ID number is the only common point in the three pages (it it a 10 digit long number beginning with number 8). The first line of each page is a header, and it is different for each page for a single employee.
I wanted to split the pdf to employees and rename it with the header of the pages.
My solution is:
#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.

# read files
IFS=$'\n' read -d '' -r -a filelist < <(printf '%s\n' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS

# process files
for file in "${filelist[@]}"; do
 pagecount=`pdfinfo $file | grep "Pages" | awk '{ print $2 }'`
 # MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8: 
 storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]{9}'`
 pattern=''
 pagetitle=''
 datestamp=''

 for (( pageindex=1; pageindex<=$pagecount; pageindex+=1 )); do

  header=`pdftotext -f $pageindex -l $pageindex $file - | head -n 1`
  pageid=`pdftotext -f $pageindex -l $pageindex $file - | egrep '8?[0-9]{9}'`
  let "datestamp =`date +%s%N`" # to avoid overwriting with same new name

  # match ID found on the page to the stored ID
  if [[ $pageid == $storedid ]]; then
   pattern+="$pageindex " # adds number as text to variable separated by spaces
   pagetitle+="$header+"

   if [[ $pageindex == $pagecount ]]; then #process last output of the file 
    pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
    storedid=0
    pattern=''
    pagetitle=''
   fi
  else 
   #process previous set of pages to output
   pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
   storedid=$pageid
   pattern="$pageindex "
   pagetitle="$header+"
  fi
 done
done

4th Quest
I wanted to have an alternate version of the 3rd Quest in case I have to split in equal sets of pages any other type of file, not having the tax ID as a common point in them.
So I wanted to split this multipage pdf based on a manually set block size, i.e. 3 pages / set.
My solution is:
#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on given block size, and renames the files using the original filename plus datestamp

# read files
IFS=$'\n' read -d '' -r -a filelist < <(printf '%s\n' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS

# process files
for file in "${filelist[@]}"; do

 # variables
 blocksize=3 # ADD BLOCK SIZE MANUALLY HERE
 blocks=()
 filename=${file%.pdf} # remove extension from filename

 # calculate page range blocks
 pagecount=`pdfinfo $file | grep "Pages" | awk '{ print $2 }'` 
 let setcount=$pagecount/$blocksize 
 for (( setindex=1; setindex<=$setcount; setindex+=1 )); do

  if [[ $setindex -lt $setcount ]]; then
   let lastpage=$setindex*$blocksize
   let firstpage=$lastpage-$blocksize+1

  elif [[ $setindex -eq $setcount ]]; then # handle last page
   let lastpage=$pagecount
   let firstpage=$setindex*$blocksize-$blocksize+1

  fi
  blocks+=("$firstpage-$lastpage")
 done

 # process ranges to output
 for block in "${blocks[@]}"; do
  let "datestamp =`date +%s%N`" # to avoid overwriting with same new name
  pdftk $file cat $block output "$filename $datestamp.pdf"
 done
done

That's it. Could be better, but is functioning all right.

Nincsenek megjegyzések: