I was planning to do these scripts for some time to ease my work with pdf files.
The original idea was to grep some text from the pdf file and do a file manipulation based on the text.
This StackExchange post suggests the following methods to find a piece of text in a pdf:
- pdftotext + grep
- pdfgrep
- strings + grep
- pdftohtml + grep
I also needed to learn some more Bash scripting:
1st Quest
I have a monthly employee data sheet (payroll), which I get in one big multipage pdf file, having a single page for each employee. The content for each page is: 1st line: title, 2nd line: month, 3rd line: employee name.
I wanted to burst this pdf file into single pages and rename it to title month and employee name.
My solution is:
#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to single pages while renaming the output files using the first three of lines of the pdf
# read files
IFS=$'\n' read -d '' -r -a filelist < <(printf '%s\n' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS
pagecount=1
datestamp=1
# burst and rename pages to separate files
for file in "${filelist[@]}"; do
pagecount=`pdfinfo $file | grep "Pages" | awk '{ print $2 }'`
for (( pageindex=1; pageindex<=$pagecount; pageindex+=1 )); do
newname=`pdftotext -f $pageindex -l $pageindex $file - | head -n 3 | tr '\n' ' ' | cut -f1 -d"("`
let "datestamp =`date +%s%N`" # to avoid overwriting with same new name
pdftk $file cat $pageindex output "$newname"$datestamp.pdf
done
done
2nd Quest
I have files I would like to rename based on the content of the file. Each file is some kind of employee data file, and has the employee name in the content, but not necessary in the file name.
I wanted to rename the file based on the employee name and the page title (first line of the file).
My solution is:
#!/bin/bash
# NAUTILUS SCRIPT
# automatically renames pdf file based on content maching a list of names
# read files
IFS=$'\n' read -d '' -r -a filelist < <(printf '%s\n' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS
names=( "Aaa Aaa" "Bbb Bbb" "Ccc Ccc") # names to match in the files
# process files
for file in "${filelist[@]}"; do
# find name from names in the file
foundname=''
for name in "${names[@]}"; do
testname=`pdftotext "$file" - | grep $name`
if [[ $testname != "" ]]; then
foundname=$name
break
fi
done
# rename file based on found name
title=`pdftotext -f 1 -l 1 "$file" - | head -n 1`
let "datestamp =`date +%s%N`"
mv "$file" "${file%/*}/$foundname $title $datestamp.pdf"
done
3rd Quest
I have a yearly employee data sheet (personal tax data), which I get in one big multipage pdf file, having a multiple pages for each employee. This year it is three pages per employee, but the tax ID number is the only common point in the three pages (it it a 10 digit long number beginning with number 8). The first line of each page is a header, and it is different for each page for a single employee.
I wanted to split the pdf to employees and rename it with the header of the pages.
My solution is:
#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.
# read files
IFS=$'\n' read -d '' -r -a filelist < <(printf '%s\n' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS
# process files
for file in "${filelist[@]}"; do
pagecount=`pdfinfo $file | grep "Pages" | awk '{ print $2 }'`
# MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8:
storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]{9}'`
pattern=''
pagetitle=''
datestamp=''
for (( pageindex=1; pageindex<=$pagecount; pageindex+=1 )); do
header=`pdftotext -f $pageindex -l $pageindex $file - | head -n 1`
pageid=`pdftotext -f $pageindex -l $pageindex $file - | egrep '8?[0-9]{9}'`
let "datestamp =`date +%s%N`" # to avoid overwriting with same new name
# match ID found on the page to the stored ID
if [[ $pageid == $storedid ]]; then
pattern+="$pageindex " # adds number as text to variable separated by spaces
pagetitle+="$header+"
if [[ $pageindex == $pagecount ]]; then #process last output of the file
pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
storedid=0
pattern=''
pagetitle=''
fi
else
#process previous set of pages to output
pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
storedid=$pageid
pattern="$pageindex "
pagetitle="$header+"
fi
done
done
4th Quest
I wanted to have an alternate version of the 3rd Quest in case I have to split in equal sets of pages any other type of file, not having the tax ID as a common point in them.
So I wanted to split this multipage pdf based on a manually set block size, i.e. 3 pages / set.
My solution is:
#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on given block size, and renames the files using the original filename plus datestamp
# read files
IFS=$'\n' read -d '' -r -a filelist < <(printf '%s\n' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS
# process files
for file in "${filelist[@]}"; do
# variables
blocksize=3 # ADD BLOCK SIZE MANUALLY HERE
blocks=()
filename=${file%.pdf} # remove extension from filename
# calculate page range blocks
pagecount=`pdfinfo $file | grep "Pages" | awk '{ print $2 }'`
let setcount=$pagecount/$blocksize
for (( setindex=1; setindex<=$setcount; setindex+=1 )); do
if [[ $setindex -lt $setcount ]]; then
let lastpage=$setindex*$blocksize
let firstpage=$lastpage-$blocksize+1
elif [[ $setindex -eq $setcount ]]; then # handle last page
let lastpage=$pagecount
let firstpage=$setindex*$blocksize-$blocksize+1
fi
blocks+=("$firstpage-$lastpage")
done
# process ranges to output
for block in "${blocks[@]}"; do
let "datestamp =`date +%s%N`" # to avoid overwriting with same new name
pdftk $file cat $block output "$filename $datestamp.pdf"
done
done
That's it. Could be better, but is functioning all right.
Nincsenek megjegyzések:
Megjegyzés küldése