OK, so I was looking for a way to delete perceptually blank pages from PDF files in bulk. I found scripts that will remove truly blank pages, but that doesn’t help with pages that have been scanned because they can have just 1 pixel and now they’re not blank.
The script below which is running on CentOS 7 is checks for pages that have page coverage of less than or equal to 0.1% in a PDF file and then removes those pages.
The trick here is to use the new(ish) option in GhostScript, inkcov. This option reports how much of the page in percentages is covered by each colour. I am adding these percentages together to get total coverage of the page using awk.
If that percentage is less than the threshold, do not add that page number to the list of pages to keep.
Finally, pass the list of pages to keep to pdftk and create a new PDF with them.
#!/bin/sh IN="$1" filename=$(basename "${IN}") filename="${filename%.*}" PAGES=$(pdfinfo $IN | grep ^Pages: | tr -dc '0-9') non_blank() { for i in $(seq 1 $PAGES) do PERCENT=$(gs -o - -dFirstPage=${i} -dLastPage=${i} -sDEVICE=inkcov ${IN} | grep CMYK | nawk 'BEGIN { sum=0; } {sum += $1 + $2 + $3 + $4;} END { printf "%.5f\n", sum } ') if [ $(echo "$PERCENT > 0.001" | bc) -eq 1 ] then echo $i #echo $i 1>&2 fi echo -n . 1>&2 done | tee ${filename}.tmp echo 1>&2 } set +x pdftk "${IN}" cat $(non_blank) output "${filename}.pdf" if [ $? -eq 0 ] then rm ${filename}.tmp # Uncomment the line below to delete the input file # rm ${IN} fi
Very good – Works well!