Left Brain Tinkering: Scanning Double-Sided Documents With a Single Sheet Scanner Under Linux

I like to keep documents. Receipts, statements, manuals, pamphlets, you name it. I file things away in hanging folders for later reference.

Since the kiddos came along it has gotten harder and harder to make time for all of this organization, and with a huge pile of unfiled documents I knew I had to find a way to make this easier, or least require less space.

Digitizing the documents seemed like a great answer. I got an OfficeJet 6500A about a year ago, and I've created a couple of scripts to make scanning a little easier on the Linux desktop.

I've broken down document scanning into a few separate tasks, most of these tasks were scripted:

Scan physical documents into TIFF format
Compress/convert TIFF files into JPEG files to reduce space
Convert JPEG files into PDF (mainly for portability)
Create double-sided PDF documents (so that the front side and back side of a single page are in one PDF file)
Concatenate the double-sided PDF files into a single PDF file.

Prerequisites for Linux:

convert (package: imagemagick). The ImageMagick package provides a commandline tool called 'convert' that will handle image conversion.
scanimage (package: sane (or sometimes sane-utils)). This will actually handle the scanning.
pdftk (package: pdftk). The pdf toolkit will handle just about any PDF-related digital activity you can think of.
(In Ubuntu/Mint: sudo apt-get install imagemagick sane pdftk)

Setup

In my case I'm using an HP Officejet 6500A Plus. Make sure you've set up the printer with hp-setup and that you can print to it from other applications.

Test your setup by typing "scanimage" at the commandline and see what happens. You should see something like:

scanimage: output is not a file, exiting

If you don't have any error messages, you can skip past the following troubleshooting steps.

Troubleshooting

Potential problem 1:

wskellenger@marquette ~ $ scanimage
scanimage: no SANE devices found

Solution:

uncomment hpaio in /etc/sane.d/dll.conf

...snip from dll.conf...

#umax_pp
umax1220u
#v4l    #don't care about this
hpaio   #uncomment this line

Potential problem 2:

Note the very last line, which contains the error:

wskellenger@marquette ~ $ scanimage
MIB search path: /home/wskellenger/.snmp/mibs:/usr/share/snmp/mibs:/usr/share/snmp/mibs/iana:/usr/share/snmp/mibs/ietf:/usr/share/mibs/site:/usr/share/snmp/mibs:/usr/share/mibs/iana:/usr/share/mibs/ietf:/usr/share/mibs/netsnmp
Cannot find module (SNMPv2-TC): At line 10 in /usr/share/snmp/mibs/UCD-DLMOD-MIB.txt
Cannot find module (SNMPv2-SMI): At line 34 in /usr/share/snmp/mibs/UCD-SNMP-MIB.txt
Cannot find module (SNMPv2-TC): At line 37 in /usr/share/snmp/mibs/UCD-SNMP-MIB.txt
Did not find 'enterprises' in module #-1 (/usr/share/snmp/mibs/UCD-SNMP-MIB.txt)
Did not find 'DisplayString' in module #-1 (/usr/share/snmp/mibs/UCD-SNMP-MIB.txt)
Did not find 'TruthValue' in module #-1 (/usr/share/snmp/mibs/UCD-SNMP-MIB.txt)
Unlinked OID in UCD-SNMP-MIB: ucdavis ::= { enterprises 2021 }
Undefined identifier: enterprises near line 39 of /usr/share/snmp/mibs/UCD-SNMP-MIB.txt
Did not find 'DisplayString' in module #-1 (/usr/share/snmp/mibs/UCD-DLMOD-MIB.txt)
Did not find 'ucdExperimental' in module UCD-SNMP-MIB (/usr/share/snmp/mibs/UCD-DLMOD-

..... stuff removed .....

Cannot adopt OID in UCD-SNMP-MIB: dskUsed ::= { dskEntry 8 }
Cannot adopt OID in UCD-SNMP-MIB: dskAvail ::= { dskEntry 7 }
Cannot adopt OID in UCD-SNMP-MIB: dskTotal ::= { dskEntry 6 }
Cannot adopt OID in UCD-SNMP-MIB: dskMinPercent ::= { dskEntry 5 }
Cannot adopt OID in UCD-SNMP-MIB: dskMinimum ::= { dskEntry 4 }
Cannot adopt OID in UCD-SNMP-MIB: dskDevice ::= { dskEntry 3 }
Cannot adopt OID in UCD-SNMP-MIB: dskPath ::= { dskEntry 2 }
Cannot adopt OID in UCD-SNMP-MIB: dskIndex ::= { dskEntry 1 }
Cannot adopt OID in UCD-DISKIO-MIB: diskIOTable ::= { ucdDiskIOMIB 1 }
Cannot adopt OID in NET-SNMP-AGENT-MIB: nsLoggingGroup ::= { nsConfigGroups 2 }
Cannot adopt OID in NET-SNMP-AGENT-MIB: nsDebugGroup ::= { nsConfigGroups 1 }
Cannot adopt OID in UCD-SNMP-MIB: snmperrErrMessage ::= { snmperrs 101 }
Cannot adopt OID in UCD-SNMP-MIB: snmperrErrorFlag ::= { snmperrs 100 }
Cannot adopt OID in UCD-SNMP-MIB: snmperrNames ::= { snmperrs 2 }
Cannot adopt OID in UCD-SNMP-MIB: snmperrIndex ::= { snmperrs 1 }
Cannot adopt OID in NET-SNMP-AGENT-MIB: nsTransactionTable ::= { nsTransactions 1 }
Cannot adopt OID in NET-SNMP-AGENT-MIB: nsLogStatus ::= { nsLoggingEntry 5 }
Cannot adopt OID in NET-SNMP-AGENT-MIB: nsLogMaxLevel ::= { nsLoggingEntry 4 }
Cannot adopt OID in NET-SNMP-AGENT-MIB: nsLogType ::= { nsLoggingEntry 3 }
Cannot adopt OID in NET-SNMP-AGENT-MIB: nsLogToken ::= { nsLoggingEntry 2 }
Cannot adopt OID in NET-SNMP-AGENT-MIB: nsLogLevel ::= { nsLoggingEntry 1 }
Cannot adopt OID in NET-SNMP-EXTEND-MIB: nsExtendResult ::= { nsExtendOutput1Entry 4 }
Cannot adopt OID in NET-SNMP-EXTEND-MIB: nsExtendOutNumLines ::= { nsExtendOutput1Entry 3 }
Cannot adopt OID in NET-SNMP-EXTEND-MIB: nsExtendOutputFull ::= { nsExtendOutput1Entry 2 }
Cannot adopt OID in NET-SNMP-EXTEND-MIB: nsExtendOutput1Line ::= { nsExtendOutput1Entry 1 }
Cannot adopt OID in NET-SNMP-EXTEND-MIB: nsExtendOutLine ::= { nsExtendOutput2Entry 2 }
Cannot adopt OID in NET-SNMP-EXTEND-MIB: nsExtendLineIndex ::= { nsExtendOutput2Entry 1 }
Cannot adopt OID in NET-SNMP-AGENT-MIB: nsNotifyStart ::= { netSnmpNotifications 1 }
Cannot adopt OID in NET-SNMP-AGENT-MIB: nsNotifyShutdown ::= { netSnmpNotifications 2 }
Cannot adopt OID in NET-SNMP-AGENT-MIB: nsNotifyRestart ::= { netSnmpNotifications 3 }
Cannot adopt OID in UCD-SNMP-MIB: laErrMessage ::= { laEntry 101 }
Cannot adopt OID in UCD-SNMP-MIB: laErrorFlag ::= { laEntry 100 }
Cannot adopt OID in UCD-SNMP-MIB: laLoadFloat ::= { laEntry 6 }
Cannot adopt OID in UCD-SNMP-MIB: laLoadInt ::= { laEntry 5 }
Cannot adopt OID in UCD-SNMP-MIB: laConfig ::= { laEntry 4 }
Cannot adopt OID in UCD-SNMP-MIB: laLoad ::= { laEntry 3 }
Cannot adopt OID in UCD-SNMP-MIB: laNames ::= { laEntry 2 }
Cannot adopt OID in UCD-SNMP-MIB: laIndex ::= { laEntry 1 }
scanimage: open of device hpaio:/net/Officejet_6500_E710n-z?ip=192.168.1.125 failed: Error during device I/O

Solution:

I had this issue when my wireless printer was set to get an IP Address automatically using DHCP. You might see this in the HP Device Manager -- an indication that all is not well communicating with the printer:

Do the following:

Drop the printer from the HP Device Manager or from CUPS

Configure the printer for a static IP address

Add the printer to HP Device Manager or CUPS again

This is what you should see -- an indication that all is well:

When you run scanimage now, you may still get a bunch of "Cannot adopt OID" warnings, but it should still work and hopefully you can make a scan from the commandline. You should *not* get the "Error during device I/O" message that is highlighted above.

You need to have "scanimage" working at this point to continue.

Let's scan some stuff.

Installing the Scripts

You can get the scripts from my git repo here. Note, you need Python 2.x to run them. Clone the repo and then find your file manager below:

Nautilus: Use the master branch of the scripts and copy everything into your ~/.gnome2/nautilus-scripts/ directory.
Thunar: Use the master branch of the scripts. I'm pretty sure you have to edit the "Custom Actions" in the Edit menu and add each script manually. I tested it and it worked, but it was in Arch Linux and I've since gone back to Mint. So I don't have screenshots of that configuration but it wasn't too difficult.
Nemo: Use the nemo branch of the repo and copy everything into your ~~~/.gnome2/nemo-scripts/~~ /.local/share/nemo/scripts
Other: ??? The scripts are pretty generic, they should work with almost any extendable file manager.
In order for pdf conversions to work properly in newer (2018+) versions of convert, you need to change the policy file in /etc/ImageMagick-6/policy.xml. See here.

Workflow

This example uses four double-sided pages
Pages are numbered (front/back) 1/2, 3/4, 5/6, 7/8.
Load the documents into the scanner in order, with page one facing up (page one will be scanned first), and run the script "Scan sheet feeder documents to PDF"
The documents will now be face down, with the last page (8) at the top. Grab the stack of documents as is, rotate 180 degrees, and put into the sheet feeder again.
The sheets are loaded exactly as they came out of the scanner, the back of the last page should be at the top, facing up. Run the script "Scan sheet feeder documents to PDF" again.
The sheets are now in the original order. You are done handing the paper.

Postprocessing

Now we need to duplex the output (put the fronts and backs together) and concatenate the resulting documents (put them together).

Duplex the front and back pages of the output:

Select all of the pages you scanned and create duplex PDFs (this will put together pages 1+2, 3+4, 5+6, 7+8)

Finally, concatenate the duplexed pages into complete documents, as desired:

Select some or all of the above duplexed documents and concatenate into single documents as necessary.

If you're confused, watch this:

Why so many steps?

One of the goals was to be able to take a huge stack of receipts, statements, whatever documents, and scan them all at once. When finished there is some tedious renaming, but for me this is easier than manual filing in a file cabinet.

Notes:

If you only scan a bunch of single-sided documents, you can skip the duplex step and just concatenate all of the pages immediately, or do nothing and you have a bunch of single sided scans.
If you have a bunch of mismatched documents, as I described, you can select three or four of your duplexed documents, and concatenate only those if you wish.

Scan some stuff

In Nautilus (or Nemo or Thunar), navigate to the folder where you want the scans to go. From here right click and select Scripts --> 1-Scan_sheet_feeder_docs_to_PDF. It may take a few seconds, but your scanner should fire up and start scanning. The files will first be scanned to .tiff, then compressed to .jpg, then converted to .pdf. I took the extra step of .jpg compression to reduce the file size before converting to .pdf. The .tiff may be 24 MB in size while a .jpg of the same file might only be 600 kb. The extra step is worth it.

Once the files are scanned, you'll have a bunch of files the destination directory like so:

scan1409007022-1.pdf
scan1409007022-2.pdf
scan1409007022-3.pdf
scan1409007022-4.pdf
scan1409007022-5.pdf
scan1409007022-6.pdf
scan1409007851-1.pdf
scan1409007851-2.pdf
scan1409007851-3.pdf
scan1409007851-4.pdf
scan1409007851-5.pdf
scan1409007851-6.pdf

The naming convention is scan(start_time)-(number).pdf.

The first six scans are all of the front sides of the pages, the second six scans are the back sides.

The front side files are all numbered scan1409007022 while the back side files are all numbered scan1409007851.

When you scan all of the front sides first, and then flip the stack over, and scan the back sides, the files need to be duplexed in a special sequence, that is (don't worry, I handle this automatically!):

front side file #1 + back side file #6
front side file #2 + back side file #5
front side file #3 + back side file #4
front side file #4 + back side file #3
front side file #5 + back side file #2
front side file #6 + back side file #1

Now we just need to select all of the files you just scanned, and then right click, and select scripts --> 2-Make_Duplex_PDFs. The script will combine the files as described above.

Now you've got:

dplx-scan1409007022-01-06.pdf
dplx-scan1409007022-02-05.pdf
dplx-scan1409007022-03-04.pdf
dplx-scan1409007022-04-03.pdf
dplx-scan1409007022-05-02.pdf
dplx-scan1409007022-06-01.pdf

We're almost done!

At this point, if the stack of papers you scanned are all from the same document, just select them all, right click, and select 3-Concatenate_Selected_PDFs. The script will ask you if you want to rename the result, and then it will ask you if it should delete the files you've selected.

If the stack of papers you scanned are from DIFFERENT documents, you can select the group of duplexed documents that represent one document, and then concatenate those. Repeat until you've concatenated the files into the appropriate document.

For example:

dplx-scan1409007022-01-06.pdf
dplx-scan1409007022-02-05.pdf
dplx-scan1409007022-03-04.pdf

might be a T-Mobile bill, while:

dplx-scan1409007022-04-03.pdf
dplx-scan1409007022-05-02.pdf

...is a bank statement and:

dplx-scan1409007022-06-01.pdf

...is a cable bill.

You proceed as above, select the documents that belong together, and then right click, select scripts, then 3-Concatenate_Selected_PDFs.

And that's it.

I've added some additional scripts that I find useful:

4-Rotate_Selected_PDFs
5-Scan_flatbed_docs_to_PDF
6-Scan_flatbed_photo_to_JPG

They pretty much do what they say.

Left Brain Tinkering

Tuesday, March 31, 2015

Scanning Double-Sided Documents With a Single Sheet Scanner Under Linux

I've broken down document scanning into a few separate tasks, most of these tasks were scripted:

Prerequisites for Linux:

Setup

Troubleshooting

Potential problem 1:

Solution:

Potential problem 2:

Solution:

Installing the Scripts

Workflow

Postprocessing

Duplex the front and back pages of the output:

Finally, concatenate the duplexed pages into complete documents, as desired:

If you're confused, watch this:

Why so many steps?

Scan some stuff

We're almost done!

1 comment: