PDF for Academics

I have made numerous negative critiques of the things I see that are done badly in academia regarding the use of PDF scans. Perhaps its time that I document those critiques more seriously and provide a few suggestions on how to have a happy PDF experience.

First of all, avoid scanning if you can. If the resource is available as a web page or a pre-made PDF provided by the publisher, it will save you some time and effort scanning and provide a better experience for your students. Also, if you have a cooperative ILL librarian or other library staff that will scan for you, let them do it. They have experience with making professional quality scans.

Second, scan one page at a time. It is more time consuming for you, but if you scan one page at a time, your students are more likely to read it on an electronic device and not print it out. This makes environmental and economical sense. It may take you a little longer to scan, but it will save your students the time of printing the pages or help them struggle with the frustration of constantly panning on a PC or tablet. It may be tempting to scan more than one page at a time, but it just isn’t worth it, unless of course you intend to split them electronically later.

This brings me to my next point, in sort of an odd way. If you are scanning  a page at a time, scan your document straight. This is a special annoyance for me as I have received documents scanned at a 35° angle. Its annoying, messy and unprofessional. Just don’t do it. If you can’t get your book or document straight with one page at a time, then put the whole thing on the platen and select the area that you need for each page.

Finally, crop it. Do not use the entire platen scan and then disseminate it out raw. Either before or after scanning, use the scanner software to select which region of the document you wish to save.

When you have finished scanning, check the pages and ensure that you have actually gotten all of the text you meant to get. Are there pages missing? Does the text fall off into the dark area at the inside edge, or perhaps fall off the edge of the scan?

Now, other than the lost words or lost pages issues, if you have failed to do everything noted above, there is a way to correct it. You must use Adobe Acrobat Pro (not reader, but the full professional Acrobat application). Most academic libraries have access to Adobe Acrobat, and some departments do, but it does cost the school for each license, therefore it is unlikely you will be able to have it on your personal office computer (unless you have other needs for it).

Using the tools pane in Acrobat you can do a lot.

The first and last thing to do is OCR. Select “Recognize Text, In this document” from the tools pane. Use this to not only provide machine-readable text from your document, but also to straighten your document. This tool will run through and detect text on each page, store it in the file’s background data and then align the page so that the text appears straight (on average). It is a good idea to do this at the beginning to straighten all of the pages for cropping, and then again at the end to ensure that the spacial mapping of the recognized text is correct.

The other task that needs to be done, if it wasn’t done in scanning is to crop the pages. Use Pages> Crop. This option will allow you to select a region of the page to keep and discard the rest. Turn on the crop tool, select the region, then press enter to finish the process.

If you scanned double pages, you will need to break them up. Start with the even pages (the pages on the right). Select the even side of the first page in the document, press enter, apply the crop to all pages in the document (even and odd), then save the document (usually DocumentName_even.pdf). Once saved, “undo” the crop (edit, undo). Now, select the odd (left) side of the page and again apply the crop to all pages. You may now rejoin the even pages to the document by selecting Pages> Insert from file. Select the _even.pdf file, and add the pages AFTER the LAST page in the document. Finally, move the pages into the correct order. You can turn on the thumbnail panel and drag them manually, or you can set up an automated task (see end of post for code) to do this. To do it by automation, you will first need to set up a task by selecting “action wizard” and then “create a new action”. From the action creation wizard, select “more tools”, then execute javascript. Select “options” to insert the code. You will only need to do this once, then you can just reuse the existing action. It should be noted that the code only works for documents with an even number of pages.  If your document has an odd number of pages, go to the Pages tool, Insert pages, more insert options. Add a blank page AFTER the LAST page in the document.

After you have finished working on your document, go to File> Properties and give your document a title, author information and perhaps a few keywords. This metadata will make your file easier to locate and index. Once that’s done, save the file.

Code for putting a document back together:

(if the code below causes an error, use the version at https://cmkularski.net/special/Interleave.txt instead)

function InterleavePages() {
var n = this.numPages;
var nOdd = Math.floor(n / 2);
var nEven = n – nOdd;
var x;
var y;
var i;
for(i = 0; i < nEven; i++) {
                         // movePage x, toAfterPage y
                         // note page numbers are 0-indexed
    x = nOdd + (i);      //
    y = i * 2     ;      //