PDF scanning and processing

Goals and assumptions

We want these to be “definitive” scans (i.e. we do not want to have to return to the paper copies and scan them again).
Disk space is cheap and networks are getting faster. Hence we emphasize retaining detail over minimizing file size (though we do strive to make files as small as possible).
The scan should faithfully reproduce the original. Hence we try not to have rotated pages in the final PDF, for example.
The PDFs should be reasonably searchable. Hence we perform optical character recognition (OCR) on the files. To preserve the original, we save the OCR text as an invisible layer beneath the page images. The goal of the OCR is to permit reasonable searching (e.g. for species names), but not to be a perfect scan. Therefore we use the best OCR software we have been able to find, but do not attempt to correct recognition errors.

Procedures

Enter bibliographic data
- It is important to enter bibliographic data first whenever possible, since the scan files can then be named using their bibliographic ID number from the outset. This number will ultimately be used to link the PDF to the bibliographic metadata on the web. Using that number as the filename from the beginning saves a lot of renaming and confusion.
Scan documents
- Scan the text and line drawings in monochrome at high contrast, 400 dpi, on one of:
  - sheet-feed scanners
  - flatbed copier/scanners
- Scan plates and images in greyscale or color, 600 dpi, on one of:
  - sheet-feed scanners
  - flatbed option on sheet-feed scanner
Use Acrobat to assemble a complete PDF (More detail)
This requires the “full” or “Pro” version of Acrobat, not just the Reader. It can also be used to combine the text and picture pages from different files, if needed. At this stage, rotate pages so that all text is horizontal (to make OCR processing easier).
Optical character recognition
- Use Abbyy FineReader Corporate Edition to do OCR on the files.
  FineReader does a much more accurate job than the OCR that is built into Acrobat. The "Corporate Edition" includes a batch processor that can go through entire directories of files, reading input PDFs and generating a series of output PDFs that have the OCR layer added invisibly beneath the page images.
- Use FineReader to split double pages.
  If the scanning captured two pages of the original document (e.g. a photocopied book) into single PDF pages, FineReader can be used to split the images into single pages (prior to the actual OCR run).
Final processing with Acrobat (More detail for Mac OS X and Windows)
- Rotate pages that are “sideways” so that all pages are now upright, as in the original document.
- Crop pages to exclude extraneous margins.
- Number pages in larger books and monographs, where it is helpful to have the PDF page numbers match the book page (and plate) numbers. This step also serves as an easy way to check for missing pages (Acrobat page number ranges will fail to match the page numbers through the scanned document).
- Embed all thumbnails to have the tiny page images used for navigation pre-stored in the PDF file (the increase in file size is negligible).
- “Optimize” the PDF to minimize file size and make sure that the internal indexing is complete. This is done by selecting Advanced / PDF Optimizer... with appropriate settings (for Mac OS X Acrobat 6 or for Windows Acrobat 7).
Rate the quality of the PDF
PDF files (particularly ones that are donated to us) are of variable quality and completeness. We rate the PDF quality so that some indication can be provided to web users and so that we can prioritize rescanning poor PDFs.
Move files to web server
The final step is to move the PDF to the web server. Discussion of the whole web infrastructure that manages references, PDFs, and full-text searches is (as they say) beyond the scope of this document.