PDF scanning and processing

Goals and assumptions


  1. Enter bibliographic data
    • It is important to enter bibliographic data first whenever possible, since the scan files can then be named using their bibliographic ID number from the outset. This number will ultimately be used to link the PDF to the bibliographic metadata on the web. Using that number as the filename from the beginning saves a lot of renaming and confusion.
  2. Scan documents
  3. Use Acrobat to assemble a complete PDF (More detail)
    This requires the “full” or “Pro” version of Acrobat, not just the Reader. It can also be used to combine the text and picture pages from different files, if needed. At this stage, rotate pages so that all text is horizontal (to make OCR processing easier).
  4. Optical character recognition
    • Use Abbyy FineReader Corporate Edition to do OCR on the files.
      FineReader does a much more accurate job than the OCR that is built into Acrobat. The "Corporate Edition" includes a batch processor that can go through entire directories of files, reading input PDFs and generating a series of output PDFs that have the OCR layer added invisibly beneath the page images.
    • Use FineReader to split double pages.
      If the scanning captured two pages of the original document (e.g. a photocopied book) into single PDF pages, FineReader can be used to split the images into single pages (prior to the actual OCR run).
  5. Final processing with Acrobat (More detail for Mac OS X and Windows)
    • Rotate pages that are “sideways” so that all pages are now upright, as in the original document.
    • Crop pages to exclude extraneous margins.
    • Number pages in larger books and monographs, where it is helpful to have the PDF page numbers match the book page (and plate) numbers. This step also serves as an easy way to check for missing pages (Acrobat page number ranges will fail to match the page numbers through the scanned document).
    • Embed all thumbnails to have the tiny page images used for navigation pre-stored in the PDF file (the increase in file size is negligible).
    • “Optimize” the PDF to minimize file size and make sure that the internal indexing is complete. This is done by selecting Advanced / PDF Optimizer... with appropriate settings (for Mac OS X Acrobat 6 or for Windows Acrobat 7).
  6. Rate the quality of the PDF
    PDF files (particularly ones that are donated to us) are of variable quality and completeness. We rate the PDF quality so that some indication can be provided to web users and so that we can prioritize rescanning poor PDFs.
  7. Move files to web server
    The final step is to move the PDF to the web server. Discussion of the whole web infrastructure that manages references, PDFs, and full-text searches is (as they say) beyond the scope of this document.