Scripts for book scanning

I had the pleasure of using Noisebridge’s book scanner to “rip” a couple of rare or unusual books I had lying around. So far as I could tell, none of these had been digitized yet, so that’s kind of exciting. Here’s a picture of me at work, taken by Maira.


Anyway, the scanner is really great, and a work in progress, so for now all it does is dump the photographs—high res images taken by DSLRs affixed to the machine—as sequentially numbered JPEGs in a directory on an attached computer. For a 600 page book, that’s over a gigabyte of photos, and not terribly useful as a digitized book.

So I used ScanTailor, which is a great piece of free software, to take a page that looked like the one on the left and (basically running on autopilot) output one like that on the right.


But then I had a directory full of TIFFs, which looked great but were not as portable as a PDF or OCRed to be treated like text. With a little help from smart people like Eric and Seth, I was able to write a script to loop through that directory and output a big PDF, and another to run the images through Tesseract and output an OCRed text file.

These are really really simple scripts (which is good because I’ve got no idea what I’m doing), but in case those are useful to anybody I’ve put them up on GitHub (and even released those three-line loops into the public domain, ha.)

The next step, I think, is manual proofreading. Unfortunately, I don’t think that can be scripted.


  1. Posted 4 December, 2013 at 11:58 | Permalink

    Congrats on getting those books scanned! That *is* super exciting!

  2. Posted 20 January, 2014 at 21:18 | Permalink

    Just in case you wanted to off load this work to us, here is the book scanning link:

One Trackback

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>