Bookscanner

From Wiki-Fou
Revision as of 23:42, 30 October 2016 by Maxigas (talk | contribs) (Copy workflow PDF to wiki page)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

There are many ways to scan books, this is how we do it. :)

maxigas explained the workflow a number of times to different people and decided to put together a how-to according to those conversations. There should be a biblio-graphy at the end of this document…

Scanning

The amount of work in the postproduction phase depends on how good quality images you can make in the scanning phase!

  1. Setting up the cameras: the most important part.

Caveats:

  • camera should look at right angle on the page
  • all the page should be in the image
  • camera settings: full automatic, perhaps with manual focus
  • back up and empty the SD cards in the cameras
  • most subtle mistake: one camera sees letters bigger than the other camera
  1. Push the big button on the scanner to scan.
  • maybe you have to put your finger to the side of the plexiglass which is closer to you when it is “down”, because the plexiglass is not always exactly the same angle as the book pages
  1. Download the images from the SD cards and put the scanner to sleep.
  • from the camera on the left, copy the images to a folder called “odd”
  • from the camera on the right, copy the images to a folder called “even”
  • upload the two folders now to to ftp://seldon.calafou/HackTheBiblio/scanning/<math>bookname--</math>yourname/ folder
  • remember to delete the pictures from the SD cards and put them back to the cameras, and maybe put the camera batteries to charge

Postproduction

You start with two folders with files like IMG_1234.JPG

The basic workflow is like this:

  1. [program] ➔ [output]
  2. gprename ➔ 1.jpg, 2.jpg, …
  3. scantailor ➔ 1.tif, 2.tif, …
  4. tesseract ➔ 1.pdf, 2.pdf, …
  5. pdftk ➔ book.pdf
  6. calibre ➔ book.epub
  7. libgen.org ➔ http://libgen.org/book/index.php?md5=B6916395FDE00D91DB4F52DCB8F069BF
  8. etc.

There are some bash oneliners which can be useful (on Debian based systems):

  1. Install the abovementioned programs:

    sudo apt-get install gprename scantailor tesseract-ocr\
          tesseract-ocr-eng tesseract-ocr-spa pdftk calibre
  2. FIXME we can probably write a script to rename the files properly… in gprename select the “numberical” tab, start = 1 for right-pages and 2 for left-pages, always step = 2.

  3. You can rotate the images appropriately (which is called “fix orientation” in scantailor) in the left/right folders before you import them. This is faster than in scantailor I think.

    sudo apt-get install imagemagick
    cd left
    mogrify -verbose -rotate 270 *
    cd ../right
    mogrify -verbose -rotate 90 *
  4. Does Optical Character Recognition (OCR) on all images in folder:

    time for i in *tif; do b=‘basename ’$i“ .tif‘; tesseract -l spa ’$i” “$b” pdf; done

    for i in *.tif; do /usr/local/bin/tesseract "<math>i" "`basename "</math>i" .tif`" -l spa pdf; done

  5. Merges all the pdf files in folder into one single file:

    pdftk *pdf cat output book.pdf
  6. Exports the pdf metadata to a text file, to edit:

    pdftk book.pdf  dump_data output report.txt
  7. Imports the metadata of report.txt back on the pdf:

    pdftk book.pdf update_info report.txt output bookcopy.pdf

Distribution

Think about how people who would be interested in this book could know about it!

Repositories:

You may consider spreading the word on relevant mailing lists, social media, etc.

Biblio-graphy

About our book scanner

English

Spanish

Principal sources

Reading And Leading With One Laptop Per Child