Bookscanner
There are many ways to scan, this is the current state of the art in Calafou.
Building a scanner
Here are two links to the public documentation of our scanner, built by Voja Antonic:
https://www.memoryoftheworld.org/blog/2012/10/28/our-beloved-bookscanner-2/
https://hackaday.io/project/5604-diy-book-scanner
We are trying to build something similar during the Kunlabora event in Calafou: https://calafou.org/en/content/kunlabora-ephimeral-projects-kooperative
Ideas for the building
The electronics is not really documented (which means that it is hard to reproduce) and it is built from basic parts (which means that it takes a lot of time to put it together. So we try to use an Arduino-based solution instead. Arduino is a general-purpose programmable microcontroller that has already built-in many of the functions/parts we need. The idea is that this makes it easier for us to build the scanner and for others to reproduce it. We also have more experience working with Arduino than with only basic electronic components.
Parts and sources
- 2 x digital cameras (see below)
- 2 x camera stand
- 1 x plexi glass (bent 90 degree in the middle?) from a company in Igualada that J. found.
For electronic parts a good shop in Barcelona is Diotronic: https://diotronic.com/
Cameras
Summary of research about cameras for book scanners:
Basically there are three categories of cameras that can be used for book scanners (from cheapest to most expensive).
1. Remote control support
The cheapest option is any camera with remote trigger support, so we can take pictures without pushing the button on the camera. This is important because when you press the button the camera position may be disadjusted to the physical pressure.
2. CHDK firmware
Middle category is CHDK firmware compatibles. CHDK is a third party open source firmware that allows the customisation of cameras. CHDK firmware is for Canon Powershot cameras, which are the cheaper compact digital camera product line. We have 200 euros in the budget for cameras, so we will probably go with this option.
3. Magic Lantern support
Magic Lantern is a third party open source firmware that is more advanced. However, it only works with Canon DLSR cameras (these are the cameras that have a reflex mirror to look at the shot through a small hole before you take the picture, and they usually have big lenses). The scanner we have now uses Canon 1100D, which are the cheapest type suported by Magic Lantern, but they still cost a few hundred euros.
How to put them together
Scanning
The amount of work in the postproduction phase depends on how good quality images you can make in the scanning phase!
- Setting up the cameras: the most important part.
Caveats:
- camera should look at right angle on the page
- all the page should be in the image
- camera settings: full automatic, perhaps with manual focus
- back up and empty the SD cards in the cameras
- most subtle mistake: one camera sees letters bigger than the other camera
- Push the big button on the scanner to scan.
- maybe you have to put your finger to the side of the plexiglass which is closer to you when it is “down”, because the plexiglass is not always exactly the same angle as the book pages
- Download the images from the SD cards and put the scanner to sleep.
- from the camera on the left, copy the images to a folder called “odd”
- from the camera on the right, copy the images to a folder called “even”
- upload the two folders now to to ftp://seldon.calafou/HackTheBiblio/scanning/<math>bookname--</math>yourname/ folder
- remember to delete the pictures from the SD cards and put them back to the cameras, and maybe put the camera batteries to charge
Dependencies
Using an up-to-date Debian operating system, you can install the following programs for the postproduction steps:
- scantailor
- gprename
- pdftk
- tesseract-ocr
- tesseract-ocr-eng
- tesseract-ocr-spa
- calibre
You can install all these programs with the following invocation:
sudo apt install scantailor gprename pdftk tesseract-ocr / tesseract-ocr-eng tesseract-ocr-spa calibre
Postproduction
You start with two folders with files like IMG_1234.JPG
The basic workflow is like this:
- [program] ➔ [output]
- gprename ➔ 1.jpg, 2.jpg, …
- scantailor ➔ 1.tif, 2.tif, …
- tesseract ➔ 1.pdf, 2.pdf, …
- pdftk ➔ book.pdf
- calibre ➔ book.epub
- libgen.org ➔ http://libgen.org/book/index.php?md5=B6916395FDE00D91DB4F52DCB8F069BF
- etc.
There are some bash oneliners which can be useful (on Debian based systems):
FIXME
we can probably write a script to rename the files properly… but for now, in gprename select the “numberical” tab, start = 1 for right-pages and 2 for left-pages, always step = 2.You can rotate the images appropriately (which is called “fix orientation” in scantailor) in the left/right folders before you import them. This is faster than in scantailor I think. However, you can also make the same operation in scantailor in a more user friendly way.
sudo apt-get install imagemagick cd left mogrify -verbose -rotate 270 * cd ../right mogrify -verbose -rotate 90 *
Does Optical Character Recognition (OCR) on all images in folder:
time for i in *tif; do b="basename $i .tif"; tesseract -l spa "$i" “$b” pdf; done
Merges all the pdf files in folder into one single file:
pdftk *pdf cat output book.pdf
Exports the pdf metadata to a text file, to edit:
pdftk book.pdf dump_data output report.txt
Imports the metadata of report.txt back on the pdf:
pdftk book.pdf update_info report.txt output bookcopy.pdf
Distribution
Think about how people who would be interested in this book could know about it!
Repositories:
- General “educational materials”: Library Genesis
- Public library, radical books: Memory of the World
- Academic radical: Aaaaarg
- Artist radical: Monoskop
- Anarchist (including fanzines): Anarchist Library
- There are many Zine Libraries you can find on the Internet…
You may consider spreading the word on relevant mailing lists, social media, etc.