Revision as of 19:30, 24 January 2019

There are many ways to scan, this is the current state of the art in Calafou. We use only free software and the documentation is for Debian GNU/Linux, but it should work with some small modifications on any UNIX based system running the bash shell. There are some parts where proprietary software such as ABBYY FineReader can be more effective. However, this workflow produces near perfect books in PDF format that we are very happy with. One thing we could definitely improve is the size of the final PDF file, which is quite big (can be more than 100 megabytes).

Scanning

The amount of work in the postproduction phase depends on how good quality images you can make in the scanning phase!

Setting up the cameras (calibration): the most important part.

open the book in the middle (at a central page with normal text on both sides)
camera should look directly on the middle of the page, parallel to the cradle, at 45 degrees compared to horizontal
all the page should be in the image, but it is not a problem if more things outside of the book are visible
check if the pages fold/curve; if so, place something underneath to straighten it (like a sponge, or another book…)
camera settings: fully automatic, perhaps with manual focus
back up and empty the SD cards in the cameras
most subtle mistake: one camera sees letters bigger than the other camera (this can be a difference in the zoom level or the distance between camera and page)
use a post-it or similar to mark the exact position of the book in relation to the lower edge of the cradle, to ensure it remains in the same position throughout the scanning

Push the big button on the scanner to scan.

maybe you have to put your finger to the side of the plexiglass which is closer to you when it is “down”, because the plexiglass is not always exactly the same angle as the book pages

Download the images from the SD cards and put the scanner to sleep.

from the camera on the left, copy the images to a folder called “odd”
from the camera on the right, copy the images to a folder called “even”
upload the two folders now to to ftp://omnius.calafou/HackTheBiblio/scanning/$bookname--$yourname/ folder
remember to delete the pictures from the SD cards and put them back to the cameras, and maybe put the camera batteries to charge

Additional information using the Marron scanner

Check before starting that the SD card are locked: the external trigger that controls the cameras requires the SD cards to be locked. If they are not locked, the pictures are not saved when using the external trigger.
Camera settings: we use two IXUS 175 set to automatic with menu/lamp setting set to "off" to avoid the use of the red light
While taking pictures, if you need to check the last picture taken: long press the green play button to enter slideshow mode, long press the green play button to go back to picture mode (half pressure on the camera trigger also works)
If you decide to use the zoom of the camera (not the digital zoom), be careful not to turn off the camera or you will loose your zoom setting

Dependencies

Using an up-to-date Debian operating system, you can install the following programs for the postproduction steps:

scantailor
gprename
pdftk
tesseract-ocr
tesseract-ocr-eng
tesseract-ocr-spa
calibre

You can install all these programs with the following invocation from the command line (also called the terminal):

sudo apt install scantailor gprename pdftk tesseract-ocr tesseract-ocr-eng tesseract-ocr-spa calibre

Postproduction

You start with two folders such as odd and even with files like IMG_1234.JPG. It is not good to talk about right and left because it can be very confusing: are you talking about the image from the right camera that takes pictures of the left page of the book, or the image of the left page of the book that is from the right camera? On the other hand, odd (1, 3, 5, …) and even (2, 4, 6, …) are good words for describing what is on the image without ambiguity!

The basic workflow is like this:

[process] → [program] → [output]
Merge pictures from the two cameras → gprename → 1.jpg, 2.jpg, …
Edit the pictures to adjust contents → scantailor → 1.tif, 2.tif, …
Character recognition → tesseract → 1.pdf, 2.pdf, …
Create the pdf file → pdftk → book.pdf
Create the ebook → calibre → book.epub
Disseminate → libgen.org → http://libgen.org/book/index.php?md5=B6916395FDE00D91DB4F52DCB8F069BF
etc.

There are some bash oneliners which can be useful (on Debian based systems):

gprename
- enter gprename using a Terminal
- go to the Directory with the odd files
- select all files
- go to the numerical tab
- set starting number to 1 and increment by 2
- set the naming pattern
- repeat the operation for even files
- merge the two folders
- FIXME we can probably write a script to rename the files properly…
scantailor

You can edit the captures appropriately with scantailor. It invites you to follow these steps:
- 1st step: Fix orientation. All odd pages need to be turned in one direction, while even pages need to be turned in the other direction.
Rotate image nr 1 and click on "apply to every other page". Then select image nr 2, rotate in the opposite direction so it stays still, and also click on "apply to every other page".
[Fix Orientation Manual -> https://github.com/scantailor/scantailor/wiki/Fix-Orientation]
- 2nd step: Split pages. If you import all files renamed, odd and even pages will be recognized as single pages, so this step is just to confirm that the edges of the pages are set properly; drag the rectangles to fit in the page's area.
[Split pages manual -> https://github.com/scantailor/scantailor/wiki/Split-Pages]
- 3rd step: Deskew. Drag and determine the angle which the page needs to be turned for the text and images to be properly horizontal
[Deskew manual -> https://github.com/scantailor/scantailor/wiki/Deskew]
- 4th step: Select content. Frame all elements to be shown as content, within one single area (beware of including for example page numbers). The outer limit of these margins affects the size of the output file.
[Select content manual -> https://github.com/scantailor/scantailor/wiki/Select-Content]
- 5th step: Margins. Check out all margins place the content in a manner that will help it being read "centralized".
- 6th step: Output. Consider the visibility/readability of pages with images and/or mixed img-txt, managing the thickness slider.

Does Optical Character Recognition (OCR) on all images in folder:

 time for i in *tif; do b=$(basename $i .tif); tesseract -l spa "$i" "$b" pdf; done

Merges all the pdf files in folder into one single file:
```
 pdftk *pdf cat output book.pdf
```
Exports the pdf metadata to a text file, to edit:
```
 pdftk book.pdf  dump_data output report.txt
```

Imports the metadata of report.txt back to the PDF:

 pdftk book.pdf update_info report.txt output bookcopy.pdf

Distribution

Think about how people who would be interested in this book could know about it!

Repositories:

General “educational materials”: Library Genesis
Academic radical: Aaaaarg
Artist radical: Monoskop
Anarchist (including fanzines): Anarchist Library
There are many Zine Libraries you can find on the Internet…

You may consider spreading the word on relevant mailing lists, social media, etc.

Biblio-graphy

Scanning Manual from Memory of the World: a quite long document in PDF
Memory of the World: Digital Public Libraries
Spanish pages on Memory of the World: Digital Public Libraries in Spanish
Reading And Leading With One Laptop Per Child: Book digitalisation manual

@@ Line 104: / Line 104: @@
 *3rd step: Deskew. Drag and determine the angle which the page needs to be turned for the text and images to be properly horizontal
-[[File:Deskew main tab.jpeg|thumb]]
+<gallery>
+File:Deskew main tab.jpeg
+</gallery>
 [Deskew manual -> https://github.com/scantailor/scantailor/wiki/Deskew]
@@ Line 110: / Line 112: @@
 *4th step: Select content. Frame all elements to be shown as content, within one single area (beware of including for example page numbers). The outer limit of these margins affects the size of the output file.
-[[File:Content main tab.jpeg|thumb]]
+<gallery>
+File:Content main tab.jpeg]
+</gallery>
 [Select content manual -> https://github.com/scantailor/scantailor/wiki/Select-Content]

Anonymous

Search

Bookscanning: Difference between revisions

Namespaces

More

Page actions

Revision as of 19:30, 24 January 2019

Contents

Scanning

Dependencies

Postproduction

Distribution

Biblio-graphy

Navigation

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Bookscanning: Difference between revisions

Revision as of 19:30, 24 January 2019

Scanning

Dependencies

Postproduction

Distribution

Biblio-graphy

Navigation

Wiki tools

Page tools