Bookscanning: Difference between revisions

From Wiki-Fou
(Migrate relevant parts from the Bookscanner page)
 
(Rewrite workflow offline and import it here)
Line 1: Line 1:
There are many ways to scan, this is the current state of the art in Calafou. We use only free software and the documentation is for Debian GNU/Linux, but it should work with some small modifications on any UNIX based system running the bash shell. There are some parts where proprietary software such as ABBYY FineReader can be more effective. However, this workflow produces near perfect books in PDF format that we are very happy with. One thing we could definitely improve is the size of the final PDF file, which is quite big (can be more than 100 megabytes).


= Scanning =
= Scanning =
Line 6: Line 7:
# Setting up the cameras (calibration): the most important part.
# Setting up the cameras (calibration): the most important part.


Caveats:
* open the book in the middle (at a central page with normal text on both sides)
 
* '''camera should look directly on the middle of the page, parallel to the cradle, at 45 degrees compared to horizontal'''
* open the book right in the middle (at the central page) to calibrate the cameras.
* all the page should be in the image, but it is not a problem if more things outside of the book are visible
* camera should look at right angle on the page. Make sure the cameras are parallel to the angles of the cradle.
* check if the pages fold/curve; if so, place something underneath to straighten it (like a sponge, or another book…)
* all the page should be in the image
* camera settings: fully automatic, perhaps with manual focus
* check if the pages fold/curve; if so, place something underneath to straighten it (like a sponge, or another book...)
* camera settings: fully automatic, perhaps with manual focus.
* back up and empty the SD cards in the cameras
* back up and empty the SD cards in the cameras
* most subtle mistake: one camera sees letters bigger than the other camera
* most subtle mistake: one camera sees letters bigger than the other camera (this can be a difference in the zoom level or the distance between camera and page)
* use a post-it or similar to mark the exact position of the book in relation to the lower edge of the cradle, to ensure it remains in the same position throughout the scanning.
* use a post-it or similar to mark the exact position of the book in relation to the lower edge of the cradle, to ensure it remains in the same position throughout the scanning


<ol start="2" style="list-style-type: decimal;">
<ol start="2" style="list-style-type: decimal;">
Line 27: Line 26:
* from the camera on the left, copy the images to a folder called “odd”
* from the camera on the left, copy the images to a folder called “odd”
* from the camera on the right, copy the images to a folder called “even”
* from the camera on the right, copy the images to a folder called “even”
* upload the two folders now to to ftp://seldon.calafou/HackTheBiblio/scanning/<math>bookname--</math>yourname/ folder
* upload the two folders now to to <code>ftp://seldon.calafou/HackTheBiblio/scanning/$bookname--$yourname/</code> folder
* remember to delete the pictures from the SD cards and put them back to the cameras, and maybe put the camera batteries to charge
* remember to delete the pictures from the SD cards and put them back to the cameras, and maybe put the camera batteries to charge


= Dependencies =
= Dependencies =
There are many ways to scan, this is the current state of the art in Calafou.


Using an up-to-date Debian operating system, you can install the following programs for the postproduction steps:
Using an up-to-date Debian operating system, you can install the following programs for the postproduction steps:
Line 44: Line 41:
* calibre
* calibre


You can install all these programs with the following invocation:
You can install all these programs with the following invocation from the command line (also called the terminal):


<pre>sudo apt install scantailor gprename pdftk tesseract-ocr /
<pre>sudo apt install scantailor gprename pdftk tesseract-ocr /
         tesseract-ocr-eng tesseract-ocr-spa calibre</pre>
         tesseract-ocr-eng tesseract-ocr-spa calibre</pre>
= Postproduction =
= Postproduction =


You start with two folders with files like IMG_1234.JPG
You start with two folders such as <code>odd</code> and <code>even</code> with files like IMG_1234.JPG. It is not good to talk about <code>right</code> and <code>left</code> because it can be very confusing: are you talking about the image from the right camera that takes pictures of the left page of the book, or the image of the left page of the book that is from the right camera? On the other hand, <code>odd</code> (1, 3, 5, …) and <code>even</code> (2, 4, 6, …) are good words for describing what is on the image without ambiguity!
 
The basic workflow is like this:


<ol start="0" style="list-style-type: decimal;">
The basic workflow is like this: → 0. [program] [output] 1. gprename 1.jpg, 2.jpg, … 2. scantailor 1.tif, 2.tif, … 3. tesseract 1.pdf, 2.pdf, … 4. pdftk book.pdf 5. calibre book.epub 6. libgen.org http://libgen.org/book/index.php?md5=B6916395FDE00D91DB4F52DCB8F069BF 7. etc.
<li>[program] [output]</li>
<li>gprename 1.jpg, 2.jpg, …</li>
<li>scantailor 1.tif, 2.tif, …</li>
<li>tesseract 1.pdf, 2.pdf, …</li>
<li>pdftk book.pdf</li>
<li>calibre book.epub</li>
<li>libgen.org http://libgen.org/book/index.php?md5=B6916395FDE00D91DB4F52DCB8F069BF</li>
<li>etc.</li></ol>


There are some bash oneliners which can be useful (on Debian based systems):
There are some bash oneliners which can be useful (on Debian based systems):


<ol style="list-style-type: decimal;">
<ol style="list-style-type: decimal;">
<li><p><code>FIXME</code> we can probably write a script to rename the files properly… but for now, in gprename select the “numberical” tab, start = 1 for right-pages and 2 for left-pages, always step = 2.</p></li>
<li><p><code>FIXME</code> we can probably write a script to rename the files properly… but for now, in <code>gprename</code> select the “numberical” tab, start = 1 for odd pages and 2 for even pages, always step = 2.</p></li>
<li><p>You can rotate the images appropriately (which is called “fix orientation” in scantailor) in the left/right folders before you import them. This is faster than in scantailor I think. However, you can also make the same operation in scantailor in a more user friendly way.</p>
<li><p>You can rotate the images appropriately (which is called “fix orientation” in scantailor) in the left/right folders before you import them. This is faster than in <code>scantailor</code> I think. However, you can also make the same operation in <code>scantailor</code> in a more user friendly way.</p>
<pre>sudo apt-get install imagemagick
<pre> sudo apt-get install imagemagick
cd left
cd odd
mogrify -verbose -rotate 270 *
mogrify -verbose -rotate 270 *
cd ../right
cd ../even
mogrify -verbose -rotate 90 *</pre></li>
mogrify -verbose -rotate 90 *</pre></li>
<li><p>Does Optical Character Recognition (OCR) on all images in folder:</p>
<li><p>Does Optical Character Recognition (OCR) on all images in folder:</p>
<pre>time for i in *tif; do b=$(basename $i .tif); tesseract -l spa "$i" "$b" pdf; done</pre></li>
<pre> time for i in *tif; do b=$(basename $i .tif); tesseract -l spa &quot;$i&quot; &quot;$b&quot; pdf; done</pre></li>
<li><p>Merges all the pdf files in folder into one single file:</p>
<li><p>Merges all the pdf files in folder into one single file:</p>
<pre>pdftk *pdf cat output book.pdf</pre></li>
<pre> pdftk *pdf cat output book.pdf</pre></li>
<li><p>Exports the pdf metadata to a text file, to edit:</p>
<li><p>Exports the pdf metadata to a text file, to edit:</p>
<pre>pdftk book.pdf  dump_data output report.txt</pre></li>
<pre> pdftk book.pdf  dump_data output report.txt</pre></li>
<li><p>Imports the metadata of report.txt back on the pdf:</p>
<li><p>Imports the metadata of report.txt back to the PDF:</p>
<pre>pdftk book.pdf update_info report.txt output bookcopy.pdf</pre></li></ol>
<pre> pdftk book.pdf update_info report.txt output bookcopy.pdf</pre></li></ol>


= Distribution =
= Distribution =
Line 91: Line 77:


* General “educational materials”: [https://libgen.io/ Library Genesis]
* General “educational materials”: [https://libgen.io/ Library Genesis]
* Public library, radical books: [https://library.memoryoftheworld.org/ Memory of the World]
* Academic radical: [https://aaaaarg.org/ Aaaaarg]
* Academic radical: [https://aaaaarg.org/ Aaaaarg]
* Artist radical: [https://monoskop.org/ Monoskop]
* Artist radical: [https://monoskop.org/ Monoskop]
Line 98: Line 83:


You may consider spreading the word on relevant mailing lists, social media, etc.
You may consider spreading the word on relevant mailing lists, social media, etc.
= Biblio-graphy =
* [https://www.memoryoftheworld.org/wp-content/uploads/2014/12/scanning_manual_v1.2.pdf Scanning Manual from Memory of the World]: a quite long document in PDF
* [https://www.memoryoftheworld.org/ Memory of the World]: Digital Public Libraries
* [https://www.memoryoftheworld.org/es/ Spanish pages on Memory of the World]: Digital Public Libraries in Spanish
* [http://en.flossmanuals.net/e-book-enlightenment/ Reading And Leading With One Laptop Per Child]: Book digitalisation manual

Revision as of 14:46, 18 October 2018

There are many ways to scan, this is the current state of the art in Calafou. We use only free software and the documentation is for Debian GNU/Linux, but it should work with some small modifications on any UNIX based system running the bash shell. There are some parts where proprietary software such as ABBYY FineReader can be more effective. However, this workflow produces near perfect books in PDF format that we are very happy with. One thing we could definitely improve is the size of the final PDF file, which is quite big (can be more than 100 megabytes).

Scanning

The amount of work in the postproduction phase depends on how good quality images you can make in the scanning phase!

  1. Setting up the cameras (calibration): the most important part.
  • open the book in the middle (at a central page with normal text on both sides)
  • camera should look directly on the middle of the page, parallel to the cradle, at 45 degrees compared to horizontal
  • all the page should be in the image, but it is not a problem if more things outside of the book are visible
  • check if the pages fold/curve; if so, place something underneath to straighten it (like a sponge, or another book…)
  • camera settings: fully automatic, perhaps with manual focus
  • back up and empty the SD cards in the cameras
  • most subtle mistake: one camera sees letters bigger than the other camera (this can be a difference in the zoom level or the distance between camera and page)
  • use a post-it or similar to mark the exact position of the book in relation to the lower edge of the cradle, to ensure it remains in the same position throughout the scanning
  1. Push the big button on the scanner to scan.
  • maybe you have to put your finger to the side of the plexiglass which is closer to you when it is “down”, because the plexiglass is not always exactly the same angle as the book pages
  1. Download the images from the SD cards and put the scanner to sleep.
  • from the camera on the left, copy the images to a folder called “odd”
  • from the camera on the right, copy the images to a folder called “even”
  • upload the two folders now to to ftp://seldon.calafou/HackTheBiblio/scanning/$bookname--$yourname/ folder
  • remember to delete the pictures from the SD cards and put them back to the cameras, and maybe put the camera batteries to charge

Dependencies

Using an up-to-date Debian operating system, you can install the following programs for the postproduction steps:

  • scantailor
  • gprename
  • pdftk
  • tesseract-ocr
  • tesseract-ocr-eng
  • tesseract-ocr-spa
  • calibre

You can install all these programs with the following invocation from the command line (also called the terminal):

sudo apt install scantailor gprename pdftk tesseract-ocr /
         tesseract-ocr-eng tesseract-ocr-spa calibre

Postproduction

You start with two folders such as odd and even with files like IMG_1234.JPG. It is not good to talk about right and left because it can be very confusing: are you talking about the image from the right camera that takes pictures of the left page of the book, or the image of the left page of the book that is from the right camera? On the other hand, odd (1, 3, 5, …) and even (2, 4, 6, …) are good words for describing what is on the image without ambiguity!

The basic workflow is like this: → 0. [program] → [output] 1. gprename → 1.jpg, 2.jpg, … 2. scantailor → 1.tif, 2.tif, … 3. tesseract → 1.pdf, 2.pdf, … 4. pdftk → book.pdf 5. calibre → book.epub 6. libgen.org → http://libgen.org/book/index.php?md5=B6916395FDE00D91DB4F52DCB8F069BF 7. etc.

There are some bash oneliners which can be useful (on Debian based systems):

  1. FIXME we can probably write a script to rename the files properly… but for now, in gprename select the “numberical” tab, start = 1 for odd pages and 2 for even pages, always step = 2.

  2. You can rotate the images appropriately (which is called “fix orientation” in scantailor) in the left/right folders before you import them. This is faster than in scantailor I think. However, you can also make the same operation in scantailor in a more user friendly way.

     sudo apt-get install imagemagick
     cd odd
     mogrify -verbose -rotate 270 *
     cd ../even
     mogrify -verbose -rotate 90 *
  3. Does Optical Character Recognition (OCR) on all images in folder:

     time for i in *tif; do b=$(basename $i .tif); tesseract -l spa "$i" "$b" pdf; done
  4. Merges all the pdf files in folder into one single file:

     pdftk *pdf cat output book.pdf
  5. Exports the pdf metadata to a text file, to edit:

     pdftk book.pdf  dump_data output report.txt
  6. Imports the metadata of report.txt back to the PDF:

     pdftk book.pdf update_info report.txt output bookcopy.pdf

Distribution

Think about how people who would be interested in this book could know about it!

Repositories:

You may consider spreading the word on relevant mailing lists, social media, etc.

Biblio-graphy