There's still a long way from the fiche reader screen shots to a small, multipage PDF document which is good enough to apply OCR.
I spent more time developing the image processing than building the scanner and scanning the fiches.
The "FilterChain" program
After experiments with Photoshop and other image processing tools, I decide to write an own program to process my fiche images. I called it "FilterChain". It is a Windows program, written in Delphi.
"FilterChain" applies a sequence of selected processing filters onto an input file. It integrates the best filters from these sources: commercial ImageEn library, free ImageMagick, tesseract-ocr, and self written goodies. It has a batch mode and diagnostic features useful for filter development.
The processing chain
In theory, procssing the images should be quite easy: cut off the border, invert, automatic contrast adjustment ... that's it. But remember the "gallery of flaws?" The worst 5% of these images cause 95% of all the post-processing effort (and the worst 0.1% added another 95% !)
After numerous try-and-error runs, FilterChain now applies these filters onto a raw fiche reader screen photo:
1. "Cropping": A part of the micro fiche reader appears on each image. This must be cropped off, so only an image of the readers screen remains. Since the reader has been painted black in those areas, automatic cropping the border off is easy.
After this processing step, images have a resolution of about 3600x2800 pixels. Image dimension remain unchanged by all further processing steps. If the original prinouts were on 15" width sheets, the image resolution is about 200dpi.
2. The resulting color image is converted to gray levels:
3. The lighting levels of the reader screen are not uniform, the image is darker in the edges. This is corrected by subtracting the image of an empty fiche from the fiche in process. Brightness of the subtracted background image is adjusted, so that the background color of the resulting image is (almost) a pure black:
4. Brightness of the document text may still have different levels at different places in the page (remember the "gallery of flaws"?) To equalize:
- the image is separated into tiles;
- for each tile the brightness of the foreground text is calcuated. Text is judged with a mix of histogram logic and OCR runs, to separate true text from other structures;
- brightness for each tile is individually adjusted, so finally all tiles have the same brightness for text structures.
5. The image is inverted, so now text is black and background is white.
Now the scans have optimal quality. Images have still 256 gray levels and format is still JPG. This makes the final PDFs very big, and OCR is difficult.
6. The images from this processing step are packed into PDFs with "Adobe Acrobat XI". Documents with more than 208 pages are originally split over several fiches, these are gathered into one single PDF. So from 432 fiches 330 PDFs result. The PDF document names are generated from the meta-data sampled while scanning the fiches. Example: The fiche title
is saved as file "AH-E122A-MC__PDP-11__DIAGNOSTIC_USER_GUIDE__CZUGAA0__(C)1978.pdf"
There are other tools to pack images to PDFs. Adobe Acrobat has good optimizations build-in, to reduce file size and enhance image quality.
7. The resulting gray level image is converted to 1 bit black & white. This is done by applying a "threshold" brightness. All pixels darker than the threshold become black, all pixels brighter than the threshold go white. Choice of the treshold impacts the shape of the letters: A lower threshold produces darker (and fatter) letters, a higher threshold results in thinner letter shapes. Normally, a fixed threshold of 128 is used which is just in the middle of a 256 gray level range.
But the document quality can be optimized by finding the optimal threshold (Some image processing tools like Photoshop have a nice slider for threshold adjustment). Target is to produce letter shapes recognizable by OCR. So for automatic optimization the threshold is regulated over a feedback loop with an in-place OCR module (tesseract again). Tesseract produces not only the recognized text, but also the quality of each recognized letter. Threshold is now regulated in a way to maximize the overall OCR quality. The recognized text is ignored, because even at best threshold it is almost unusable.
Since the image is always inhomogenous, the threshold is calculated for different tiles of the images
The OCR feedback loop for threshold gives optimal letter shapes, which turn out to be quite "light". It needs an incredible amount of processor power. To reach the optimal threshold, about 20 iteration steps are required, so for each of the 50.000+ fiches 20 OCR operations are performed ... this project is a million page OCR! And tesseract OCR is quite slow, because it is fed with tons of semi-random graylevel images.
8. Finally, the thresholded version of the images is packed to PDFs again like in step 6. Now we have an "original" version in gray levels, and a small, OCRable version in black & white. Both should be archived.
9. The resulting PDFs are copied to a public server, and proud announcement to the retrocomputing community is released ...
Running the filters
Because the filter chain is so elaborate and so slow, processing of all 50.000 images would take several months (imagine the shock when I first calculated that number!)
Luckily I have five PCs here around: my desktop, my notebook, a test machine for my job, another desktop in my electronic lab, and the controller for the automatic fiche scanner rig. I managed to write a special software (called "BatchConverter") which can run many filter chains in parallel:
- one BatchConverter can run multiple threads on one machine,
- multiple BatchConverters can run on different PC's, sharing work by using a system of lock files on a central network file system.
This way I could calculate the filter chain on all 20 processor cores, reducing processing time down to 10-12 days.
Running this massive parallel task was very cool at the beginning, but quickly got boring ... and it's still running while I'm writing this ...