Question:
We are using ImageMagick and Tesseract to try to read the document information, but we are not finding the correct configuration and the combination of both software to optimize the original scanned tiff document and apply the transformation with Tesseract to it to obtain a pdf with the information.
We first digitize the document on a scanner with the 300 dpi setting, and the document produces a tiff of about 170kb.
We then tried to pre-process the image with ImageMagick before passing it to Tesseract 3.0.3 to produce a PDF with the text.
The first command we use is the following:
convert page.tiff -respect-parenthesis -compress LZW -density 300 \
-bordercolor black -border 1 -fuzz 1% -trim +repage -fill white -draw \
"color 0,0 floodfill" -alpha off -shave 1x1 -bordercolor black -border 2 \
-fill white -draw "color 0,0 floodfill" -alpha off -shave 0x1 -fuzz 1% \
-deskew 40 +repage temp.tiff
And then we apply the Tesseract like this:
tesseract -l spa temp.tiff temp pdf
This produces a very large pdf (900kb, example in the link), but Tesseract is not able to read data that is in cells or just below shadowed headers.
https://drive.google.com/open?id=0B3CPIZ_TyzFXd2UtWldfajR4SVU
Then we have tried using this command:
convert page.tiff -compress LZW -fuzz 1% -trim -alpha off -shave 1x1 temp.tiff
This produces a lighter pdf (130kb, example in the link), but we're still having the same problems.
https://drive.google.com/open?id=0B3CPIZ_TyzFXWFEwT3JucDBTVVU
Could someone tell us which way to go to optimize the image to try to get information like the examples? Or guidelines for optimizing images to improve Tesseract accuracy?
The type of documents we are trying to process are very different with different fonts and sizes.
–Added– I include here the script we run to transform a tiff image. We have it mounted on a ubuntu machine where we have ImageMagick and Tesseract installed. To transform the tiff we call the script passing it the name of the image and the output name of the pdf.
Example: ./transformTesseractOCR.sh nombre-del-tiff.tiff nombre-del-pdf
#!/bin/bash
#Obtenemos el numero de paginas del archivo tiff
PAGES=$(identify -format "%n" $1)
FILE="${2%%.*}"
BUCLE=$(($PAGES-1))
cadena=""
declare -a array_temp_tiff
declare -a array_temo_pdf
# funcion que devuelve un nombre de 24 caracteres aleatorios + fecha + temp
get_random_name(){
now=$(date +"%m-%d-%Y")
random=$(cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 24 | head -n 1)
new_temp="temp-$random-$now"
echo $new_temp
}
#Para cada pagina del archivo tiff, la procesamos con ImageMagick y transformamos con Tesseract en PDF
for i in `seq 0 $BUCLE`; do
#Declaramos los arrays donde guardaremos los nombres de los archivos temporales
declare -a array_temp_tiff
declare -a array_temp_pdf
#Generamos un nombre aleatorio para los archivos temporales
read temp_pdf <<<$(get_random_name; echo)
read temp_tiff <<<$(get_random_name; echo)
temp_tiff="$temp_tiff.tiff"
#Guardamos los nombres en los arrays
array_temp_pdf[$i]="$temp_pdf.pdf"
array_temp_tiff[$i]=$temp_tiff
#Primero aplicamos la mejora del tiff con ImageMagick
convert $1\[$(($i))\] -compress LZW -fuzz 1% -trim -alpha off -shave 1x1 $temp_tiff
#convert $1\[$(($i))\] -respect-parenthesis -compress LZW -density 300 -bordercolor black -border 1 -fuzz 1% -trim +repage -fill white -draw "color 0,0 floodfill" -alpha off -shave 1x1 -bordercolor black -border 2 -fill white -draw "color 0,0 floodfill" -alpha off -shave 0x1 -fuzz 1% -deskew 40 +repage page$i.tiff
#Ahora aplicamos el tesseract
/usr/bin/tesseract -l spa $temp_tiff $temp_pdf pdf
done
#Creamos una cadena formada por todos los archivos pdf temporales creados anteriormente para juntarlos en un solo pdf
for element in "${array_temp_pdf[@]}"
do
cadena+="$element "
done
#Juntamos los pdf temporales en un solo pdf
/usr/bin/pdftk $cadena cat output $FILE.pdf
#Borramos archivos pdf temporales
for element in "${array_temp_pdf[@]}"
do
rm $element
done
#Borramos archivos tiff temporales
for element in "${array_temp_tiff[@]}"
do
rm $element
done
Answer:
One possible problem I see is that the "grays" in the title cell shadings become "noise", a number of black dots to simulate a "gray" in the monochrome image. That undoubtedly makes it impossible to read these titles and possibly the fields below them, which, being so close to them, make the OCR "confused" and not detect the text pattern, the difference is noticeable with the " concept" that are somewhat further away from the shadings and recognizes them much better.
Ideally, if what is sought is only to extract the text, it is to start from a "cleaner" image. To achieve this, either from the Scanner (if this is the one that delivers the monochrome tiff) or in a later step if a grayscale or color image is received, what must be done is to find a "level" of cut that determines which things turn into black dots and which turn into white dots. This parameter can often be configured from the brightness or contrast of the Scanner, similar algorithms can be applied by software, the idea is to turn the "shading" into white, but this on a grayscale or color image, with the tiff link we can no longer do anything. Finding this optimal level is something very traditional and based on trial and error, we can adjust a suitable level to remove the shading but as a counterpart, important information could eventually be eliminated, we also have the problem that said level can be good for a type of document but bad for another.
A separate comment, a little more than 15 years ago I had to work for a long time with Kodak high-volume scanners, at that time, our experience of applying OCR on heterogeneous documents had not been very good, even using advanced engines of that time, in In general, OCR could be used in documents of the same type and applied to "masks" or specific sectors of the same, to obtain some other relevant data.
Luck..//