Limit characters on portable tesseract

Question:

I'm currently using tesseract portable integrated with java to identify some characters, but I'm facing some problems like:

Some fields only date like: 01/02/2013

Something like this appears: 0Il0S/S013

Only it doesn't follow any pattern. Does anyone have any idea if there is how to create a standard dictionary only for characters like 0-9 and / ?

Remembering: I know it exists for C, but the portable version I haven't found yet.

Answer:

I've only been using tesseract on Linux, via the command line, or in scripts that tell the command line to do the work…

1) create a mydata configuration file with valid characters:

tessedit_char_whitelist 0123456789/-

2) then invoke the tesseract as:

tesseract f.png zzz   mydata

producing zzz.txt with just digits and '/' and '-'

For good results it is worth investing in the quality (resolution) of the initial image…

If the scope is wider it is probably useful to indicate the language.

It is natural that the Java, C, etc. interface has functionality to define the "whitelists".

There is even the possibility of retraining tesseracts (I doubt this is justified).

Scroll to Top