so I'm training for the adaptive classifier (the default engine in Tesseract). But I'm having a bit of trouble with this a documentation is very fragmented and/or missing.
I'm training for a very small data set to start with, I thought I just start out using arial black until i gather more data on my subject. I would like to recognize labels on say cosmetics (in danish), which is just a list (comma separated words). And only very specific words, in particular:
smør, ost, yoghurt, ymer, ylette, fløde, milkshake, laktose, mælkesukker, animalsk fedtstof, animalsk olie, smørolie, bagermargarine, margarine, minarine, risbagemel, inddampet mælk, mælkebestanddele, mælketørstof, tørmælk, mælkepulver, skummetmælkspulver, sødmælkspulver, mælkeprotein, lactalbumin, kasein, kaseinat, calciumkaseinat, kaliumkaseinat, natriumkaseinat, valle, valleprotein, vallepulver, mælk,
And the same words starting with a capital letter (example: "Vallepulver"). But I keep having trouble figuring out a proper config file for this type of morphology, I though that I should probably utilize the DAWG system as accuracy and speed is very important.
So far I took the following steps: Used jTessboxeditor to generate a .box file convert the .box file to a .tr file with tesseract imagefile filename.exp0,box nobatch box.train Then extract the unicharset with unicharset_extractor filename.exp0.box Create a font property file, with following content: arial 1 0 0 0 0 Then cluster the character features with "mftraining" "cntraining" Renaming all the files to my choosen language name Creating a wordlist containing the above list Converting the wordlist to a lang.words.dawg with wordlist2dawg And finally combining the data with combine_tessdata lang. But I'm still expericening very inaccurate results (I'm using scantailor to preprocess the images before feeding them to Tesseract), here's the image (in .tif format) that I'm currently testing tesseract on:
https://drive.google.com/file/d/0B8e0HDFGiNZOOXpWbUQwc0l3N2xqYlE3SGN4d1BPcHlxQVRn/view?usp=sharing
The system is only supposed to recognize words from the above list (the only match between the list and the image would therefore be "milk").
Any suggestions to what I could be doing wrong/improve (especially in my nonexistent config) would be very apreciated as I have been struggling with this for quite a while now.
Sincerely a desperate fellow nerd.