ocr - Including unicharambigs in the [lang].traineddata file (Tesseract) -


i'm facing problem in training tesseract ocr kannada font (lohit kannada , kedage), when comes numerals.

for example, 0 getting recognized 8 (and ನ ವ). needed in including unicharambigs file (the documentation on github describes format solely).my output.txt file has not changed,despite including unicharambigs file.

suppose [lang] corresponds kan, following command include unicharambigs file in kan.traineddata file?

combine_tessdata kan.

incase doesn't, i'd appreciate regarding how proceed same.

difficult answer not knowing version of tesseract , kan.traineddata you're using.

you can unpack kan.traineddata see version of kan.unicharabigs included in , recombine after editing file.

see https://github.com/tesseract-ocr/tesseract/blob/master/doc/combine_tessdata.1.asc command syntax

  • use -u option unpack:

    • -u .traineddata pathprefix unpacks .traineddata using provided prefix.
  • use -o option overwrite ucharambigs:

    • -o .traineddata file…​: overwrites specified components of .traineddata file provided on command line.

please note https://github.com/tesseract-ocr/langdata/blob/master/kan/kan.unicharambigs seems copy of eng.unicharambigs


Comments

Popular posts from this blog

sequelize.js - Sequelize group by with association includes id -

android - Robolectric "INTERNET permission is required" -

java - Android raising EPERM (Operation not permitted) when attempting to send UDP packet after network connection -