-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get better result after human validation #36
Comments
As you can see the expected result is the second output in all the three cases. Adding more training data might not help in this case. Since Roman is not the original script for Hindi, one can choose any spelling, for example, for the word बहुत the actual pronunciation is bahut, but most of the Hindi speakers (including me) prefer the spelling bohat. So bohut, bahut, bohat all are correct for me. Saying that the above transliterations are erroneous doesn't seem right. These actually are one of the possible transliterations. Though the system can actually fail in some cases. After all it is a machine that learned some parameters using some training data and not some actual human being doing the transliteration. Expecting a 100% result does not seem reasonable. |
Ok, thanks for clear response. |
I don't think so. You can use back-transliteration to estimate the quality of the target transliteration by comparing its back-transliteration with the source word.
As you can see, in the above example, second |
Thank you very much! |
You can tokenize Hindi text using polyglot-tokenizer. |
Great! I think I have all to continue. Regards |
Sorry, last question. How can I include polyglot-tokenizer into indictrans in order to make it work inside binary command? |
Hi, I have this word OUTPUT= So how do I chose the "best"? What I do is what you suggest: I "back-transliterate" every word contained in OUTPUT and I check that every back-transliteration corresponds to the initial input word, in this case Thanks |
Based on the original question, my suggestion was not to add more training data rather to get some work around, but that was the case because the language pair under consideration was |
Ok, so I think that this is the link http://irshadbhat.github.io/rom-ind/ |
If you read the blog, I have mentioned a couple of sources from were I collected/generated the training data. Apart from those you can search online for additional data. Regarding your second question, "which pairs of language are considered reliable?", reliability of the model is highly relative. It depends on the downstream task whether you consider the output good or bad. But since you have asked, the best performing models are |
Hi, |
Hi,
we are using indic-trans in order to make transliteration from Hindi to Roman/Eng.
After applying your model, I have good results in general, but there are still some errors, as some Hindi people have shown us.
Any advice to get better results? like increase training set? choose between BeamSearch and Viterbi?
Thanks
The text was updated successfully, but these errors were encountered: