Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add language information to the TSV output (fixes #1861) #4168

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

DrDub
Copy link

@DrDub DrDub commented Dec 13, 2023

Also make the font_info flag work on TSV output.

The existing code was reading hocr_font_info but not using it when producing output. This PR fixes that, too but it feels a rather undocumented feature without reading the source code. I wonder where to document that. Maybe as a comment on the tsv example config?

Also make the font_info flag work on TSV output.
@zdenop
Copy link
Contributor

zdenop commented Dec 28, 2023

font_info has no reliable information (see e.g. https://github.com/tesseract-ocr/tesseract/issues?q=is%3Aissue+Font+Attribute+) so I doubt if it is a good idea to promote it in output

@DrDub
Copy link
Author

DrDub commented Dec 28, 2023

Alright, do you want me to remove it from this PR?

@zdenop
Copy link
Contributor

zdenop commented Dec 28, 2023

Yes.

What is a use case for lang_info?

@DrDub
Copy link
Author

DrDub commented Dec 28, 2023

Nothing in particular. It was already present in the code and it is in the example TSV linked in #1861.

@DrDub
Copy link
Author

DrDub commented Dec 28, 2023

Let me know if you want the commits to be squashed. (I do hope font information can be brought back into the engine in some moment.)

@@ -564,7 +564,7 @@ class TESS_API TessBaseAPI {
* page_number is 0-based but will appear in the output as 1-based.
* Returned string must be freed with the delete [] operator.
*/
char *GetTSVText(int page_number);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an API change. It requires a new major version (Tesseract 6.0.0) and changes in other software like for example tesserocr.

Therefore we cannot simply merge this pull request.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. If you have a suggestion how to provide this functionality without modifying the API, I could steer the PR in that direction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use overload

char *GetTSVText(int page_number, bool lang_info=false);
to
char *GetTSVText(int page_number, bool lang_info);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pushed the overload approach, it does not break the API now.

@DrDub
Copy link
Author

DrDub commented Apr 3, 2024

Yes.

What is a use case for lang_info?

Sorry, I misunderstood your question. The use case for lang_info (not for the font information) is in #1861.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants