Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input iterator's column data member doesn't account for multibyte characters (UTF-8) #369

Open
catsidhe opened this issue Sep 6, 2024 · 4 comments
Assignees

Comments

@catsidhe
Copy link

catsidhe commented Sep 6, 2024

Unless I'm mistaken, column currently counts bytes. I ended up re-calculating the column by walking back from data until I see a newline character or reach the beginning of the content, counting characters while ignoring non-first UTF-8 bytes. Would've been nicer if it was handled by PEGTL.

@ColinH ColinH self-assigned this Sep 11, 2024
@ColinH
Copy link
Member

ColinH commented Sep 11, 2024

Am I right to assume that what you are counting are codepoints? That might be possible to add for PEGTL 4.0, though we have been apprehensive to go down that road because the next step would be to take composed characters and grapheme clusters into consideration. That's a big can of worms to open.

@catsidhe
Copy link
Author

catsidhe commented Sep 11, 2024

Yes, just the code points, bytes outside 0x80..0xbf range. I figured it'd cover the majority of use cases, while being simple and cheap. But I understand your concern, that can of worms might be best left closed.

Another dimension to this is double-width and non-printable characters - those with wcwidth(c) values other than 1. Looks like most editors, gcc and clang, take wcwidth into account.

@ColinH
Copy link
Member

ColinH commented Sep 14, 2024

Are you using lazy or eager position tracking?

@catsidhe
Copy link
Author

catsidhe commented Sep 16, 2024

Sorry, I don't quite know how to answer that. Maybe because I use it in a tree? Probably incorrectly, too!

auto from_utf8(std::string_view) -> std::u32string;
inline auto line_width(std::u32string_view sv) -> int {
  return std::accumulate(sv.begin(), sv.end(), 0, [](int w, char32_t uc) {
    return w + std::max(width(uc), 0);
  });
}
class node {
  // ...
  template<typename, typename Input>
  void start(Input const &in) noexcept {
    auto i = in.iterator();
    line = i.line;
    // Can I do this? Maybe not for all inputs.
    column = line_width(from_utf8({i.data - (i.column - 1), i.data})) + 1;
  }
};

Actually, I'm no longer sure how to best calculate columns. Consider this snippet:

int /*あいうえお*/ 123;

According to VSCode, the digit 1 is on column 15. It treats each character as occupying exactly one column.

Clang thinks it's on column 25 (counting bytes, like PEGTL):

test.cpp:1:25: error: expected unqualified-id
    1 | int /*あいうえお*/ 123;
      |                    ^

GCC says 20 (presumably using wcwidth):

test.cpp:1:20: error: expected unqualified-id before numeric constant
   1 | int /*あいうえお*/ 123;
     |                    ^~~

(Arrows point to 1 in both examples above in the terminal.)

Nano: 20
Emacs: 19 (it counts from 0)
Vim shows two numbers: 25-20

I'm leaning towards using wcwidth. But I don't think PEGTL should (since it's not in the standard library).

For best flexibility, perhaps a user-supplied character traits class is needed, that gets to decide how many columns each code point takes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants