Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP/proof of concept: Unmangle MAST strings #607

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

samcv
Copy link
Collaborator

@samcv samcv commented Mar 29, 2020

This is done in multiple steps:

  1. Handle resolving mangled lexical strings (so latin1 encoded as utf8 and utf8 encoded as latin1)
  2. Regenerate stage0 bootstrap
  3. Don't encode lexicals as latin-1 anymore, only encode to utf8
  4. Regenerate stage0 bootstrap

This is incomplete, but is a proof of concept for fixing us having both utf8 and latin-1 encoded lexicals.
5. Disable handling of mangled lexical strings

@samcv samcv changed the title Unmangle mast strings WIP/proof of concept: Unmangle MAST strings Mar 29, 2020
nqp::setmethcache($buf, nqp::hash('new', method () {nqp::create($buf)}));
$buf;
}
my $handle-mangled-strings := 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a debugging thing? I don't understand the if just after it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is. It's also so it's easy to see how this proof of concept was achieved when viewing the pull request.

@jnthn
Copy link
Contributor

jnthn commented Mar 29, 2020

I'm missing the overall goal of this PR? Bytecode files store strings in a couple of different ways, in order to reduce memory use and startup decoding time (because if you know it's just in the latin-1 range, a whole load of things simply cannot happen).

@jnthn
Copy link
Contributor

jnthn commented Mar 29, 2020

is a proof of concept for fixing us having both utf8 and latin-1 encoded lexicals

I don't understand why there's anything to fix. To me this looks like it's removing an optimization.

@samcv
Copy link
Collaborator Author

samcv commented Mar 30, 2020

@jnthn if there is a good reason for it, then it can stay. But let me try and explain in more detail what I currently understand. Some of this may be incorrect, so feel free to correct/expand on this.

String Latin-1 UTF-8 Latin-1 and UTF-8 roundtrip identically?
24 A2 24 C2 A2 No
$foo 24 66 6F 6F 24 66 6F 6F Yes

It is not clear to me why we should be storing non-utf8 valid strings. The one that is of concern inside nqp is '$¢'. I am guessing rakudo also goes through this path.

It is my opinion we should only be storing/decoding strings as utf8. This would mean $foo could use the latin-1 encoder/decoder since this will have the same results as the utf8 encoder/decoder. But because '$¢' is encoded as 0x24, 0xA2 in latin-1 while 0x24, 0xC2, 0xA2 in utf-8, this results in us storing two incompatible encodings in the same blob.

If the reason for this optimization is to avoid using the full utf-8 encoder/decoder, I think it would make sense to change this pull request so we use the ASCII decoder/encoder on ASCII strings only.

I hope this makes it a bit more clear my intentions here.

@coke coke changed the base branch from master to main April 19, 2023 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants