WIP/proof of concept: Unmangle MAST strings #607

samcv · 2020-03-29T10:12:41Z

This is done in multiple steps:

Handle resolving mangled lexical strings (so latin1 encoded as utf8 and utf8 encoded as latin1)
Regenerate stage0 bootstrap
Don't encode lexicals as latin-1 anymore, only encode to utf8
Regenerate stage0 bootstrap

This is incomplete, but is a proof of concept for fixing us having both utf8 and latin-1 encoded lexicals.
5. Disable handling of mangled lexical strings

lizmat · 2020-03-29T10:14:51Z

src/vm/moar/QAST/QASTCompilerMAST.nqp

+                nqp::setmethcache($buf, nqp::hash('new', method () {nqp::create($buf)}));
+                $buf;
+            }
+            my $handle-mangled-strings := 0;


Is this a debugging thing? I don't understand the if just after it?

Yes it is. It's also so it's easy to see how this proof of concept was achieved when viewing the pull request.

jnthn · 2020-03-29T13:17:53Z

I'm missing the overall goal of this PR? Bytecode files store strings in a couple of different ways, in order to reduce memory use and startup decoding time (because if you know it's just in the latin-1 range, a whole load of things simply cannot happen).

jnthn · 2020-03-29T13:21:15Z

is a proof of concept for fixing us having both utf8 and latin-1 encoded lexicals

I don't understand why there's anything to fix. To me this looks like it's removing an optimization.

samcv · 2020-03-30T06:50:29Z

@jnthn if there is a good reason for it, then it can stay. But let me try and explain in more detail what I currently understand. Some of this may be incorrect, so feel free to correct/expand on this.

String	Latin-1	UTF-8	Latin-1 and UTF-8 roundtrip identically?
$¢	24 A2	24 C2 A2	No
$foo	24 66 6F 6F	24 66 6F 6F	Yes

It is not clear to me why we should be storing non-utf8 valid strings. The one that is of concern inside nqp is '$¢'. I am guessing rakudo also goes through this path.

It is my opinion we should only be storing/decoding strings as utf8. This would mean $foo could use the latin-1 encoder/decoder since this will have the same results as the utf8 encoder/decoder. But because '$¢' is encoded as 0x24, 0xA2 in latin-1 while 0x24, 0xC2, 0xA2 in utf-8, this results in us storing two incompatible encodings in the same blob.

If the reason for this optimization is to avoid using the full utf-8 encoder/decoder, I think it would make sense to change this pull request so we use the ASCII decoder/encoder on ASCII strings only.

I hope this makes it a bit more clear my intentions here.

samcv added 2 commits March 29, 2020 11:59

Handle mangled lexicals

e3490a8

Update stage0 with support for handling mangled strings

10af899

samcv changed the title ~~Unmangle mast strings~~ WIP/proof of concept: Unmangle MAST strings Mar 29, 2020

lizmat reviewed Mar 29, 2020

View reviewed changes

samcv added 3 commits March 30, 2020 09:43

Only encode as utf8 or ascii

cf68f48

update stage0

6903d0f

Don't handle mangled strings anymore

9aca235

samcv force-pushed the unmangle-mast-strings branch from aef8971 to 9aca235 Compare March 30, 2020 07:47

coke changed the base branch from master to main April 19, 2023 13:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP/proof of concept: Unmangle MAST strings #607

WIP/proof of concept: Unmangle MAST strings #607

samcv commented Mar 29, 2020 •

edited

Loading

lizmat Mar 29, 2020

samcv Mar 29, 2020

jnthn commented Mar 29, 2020

jnthn commented Mar 29, 2020

samcv commented Mar 30, 2020 •

edited

Loading

WIP/proof of concept: Unmangle MAST strings #607

Are you sure you want to change the base?

WIP/proof of concept: Unmangle MAST strings #607

Conversation

samcv commented Mar 29, 2020 • edited Loading

lizmat Mar 29, 2020

Choose a reason for hiding this comment

samcv Mar 29, 2020

Choose a reason for hiding this comment

jnthn commented Mar 29, 2020

jnthn commented Mar 29, 2020

samcv commented Mar 30, 2020 • edited Loading

samcv commented Mar 29, 2020 •

edited

Loading

samcv commented Mar 30, 2020 •

edited

Loading