This is an archive of the discontinued Mercurial Phabricator instance.

tests: make test-alias.t pass with re2
ClosedPublic

Authored by valentin.gatienbaron on Nov 19 2018, 1:40 PM.

Details

Summary

Locally, these "non-ASCII character in alias" errors don't show up,
though I get them when the alias is defined at the command line rather
than in an hgrc.
The brokenness comes from the fact that hgrcs are parsed with regexes,
and re/re2 differ in this way:

$ python -c 'import re; print(re.compile("(.*)").match("aaa\xc0bbbb").groups())'
('aaa\xc0bbbb',)
$ python -c 'import re2; print(re2.compile("(.*)").match("aaa\xc0bbbb").groups())'
('aaa',)

Apparently re2 stops when it encounters invalid utf8 (which I suppose makes sense
given that '.' matches what appears to be a codepoint rather than a byte). This is
presumably a bug in hg, but not very important, so just change the test to stick
to valid utf8.

Diff Detail

Repository
rHG Mercurial
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

yuja added a subscriber: yuja.Nov 20 2018, 7:22 AM

Queued, thanks.

This revision was automatically updated to reflect the committed changes.

Thanks. FWIW, re2 options allow to choose an encoding which is latin1 or utf8 (https://github.com/google/re2/blob/master/re2/re2.h#L609). Presumably latin1 means that "." matches a byte, which would seem more compatible with re, but python bindings don't provide the ability to choose this encoding.

yuja added a comment.Nov 21 2018, 9:22 AM
Thanks. FWIW, re2 options allow to choose an encoding which is latin1 or utf8 (https://github.com/google/re2/blob/master/re2/re2.h#L609). Presumably latin1 means that "." matches a byte, which would seem more compatible with re, but python bindings don't provide the ability to choose this encoding.

Indeed. We could pass in a fat unicode array (i.e. each byte as latin-1 char)
to re2, but that would sacrifice the performance.

Thanks for the info.