This is an archive of the discontinued Mercurial Phabricator instance.

tests: make test-alias.t pass with re2
ClosedPublic

Authored by valentin.gatienbaron on Nov 19 2018, 1:40 PM.

Details

Summary

Locally, these "non-ASCII character in alias" errors don't show up,
though I get them when the alias is defined at the command line rather
than in an hgrc.
The brokenness comes from the fact that hgrcs are parsed with regexes,
and re/re2 differ in this way:

$ python -c 'import re; print(re.compile("(.*)").match("aaa\xc0bbbb").groups())'
('aaa\xc0bbbb',)
$ python -c 'import re2; print(re2.compile("(.*)").match("aaa\xc0bbbb").groups())'
('aaa',)

Apparently re2 stops when it encounters invalid utf8 (which I suppose makes sense
given that '.' matches what appears to be a codepoint rather than a byte). This is
presumably a bug in hg, but not very important, so just change the test to stick
to valid utf8.

Diff Detail

Repository
rHG Mercurial
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

yuja added a subscriber: yuja.Nov 20 2018, 7:22 AM

Queued, thanks.

This revision was automatically updated to reflect the committed changes.

Thanks. FWIW, re2 options allow to choose an encoding which is latin1 or utf8 (https://github.com/google/re2/blob/master/re2/re2.h#L609). Presumably latin1 means that "." matches a byte, which would seem more compatible with re, but python bindings don't provide the ability to choose this encoding.

yuja added a comment.Nov 21 2018, 9:22 AM
Thanks. FWIW, re2 options allow to choose an encoding which is latin1 or utf8 (https://github.com/google/re2/blob/master/re2/re2.h#L609). Presumably latin1 means that "." matches a byte, which would seem more compatible with re, but python bindings don't provide the ability to choose this encoding.

Indeed. We could pass in a fat unicode array (i.e. each byte as latin-1 char)
to re2, but that would sacrifice the performance.

Thanks for the info.