This is an archive of the discontinued Mercurial Phabricator instance.

chg: force-set LC_CTYPE on server start to actual value from the environment
ClosedPublic

Authored by spectral on Jan 29 2020, 5:23 PM.

Details

Summary

Python 3.7+ will "coerce" the LC_CTYPE variable in many instances, and this can
cause issues with chg being able to start up. D7550 attempted to fix this, but a
combination of a misreading of the way that python3.7 does the coercion and an
untested state (LC_CTYPE being set to an invalid value) meant that this was
still not quite working.

This change will cause differences between chg and hg: hg will have the LC_CTYPE
environment variable coerced, while chg will not. This is unlikely to cause any
detectable behavior differences in what Mercurial itself outputs, but it does
have two known effects:

  • When using hg, the coerced LC_CTYPE will be passed to subprocesses, even non-python ones. Using chg will remove the coercion, and this will not happen. This is arguably more correct behavior on chg's part.
  • On macOS, if you set your region to Brazil but your language to English, this isn't representable in locale strings, so macOS sets LC_CTYPE=UTF-8. If this value is passed along when ssh'ing to a non-macOS machine, some functions (such as locale.setlocale()) may raise an exception due to an unsupported locale setting. This is most easily encountered when doing an interactive commit/split/etc. when using ui.interface=curses.

Diff Detail

Repository
rHG Mercurial
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

spectral created this revision.Jan 29 2020, 5:23 PM
yuja added a subscriber: yuja.Jan 30 2020, 10:05 AM
  • # Python3 has some logic to "coerce" the C locale to a UTF-8 capable
  • # one, and it sets LC_CTYPE in the environment to C.UTF-8 if none of
  • # 'LC_CTYPE', 'LC_ALL' or 'LANG' are set (to any value). This can be
  • # disabled with PYTHONCOERCECLOCALE=0 in the environment.
  • #
  • # When fromui is called via _inithashstate, python has already set
  • # this, so that's in the environment right when we start up the hg
  • # process. Then chg will call us and tell us to set the environment to
  • # the one it has; this might NOT have LC_CTYPE, so we'll need to
  • # carry-forward the LC_CTYPE that was coerced in these situations.
  • #
  • # If this is not handled, we will fail config+env validation and fail
  • # to start chg. If this is just ignored instead of carried forward, we
  • # may have different behavior between chg and non-chg.

Can you move and rephrase this comment?

@@ -730,6 +696,11 @@

  1. environ cleaner. if b'CHGINTERNALMARK' in encoding.environ: del encoding.environ[b'CHGINTERNALMARK']

+ if b'CHGORIG_LC_CTYPE' in encoding.environ:
+ encoding.environ[b'LC_CTYPE'] = encoding.environ[b'CHGORIG_LC_CTYPE']
+ del encoding.environ[b'CHGORIG_LC_CTYPE']
+ elif b'CHG_CLEAR_LC_CTYPE' in encoding.environ:
+ del encoding.environ[b'LC_CTYPE']

would crash if LC_CTYPE wasn't set, and probably needs to delete
CHG_CLEAR_LC_CTYPE.

diff --git a/hg b/hg

  • a/hg

+++ b/hg
@@ -1,4 +1,4 @@
-#!/usr/bin/env python
+#!/usr/bin/env python3

Unrelated change.

quark added a subscriber: quark.Jan 30 2020, 10:50 AM

What do you think about this approach:

  1. The server detects that LC_TYPE is coerced.
  2. When handling the "validate" command, the server sends back "invalidate this server, and fallback to original hg" response.

This makes chg/non-chg behave consistently with some startup overhead in mis-configured environment. The chg client can potentially print a warning to remind the user to fix their environment.

yuja added a comment.Jan 30 2020, 12:02 PM
What do you think about this approach:
1. The server detects that LC_TYPE is coerced.
2. When handling the "validate" command, the server sends back "invalidate this server, and fallback to original hg" response.
This makes chg/non-chg behave consistently with some startup overhead in mis-configured environment. The chg client can potentially print a warning to remind the user to fix their environment.

That could be, but if we do want to make chg/hg behavior consistent, maybe we
can adjust the hash computation logic?

  1. client sends CHG_ORIG_LC_CTYPE or CHG_UNSET_LC_CTYPE when spawning server
  2. they're kept in environ dict, but override LC_CTYPE while computing hash, and excluded from the hash
  3. client does not send these variables over setenv command, but passes the validation because {CHG_ORIG_LC_CTYPE: x, LC_CTYPE: y} == {LC_CTYPE: x}.

If we had a sha1 logic in C, we could compute the env hash at client side.
Python 3 can't be trusted.

In D8039#118719, @yuja wrote:
What do you think about this approach:
1. The server detects that LC_TYPE is coerced.
2. When handling the "validate" command, the server sends back "invalidate this server, and fallback to original hg" response.
This makes chg/non-chg behave consistently with some startup overhead in mis-configured environment. The chg client can potentially print a warning to remind the user to fix their environment.

That could be, but if we do want to make chg/hg behavior consistent, maybe we
can adjust the hash computation logic?

  1. client sends CHG_ORIG_LC_CTYPE or CHG_UNSET_LC_CTYPE when spawning server
  2. they're kept in environ dict, but override LC_CTYPE while computing hash, and excluded from the hash
  3. client does not send these variables over setenv command, but passes the validation because {CHG_ORIG_LC_CTYPE: x, LC_CTYPE: y} == {LC_CTYPE: x}.

I think that's almost what I was doing in the original stack of three commits ending in D8023, though I used a different encoding than two separate environment variables (one for clear, one for set), I have no strong preference between the two encoding methods. There's still a bit of a issue with the setenv command: it'll clear the environment, and replace it with the one that came from chg. So we need to somehow keep it from clearing/modifying CHG_ORIG_LC_CTYPE and LC_CTYPE, which isn't difficult (D8023 already does this, just with a more complicated mechanism). With a small change to D8023 I think D8021 could be dropped, and D8022 could be changed to only do this in execcmdserver. Done.

If we had a sha1 logic in C, we could compute the env hash at client side.

I think sha1dc has been added to the repo, so we do have a sha1 implementation in C. :) Computing the hash on the client avoids the restart loop, but doesn't avoid the behavior difference - chg will still call setenv and replace the modified version with the original one.

Python 3 can't be trusted.

Yeah :(

Sorry about the delay responding here, I was sick with the flu.

spectral updated this revision to Diff 19876.Feb 4 2020, 4:57 PM
yuja added a comment.Feb 6 2020, 9:29 AM

I like the simplicity of this patch. Queued, thanks.

Computing the hash on the client avoids the restart loop, but doesn't avoid the behavior difference - chg will still call setenv and replace the modified version with the original one.

Indeed.

This revision was not accepted when it landed; it landed in state Needs Review.
This revision was automatically updated to reflect the committed changes.