This is an archive of the discontinued Mercurial Phabricator instance.

narrow: escape includepats/excludepats when sending over the wire
AbandonedPublic

Authored by spectral on Mar 13 2020, 12:12 AM.

Details

Reviewers
durin42
martinvonz
marmoute
Group Reviewers
hg-reviewers
Summary

The escaping chosen was RFC 4180-like in that it wraps the values that are to be
escaped in double quote (") characters, and doubles any embedded double quote
characters. It differs from RFC 4180 in what is escaped; the escaping is only
applied if the item contains a , character or starts with a " character. RFC
4180 states that " characters embedded in a value that isn't escaped are not
allowed, and it also states that \r and \n characters should trigger an item to
be escaped.

The differences from RFC 4180 are intentional:

  • the include and exclude patterns can't contain a , character today (that's what's being fixed), so surrounding them in " characters won't really change anything.
  • the include and exclude patterns today can't start with a " character, so by *not* escaping all strings that have an embedded " unless they start with a " character (for future compatibility), we maintain BC.
  • the include and exclude patterns aren't multi-line, so \r and \n are irrelevant (and technically allowed in the patterns, though Mercurial is generally unable to reason about filenames that contain newlines).

Alternatives considered:

Full RFC 4180 encoding and decoding:

  • this would have been a breaking change for any include/exclude pattern that includes a " in the name if a new client talks to an old server. This is quite unlikely to cause problems in practice, but was easy to avoid.
  • this would have required the decoder to handle \r and \n characters specially, which would complicate it significantly.
  • the csv module in Python would have provided this to us, but it requires str on python3, and we only have bytes. It's not generically safe to convert these patterns to str, though perhaps roundtripping the string through fsencode and fsdecode would have been acceptable?

Other escaping mechanisms:

  • the original version of this change did use a different (home-grown) escaping mechanism. This received pushback during review.
  • using \ escaping (\->\\, and , -> \,) is annoying to decode, though overall pretty similar to what we have to do here. It would introduce significantly more changes for cross-platform and cross-version compatibility, however, as a server that encounters a string with \, can't know if the client is escaping a comma, or if the client is old and the path ends in a backslash (among other ambiguities).
  • using the existing batch escaping would likely have worked, but would introduce cross-version ambiguities if any path contained a : character followed by one of the four characters used in that escaping mechanism. The author does not know if their deployment has any paths with these character sequences in them, so decided to err on the side of caution.

Diff Detail

Repository
rHG Mercurial
Branch
default
Lint
No Linters Available
Unit
No Unit Test Coverage

Event Timeline

spectral created this revision.Mar 13 2020, 12:12 AM

The Windows path changes seem like a good idea.

Would quoting paths with commas eliminate the need for custom escaping? I don't feel strongly about it, but custom escaping always feels weird to me. (I fact, a coworker did some homebrew escaping for CSV files a few days ago, but I forget how it ultimately ended up.)

spectral updated this revision to Diff 20767.Mar 13 2020, 1:21 AM

The Windows path changes seem like a good idea.
Would quoting paths with commas eliminate the need for custom escaping? I don't feel strongly about it, but custom escaping always feels weird to me. (I fact, a coworker did some homebrew escaping for CSV files a few days ago, but I forget how it ultimately ended up.)

Let me play with that a bit, I think it'll work and be detectable on the server since the first character can't currently be a double-quote, so there wouldn't really be any BC issues aside from the pconvert (which wouldn't be as important anymore, but still probably a good idea?)

The Windows path changes seem like a good idea.
Would quoting paths with commas eliminate the need for custom escaping? I don't feel strongly about it, but custom escaping always feels weird to me. (I fact, a coworker did some homebrew escaping for CSV files a few days ago, but I forget how it ultimately ended up.)

Let me play with that a bit, I think it'll work and be detectable on the server since the first character can't currently be a double-quote, so there wouldn't really be any BC issues aside from the pconvert (which wouldn't be as important anymore, but still probably a good idea?)

It's a little weird to me that there's no step corresponding to pconvert() in the decoding function. Maybe we should do the pconvert() step outside encodecsvpaths(), maybe in narrowspec.normalizepattern()? However, that would make it impossible to have backslashes in paths in the narrowspec on Windows even after this patch (because they would always be converted to slashes), including narrowspecs returned from the server. However^2, maybe it wouldn't work to check out such paths on Windows anyway, so it doesn't really matter (so it's fine to call pconvert())?

marmoute requested changes to this revision.Mar 13 2020, 4:58 AM
marmoute added a subscriber: marmoute.

The escaping scheme is a bit puzzling to me. Coudl we use something more standard for this ? (like urlencode).

(requesting change of the function name. Now that we can, lets make them readable)

mercurial/wireprototypes.py
182

Let's name it encode_csv_paths now that _ are permitted. It woul dbe easier to read.

219

Let's name it decode_csv_paths now that _ are permitted. It woul dbe easier to read.

This revision now requires changes to proceed.Mar 13 2020, 4:58 AM

The Windows path changes seem like a good idea.
Would quoting paths with commas eliminate the need for custom escaping? I don't feel strongly about it, but custom escaping always feels weird to me. (I fact, a coworker did some homebrew escaping for CSV files a few days ago, but I forget how it ultimately ended up.)

Let me play with that a bit, I think it'll work and be detectable on the server since the first character can't currently be a double-quote, so there wouldn't really be any BC issues aside from the pconvert (which wouldn't be as important anymore, but still probably a good idea?)

I haven't played with narrow yet, so I'm not sure of the context. Are these user input paths that would end up being ignored/rejected if a Windows user used path\to\file when talking to a Unix server? Or are these stored in a tracked file? (Which I think could still cause problems.) I can't think of a good reason to stay inconsistent, and it is still experimental, so it still seems like a good idea while we still can fix it.

The Windows path changes seem like a good idea.
Would quoting paths with commas eliminate the need for custom escaping? I don't feel strongly about it, but custom escaping always feels weird to me. (I fact, a coworker did some homebrew escaping for CSV files a few days ago, but I forget how it ultimately ended up.)

Let me play with that a bit, I think it'll work and be detectable on the server since the first character can't currently be a double-quote, so there wouldn't really be any BC issues aside from the pconvert (which wouldn't be as important anymore, but still probably a good idea?)

I haven't played with narrow yet, so I'm not sure of the context. Are these user input paths that would end up being ignored/rejected if a Windows user used path\to\file when talking to a Unix server? Or are these stored in a tracked file? (Which I think could still cause problems.) I can't think of a good reason to stay inconsistent, and it is still experimental, so it still seems like a good idea while we still can fix it.

They are stored in .hg/store/narrowspec. They usually get into that file via hg clone --narrow --include and similar, but the server may also send them. We should ideally do the conversion early when the user provides them. I think the pconvert in this patch is to handle existing repos as well as possible.

spectral retitled this revision from narrow: escape includepats/excludepats when sending over the wire (BC) to narrow: escape includepats/excludepats when sending over the wire.Mar 17 2020, 5:25 PM
spectral edited the summary of this revision. (Show Details)
spectral updated this revision to Diff 20813.
spectral updated this revision to Diff 20818.Mar 17 2020, 5:50 PM
spectral updated this revision to Diff 20819.Mar 17 2020, 5:59 PM

Since narrow is still experimental, I don't think we should try too hard for backward compatibility. We could introduce a new end-point for a new encoding and drop the old one in a couple of version.

I am really not enthousiatic with having our own version of an encoding. Because this means extra overhead for people working on this in the future. Especially if it needs to be reimplemented (eg: in rust). If we drop the hard BC constraint on this, like starting from scratch. What would be your (@spectral ) encoding of choice ? could we got for something simple but widely available like url-encode ?

Since narrow is still experimental, I don't think we should try too hard for backward compatibility. We could introduce a new end-point for a new encoding and drop the old one in a couple of version.

+0, honestly. I won't require it, but I'd really rather we shaved this yak _now_ rather than when narrow has even more users.

Since narrow is still experimental, I don't think we should try too hard for backward compatibility. We could introduce a new end-point for a new encoding and drop the old one in a couple of version.

+0, honestly. I won't require it, but I'd really rather we shaved this yak _now_ rather than when narrow has even more users.

I'm getting a bit frustrated with how much time I've spent on this, made worse by the fact that I agree with everything everyone's saying and so it's not like I'm frustrated at the review process, just how slow I've been at accomplishing this.

So, before I go down another rabbit hole, here's what I'm thinking:

  • Server emits a new capability narrow-exp-1-escaped (in addition to the current narrow-exp-1, this is not replacing the existing capability)
  • wireprototypes's map will change these items from csv to csv.escaped
  • Compatible clients will detect this capability from the server and send items of type csv.escaped during getbundle with keys like <previousname>.escaped (ex: include.escaped). If the server doesn't support csv.escaped, the client sends with the old names (unescaped).
  • The escaping will be urllibcompat.quote
  • The server will strip the .escaped suffix on the keys, split on comma, and urllibcompat.unquote the individual items
  • I'm *not* expecting to do anything about \ -> / conversion.

Since these are part of getbundle, I haven't found a way of doing this that's not one of:

  • a custom escaping mechanism that's backwards compatible
  • adding a capability and renaming the keys that are sent (so the server can tell when it needs to unescape)
  • having the client always send duplicate items (i.e. send include and include.escaped). I'm not even sure that older servers would tolerate receiving keys they aren't expecting.
  • having the client only escape when necessary (i.e. it includes a comma), and then always send the path as include.escaped (which runs into the problem of old servers rejecting).
  • having the server always unescape and the client always escape. This breaks the server's ability to interact with older clients that aren't escaping (which we'll need to support for at least a week or two).

For the non-getbundle parts (I think the wireproto command is 'widen'), we can easily make a widen2 or something, but it's probably easier to just keep the same command name and do the same thing as in getbundle: detect the capability, send as foo.escaped if supported.

Since narrow is still experimental, I don't think we should try too hard for backward compatibility. We could introduce a new end-point for a new encoding and drop the old one in a couple of version.

+0, honestly. I won't require it, but I'd really rather we shaved this yak _now_ rather than when narrow has even more users.

I'm getting a bit frustrated with how much time I've spent on this, made worse by the fact that I agree with everything everyone's saying and so it's not like I'm frustrated at the review process, just how slow I've been at accomplishing this.
So, before I go down another rabbit hole, here's what I'm thinking:

  • Server emits a new capability narrow-exp-1-escaped (in addition to the current narrow-exp-1, this is not replacing the existing capability)

nit: I *think* the "1" in the name was supposed to be a version number, so the new capability's name would be narrow-exp-2.

  • wireprototypes's map will change these items from csv to csv.escaped
  • Compatible clients will detect this capability from the server and send items of type csv.escaped during getbundle with keys like <previousname>.escaped (ex: include.escaped). If the server doesn't support csv.escaped, the client sends with the old names (unescaped).
  • The escaping will be urllibcompat.quote
  • The server will strip the .escaped suffix on the keys, split on comma, and urllibcompat.unquote the individual items
  • I'm *not* expecting to do anything about \ -> / conversion.

This all sounds good to me.

Since these are part of getbundle, I haven't found a way of doing this that's not one of:

  • a custom escaping mechanism that's backwards compatible
  • adding a capability and renaming the keys that are sent (so the server can tell when it needs to unescape)
  • having the client always send duplicate items (i.e. send include and include.escaped). I'm not even sure that older servers would tolerate receiving keys they aren't expecting.
  • having the client only escape when necessary (i.e. it includes a comma), and then always send the path as include.escaped (which runs into the problem of old servers rejecting).
  • having the server always unescape and the client always escape. This breaks the server's ability to interact with older clients that aren't escaping (which we'll need to support for at least a week or two).

For the non-getbundle parts (I think the wireproto command is 'widen'), we can easily make a widen2 or something, but it's probably easier to just keep the same command name and do the same thing as in getbundle: detect the capability, send as foo.escaped if supported.

Maybe no one uses the "widen" command so we don't even need to worry about compatibility there?

  • Server emits a new capability narrow-exp-1-escaped (in addition to the current narrow-exp-1, this is not replacing the existing capability)

nit: I *think* the "1" in the name was supposed to be a version number, so the new capability's name would be narrow-exp-2.

Yes, I had assumed that as well. This isn't really a new version of the protocol, though, just a minor tweak, and it's primarily to the 'csv' type used in getbundle (see the current version of this patch that adds 'qcsv' for the actual locations it's used). Honestly I went back and forth between announcing it as getbundle-csv-escaped and something related to narrow (and ended up on narrow, as you see, since while it's generically useful besides narrow, nothing else needs it today, and future things wouldn't need this to be announced forever - they'll always have used foo.escaped and been transmitted as escaped).

Maybe no one uses the "widen" command so we don't even need to worry about compatibility there?

I don't know how often it's used :) I just know that there's something *not* getbundle-related in narrowwirepeer.py (looks like it's called narrow_widen that needed to be modified or else the tests wouldn't pass. I honestly don't even know if we're using it internally at Google right now. If not, that's fewer things to change, which I'm OK with :)

  • I'm *not* expecting to do anything about \ -> / conversion.

So would there be some interoperability issue between Windows and not-Windows if paths aren't pconverted, if paths can also come from the server as Martin mentioned? Is there anything here that makes it more difficult to pconvert in the future? (I assume it only came up in the first place to allow the custom escaping. I understand your frustration, so I'm not looking to sign you up for more work. But I only know about narrow from a very high conceptual level, so I figure I might as well ask now and save this info for later.)

Since narrow is still experimental, I don't think we should try too hard for backward compatibility. We could introduce a new end-point for a new encoding and drop the old one in a couple of version.

+0, honestly. I won't require it, but I'd really rather we shaved this yak _now_ rather than when narrow has even more users.

I'm getting a bit frustrated with how much time I've spent on this, made worse by the fact that I agree with everything everyone's saying and so it's not like I'm frustrated at the review process, just how slow I've been at accomplishing this.

I know the feeling, thanks a lot for revisiting you original plan.

So, before I go down another rabbit hole, here's what I'm thinking:

  • Server emits a new capability narrow-exp-1-escaped (in addition to the current narrow-exp-1, this is not replacing the existing capability)
  • wireprototypes's map will change these items from csv to csv.escaped
  • Compatible clients will detect this capability from the server and send items of type csv.escaped during getbundle with keys like <previousname>.escaped (ex: include.escaped). If the server doesn't support csv.escaped, the client sends with the old names (unescaped).
  • The escaping will be urllibcompat.quote
  • The server will strip the .escaped suffix on the keys, split on comma, and urllibcompat.unquote the individual items

This looks overall good to me.

  • I'm *not* expecting to do anything about \ -> / conversion.

Does this means the client side is expected to enforce using / as the directory separator ?

Since these are part of getbundle, I haven't found a way of doing this that's not one of:

  • a custom escaping mechanism that's backwards compatible
  • adding a capability and renaming the keys that are sent (so the server can tell when it needs to unescape)
  • having the client always send duplicate items (i.e. send include and include.escaped). I'm not even sure that older servers would tolerate receiving keys they aren't expecting.

It would not work. Server would reject unknown arguments to getbundle.

  • having the client only escape when necessary (i.e. it includes a comma), and then always send the path as include.escaped (which runs into the problem of old servers rejecting).

In my opinion, I don't think we get much benefit of conditional escaping. So keeping things simple seems better.

  • having the server always unescape and the client always escape. This breaks the server's ability to interact with older clients that aren't escaping (which we'll need to support for at least a week or two).

As much as I think we don't need strong BC on narrow because it is experimental, have a couple of version that can still speak to each other is preferable.

For the non-getbundle parts (I think the wireproto command is 'widen'), we can easily make a widen2 or something, but it's probably easier to just keep the same command name and do the same thing as in getbundle: detect the capability, send as foo.escaped if supported.

marmoute requested changes to this revision.Apr 22 2020, 12:11 PM

If I understood the situation correctly, a rewrite in planned.

This revision now requires changes to proceed.Apr 22 2020, 12:11 PM
spectral abandoned this revision.Oct 29 2021, 5:44 PM