The escaping chosen was RFC 4180-like in that it wraps the values that are to be
escaped in double quote (") characters, and doubles any embedded double quote
characters. It differs from RFC 4180 in what is escaped; the escaping is only
applied if the item contains a , character or starts with a " character. RFC
4180 states that " characters embedded in a value that isn't escaped are not
allowed, and it also states that \r and \n characters should trigger an item to
be escaped.
The differences from RFC 4180 are intentional:
- the include and exclude patterns can't contain a , character today (that's what's being fixed), so surrounding them in " characters won't really change anything.
- the include and exclude patterns today can't start with a " character, so by *not* escaping all strings that have an embedded " unless they start with a " character (for future compatibility), we maintain BC.
- the include and exclude patterns aren't multi-line, so \r and \n are irrelevant (and technically allowed in the patterns, though Mercurial is generally unable to reason about filenames that contain newlines).
Alternatives considered:
Full RFC 4180 encoding and decoding:
- this would have been a breaking change for any include/exclude pattern that includes a " in the name if a new client talks to an old server. This is quite unlikely to cause problems in practice, but was easy to avoid.
- this would have required the decoder to handle \r and \n characters specially, which would complicate it significantly.
- the csv module in Python would have provided this to us, but it requires str on python3, and we only have bytes. It's not generically safe to convert these patterns to str, though perhaps roundtripping the string through fsencode and fsdecode would have been acceptable?
Other escaping mechanisms:
- the original version of this change did use a different (home-grown) escaping mechanism. This received pushback during review.
- using \ escaping (\->\\, and , -> \,) is annoying to decode, though overall pretty similar to what we have to do here. It would introduce significantly more changes for cross-platform and cross-version compatibility, however, as a server that encounters a string with \, can't know if the client is escaping a comma, or if the client is old and the path ends in a backslash (among other ambiguities).
- using the existing batch escaping would likely have worked, but would introduce cross-version ambiguities if any path contained a : character followed by one of the four characters used in that escaping mechanism. The author does not know if their deployment has any paths with these character sequences in them, so decided to err on the side of caution.
Let's name it encode_csv_paths now that _ are permitted. It woul dbe easier to read.