Page MenuHomePhabricator

rust: fix non-utf8 char in requirements.rs
ClosedPublic

Authored by marmoute on Nov 28 2020, 9:00 AM.

Details

Summary

Apparently Phabricator detect rust/hg-core/src/requirements.rs file as non
utf8 ‽, and mark it as binary. During application it ended up being non-utf8
and this made Rust (and as a result heptapod) very angry::

error: couldn't read hg-core/src/requirements.rs: stream did not contain valid UTF-8
  --> hg-core/src/lib.rs:11:9
   |
11 | pub mod requirements;
   |         ^^^^^^^^^^^^

error: aborting due to previous error

error: could not compile `hg-core`.

The after "fixing", the file content has no character matching the following
regexp:

[^0-9-a-zA-Z /(|).,{}!\[\]:"&=>?_*-;<`'#]

So we should be fine, unless Phabricator does something funny again.

Diff Detail

Repository
rHG Mercurial
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

marmoute created this revision.Nov 28 2020, 9:00 AM

I would recommend pulling this changeset from Mercurial to avoid Phabricator weirdness. It is available using:
hg pull https://foss.heptapod.net/octobus/mercurial-devel --rev d787baae01b5

mharbison72 accepted this revision.Nov 28 2020, 3:09 PM
mharbison72 added a subscriber: mharbison72.

Queued, thanks

This revision is now accepted and ready to land.Nov 28 2020, 3:09 PM
This revision was automatically updated to reflect the committed changes.
SimonSapin added a subscriber: SimonSapin.EditedNov 28 2020, 6:42 PM

I’m pretty sure that this U+2019 RIGHT SINGLE QUOTATION MARK character was correctly encoded in UTF-8. Calling it "non-UTF-8" is incorrect, although it is indeed non-ASCII. The regex [^0-9-a-zA-Z /(|).,{}!\[\]:"&=>?_*-;<\'#]` is testing for ASCII "printable" characters.

If source files really should be entirely ASCII, that is news to me but I can try to do keep them that way. But I strongly feel that requirement should be revisited.

(By the way, marmoute’s message above uses an U+203D INTERROBANG non-ASCII character.)

I’m pretty sure that this U+2019 RIGHT SINGLE QUOTATION MARK character was correctly encoded in UTF-8.

Can you check out f3a001cfe4b8 in your local repo to see? (At least that's what's listed as the commit in the commit tab, when clicking on diff1 in the history tab. If that isn't it, we can walk backward using the predecessors() revset to find the last predecessor without the differential line in the commit message.)

If source files really should be entirely ASCII, that is news to me but I can try to do keep them that way. But I strongly feel that requirement should be revisited.

To my knowledge, binary and non utf-8 stuff should make the round trip unharmed because the patch that is stored is different from the files listed in the web UI. It just makes it so that you can't easily view it on the web UI. IDK the details, but the reason it is marked binary is this:

https://www.mercurial-scm.org/repo/hg/file/tip/hgext/phabricator.py#l856

I apologize, in my previous message I misunderstood what was going on.

My editor is configured as UTF-8 and the typographical apostrophe on my keyboard layout is U+2019 which is [0xE7, 0xA8, 0x9A] in UTF-8. On https://phab.mercurial-scm.org/D9398?vs=23670&id=23692#toc the History tab of Phabricator shows two diffs. I don’t know why, I did not edit this patch after first submitting it. The first diff correctly shows the apostrophe. The difference between these two diffs shows that the apostrophe and the following lower case letter t have been replaced with CJK ideogram 稚, U+7A1A, [0xE7 0xA8 0x9A] in UTF-8.

My local repository does not have a commit f3a001cfe4b8. It has a2eda1ff22aa, where the apostrophe was replaced with a single 0x92 byte (and the letter t following it is there).

I’m not very familiar with Phabricator and I don’t know the details of how data flows within it and hg phabsend and back to repositories, bit it looks like there are at least two bugs somewhere mangling non-ASCII content.

Kwan added subscribers: pulkit, Kwan.Nov 30 2020, 5:29 PM

My editor is configured as UTF-8 and the typographical apostrophe on my keyboard layout is U+2019 which is [0xE7, 0xA8, 0x9A] in UTF-8. On https://phab.mercurial-scm.org/D9398?vs=23670&id=23692#toc the History tab of Phabricator shows two diffs. I don’t know why, I did not edit this patch after first submitting it.

This is usual and expected, the second diff is the landed version. All DREVs get closed by daemon and get a new diff reflecting the landed commit when the commit gets picked up by Diffusion.

The first diff correctly shows the apostrophe. The difference between these two diffs shows that the apostrophe and the following lower case letter t have been replaced with CJK ideogram 稚, U+7A1A, [0xE7 0xA8 0x9A] in UTF-8.
My local repository does not have a commit f3a001cfe4b8. It has a2eda1ff22aa, where the apostrophe was replaced with a single 0x92 byte (and the letter t following it is there).

0x92 is the RIGHT SINGLE QUOTATION MARK in Windows-1252, which is what the requirements.rs file gets detected as prior to this commit. I don't know how phab mangles that into the ideogram, but locally the quotation mark is there, just not a UTF-8 one.

Given that if I hack phabread to return the old diff (s/max/min/ x2 in readpatch()) and import D9398 fresh on top of ead435aa5294 everything seems to go okay (on Linux) I think the next point of enquiry would be what system is @pulkit on and how did they import that patch (hg phabread 9398 | hg import -, hg phabimport), since obslog seems to show they were the one who did so. It they are on Windows I could see that maybe mangling it if they did the old piping method.