This is an archive of the discontinued Mercurial Phabricator instance.

rust-hgpath: add HgPath and HgPathBuf structs to encapsulate handling of paths
ClosedPublic

Authored by Alphare on Aug 29 2019, 10:40 AM.

Details

Summary

This change is a direct consequence of this discussion on the mailing list:
https://www.mercurial-scm.org/pipermail/mercurial-devel/2019-August/133574.html

The implementations of HgPath and HgPathBuf are, for the most part, taken
directly from OsStr and OsString respectively from the standard library.

What this change does *not* yet do is implement the Windows MBCS to WTF8
conversion, but it lays the basis for a very flexible interface for paths.

Diff Detail

Repository
rHG Mercurial
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Alphare created this revision.Aug 29 2019, 10:40 AM
Alphare updated this revision to Diff 16338.Aug 30 2019, 5:27 AM
kevincox requested changes to this revision.Aug 30 2019, 5:38 AM

It looks good overall. I just would like to have a bit more strict definition of what an HgPath can contain and preferably some init-time validation.

rust/hg-core/src/utils/hg_path.rs
20

You should probably add a note about what a valid path is. My initial assumption is "A sequence of non-null characters. / separates directories, consecutive / are forbidden". It would be good to write this down, provide an is_valid function and add a debug_assert! on creation/modification to check that everything is as expected.

44

Please call this bytes() to highlight that it iterates over bytes, not characters or directory components.

137

I would also call this HgPathByteIterator

205

This is almost certainly unnecessary.

This revision now requires changes to proceed.Aug 30 2019, 5:38 AM
Alphare updated this revision to Diff 16339.Aug 30 2019, 10:38 AM
Alphare marked 3 inline comments as done.Aug 30 2019, 10:41 AM
Alphare added inline comments.
rust/hg-core/src/utils/hg_path.rs
205

It is useful for some methods already. The more we add, the more it is useful.

kevincox accepted this revision.Aug 30 2019, 11:13 AM
yuja added a subscriber: yuja.Aug 31 2019, 8:20 AM

+#[derive(Eq, Ord, PartialEq, PartialOrd, Debug, Hash)]
+pub struct HgPath {
+ inner: [u8],
+}

I found std::path::Path has a well-written inline comment about
the unsafe pointer cast. Let's copy it so we won't introduce a
memory issue.

yuja added a comment.Aug 31 2019, 8:20 AM

(resend to get around Phabricator's bad email processing)

+ pub fn as_bytes(&self) -> &[u8] {
+ unsafe { &*(&self.inner as *const _ as *const [u8]) }
+ }

&self.inner. No unsafe trick is needed.

+ / Checks for errors in the path, short-circuiting at the first one.
+
/ Useful to get finer-grained errors. To simply check if the path is
+ /// valid, use is_valid.
+ pub fn check_state(&self) -> Result<(), HgPathError> {
+ if self.len() == 0 {
+ return Ok(());
+ }
+ let bytes = self.as_bytes();
+ let mut previous_byte = None;
+
+ if bytes[0] == b'/' {
+ return Err(HgPathError::LeadingSlash);
+ }
+ for (index, byte) in bytes.iter().enumerate() {
+ match byte {
+ 0 => return Err(HgPathError::ContainsNullByte(index)),
+ b'/' => {
+ if previous_byte.is_some() && previous_byte == Some(b'/') {

previous_byte.is_some() is redundant.

+ return Err(HgPathError::ConsecutiveSlashes(index));
+ }
+ }
+ _ => (),
+ };
+ previous_byte = Some(*byte);
+ }
+ Ok(())
+ }
+ pub fn is_valid(&self) -> bool {
+ self.check_state().is_ok()
+ }

I'm not sure if this HgPath type is designed to be always validated or not.
If yes, these functions can be private, and we'll need a conservative
HgPath::from_bytes(s) -> Result<&HgPath> and unsafe fn from_bytes_unchecked(s)
functions. If not, maybe HgPath::new() shouldn't validate the bytes at all.

I feel debug_assert() implies that the condition must always be met,
so the caller has to guarantee that.

+impl Index<usize> for HgPath {
+ type Output = u8;
+
+ fn index(&self, i: usize) -> &Self::Output {
+ &self.inner[i]
+ }
+}

[...]

Better to not implement path[...] since it can be considered returning
a path component at the specified index. We can always do
path.as_bytes()[...] explicitly.

+#[derive(Debug)]
+pub struct HgPathBytesIterator<'a> {
+ path: &'a HgPath,
+}

Can't we simply implement it as a slice::Iter<> wrapper?

+ pub fn as_vec(&self) -> &Vec<u8> {
+ &self.inner
+ }
+ pub fn as_ref(&self) -> &[u8] {
+ self.inner.as_ref()
+ }

Do we need as_vec()? I think as_ref() can be used in most cases.

+impl<T: ?Sized + AsRef<HgPath>> From<&T> for HgPathBuf {
+ fn from(s: &T) -> HgPathBuf {
+ let new = s.as_ref().to_owned();
+ debug_assert_eq!(Ok(()), new.check_state());
+ new
+ }

No need to do new.check_state()?
s should be already _checked_ if the type supports s.as_ref().

+impl TryInto<PathBuf> for HgPathBuf {
+ type Error = std::io::Error;
+
+ fn try_into(self) -> Result<PathBuf, Self::Error> {
+ let os_str;
+ #[cfg(unix)]
+ {
+ use std::os::unix::ffi::OsStrExt;
+ os_str = std::ffi::OsStr::from_bytes(&self.inner);
+ }
+ #[cfg(windows)]
+ {
+ TODO: convert from Windows MBCS (ANSI encoding) to WTF8.
+
Perhaps, the return type would have to be Result<PathBuf>.
+ unimplemented!();
+ }
+
+ Ok(Path::new(os_str).to_path_buf())
+ }

As I said, the encoding of HgPathBuf will be either MBCS or UTF-8 on
Windows, and the encoding will be selected at runtime. We can't define
a single conversion between HgPathBuf and PathBuf/OsString.

In D6773#99433, @yuja wrote:

+#[derive(Eq, Ord, PartialEq, PartialOrd, Debug, Hash)]
+pub struct HgPath {
+ inner: [u8],
+}

I found std::path::Path has a well-written inline comment about
the unsafe pointer cast. Let's copy it so we won't introduce a
memory issue.

I'm not 100% sure which comment you're referring to, sorry.

In D6773#99434, @yuja wrote:

I'm not sure if this HgPath type is designed to be always validated or not.
If yes, these functions can be private, and we'll need a conservative
HgPath::from_bytes(s) -> Result<&HgPath> and unsafe fn from_bytes_unchecked(s)
functions. If not, maybe HgPath::new() shouldn't validate the bytes at all.
I feel debug_assert() implies that the condition must always be met,
so the caller has to guarantee that.

I feel like both the performance implications and the ergonomics of checking everything are going to be a problem. @kevincox's idea of using debug_assert! seemed like a decent enough check to ease development. These functions could also be used during conversions to OsString and the like.

+impl Index<usize> for HgPath {
+ type Output = u8;
+
+ fn index(&self, i: usize) -> &Self::Output {
+ &self.inner[i]
+ }
+}

[...]
Better to not implement path[...] since it can be considered returning
a path component at the specified index. We can always do
path.as_bytes()[...] explicitly.

I'd leave Index<RangeFull>since it it not ambiguous and is a neat shorthand.

yuja added a comment.Aug 31 2019, 10:17 PM
> I found `std::path::Path` has a well-written inline comment about
> the unsafe pointer cast. Let's copy it so we won't introduce a
> memory issue.
I'm not 100% sure which comment you're referring to, sorry.

This.

https://github.com/rust-lang/rust/commit/740f8db85572aef58d0734fc60bc2b54862ebbb0#diff-120c5906285d41e976dc4176265d7cbaR1754

Apparently it was added quite recently.

> I'm not sure if this `HgPath` type is designed to be always validated or not.
> If yes, these functions can be private, and we'll need a conservative
> `HgPath::from_bytes(s) -> Result<&HgPath>` and `unsafe fn from_bytes_unchecked(s)`
> functions. If not, maybe `HgPath::new()` shouldn't validate the bytes at all.
> I feel `debug_assert()` implies that the condition must always be met,
> so the caller has to guarantee that.
I feel like both the performance implications and the ergonomics of checking everything are going to be a problem. @kevincox's idea of using `debug_assert!` seemed like a decent enough check to ease development. These functions could also be used during conversions to `OsString` and the like.

@kevincox said "it would be good to [...] add a debug_assert! on
creation/modification to check that everything is as expected."
IIUC, his idea is that the debug_assert!() condition must always be met,
which is the guarantee that the HgPath type provides. In order to ensure
that, we'll have to validate all paths read from dirstate, for example,
prior to casting to HgPath type. Otherwise debug_assert!() would fail,
meaning the implementation had a bug.

So, my question is, do we really need such strong guarantee by default?

Instead, we could lower the bar to "HgPath is merely a path represented in
platform-specific encoding though it's supposed to be a canonical path",
and add e.g. is_canonicalized() to check if the path meets the strong
requirement.

>> +impl Index<usize> for HgPath {
>> +    type Output = u8;
>> +
>> +    fn index(&self, i: usize) -> &Self::Output {
>> +        &self.inner[i]
>> +    }
>> +}
>
> [...]
> Better to not implement `path[...]` since it can be considered returning
> a path component at the specified index. We can always do
> `path.as_bytes()[...]` explicitly.
I'd leave `Index<RangeFull>`since it it not ambiguous and is a neat shorthand.

What shorthand would it be used for?

I feel it's weird only path[..] is allowed.

@kevincox said "it would be good to [...] add a debug_assert! on
creation/modification to check that everything is as expected."
IIUC, his idea is that the debug_assert!() condition must always be met,
which is the guarantee that the HgPath type provides. In order to ensure
that, we'll have to validate all paths read from dirstate, for example,
prior to casting to HgPath type. Otherwise debug_assert!() would fail,
meaning the implementation had a bug.
So, my question is, do we really need such strong guarantee by default?

I indeed do not think we do, for the reasons I gave in my previous comment.

Instead, we could lower the bar to "HgPath is merely a path represented in
platform-specific encoding though it's supposed to be a canonical path",
and add e.g. is_canonicalized() to check if the path meets the strong
requirement.

Is it really represented in a platform-specific encoding or it just treated as ASCII until file system operations are needed?
I would use the strong check when converting to platform-specific types like OsStr or Path.

>> +impl Index<usize> for HgPath {
>> +    type Output = u8;
>> +
>> +    fn index(&self, i: usize) -> &Self::Output {
>> +        &self.inner[i]
>> +    }
>> +}
>
> [...]
> Better to not implement `path[...]` since it can be considered returning
> a path component at the specified index. We can always do
> `path.as_bytes()[...]` explicitly.
I'd leave `Index<RangeFull>`since it it not ambiguous and is a neat shorthand.

What shorthand would it be used for?
I feel it's weird only path[..] is allowed.

This is useful in Deref for example: &self[..]. But I can remove it if needed.

yuja added a comment.Sep 1 2019, 9:25 AM
> @kevincox said "it would be good to [...] add a debug_assert! on
> creation/modification to check that everything is as expected."
> IIUC, his idea is that the `debug_assert!()` condition must always be met,
> which is the guarantee that the HgPath type provides. In order to ensure
> that, we'll have to validate all paths read from dirstate, for example,
> prior to casting to HgPath type. Otherwise `debug_assert!()` would fail,
> meaning the implementation had a bug.
> So, my question is, do we really need such strong guarantee by default?
I indeed do not think we do, for the reasons I gave in my previous comment.

Okay, let's remove the debug_asserts, and update the documentation
accordingly.

> Instead, we could lower the bar to "HgPath is merely a path represented in
> platform-specific encoding though it's supposed to be a canonical path",
> and add e.g. `is_canonicalized()` to check if the path meets the strong
> requirement.
Is it really represented in a platform-specific encoding or it just treated as ASCII until file system operations are needed?

It's a chunk of bytes. I just said "platform-specific encoding" to clarify
that the encoding might be different from OsStr. Sorry for confusion.

I would use the strong check when converting to platform-specific types like `OsStr` or `Path`.

Perhaps, that's similar to what the pathauditor would do in Python.

>>   >> +impl Index<usize> for HgPath {
>>   >> +    type Output = u8;
>>   >> +
>>   >> +    fn index(&self, i: usize) -> &Self::Output {
>>   >> +        &self.inner[i]
>>   >> +    }
>>   >> +}
>>   >
>>   > [...]
>>   > Better to not implement `path[...]` since it can be considered returning
>>   > a path component at the specified index. We can always do
>>   > `path.as_bytes()[...]` explicitly.
>>   I'd leave `Index<RangeFull>`since it it not ambiguous and is a neat shorthand.
>
> What shorthand would it be used for?
> I feel it's weird only `path[..]` is allowed.
This is useful in `Deref` for example: `&self[..]`. But I can remove it if needed.

Maybe it can be written as HgPath::new(&self.inner).

Alphare updated this revision to Diff 16364.Sep 2 2019, 4:41 AM
Alphare updated this revision to Diff 16366.Sep 3 2019, 4:58 AM
Alphare updated this revision to Diff 16371.Sep 5 2019, 8:01 AM

I would love for this series to be queued. Is there anything I didn't address left to do?

Alphare updated this revision to Diff 16529.Sep 13 2019, 6:06 AM
kevincox accepted this revision.Sep 13 2019, 6:49 AM
kevincox added inline comments.
rust/hg-core/src/utils/hg_path.rs
176

debug_assert check after the mutation.

243

debug_assert check after the mutation.

yuja added a comment.Sep 21 2019, 11:15 PM

The series looks good as a start. Queued, thanks!

+impl ToString for HgPathError {

Nit: std::fmt::Display can be implemented instead, and the HgPathError
can be a std::error::Error.

+impl From<HgPathError> for std::io::Error {
+ fn from(e: HgPathError) -> Self {
+ std::io::Error::new(std::io::ErrorKind::InvalidData, e.to_string())
+ }

Then, the e itself can be embedded in io::Error.

+impl HgPath {
+ pub fn new<S: AsRef<[u8]> + ?Sized>(s: &S) -> &Self {
+ unsafe { &*(s.as_ref() as *const [u8] as *const Self) }
+ }

I moved the declaration of the struct HgPath close to this because it's
so important that HgPath is basically a newtype of [u8].

+ pub fn is_empty(&self) -> bool {
+ self.inner.len() == 0
+ }

s/.len() == 0/.is_empty()/ in flight.

+ pub fn join<T: ?Sized + AsRef<HgPath>>(&self, other: &T) -> HgPathBuf {
+ let mut inner = self.inner.to_owned();
+ if inner.len() != 0 && inner.last() != Some(&b'/') {
+ inner.push(b'/');
+ }
+ inner.extend(other.as_ref().bytes());
+ HgPathBuf::from_bytes(&inner)
+ }

Perhaps, it's simpler to branch by self.is_empty() first.

if is_empty() {
    other.as_ref().to_owned()
} else {
    # concatenate
}

+ / Checks for errors in the path, short-circuiting at the first one.
+
/ This generates fine-grained errors useful for debugging.
+ /// To simply check if the path is valid during tests, use is_valid.
+ pub fn check_state(&self) -> Result<(), HgPathError> {

Nit: the word state seems confusing. I don't have a good naming skill,
but I feel validate() or validate_canonical_form() is better.

+ b'/' => {
+ if previous_byte.is_some() && previous_byte == Some(b'/') {
+ return Err(HgPathError::ConsecutiveSlashes(
+ bytes.to_vec(),
+ index,
+ ));
+ }
+ }

Nit: useless check for previous_byte.is_some()

+ pub fn push(&mut self, byte: u8) {
+ self.inner.push(byte);
+ }

+impl Extend<u8> for HgPathBuf {
+ fn extend<T: IntoIterator<Item = u8>>(&mut self, iter: T) {
+ self.inner.extend(iter);
+ }
+}

path.push() and path.extend(data) seem weird since they don't care
the path separator. Can we remove them?

This revision was not accepted when it landed; it landed in state Needs Review.
This revision was automatically updated to reflect the committed changes.