Page MenuHomePhabricator

rust-dirstate-status: add first Rust implementation of `dirstate.status`
ClosedPublic

Authored by Alphare on Oct 11 2019, 11:44 AM.

Details

Summary

Note: This patch also added the rayon crate as a Cargo dependency. It will
help us immensely in making Rust code parallel and easy to maintain. It is
a stable, well-known, and supported crate maintained by people on the Rust
team.

The current dirstate.status method has grown over the years through bug
reports and new features to the point where it got too big and too complex.

This series does not yet improve the logic, but adds a Rust fast-path to speed
up certain cases.

Tested on mozilla-try-2019-02-18 with zstd compression:

  • hg diff on an empty working copy:
    • c: 1.64(+-)0.04s
    • rust+c before this change: 2.84(+-)0.1s
    • rust+c: 849(+-)40ms
  • hg commit when creating a file:
    • c: 5.960s
    • rust+c before this change: 5.828s
    • rust+c: 4.668s
  • hg commit when updating a file:
    • c: 4.866s
    • rust+c before this change: 4.371s
    • rust+c: 3.855s
  • hg status -mard
    • c: 1.82(+-)0.04s
    • rust+c before this change: 2.64(+-)0.1s
    • rust+c: 896(+-)30ms

The numbers are clear: the current Rust dirstatemap implementation is super
slow, its performance needs to be addressed.
This will be done in a future series, immediately after this one, with the goal
of getting Rust to be at least to the speed of the Python + C implementation
in all cases before the 5.2 freeze. At worse, we gate dirstatemap to only be used
in those cases.

Cases where the fast-path is not executed:

  • for commands that need ignore support (status, for example)
  • if subrepos are found (should not be hard to add, but winter is coming)
  • any other matcher than an alwaysmatcher, like patterns, etc.
  • with extensions like sparse and fsmonitor

The next step after this is to rethink the logic to be closer to
Jane Street's Valentin Gatien-Baron's Rust fast-path which does a lot less
work when possible.

Diff Detail

Repository
rHG Mercurial
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Alphare created this revision.Oct 11 2019, 11:44 AM
Alphare edited the summary of this revision. (Show Details)Oct 11 2019, 6:05 PM
Alphare edited the summary of this revision. (Show Details)Oct 11 2019, 6:08 PM

I've slowly worked my way through walk_explicit() so far, but I had a few comments there, so I'll let you take a look at those first while I take a break.

rust/hg-core/src/dirstate/status.rs
23

Does this need to be public?

23–29

I'm completely new to Rust (just finished going through the tutorial), so please excuse naive questions...

What's the advantage of splitting up the definition of P like this? Why not constrain it as AsRef<Path> + Sync either on line 22 or line 28?

34

Looks like this will end up getting returned. Do we want that? It seems surprising to me.

37–39

Isn't this the same as vec!(HgPathBuf::new())? Do we dislike the vec! macro? Or macros in general?

37–39

What is this case about? If no files are specified, we want to return stat information for the (repo) root directory? That seems surprising. Does the caller depend on that? Does it have to?

41

It already is a vector, so no need to call .to_vec() here?

44

Please add a comment explaining what the type is. What does the outer Result mean? What does the Option mean? What are the two HgPathBufs? What does the inner Result mean?

45

Do we want to keep the trailing comma?

55–56

Do you think it would be easier to read and reason about this whole function if we did up to here in one step and the symlink_metadata() in a separate step after? By "step", I mean either in a separate .map() or in a separate par_iter(). I think it might be easier to follow if any errors from hg_path_to_path_buf() were checked earlier.

56

It doesn't matter yet, but should this be normalized instead of filename?

64

https://doc.rust-lang.org/std/iter/trait.Iterator.html#method.collect talks about using a "turbofish". I don't know how specific you *want* to be about the type here (do you want all the details on line 44? If not, you can do .collect::<std::io::Result<Vec<_>>>(); here instead.

66

The tutorial seemed to put the ? where the expression was defined, i.e. at the end of line 63 here. Seems generally better to me so the variable gets a simpler type and so it's only evaluated once even if the variable is used in multiple places. (Maybe changing it requires the turbofish?)

67

Are we going to use the second value (the unnormalized filename) later? Otherwise I'd prefer to remove it from the tuple. Actually, I'd prefer to do that now anyway unless you already have patches that start using it.

79–81

This looks just like the is_dir() case. Should we merge them (i.e. just check if stat.is_file() { ... } else { ... })?

84

Are we going to do more here later? If not, we can just drop the Option-wrapping in stats_res (perhaps by adding a .filter() step to the iteration).

85

Are we going to care about the specific error type/message soon? Otherwise, could we replace it by an Option?

85

Call it normalized here too for consistency (and normalized is clearer than nf)

Alphare marked 3 inline comments as done.Oct 14 2019, 6:05 AM

@martinvonz First of all, thanks a lot for starting the Rust reviews, your feedback is very helpful.

For most of your comments, I tried cutting out the parts of the implementation that were not needed yet but did not go far enough. I'll send a new version in a few minutes.

rust/hg-core/src/dirstate/status.rs
23

It does not, I used to export it to Python to test it.

23–29

I added the where clause well after defining the signature, it's indeed better to do either one of the two syntax.

34

It gets removed by the caller. I tried to replicate Python code to get close to feature-parity, but the more I dive into this, the less convinced I am that the Python implementation should really be trusted. I just removed it completely since it has no real effect on the current code.

37–39

It's the same and much cleaner, I just forgot about using this macro since most other collections don't have one.

37–39

I removed this block of code for the same reason as the sentinel above, it's not useful yet, and it might never be.

44

Although the inner type should be explained (which I did in my next patch), most of the time Result is easily identifiable by looking at the different ? within the associated scope, which - I think - makes it not really worth it to comment.

45

No. You would think that rustfmt should remove those, but apparently not.

55–56

For performance reasons I don't want to do any more loops than I need to. I'm already resorting to a sequential for loop right after because I haven't found an easy way of collecting into 3 different collections in parallel.
Overall, I think that the lack of comments is the real issue, this is not too complex.

56

Good catch!

64

I figured that having the full type written out was more helpful to understand the context. I still do, but I need comments as well. You're right that collecting into Result<Vec<_>> is cleaner and compiles just as well however. :)

I like to avoid the turbofish when possible because I think that it's harder to read that a normal variable type declaration.

66

It is indeed preferable to put the ? where the expression is defined, however in certain cases such as this one, the type inference is not good enough. Here, we're collecting into a Result; changing it to collect? into a Vec causes the compiler to ask us for type annotations.

I could also have a line right after to do let stats_res = stats_res?; but I don't think it's really useful since we're only using it once.

84

Collecting into a Result prevents us from doing a filter_map during collect. Implementing a ParallelIterator adapter that would do a filter_map with Result handling is possible, but would require more work.

In this specific case, as I eluded to above, I just removed the sentinel value so the question is not relevant anymore.

85

The symlink_metadata returns a Result, we don't care about the error here. It's the Python equivalent of a try except or try finally fallback.

Alphare updated this revision to Diff 17132.Oct 14 2019, 6:15 AM
Alphare updated this revision to Diff 17133.Oct 14 2019, 6:23 AM
yuja added a subscriber: yuja.Oct 14 2019, 11:21 AM

Just quickly scanned. Not reviewed the core logic.

+/ Get name in the case stored in the filesystem
+
/ The name should be relative to root, and be normcase-ed for efficiency.
+/
+
/ Note that this function is unnecessary, and should not be
+ called, for case-sensitive filesystems (simply because it's expensive).
+
/ The root should be normcase-ed, too.
+pub fn filesystem_path<P: AsRef<HgPath>>(root: &HgPath, path: P) -> HgPathBuf {

Unused?

I think path: impl AsRef<HgPath> syntax is easier to follow unless we need
a type variable.

+ // TODO path for case-insensitive filesystems, for now this is transparent
+ root.to_owned().join(path.as_ref())
+}

filesystems_path() sounds like returning a std::path::PathBuf, but
HgPath should be a repo-relative path.

+/ Returns true if path refers to an existing path.
+
/ Returns true for broken symbolic links.
+/// Equivalent to exists() on platforms lacking lstat.
+pub fn lexists<P: AsRef<Path>>(path: P) -> bool {

Unused?

+ if !path.as_ref().exists() {
+ return read_link(path).is_ok();
+ }
+ true
+}

Maybe you want fs::symlink_metadata()?
I don't know which is faster, but the behavior should be closer to lstat.

https://doc.rust-lang.org/std/fs/fn.symlink_metadata.html

+impl HgMetadata {
+ #[cfg(unix)]
+ pub fn from_metadata(metadata: Metadata) -> Self {
+ use std::os::linux::fs::MetadataExt;

unix != linux. Maybe you want std::os::unix::fs::MetadataExt.

+pub fn status<P: AsRef<Path>>(
+ mut dmap: &mut DirstateMap,
+ root_dir: P,
+ files: Vec<HgPathBuf>,
+ list_clean: bool,
+ last_normal_time: i64,
+ check_exec: bool,
+) -> std::io::Result<(Vec<HgPathBuf>, StatusResult)>
+where
+ P: Sync,

P: AsRef<path> + Sync.

In D7058#103954, @yuja wrote:

Just quickly scanned. Not reviewed the core logic.

+/ Get name in the case stored in the filesystem
+
/ The name should be relative to root, and be normcase-ed for efficiency.
+/
+
/ Note that this function is unnecessary, and should not be
+ called, for case-sensitive filesystems (simply because it's expensive).
+
/ The root should be normcase-ed, too.
+pub fn filesystem_path<P: AsRef<HgPath>>(root: &HgPath, path: P) -> HgPathBuf {

Unused?

Indeed, it was for a future patch, removed.

I think path: impl AsRef<HgPath> syntax is easier to follow unless we need
a type variable.

I changed to impl Trait in all relevant places in this patch.

+ // TODO path for case-insensitive filesystems, for now this is transparent
+ root.to_owned().join(path.as_ref())
+}

filesystems_path() sounds like returning a std::path::PathBuf, but
HgPath should be a repo-relative path.

+/ Returns true if path refers to an existing path.
+
/ Returns true for broken symbolic links.
+/// Equivalent to exists() on platforms lacking lstat.
+pub fn lexists<P: AsRef<Path>>(path: P) -> bool {

Unused?

Same as above, removed.

+ if !path.as_ref().exists() {
+ return read_link(path).is_ok();
+ }
+ true
+}

Maybe you want fs::symlink_metadata()?
I don't know which is faster, but the behavior should be closer to lstat.
https://doc.rust-lang.org/std/fs/fn.symlink_metadata.html

+impl HgMetadata {
+ #[cfg(unix)]
+ pub fn from_metadata(metadata: Metadata) -> Self {
+ use std::os::linux::fs::MetadataExt;

unix != linux. Maybe you want std::os::unix::fs::MetadataExt.

Thanks. I also caught a typo in the name of the non-unix impl. It's now split it in two impls.

+pub fn status<P: AsRef<Path>>(
+ mut dmap: &mut DirstateMap,
+ root_dir: P,
+ files: Vec<HgPathBuf>,
+ list_clean: bool,
+ last_normal_time: i64,
+ check_exec: bool,
+) -> std::io::Result<(Vec<HgPathBuf>, StatusResult)>
+where
+ P: Sync,

P: AsRef<path> + Sync.

Alphare updated this revision to Diff 17165.Oct 15 2019, 5:24 AM
Alphare updated this revision to Diff 17167.Oct 15 2019, 5:29 AM
kevincox requested changes to this revision.Oct 15 2019, 9:12 AM
kevincox added inline comments.
rust/hg-core/src/dirstate/status.rs
24

This doesn't appear to be consumed by the function so you should take a &[HgPathBuf]. Or since I don't think you use the path either probably a &[HgPath].

84

I don't follow. can't this .filter_map(...) be replaced with a .filter(...).map(...)? I think this would make the code much more clear.

135

Why are there two pairs of parens?

169

This seems like a very complicated way to write this. Why not separate the match state and do it first. Or put it in an else block? Furthermore why do you have two separate match state? Why not handle them all in one match?

241

Do you need to do this to root_dir? The walk_explicit function seems to only need AsRef<Path> so you should be able to just pass it. You might need to update the argument to be a reference.

rust/hg-core/src/utils/files.rs
109

Isn't it better to avoid implementing this so that we get a compile error rather than a runtime error?

This revision now requires changes to proceed.Oct 15 2019, 9:12 AM
Alphare marked an inline comment as done.Oct 15 2019, 10:43 AM
Alphare added inline comments.
rust/hg-core/src/dirstate/status.rs
24

Done with &[impl AsRef<HgPath> + Sync] which is more generic.

169

I could indeed probably simplify it. I'll update this part later today (Paris time) .

241

I had to add a Copy trait bound for it to work, otherwise it would be moved by walk_explicit before usage in stat_dmap_entries

rust/hg-core/src/utils/files.rs
109

It is indeed a much better idea. I'll also send a patch for the other uses of unimplemented!() in the rest of the codebase

Alphare updated this revision to Diff 17176.Oct 15 2019, 10:44 AM
kevincox accepted this revision.Oct 15 2019, 10:48 AM

Thanks, looks good.

martinvonz added inline comments.Oct 15 2019, 1:11 PM
rust/hg-core/src/dirstate/status.rs
25

Do we need to borrow it mutably?

50

Should this be stat.file_type().is_file() || stat.file_type().is_symlink() as below?

157–161

This nested match was hard for me to read. I think it would be simpler if you merged this one with the next one inside a simple None => match block, like this:

None => {
    match state {
        EntryState::Normal
        | EntryState::Merged
        | EntryState::Added => {
            deleted.push(filename);
        },
        EntryState::Removed => {
            removed.push(filename);
        },
        _ => {}
    }
}
Alphare added inline comments.Oct 15 2019, 3:12 PM
rust/hg-core/src/dirstate/status.rs
25

We don't anymore, thanks. I'll be reworking the overly complicated match expression within the next hour.

Alphare updated this revision to Diff 17184.Oct 15 2019, 3:47 PM
Alphare updated this revision to Diff 17196.Oct 16 2019, 4:01 AM
martinvonz added inline comments.Oct 16 2019, 1:48 PM
rust/hg-core/src/dirstate/status.rs
104–111

What does this case mean? I.e. why would we get "NotADirectory"? Oh, I guess if e.g. dir/file is in the dirstate, but dir is now a file? Could you add a comment here saying that that is what this is about?

Alphare updated this revision to Diff 17252.Oct 16 2019, 2:56 PM
martinvonz accepted this revision.Oct 16 2019, 5:16 PM
This revision is now accepted and ready to land.Oct 16 2019, 5:16 PM