( )⚙ D2057 rust implementation of hg status

This is an archive of the discontinued Mercurial Phabricator instance.

rust implementation of hg status
Needs RevisionPublic

Authored by Ivzhh on Feb 5 2018, 5:31 PM.
Tags
None
Subscribers

Details

Reviewers
kevincox
baymax
Group Reviewers
hg-reviewers
Summary
  • implementation of revlog v1
  • parsing changelog, manifest, dirstate
  • use .hgignore in repo root
  • comparable performance with current hg status (Linux & Mac: slightly faster, Windows: slightly slower)
  • use hg r-status as subcommand, in this case, bypass python engine

Diff Detail

Repository
rHG Mercurial
Branch
phab-submit-D2057-2018-02-05 (bookmark) on default (branch)
Lint
No Linters Available
Unit
No Unit Test Coverage

Event Timeline

Ivzhh created this revision.Feb 5 2018, 5:31 PM

I'd be curious to see what @indygreg has to say about this, maybe wait on his input before doing any work in response to my feedback?

I do wonder if we should have at least three crates:

  1. hgcli
  2. libmercurial
  3. hgcext

The first one would be the command-line entry point, the last could use the cpython API, and libmercurial would be "pure rust" and open the door to eventually having a libhg or something that exports C functions and would be suitable for cffi and linking into other binaries?

rust/hgcli/src/hgext/base85.rs
22 ↗(On Diff #5238)

I think I'd like to separate things a bit more and have a Python-free module, and then a glue module that we can use to call into the pure Rust. Part of the reason is that in my perfect world we won't use the cpython crate for speedups so they can be used from pypy as well. Separating them at least makes it easier to have an extern "C" version of the method that can be used from cffi instead of only through the CPython API.

(Not sure what opinions others have. It's likely that I'll attempt this approach in the near future as part of a continued attempt to speed up hg diff.)

Ivzhh added a comment.EditedFeb 6 2018, 9:54 AM

I am open to the three-crates plan. Oirginally I have hgcli and hgext separately, and I was planning to use CFFI mode. I am a pypy user too, so I will be willing to provide a python C API free crate for pypy and others.

Yes, we should definitely split things into multiple crates. Small, narrowly-focused crates does seem to be the Rust way, after all.

hgcli should be for things specific to the Rust implementation of hg. I think this can also include the feature set of chg (once we've ported chg to Rust).

I definitely support separating the "pure Rust" from the "Python Rust" via a crate boundary. It is generally useful to have Rust that isn't bound to Python because it will facilitate reuse outside of Python contexts. For example, someone could implement a Mercurial wire protocol server in pure Rust without needing to worry about Python. Of course, we're likely to encounter areas where we really want tight coupling in order to achieve optimal performance in Python. So we may have to design APIs on the pure Rust side to facilitate CPython use. I'm OK with that.

As for how many crates to have, I don't have super strong opinions. I could see us putting every little component/subsystem in its own crate. I could also see us putting everything in one large crate. I don't think it is worth deciding at this early juncture. API design and ability to be reused outside its originally intended purpose is the important property to strive for. I think that has more to do with how the code is authored rather than which crates things are in.

A missing piece of this patch is the build system and module loader integration. We have a module policy that dictates which implementation of a Python module we use. We probably want to introduce a rust policy that uses Rust-based modules where available and falls back to the cext modules/policy if a Rust module isn't available. We also need to figure out how to integrate Rust into setup.py. But I think the build system bit can be deferred until we're actually ready to ship Rust, which is still a bit of ways off. I'm happy for the workflow to be run cargo in order to load Rust modules for the time being. But if you can implement Makefile and/or setup.py integration to build these Rust extensions, that would be awesome.

Ivzhh added a comment.Feb 6 2018, 9:24 PM

Sure, thank you for the comments! I can definitely prepare makefile and setup.py to make the building process work with rust part. I am planning to change the policy.py module to support and try to load rust modules and run all the tests. I will submit a new patch after finishing these two tasks.

After reading wiki/OxidationPlan again, I plan to change to cffi for better compatibility (pypy and others), and try to build algorithms in pure rust. Shall I wait till migrating to cffi based solution now and resubmit this patch with all three changes (building, testing, and cffi)?

Thank you!

We generally prefer that patches to Mercurial be small and do a single thing. This makes it easier to review and understand changes, since each change can be evaluated in isolation. If you submit changesets together using hg phabsend, they automatically show up as a stack in Phabricator. And if changesets at the bottom of the stack are ready to land, we generally land those without waiting for the entire stack to land. This enables forward progress to be made and this is generally better for everyone than waiting until a series of commits is perfect before adding any of them.

What that means is you should ideally split this work into smaller parts. For example:

  1. Add the pure Rust code/crate
  2. Add the Python Rust code/crate
  3. Build system / module policy changes

I'm not sure of the order of things though. Since this is the first Rust extension, it's not clear what needs to be implemented in what order. I'm fine looking at a large commit if things are too tightly coupled to separate. But you should strive to make smaller commits.

Ivzhh added a comment.Feb 6 2018, 9:53 PM

Thank you @indygreg for your detailed explanation!

I understand the process now, and I will go back reading the developer's guide thoroughly again. I will try my best to provide a relatively clean stack of patches.

Thank you for you time!

I agree with the splitting comments :) In fact there might already be a base85 crate which can be used: https://docs.rs/zero85. Either way I'll hold off on the review, feel free to ping me when you are ready for me to take a look.

What would be the advantage of taking this? Since we already have the C implementation, it's not likely to gain us any performance. On the other hand, it might make a good test case for integrating Rust and Python, finding the right API boundaries and experimenting with different approaches, precisely because we already have a C implementation. @indygreg @durin42 what are your thoughts about it?

Ivzhh added a comment.Feb 8 2018, 1:12 AM

As the author of this patch, actually I have the same concern. I started to translate base85 as baby steps to find a way of integrating rust and cpython, on my side, Today I modify setup.py, policy.py and makefile to run hg's test suit with the new base85. For myself, it is only proof of concept.

Maybe I should take another way: translate more python modules into CFFI-style, and let CFFI call rust implementation. And gradually change more implementations of python modules with corresponding cffi-style, while keep the python interface the same. My own hope is the rust routines will be able to call each other and eventually run some basic tasks without calling python part. And the rust still lazily provides info to python interface for extensions etc.

I am exploring this way now, and hope the findings will be useful for community to make decision.

Thank you all for the comments!

To be honest, we're not yet sure what we'll decide for the Python -> Rust bridge. The problem is summarized in the Rust <=> Python Interop section on https://www.mercurial-scm.org/wiki/OxidationPlan.

I suspect at some level we'll need a CPython extension for CPython for performance reasons (especially for high volume function calls). PyPy obviously uses CFFI. I think the ideal outcome is we can write Rust that exposes a C API and use CFFI natively on PyPy and something like cbindgen + Milksnake to auto-generate a CPython extension that acts as a wrapper around the C API exposed by Rust. I'm not sure if anyone has invented this exact wheel yet. If not, it's probably faster to use rust-cpython. Maybe several months from now we have enough Rust and maintaining rust-cpython is painful enough that we pursue the auto-generated CPython extension route.

What I'm trying to say is you have a green field to explore! But at this juncture, perfect is the enemy of done. We'll be happy with any forward progress, even failed experiments.

Ivzhh added a comment.Feb 8 2018, 2:35 AM

Thank you @indygreg!

The OxidationPlan is my best reference when I started to make a move, and this thread is even more helpful. I am really interested in exploring this ;-) In 2014 I was trying to change the hg backend storage to Postgres, a silly and failed experiment.

Anyway, I will save everyone's time and stop talking. I will come back later with a more meaningful implementation.

Ivzhh updated this revision to Diff 6724.Mar 8 2018, 1:30 AM
  • merge with stable
  • translate base85.c into rust code
  • move hgbase85 into independent module
  • add hgstorage crate
  • hg status implementation in rust
Ivzhh retitled this revision from translate base85.c into rust code to rust implementation of hg status.Mar 8 2018, 1:33 AM
Ivzhh edited the summary of this revision. (Show Details)
Ivzhh added a comment.Mar 8 2018, 1:42 AM

Hi all,

Based on the discussion a few weeks ago, I come up with a solution for review and discussion. After reading the Oxidation plan, the first thought is to bypass python engine and current plugin system IFF on request. If user (maybe background checker of IDE) request r-* subcommands, then hg gives rust implementations instead of python's. So I try to make hg r-status as fast as possible. The submitted version has comparable performance (as an example of the performance, not evidence, on my MacBook, in hg's own repo, hg r-status 150ms, and hg status 220ms). I am using CodeXL to profile the performance, and plan to use Future.rs to make the loading parallel and maybe 30ms faster.

The implementation originates from hg python implementation, because the python version is really fast. I tried to split into small changes, however, I eventually to combine all hgstorage module as one commit.

Thank you for your comments!

First of all - wow! Thanks for writing all this code. There's definitely a lot to work with. And work with it we will!

This is definitely too big to review as one commit. If you could do *any* work to split it up, it would be greatly appreciated. I'd focus on the pure Rust pieces first. Everything needed to open revlogs would be great!

You may find our custom Phabricator extensions (linked from https://www.mercurial-scm.org/wiki/Phabricator) useful for submitting series of commits to Phabricator.

Regarding the performance, that's pretty good! The dirstate code is some of the most optimized code in Mercurial. There are some gnarly Python C hacks to make it fast. Some of those tricks involve using special system calls to walk directories to minimize the number of system calls. I'm not sure if the crate you imported has those optimizations. (I wouldn't be surprised either way.) I wouldn't worry too much about performance at this juncture. But I suspect we could make the Rust code another 50% faster with some tweaking. It would also be interesting to test on a larger repo, say https://hg.mozilla.org/mozilla-unified. Also, I believe there are hooks in the dirstate code to use Watchman (fsmonitor). Those hooks are critical in order to achieve peak performance on large repositories.

Since you seem to be proficient at writing lots of Rust code, if you are looking for another project, may I suggest porting chg to Rust? That code is in contrib/chg. That might be the easiest component to actually ship in Rust since it is a standalone binary that doesn't link against Python. But we shouldn't get ahead of ourselves :)

Anyway, it is late for me and I need to detach from my computer. I'm sure others will have things to say as well...

rust/hgbase85/build.rs
1

I see this file was copied. There's nothing wrong with that. But does this mean we will need a custom build.rs for each Rust package doing Python? If that's the case, then I would prefer to isolate all our rust-cpython code to a single package, if possible. I'm guessing that could be challenging due to crossing create boundaries. I'm sure there are placed where we don't want to expose symbols outside the crate.

I'm curious how others feel about this.

rust/hgcli/src/main.rs
233–261

This is definitely nifty and an impressive achievement \o/

The r- commands for testing pure Rust code paths are an interesting idea!

I think I'm OK with including support for this in hgcli. But I think the code should live in a separate file so it doesn't pollute main(). And it should be behind a Cargo feature flag so we maintain compatibility with hg as much as possible by default.

Also, Mercurial's command line parser is extremely wonky and has some questionable behavior. If the intent is to make rhg compatible with hg, we would need to preserve this horrible behavior. We'll likely have to write a custom argument parser because of how quirky Mercurial's argument parser is :(

rust/hgstorage/src/config.rs
95

I would not worry about supporting v0 or v2 at this time. v0 is only important for backwards compatibility with ancient repos. And v2 never got off the ground.

rust/hgstorage/src/revlog_v1.rs
279

IIRC, core Mercurial keeps an open file handle on revlogs and ensures we don't run out of file handles by not keeping too many revlogs open at the same time. For scanning operations, not having to open and close the file handles all the time will make a difference for performance. Also, core Mercurial loads the entirety of the .i file into memory. That's a scaling problem for large revlogs. But it does make performance of index lookups really fast.

290–293

A thread pool to help with zlib decompression should go a long way here.

Probably too early to think about this, but we'll likely eventually want a global thread pool for doing I/O and CPU expensive tasks, such as reading chunks from a revlog and decompressing them.

FWIW, we're going to radically alter the storage format in order to better support shallow clones. But that work hasn't started yet. I still think there is a benefit to implementing the revlog code in Rust though.

Doesn't mononoke have code to read revlogs already?

kevincox requested changes to this revision.Mar 8 2018, 12:30 PM

I will try to finish the review later, but it might be quicker if you incorporate some of the changes first since a lot of them are repeated many times. Overall it looks good, there are a couple of things that i would highlight to make the code easier to read.

  • Prefer more descriptive variable names.
  • If you can, avoid "pointer" arithmetic. Cursors and slices are nice and have great convenience methods.
  • Group your flow control and filtering more.
  • Try to keep your types straight. In rust using the right types helps catch errors. So be aware of char vs u8 vs String vs Vec<char> vs Vec<u8>.

On a higher level, all of these code appears to be treating file names as strings. This isn't really true and will disallow some valid file names. Maybe we should stick with bytes throughout. Of course this makes windows filepaths difficult because they are actually (utf16) strings.

rust/hgbase85/build.rs
1

If this is going to be reused I would move it into it's own crate. It seems like everything here could be boiled down to a single function call in main.

rust/hgbase85/src/base85.rs
21

I prefer something like this: https://play.rust-lang.org/?gist=5ca18a5b95335600e911b8f9310ea5c7&version=stable

I doubt lazy_static is too slow. Otherwise we can stay like this until const functions get implemented.

Either way note:

  • I changed the type of B85CHARS to an array as opposed to an array ref.
  • The loop condition is much nicer.
23

Would it be possible to separate the decode from the python objects. I'm thinking two helper functions.

fn b85_required_len(text: &str) -> usize
fn b85_encode(text: &str, pad: i32, out: &mut [u8]) -> Result<()>
23

&str can only hold valid utf8 data? Does it make more sense to have &[u8] here for a list of bytes?

23

IIUC pad is only ever checked == 0. Can it be made into a bool?

45

ptext isn't very descriptive.

46

Why the braces here?

47

I suspect this type annotation isn't required.

47

It might be best to use a std::io::Cursor and let it drack dst_off for your.

52
while !ptext.is_empty()
54

I would prefer the name chunk or even accum is a lot mode obvious to me than acc.

56
for i in &[24, 16, 8, 0]
58

I would just combine these into one line as the name ch isn't adding much.

63

Tracking len manually is a smell. Why not drop it and use ptest.is_empty().

91

This is probably worth a comment that this is safe because D85DEC is required to be initialized before this function is called.

152

Let rust do the overflow checking.

acc = acc.checked_mul(85)
    .ok_or_else(|| {
        PyErr::new::<exc::ValueError, _>(
            py,
            format!("bad base85 character at position {}", i))
    })?;
rust/hgstorage/src/changelog.rs
24

Passing a message as a third argument is really useful.

31

If you aren't using the value I would prefer truncate(NodeId::hex_len())

33

Just put msg: content in the struct construction.

rust/hgstorage/src/config.rs
49

Is this used yet? It probably also needs some documentation because I don't really understand the fields (but I do have little domain knowledge).

78

If you are just going to convert to String I would recommend taking a String argument.

Also prefer .to_owned() over .to_string().

rust/hgstorage/src/dirstate.rs
5

I recommend not renaming this. It is confusing.

48

This could have a better name.

87

This should Probably return a Result<Self> and pass the error to the caller.

90

I would skip this check and rely on p.metadata(). Just switch .unwrap() to .expect() with a nicer message. This also handles race conditions more nicely.

108
  1. Does this function need to be public? It seems internal to the constructor.
  2. If it doesn't need to be I would prefer it return the Map so that you don't have a partial-constructed DirState.
125

Is ignoring duplicate entries desired? It might be worth a comment explaining why.

130

Don't use _ prefix for privates. Rely on rust viability.

Also is_bad isn't very informative.

141

s/mtc/matcher/

146
let mut grey = Set::new();
grey.extend(self.dmap.keys().map(|s| s.as_path()));

Also I would pick a name like undiscovered_paths or something. grey is cryptic.

152

I would prefer doing the filter before the loop and storing it in a variable.

155

This is probably worth a helper function.

161

Please explain why you are ignoring the error condition.

162

I would just call this path or pathbuf.

167

I would move this filter beside the filter in the loop.

169

I would also put this filter above. But more importantly all _is_bad() does is check for file types. So it seems like the former filter is redundant with this one.

170

You could do the following for a slight performance win and save a line.

if let Occupied(entry) = self.dmap.entry(relpath) {
   ...
}
175

Use an else if.

183

s/rem/path/ or remaining_path.

184

You can use the entry api here.

199

Please use a better name for sent.

206

In rust we generally avoid brackets around as as it is very tightly binding.

rust/hgstorage/src/lib.rs
54

You can add a later .arg(dst) to support non-utf8 paths instead of converting to a str here.

rust/hgstorage/src/local_repo.rs
50

I would replace the condition with.

assert!(dot_hg_path.exists(), ".hg folder not found for the path given by -R argument: {:?}", p);
66
while !root.join(".hg").exists() {
    root = root.parent().expect(".hg folder not found");
}
121

s/fp/path/

127
assert!(abspath.exists(), "path not exists: {:?}", abspath);
129

gd is cryptic.

kevincox added inline comments.Mar 8 2018, 12:30 PM
rust/hgstorage/src/local_repo.rs
136

Why does it need to be mutable to clone?

155

This test has no assetions. Consider calling it test_create_... or something to indicate that you are just checking for panics.

rust/hgstorage/src/manifest.rs
33

What are these magic numbers?

42

s/ent/entry/

49

What are these numbers?

rust/hgstorage/src/matcher.rs
9

s/pat/glob/

10

Might be worth calling String::with_capacity(pat.len()) since it will be at least that long.

14

Can you manage a &[u8] rather then pointer arithmetic for the whole string. It will make me feel better and will probably be easier to read.

108

If you are going to call String.as_str() just take a &str.

108

s/relglob/relative_glob_re/

111

If you are just doing one call just return the result.

111

You should be able to do &string rather then string.as_str() as it coerces.

131

Better name please.

143

s/ln/line/

159

Is this a warning or error? You might want to switch to panic!.

160

I would move this into the following match because it dedupes the starts_with check and puts the logic closer together.

195

s/rp/path/

200

I would do self.inner.map(|m| m.is_match(rp)).unwrap_or(false) but this is fine.

rust/hgstorage/src/mpatch.rs
13

Spell these out please.

24
struct Fragment {
    len: u32,
    offset: u32,
}
35

Maybe it's just me but I think it is more common to put the source before the destination.

35

pull is very generic.

39
assert!(!src.is_empty())
40

If you are unwrapping the pop there is no need for the prior check.

40

s/f/fragment/

51

mov is overly shortened and generic.

51

It seems weird to take a cursor to a vec if you are just going to do an absolute seek. Can it work with &mut [u8]?

54

vec![0; count] works. (The arguments might be the other way around).

68
for &Fragment{frag_len, frag_ofs} in list.iter().rev()
86

Make this one line and don't bother renaming.

120

Please explain.

137
assert!(!frags.is_empty());
rust/hgstorage/src/path_encoding.rs
5

const HEX_DIGIT: [u8; 16] = *b"0123456789abcdef";

7

c should be a u8.

11

Vec<char> is odd. Is there any reason not to use a String or Vec<u8>

11

Don't pass a char by reference. Also it seems your function wants a u8.

22

I don't think you need this.

23

This isn't necessary.

25
p.ends_with(".i") || p.ends_with(".d")
34

Take a &str

34

encode_file_name?

35

Use a String.

57
fn escape(out: &mut String, b: char) {
    unimplemented!()
}

pub fn encode_path(path: &str) -> String {
    let mut out = String::with_capacity(path.len());

    for c in path.bytes() {
        let c = c as char;
        match c {
            'A'...'Z' => {
                out.push('_');
                out.push(c.to_ascii_lowercase());
            }
            '\\' | ':' | '*' | '?' | '"' | '<' | '>' | '|' => {
                escape(&mut out, c);
            }
            // The rest of the printable range.
            ' '...'~' => {
                out.push(c);
            }
            _ => {
                escape(&mut out, c);
            }
        }
    }

    out
}

https://godbolt.org/g/3WCQs3

62

Take a &str.

This revision now requires changes to proceed.Mar 8 2018, 12:30 PM

On a higher level, all of these code appears to be treating file names as strings. This isn't really true and will disallow some valid file names. Maybe we should stick with bytes throughout. Of course this makes windows filepaths difficult because they are actually (utf16) strings.

Mercurial tries to be principled about always treating filenames as bytes. AIUI https://www.mercurial-scm.org/wiki/WindowsUTF8Plan is still the plan of record there?

Mercurial tries to be principled about always treating filenames as bytes. AIUI https://www.mercurial-scm.org/wiki/WindowsUTF8Plan is still the plan of record there?

Reading that page it seems to claim that filenames should be utf8, not bytes. If utf8, this is what the code does, but if it is bytes that definitely won't work.

Mercurial tries to be principled about always treating filenames as bytes. AIUI https://www.mercurial-scm.org/wiki/WindowsUTF8Plan is still the plan of record there?

Reading that page it seems to claim that filenames should be utf8, not bytes. If utf8, this is what the code does, but if it is bytes that definitely won't work.

IIRC it's bytes everyplace except Windows, where we pretend utf8 is real?

We may have to make adjustments to this plan on macOS with APFS, but I'm not sure about that yet.

Mercurial tries to be principled about always treating filenames as bytes. AIUI https://www.mercurial-scm.org/wiki/WindowsUTF8Plan is still the plan of record there?

Reading that page it seems to claim that filenames should be utf8, not bytes. If utf8, this is what the code does, but if it is bytes that definitely won't work.

IIRC it's bytes everyplace except Windows, where we pretend utf8 is real?
We may have to make adjustments to this plan on macOS with APFS, but I'm not sure about that yet.

I think we want to express a path as a dedicated type which has different underlying storage depending on the platform (bytes on Linux, UTF-16 on Windows). All filesystem operations should take a Path instance to operate on. This is the only way to cleanly round trip filenames between the OS, the application, and back to the OS. That leaves us with the hard problem of normalizing Mercurial's storage representation of paths (bytes) with the operating system's. But this world is strictly better than today, where we lose path data from the OS because we use POSIX APIs.

FWIW, Python 3 rewrote the I/O layer to use Win32 APIs everywhere. Combined with the pathlib types, I'm pretty sure Python 3 can round trip paths on Windows. I also think Rust's path type(s) have OS-dependent functionality.

Rust has platform independent types PathBuf and &Path for paths and OsString and &OsStr for strings (owned and references respectively. They do have os-specific extensions but as long as you don't use them it should be cross platform. That being said, if you are serializing and deserializing them you may need to write some platform dependant code.

Ivzhh added a comment.Mar 8 2018, 8:17 PM

Hi everyone,

Thank you for your encouragements and comments! I will follow up with all comments and update the code soon.

@indygreg It is a great idea to test on Mozilla repo, actually I found several things interesting:

  1. I found a bug in my code (shame on me): because I did not use byte literal, and I made a typo. This triggers problem in Mozilla unified repo
  2. A regexp pattern in hgignore in Mozilla unified repo is not supported by rust's regex crate, a.k.a. "(?!)". I choose to ignore these unsupported patterns.
  3. My version is slower in this repo: 70s (hg) and 90s (mine). CodeXL reveals that the mpatch::collect() function uses 63% of the running time. I think I need to optimize it somehow.

I totally agree with @kevincox that I did not sort well on char/u8/str/String/Path/PathBuf. The first bug is caused by this. I need to improve them.

Thank you everyone!

yuja added a subscriber: yuja.Mar 9 2018, 7:33 AM

Reading that page it seems to claim that filenames should be utf8, not bytes. If utf8, this is what the code does, but if it is bytes that definitely won't work.

IIRC it's bytes everyplace except Windows, where we pretend utf8 is real?

It's MBCS (i.e. ANSI multi-byte characters) on Windows. The plain was to support
both MBCS and UTF-8-variant on Windows, but that isn't a thing yet.

Perhaps we'll have to write a platform compatibility layer (or serialization/deserialization
layer) on top of the Rust's file API, something like vfs.py we have in Python code.

Ivzhh added a comment.Mar 9 2018, 12:59 PM
In D2057#44269, @yuja wrote:

Reading that page it seems to claim that filenames should be utf8, not bytes. If utf8, this is what the code does, but if it is bytes that definitely won't work.

IIRC it's bytes everyplace except Windows, where we pretend utf8 is real?

It's MBCS (i.e. ANSI multi-byte characters) on Windows. The plain was to support
both MBCS and UTF-8-variant on Windows, but that isn't a thing yet.
Perhaps we'll have to write a platform compatibility layer (or serialization/deserialization
layer) on top of the Rust's file API, something like vfs.py we have in Python code.

Thank you for confirming that, I am a bit confusing when I read Encoding Plan wiki page. I am looking at Mozilla's rust winapi bindings, let me see if I can directly wrap around winapi::um::fileapi::FindFirstFileA

yuja added a comment.Mar 10 2018, 6:20 AM

I am looking at Mozilla's rust winapi bindings, let me see if I can directly wrap around winapi::um::fileapi::FindFirstFileA

That's probably a hard way. I was thinking of something converting
between OsStr (i.e. Path) and MBCS bytes by using Win32 API, instead
of calling out the "A" API.

https://msdn.microsoft.com/en-us/library/windows/desktop/dd319072(v=vs.85).aspx

We don't do that in Python, but Rust's type system will help making it right.

I'm not a windows expert but it seems like the rust OsStr, Path and filesystem APIs should handle these conversions for you. I think the only place where you would need to do os-specific code is when doing serialization and serialization which I think should be handled by https://doc.rust-lang.org/std/os/unix/ffi/trait.OsStringExt.html and https://doc.rust-lang.org/std/os/windows/ffi/trait.OsStringExt.html.

yuja added a comment.Mar 21 2018, 12:44 AM

I think the only place where you would need to do os-specific code is when
doing serialization and serialization

Yes, that will be feasible in strictly typed language like Rust.

which I think should be handled by https://doc.rust-lang.org/std/os/unix/ffi/trait.OsStringExt.html
and https://doc.rust-lang.org/std/os/windows/ffi/trait.OsStringExt.html.

Not true for Windows because Rust uses Unicode (UTF-16-ish) API, whereas
Python 2 does ANSI. We need to convert a "wide" string to a locale-dependent string.

Maybe the local-encoding crate will do that for us?

yuja added a comment.Mar 21 2018, 2:30 AM

I think the only place where you would need to do os-specific code is when
doing serialization and serialization

Yes, that will be feasible in strictly typed language like Rust.

To be clear, I meant serialization/deserialization between filesystem path and
internal dirstate/manifest path, not between dirstate storage and in-memory
dirstate object.

Ah, I forgot to consider the python interop. Now the need for that crate makes sense. Thanks for explaining.

Ivzhh updated this revision to Diff 7188.Mar 21 2018, 2:23 PM
  • add revlog and mpatch facilities
  • add changelog parsing
  • add manifest parsing
  • path encoding for data store
  • add dirstate and matcher facilities
  • add local repository and the supporting modules
  • use cargo fmt to format code
  • add hg r-status command
  • bincode 1.0.0 is a bit slow in my test
  • delay pattern matching during dir walk
  • optimize out trie and enable CoreXL profiling
  • use hashmap
  • remove thread pool
  • rust default read is not buf-ed, this is the key of slowness
  • change to globset
  • convert glob to regex
  • hg ignore patterns are all converted to regex (as hg does), and now it is faster
  • filter dir early to prevent walking
  • Update matcher mod after testing Mozilla unified repo
  • bug fix: use byte literals instead of numbers
  • hg store path encoding is per byte style, update code according to Kevin Cox's comments
  • update matcher testing according to Match interface change
  • If clap fails to recognize r-* subcommands, then run python-version hg
  • changelog coding style revised
  • remove legacy revlog v0 and unfinished v2.
  • partially revise the dirstate reviews
  • remove duplicated build.rs, let the executable module guarantee the python
  • use cursor in base85 encoding, reducing raw index-math
  • use cursor in base85 decoding, reducing raw index-math
  • dirstate update according to review comments
  • config update according to review comments
  • mpatch rename to more meaningful names
  • simplify matcher as when there is no syntax named in the beginning, use regexp
  • local repo coding style update
  • dirstate coding style update
  • manifest coding style update
Ivzhh marked 45 inline comments as done.Mar 21 2018, 2:55 PM
Ivzhh added inline comments.
rust/hgbase85/src/base85.rs
23

It should be any &[u8], but the current cpython crate doesn't wrap for &[u8]. I think I need to fork and add that part. I put it in my checklist now.

23

This crate is my previous try to integrate rust into hg. Right now I guess mine main pursue is to add hg r-* commands for rust. I will follow your suggestion when I am implementing the wire protocol and reuse the code for pure rust crate.

23

pad is a bool, however when I checked it in hg-python, int are passed to the function. I guess I need to update cpython wrapper for this, a more broad logic conversion.

46

I guess it is because NLL. When I started the work, rust compiler reported borrow check error on this part. I later read an article talking about NLL update in rust. But before that, I use the braces to avoid the error.

91

when I removed the unsafe, I got error: error[E0133]: use of mutable static requires unsafe function or block

rust/hgcli/src/main.rs
233–261

Thank you for the suggestion! I guess I need to extend clap later to support hg style command line. Right now whenever clap cannot handle the argument parsing, I will redirect the arguments to hg directly.

rust/hgstorage/src/changelog.rs
31

I guess I will use the rest info later. hg seems put some meta data in the commit comments. I will keep it for now. Thank you!

rust/hgstorage/src/config.rs
78

I like to_owned(), I will them in later occasions. Thank you!

95

Sure, I will use v1 only for now. In the beginning I kinda over designed this part.

rust/hgstorage/src/dirstate.rs
48

I remember the python hg uses the name, in the beginning, I tried to replicate py-hg's behaviour. But I think it needs to be renamed. I agree with you.

108

I think dir state needs to 1. read existing one; 2. create one if not exits; maybe private for now.

152

For the filter, I follow the example in the walkdir doc. I guess what I want is to skip the dir for later recursive visiting.

161

I add the error handling back

170

I kind of get borrow check compile error here. Later I use Occupied() when possible.

rust/hgstorage/src/local_repo.rs
136

I think LRU will update reference count (or timestamp) when the data is accessed.

rust/hgstorage/src/matcher.rs
14

I borrow this logic as whole from python code. It will need sometime to re-translate to non-pointer-arithmetic way.

rust/hgstorage/src/mpatch.rs
51

This part, including the stream-style, is from python part. I will update later with xi-rope.

rust/hgstorage/src/revlog_v1.rs
279

I think it explains why in mercurial repo, rust version is significantly faster. I am working on cpu future, but I did not finalize design style yet. I will keep working on that.

290–293

I guess I did this because I met some empty change delta in the beginning. I think I won't try to parallelize unzip for now.

Ivzhh marked 4 inline comments as done.EditedMar 21 2018, 4:00 PM
In D2057#46726, @yuja wrote:

I think the only place where you would need to do os-specific code is when
doing serialization and serialization

Yes, that will be feasible in strictly typed language like Rust.

To be clear, I meant serialization/deserialization between filesystem path and
internal dirstate/manifest path, not between dirstate storage and in-memory
dirstate object.

I guess your suggestion is like this: @yuja

  1. if it is windows and the code page is MBCS, try to decode the paths read from manifest and dirstate into unicode equivalent
  2. use utf internally and with rust IO api
  3. when writing back to dirstate and manifest, encode utf to MBCS

Please let me know if I have misunderstanding. Thank you!

yuja added a comment.Mar 22 2018, 11:00 AM
In D2057#46980, @Ivzhh wrote:
In D2057#46726, @yuja wrote:

I think the only place where you would need to do os-specific code is when
doing serialization and serialization

Yes, that will be feasible in strictly typed language like Rust.

To be clear, I meant serialization/deserialization between filesystem path and
internal dirstate/manifest path, not between dirstate storage and in-memory
dirstate object.

I guess your suggestion is like this: @yuja

  1. if it is windows and the code page is MBCS, try to decode the paths read from manifest and dirstate into unicode equivalent
  2. use utf internally and with rust IO api
  3. when writing back to dirstate and manifest, encode utf to MBCS

No. My suggestion is:

  1. keep manifest/dirstate paths as bytes (which are probably wrapped by some type, say HgPath)
  2. but we want to use Rust's standard library for I/O
  3. so, add utility function/trait to convert HgPath to Path/PathBuf, where MBCS-Wide conversion will occur.

I think raw byte paths will be needed to build store paths (e.g. .hg/store/data/~2eclang-format.i).

https://www.mercurial-scm.org/repo/hg/file/4.5.2/mercurial/store.py

The latest changes are looking really good. I have a couple more comments but I didn't have time for a full review. I'll try to get more reviewed tomorrow. It seems that you still have a lot of stuff still in-flight so I'll try to slowly review the changes as I have time. If you want input/feedback on any particular part just ask and I will prioritize it.

This change is very large so it might be worth splitting off a smaller component and getting that submitted first. However I do realize that for starting out it is often helpful to get some actual use cases implemented before committing the base structures.

rust/hgbase85/src/base85.rs
23

I get that, but I still think it makes the code easier to read when the python-interop and the logic as separated where it is easy to do so.

91

I meant safe not as it it didn't need the unsafe keyword, but in that the use of the unsafe block is safe.

It should really be called the trust_me,_I_know_this_is_safe block. But since you are not getting the compiler checking it is often useful to add a comment why the action you are performing is correct. In this case it is correct because the caller initializes this variable before the function is called.

106

If this computation only depends on len it would be nice to put it in a helper function.

rust/hgcli/src/main.rs
250

These are HashSet's which don't have a defined iterator order. IIRC the python implementation sorts the results which is probably desirable.

rust/hgstorage/src/config.rs
51

A link to the mentioned wiki page would be very helpful to readers.

rust/hgstorage/src/dirstate.rs
104

Switch the return type to std::io::Result and then you can have

let metadata = p.metadata()?;
let mtime = metadata.modified()?;
// ...
170

Sorry, I misunderstood the logic. You can do this:

diff -r ccc683587fdb rust/hgstorage/src/dirstate.rs
--- a/rust/hgstorage/src/dirstate.rs	Sat Mar 24 10:05:53 2018 +0000
+++ b/rust/hgstorage/src/dirstate.rs	Sat Mar 24 10:14:58 2018 +0000
@@ -184,8 +184,7 @@
                         continue;
                     }
 
-                    if self.dir_state_map.contains_key(rel_path) {
-                        let dir_entry = &self.dir_state_map[rel_path];
+                    if let Some(dir_entry) = self.dir_state_map.get(rel_path) {
                         files_not_in_walkdir.remove(rel_path);
                         DirState::check_status(&mut res, abs_path, rel_path, dir_entry);
                     } else if !matcher.check_path_ignored(rel_path.to_str().unwrap()) {
264

Does it make sense to make DirStateEntry.mtime be a std::time::SystemTime and convert upon reading the structure in?

If not I would prefer doing the conversion here:

else if mtd.modified().unwrap() == UNIX_EPOCH + Duration::from_secs(dir_entry.mtime as u64) {

(Maybe extract the system time to higher up, or even a helper function on dir_entry)

rust/hgstorage/src/local_repo.rs
136

Actually I didn't realize that RwLock doesn't get a regular get() since it is doing a compile time borrow check. https://doc.rust-lang.org/std/sync/struct.RwLock.html#method.get_mut. My mistake, the code is fine.

baymax requested changes to this revision.Jan 24 2020, 12:33 PM

There seems to have been no activities on this Diff for the past 3 Months.

By policy, we are automatically moving it out of the need-review state.

Please, move it back to need-review without hesitation if this diff should still be discussed.

:baymax:need-review-idle:

This revision now requires changes to proceed.Jan 24 2020, 12:33 PM