rust implementation of hg status
Needs ReviewPublic

Authored by Ivzhh on Feb 5 2018, 5:31 PM.

Details

Reviewers
kevincox
Group Reviewers
hg-reviewers
Summary
  • implementation of revlog v1
  • parsing changelog, manifest, dirstate
  • use .hgignore in repo root
  • comparable performance with current hg status (Linux & Mac: slightly faster, Windows: slightly slower)
  • use hg r-status as subcommand, in this case, bypass python engine

Diff Detail

Repository
rHG Mercurial
Branch
rust-hg-optimize (bookmark) on default (branch)
Lint
No Linters Available
Unit
No Unit Test Coverage
There are a very large number of changes, so older changes are hidden. Show Older Changes
kevincox added inline comments.Mar 8 2018, 12:30 PM
rust/hgstorage/src/local_repo.rs
122

s/fp/path/

128
assert!(abspath.exists(), "path not exists: {:?}", abspath);
130

gd is cryptic.

137

Why does it need to be mutable to clone?

156

This test has no assetions. Consider calling it test_create_... or something to indicate that you are just checking for panics.

rust/hgstorage/src/manifest.rs
34

What are these magic numbers?

43

s/ent/entry/

50

What are these numbers?

rust/hgstorage/src/matcher.rs
10

s/pat/glob/

11

Might be worth calling String::with_capacity(pat.len()) since it will be at least that long.

15

Can you manage a &[u8] rather then pointer arithmetic for the whole string. It will make me feel better and will probably be easier to read.

109

If you are going to call String.as_str() just take a &str.

109

s/relglob/relative_glob_re/

112

If you are just doing one call just return the result.

112

You should be able to do &string rather then string.as_str() as it coerces.

132

Better name please.

144

s/ln/line/

160

Is this a warning or error? You might want to switch to panic!.

161

I would move this into the following match because it dedupes the starts_with check and puts the logic closer together.

196

s/rp/path/

201

I would do self.inner.map(|m| m.is_match(rp)).unwrap_or(false) but this is fine.

rust/hgstorage/src/mpatch.rs
14

Spell these out please.

25
struct Fragment {
    len: u32,
    offset: u32,
}
36

Maybe it's just me but I think it is more common to put the source before the destination.

36

pull is very generic.

40
assert!(!src.is_empty())
41

If you are unwrapping the pop there is no need for the prior check.

41

s/f/fragment/

52

mov is overly shortened and generic.

52

It seems weird to take a cursor to a vec if you are just going to do an absolute seek. Can it work with &mut [u8]?

55

vec![0; count] works. (The arguments might be the other way around).

69
for &Fragment{frag_len, frag_ofs} in list.iter().rev()
87

Make this one line and don't bother renaming.

121

Please explain.

138
assert!(!frags.is_empty());
rust/hgstorage/src/path_encoding.rs
6

const HEX_DIGIT: [u8; 16] = *b"0123456789abcdef";

8

c should be a u8.

12

Vec<char> is odd. Is there any reason not to use a String or Vec<u8>

12

Don't pass a char by reference. Also it seems your function wants a u8.

23

I don't think you need this.

24

This isn't necessary.

26
p.ends_with(".i") || p.ends_with(".d")
35

Take a &str

35

encode_file_name?

36

Use a String.

58
fn escape(out: &mut String, b: char) {
    unimplemented!()
}

pub fn encode_path(path: &str) -> String {
    let mut out = String::with_capacity(path.len());

    for c in path.bytes() {
        let c = c as char;
        match c {
            'A'...'Z' => {
                out.push('_');
                out.push(c.to_ascii_lowercase());
            }
            '\\' | ':' | '*' | '?' | '"' | '<' | '>' | '|' => {
                escape(&mut out, c);
            }
            // The rest of the printable range.
            ' '...'~' => {
                out.push(c);
            }
            _ => {
                escape(&mut out, c);
            }
        }
    }

    out
}

https://godbolt.org/g/3WCQs3

63

Take a &str.

This revision now requires changes to proceed.Mar 8 2018, 12:30 PM

On a higher level, all of these code appears to be treating file names as strings. This isn't really true and will disallow some valid file names. Maybe we should stick with bytes throughout. Of course this makes windows filepaths difficult because they are actually (utf16) strings.

Mercurial tries to be principled about always treating filenames as bytes. AIUI https://www.mercurial-scm.org/wiki/WindowsUTF8Plan is still the plan of record there?

Mercurial tries to be principled about always treating filenames as bytes. AIUI https://www.mercurial-scm.org/wiki/WindowsUTF8Plan is still the plan of record there?

Reading that page it seems to claim that filenames should be utf8, not bytes. If utf8, this is what the code does, but if it is bytes that definitely won't work.

Mercurial tries to be principled about always treating filenames as bytes. AIUI https://www.mercurial-scm.org/wiki/WindowsUTF8Plan is still the plan of record there?

Reading that page it seems to claim that filenames should be utf8, not bytes. If utf8, this is what the code does, but if it is bytes that definitely won't work.

IIRC it's bytes everyplace except Windows, where we pretend utf8 is real?

We may have to make adjustments to this plan on macOS with APFS, but I'm not sure about that yet.

Mercurial tries to be principled about always treating filenames as bytes. AIUI https://www.mercurial-scm.org/wiki/WindowsUTF8Plan is still the plan of record there?

Reading that page it seems to claim that filenames should be utf8, not bytes. If utf8, this is what the code does, but if it is bytes that definitely won't work.

IIRC it's bytes everyplace except Windows, where we pretend utf8 is real?

We may have to make adjustments to this plan on macOS with APFS, but I'm not sure about that yet.

I think we want to express a path as a dedicated type which has different underlying storage depending on the platform (bytes on Linux, UTF-16 on Windows). All filesystem operations should take a Path instance to operate on. This is the only way to cleanly round trip filenames between the OS, the application, and back to the OS. That leaves us with the hard problem of normalizing Mercurial's storage representation of paths (bytes) with the operating system's. But this world is strictly better than today, where we lose path data from the OS because we use POSIX APIs.

FWIW, Python 3 rewrote the I/O layer to use Win32 APIs everywhere. Combined with the pathlib types, I'm pretty sure Python 3 can round trip paths on Windows. I also think Rust's path type(s) have OS-dependent functionality.

Rust has platform independent types PathBuf and &Path for paths and OsString and &OsStr for strings (owned and references respectively. They do have os-specific extensions but as long as you don't use them it should be cross platform. That being said, if you are serializing and deserializing them you may need to write some platform dependant code.

Ivzhh added a comment.Mar 8 2018, 8:17 PM

Hi everyone,

Thank you for your encouragements and comments! I will follow up with all comments and update the code soon.

@indygreg It is a great idea to test on Mozilla repo, actually I found several things interesting:

  1. I found a bug in my code (shame on me): because I did not use byte literal, and I made a typo. This triggers problem in Mozilla unified repo
  2. A regexp pattern in hgignore in Mozilla unified repo is not supported by rust's regex crate, a.k.a. "(?!)". I choose to ignore these unsupported patterns.
  3. My version is slower in this repo: 70s (hg) and 90s (mine). CodeXL reveals that the mpatch::collect() function uses 63% of the running time. I think I need to optimize it somehow.

I totally agree with @kevincox that I did not sort well on char/u8/str/String/Path/PathBuf. The first bug is caused by this. I need to improve them.

Thank you everyone!

yuja added a subscriber: yuja.Mar 9 2018, 7:33 AM

Reading that page it seems to claim that filenames should be utf8, not bytes. If utf8, this is what the code does, but if it is bytes that definitely won't work.

IIRC it's bytes everyplace except Windows, where we pretend utf8 is real?

It's MBCS (i.e. ANSI multi-byte characters) on Windows. The plain was to support
both MBCS and UTF-8-variant on Windows, but that isn't a thing yet.

Perhaps we'll have to write a platform compatibility layer (or serialization/deserialization
layer) on top of the Rust's file API, something like vfs.py we have in Python code.

Ivzhh added a comment.Mar 9 2018, 12:59 PM
In D2057#44269, @yuja wrote:

Reading that page it seems to claim that filenames should be utf8, not bytes. If utf8, this is what the code does, but if it is bytes that definitely won't work.

IIRC it's bytes everyplace except Windows, where we pretend utf8 is real?

It's MBCS (i.e. ANSI multi-byte characters) on Windows. The plain was to support
both MBCS and UTF-8-variant on Windows, but that isn't a thing yet.

Perhaps we'll have to write a platform compatibility layer (or serialization/deserialization
layer) on top of the Rust's file API, something like vfs.py we have in Python code.

Thank you for confirming that, I am a bit confusing when I read Encoding Plan wiki page. I am looking at Mozilla's rust winapi bindings, let me see if I can directly wrap around winapi::um::fileapi::FindFirstFileA

yuja added a comment.Mar 10 2018, 6:20 AM

I am looking at Mozilla's rust winapi bindings, let me see if I can directly wrap around winapi::um::fileapi::FindFirstFileA

That's probably a hard way. I was thinking of something converting
between OsStr (i.e. Path) and MBCS bytes by using Win32 API, instead
of calling out the "A" API.

https://msdn.microsoft.com/en-us/library/windows/desktop/dd319072(v=vs.85).aspx

We don't do that in Python, but Rust's type system will help making it right.

I'm not a windows expert but it seems like the rust OsStr, Path and filesystem APIs should handle these conversions for you. I think the only place where you would need to do os-specific code is when doing serialization and serialization which I think should be handled by https://doc.rust-lang.org/std/os/unix/ffi/trait.OsStringExt.html and https://doc.rust-lang.org/std/os/windows/ffi/trait.OsStringExt.html.

yuja added a comment.Mar 21 2018, 12:44 AM

I think the only place where you would need to do os-specific code is when
doing serialization and serialization

Yes, that will be feasible in strictly typed language like Rust.

which I think should be handled by https://doc.rust-lang.org/std/os/unix/ffi/trait.OsStringExt.html
and https://doc.rust-lang.org/std/os/windows/ffi/trait.OsStringExt.html.

Not true for Windows because Rust uses Unicode (UTF-16-ish) API, whereas
Python 2 does ANSI. We need to convert a "wide" string to a locale-dependent string.

Maybe the local-encoding crate will do that for us?

yuja added a comment.Mar 21 2018, 2:30 AM

I think the only place where you would need to do os-specific code is when
doing serialization and serialization

Yes, that will be feasible in strictly typed language like Rust.

To be clear, I meant serialization/deserialization between filesystem path and
internal dirstate/manifest path, not between dirstate storage and in-memory
dirstate object.

Ah, I forgot to consider the python interop. Now the need for that crate makes sense. Thanks for explaining.

Ivzhh updated this revision to Diff 7188.Mar 21 2018, 2:23 PM
  • add revlog and mpatch facilities
  • add changelog parsing
  • add manifest parsing
  • path encoding for data store
  • add dirstate and matcher facilities
  • add local repository and the supporting modules
  • use cargo fmt to format code
  • add hg r-status command
  • bincode 1.0.0 is a bit slow in my test
  • delay pattern matching during dir walk
  • optimize out trie and enable CoreXL profiling
  • use hashmap
  • remove thread pool
  • rust default read is not buf-ed, this is the key of slowness
  • change to globset
  • convert glob to regex
  • hg ignore patterns are all converted to regex (as hg does), and now it is faster
  • filter dir early to prevent walking
  • Update matcher mod after testing Mozilla unified repo
  • bug fix: use byte literals instead of numbers
  • hg store path encoding is per byte style, update code according to Kevin Cox's comments
  • update matcher testing according to Match interface change
  • If clap fails to recognize r-* subcommands, then run python-version hg
  • changelog coding style revised
  • remove legacy revlog v0 and unfinished v2.
  • partially revise the dirstate reviews
  • remove duplicated build.rs, let the executable module guarantee the python
  • use cursor in base85 encoding, reducing raw index-math
  • use cursor in base85 decoding, reducing raw index-math
  • dirstate update according to review comments
  • config update according to review comments
  • mpatch rename to more meaningful names
  • simplify matcher as when there is no syntax named in the beginning, use regexp
  • local repo coding style update
  • dirstate coding style update
  • manifest coding style update
Ivzhh marked 45 inline comments as done.Mar 21 2018, 2:55 PM
Ivzhh added inline comments.
rust/hgbase85/src/base85.rs
24

It should be any &[u8], but the current cpython crate doesn't wrap for &[u8]. I think I need to fork and add that part. I put it in my checklist now.

24

This crate is my previous try to integrate rust into hg. Right now I guess mine main pursue is to add hg r-* commands for rust. I will follow your suggestion when I am implementing the wire protocol and reuse the code for pure rust crate.

24

pad is a bool, however when I checked it in hg-python, int are passed to the function. I guess I need to update cpython wrapper for this, a more broad logic conversion.

47

I guess it is because NLL. When I started the work, rust compiler reported borrow check error on this part. I later read an article talking about NLL update in rust. But before that, I use the braces to avoid the error.

92

when I removed the unsafe, I got error: error[E0133]: use of mutable static requires unsafe function or block

rust/hgcli/src/main.rs
233–273

Thank you for the suggestion! I guess I need to extend clap later to support hg style command line. Right now whenever clap cannot handle the argument parsing, I will redirect the arguments to hg directly.

rust/hgstorage/src/changelog.rs
32

I guess I will use the rest info later. hg seems put some meta data in the commit comments. I will keep it for now. Thank you!

rust/hgstorage/src/config.rs
79

I like to_owned(), I will them in later occasions. Thank you!

96

Sure, I will use v1 only for now. In the beginning I kinda over designed this part.

rust/hgstorage/src/dirstate.rs
49

I remember the python hg uses the name, in the beginning, I tried to replicate py-hg's behaviour. But I think it needs to be renamed. I agree with you.

109

I think dir state needs to 1. read existing one; 2. create one if not exits; maybe private for now.

153

For the filter, I follow the example in the walkdir doc. I guess what I want is to skip the dir for later recursive visiting.

162

I add the error handling back

171

I kind of get borrow check compile error here. Later I use Occupied() when possible.

rust/hgstorage/src/local_repo.rs
137

I think LRU will update reference count (or timestamp) when the data is accessed.

rust/hgstorage/src/matcher.rs
15

I borrow this logic as whole from python code. It will need sometime to re-translate to non-pointer-arithmetic way.

rust/hgstorage/src/mpatch.rs
52

This part, including the stream-style, is from python part. I will update later with xi-rope.

rust/hgstorage/src/revlog_v1.rs
280

I think it explains why in mercurial repo, rust version is significantly faster. I am working on cpu future, but I did not finalize design style yet. I will keep working on that.

291–294

I guess I did this because I met some empty change delta in the beginning. I think I won't try to parallelize unzip for now.

Ivzhh marked 4 inline comments as done.EditedMar 21 2018, 4:00 PM
In D2057#46726, @yuja wrote:

I think the only place where you would need to do os-specific code is when
doing serialization and serialization

Yes, that will be feasible in strictly typed language like Rust.

To be clear, I meant serialization/deserialization between filesystem path and
internal dirstate/manifest path, not between dirstate storage and in-memory
dirstate object.

I guess your suggestion is like this: @yuja

  1. if it is windows and the code page is MBCS, try to decode the paths read from manifest and dirstate into unicode equivalent
  2. use utf internally and with rust IO api
  3. when writing back to dirstate and manifest, encode utf to MBCS

Please let me know if I have misunderstanding. Thank you!

yuja added a comment.Mar 22 2018, 11:00 AM
In D2057#46980, @Ivzhh wrote:
In D2057#46726, @yuja wrote:

I think the only place where you would need to do os-specific code is when
doing serialization and serialization

Yes, that will be feasible in strictly typed language like Rust.

To be clear, I meant serialization/deserialization between filesystem path and
internal dirstate/manifest path, not between dirstate storage and in-memory
dirstate object.

I guess your suggestion is like this: @yuja

  1. if it is windows and the code page is MBCS, try to decode the paths read from manifest and dirstate into unicode equivalent
  2. use utf internally and with rust IO api
  3. when writing back to dirstate and manifest, encode utf to MBCS

No. My suggestion is:

  1. keep manifest/dirstate paths as bytes (which are probably wrapped by some type, say HgPath)
  2. but we want to use Rust's standard library for I/O
  3. so, add utility function/trait to convert HgPath to Path/PathBuf, where MBCS-Wide conversion will occur.

I think raw byte paths will be needed to build store paths (e.g. .hg/store/data/~2eclang-format.i).

https://www.mercurial-scm.org/repo/hg/file/4.5.2/mercurial/store.py

The latest changes are looking really good. I have a couple more comments but I didn't have time for a full review. I'll try to get more reviewed tomorrow. It seems that you still have a lot of stuff still in-flight so I'll try to slowly review the changes as I have time. If you want input/feedback on any particular part just ask and I will prioritize it.

This change is very large so it might be worth splitting off a smaller component and getting that submitted first. However I do realize that for starting out it is often helpful to get some actual use cases implemented before committing the base structures.

rust/hgbase85/src/base85.rs
24

I get that, but I still think it makes the code easier to read when the python-interop and the logic as separated where it is easy to do so.

92

I meant safe not as it it didn't need the unsafe keyword, but in that the use of the unsafe block is safe.

It should really be called the trust_me,_I_know_this_is_safe block. But since you are not getting the compiler checking it is often useful to add a comment why the action you are performing is correct. In this case it is correct because the caller initializes this variable before the function is called.

105

If this computation only depends on len it would be nice to put it in a helper function.

rust/hgcli/src/main.rs
260

These are HashSet's which don't have a defined iterator order. IIRC the python implementation sorts the results which is probably desirable.

rust/hgstorage/src/config.rs
50

A link to the mentioned wiki page would be very helpful to readers.

rust/hgstorage/src/dirstate.rs
103

Switch the return type to std::io::Result and then you can have

let metadata = p.metadata()?;
let mtime = metadata.modified()?;
// ...
171

Sorry, I misunderstood the logic. You can do this:

diff -r ccc683587fdb rust/hgstorage/src/dirstate.rs
--- a/rust/hgstorage/src/dirstate.rs	Sat Mar 24 10:05:53 2018 +0000
+++ b/rust/hgstorage/src/dirstate.rs	Sat Mar 24 10:14:58 2018 +0000
@@ -184,8 +184,7 @@
                         continue;
                     }
 
-                    if self.dir_state_map.contains_key(rel_path) {
-                        let dir_entry = &self.dir_state_map[rel_path];
+                    if let Some(dir_entry) = self.dir_state_map.get(rel_path) {
                         files_not_in_walkdir.remove(rel_path);
                         DirState::check_status(&mut res, abs_path, rel_path, dir_entry);
                     } else if !matcher.check_path_ignored(rel_path.to_str().unwrap()) {
263

Does it make sense to make DirStateEntry.mtime be a std::time::SystemTime and convert upon reading the structure in?

If not I would prefer doing the conversion here:

else if mtd.modified().unwrap() == UNIX_EPOCH + Duration::from_secs(dir_entry.mtime as u64) {

(Maybe extract the system time to higher up, or even a helper function on dir_entry)

rust/hgstorage/src/local_repo.rs
137

Actually I didn't realize that RwLock doesn't get a regular get() since it is doing a compile time borrow check. https://doc.rust-lang.org/std/sync/struct.RwLock.html#method.get_mut. My mistake, the code is fine.