This is an archive of the discontinued Mercurial Phabricator instance.

hg-core: Add a limited read only `revlog` implementation
ClosedPublic

Authored by acezar on Aug 26 2020, 9:23 AM.

Details

Summary

Only covers the needs of the upcoming rhg debugdata command.

Diff Detail

Repository
rHG Mercurial
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

acezar created this revision.Aug 26 2020, 9:23 AM
acezar retitled this revision from hg-core: Add a limited readonly revlog implementation to hg-core: Add a limited read only `revlog` implementation.Aug 26 2020, 9:44 AM
acezar updated this revision to Diff 22468.Aug 27 2020, 4:40 AM
Alphare requested changes to this revision.Aug 28 2020, 6:11 AM
Alphare added a subscriber: Alphare.

This is my partial review, as indicated by my last comment, I'll be resuming it later today.

rust/hg-core/Cargo.toml
27

You could add a comment saying that we don't use the miniz-oxide backend because its minimum Rust version is 1.36. However, this PR (https://github.com/Frommi/miniz_oxide/pull/84/files) introduces a flag no_extern_crate_alloc to bring the requirement back down to 1.34, so maybe leave a comment and we'll benchmark this later? (There also is the zlib implem from Cloudflare IIRC)

rust/hg-core/src/revlog/index.rs
11

nit: "Offsets of starts of index blocks"

47

nit: s/then/the/

106

nit: corresponds

107

nit: "overridden", "after"

112

nit: overridden

113

Is it right to call it offset_override if it's always overridden? Not that I care *that* much, but offset would be shorter.

117

nit: overridden

122

This needlessly allocates a Vec<u8>, this version doesn't:

let mut arr = [0;8];
arr.copy_from_slice(&self.bytes[0..=5]);

You can compare the assembly output here: https://godbolt.org/z/rdbb4P. (I've also tried a very dumb version with manual indexing and no ByteOrder, but index checking doesn't kick in and is less readable anyway.

rust/hg-core/src/revlog/patch.rs
7

Please use backticks (`) with inline code

27

If it's the responsibility of an already-present method, why not delete it now?

55

nit: revisions'

70

We could potentially have a heuristic of approximately how many entries there are given a the length of data, but that's probably for later.

90

This should probably be self.frags.fold(initial_size, |acc, m| acc + m.len_diff()), since iterators usually allow for better optimization (also it's less code, but the readability is debatable, I like it personally).

Alphare added inline comments.Aug 28 2020, 6:11 AM
rust/hg-core/src/revlog/patch.rs
115

nit: s/sequences/sequence/

149

Should this have been done?

172

nit: missing backticks

241

I'm stopping my review here, will resume later, this was a long read so far.

This revision now requires changes to proceed.Aug 28 2020, 6:11 AM

Overall this looks good. I have obviously left many comments, but a lot of them are nits, and it's a huge patch.

rust/hg-core/src/revlog/patch.rs
249

Why the lifetime?

262

nit: probably too many parenthesis.

rust/hg-core/src/revlog/revlog.rs
39

I think this should probably be an error of sorts that falls back to Mercurial and not just a panic?

96

It can probably be a &[RevlogEntry], but that doesn't matter much here.

98

It's more efficient to do .rev on the iterator, this also removes the need for mut.

100

You should only collect() once

102

nit: .map(patch::PatchList::new)

169

Should that be a fallback?

186

nit: s/zst/zstd/

196

assert_eq! would be clearer and have a more useful error message

Alphare added inline comments.Aug 28 2020, 9:35 AM
rust/hg-core/src/revlog/index.rs
122

Oops, should be

let mut arr = [0;8];
arr[2..8].copy_from_slice(&bytes[0..=5]);
acezar marked 20 inline comments as done.Aug 28 2020, 12:52 PM
acezar updated this revision to Diff 22482.
acezar added inline comments.Aug 28 2020, 12:54 PM
rust/hg-core/src/revlog/index.rs
113

fn offset() does not always return the overridden offset.

rust/hg-core/src/revlog/patch.rs
249

It was part of old code using &[u8] instead of Vec<u8>

rust/hg-core/src/revlog/revlog.rs
96

The actual RevlogEntry purpose is to get the partial entry data. It would requires two distinct structs to do what I think you suggest.

Anyway, I see little interest in wrapping the final data right now in a generic struct as it is specific to the revlog kind.

In a set of patches I'll submit latter, I have a ChanglogEntry that make sense of the data in the context of the changlog and a ManifestEntry that make sense of the data in the context of the manifest. ChangelogEntry having a manifest_node method while ManifestEntry as a files method.

But not knowing what is in all the revlogs of all kinds, I don't know if there is some common pattern in their organization. Are they always lines organized? Is \0 always the field separator?
I change this if I see an emerging pattern as I'll implement more usecases.

100

Leads to borrowing issue.

102

Some compilation issue here too. I will investigate.

Alphare added inline comments.Aug 28 2020, 1:14 PM
rust/hg-core/src/revlog/revlog.rs
94

I missed it, but use of self is unnecessary, it should be an associated method.

96

I was merely commenting on the fact that you used a Vec<_> and not a &[_], which is really nitpicking since it's a private function that forced no cloning or allocation in the first place, but it's always a good idea to take by reference if that's not more effort.

Regarding your idea about generic RevlogEntry vs *log-specific structures, I too don't have enough perspective to be sure how the implementation should go. Maybe it should be a trait, for example. But let's stick with your idea of figuring things out along the way, that sounds good.

102

https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=3b54c1e0470590b54e89d6d347bba6f2

This should work, if not we can talk about it on IRC so as to not pollute the review with debugging.

acezar marked 4 inline comments as done.Aug 31 2020, 4:38 AM
acezar updated this revision to Diff 22492.
acezar added inline comments.Aug 31 2020, 4:39 AM
rust/hg-core/src/revlog/revlog.rs
96

I have misunderstood your comment. I thought it was about -> Vec<u8>.

102

It does not work because of the std::borrow::Cow.

Alphare added inline comments.Aug 31 2020, 5:10 AM
rust/hg-core/src/revlog/revlog.rs
102

I suppose so. I feel like there's something to be improved here, but if it doesn't show up in profiling I suppose it'll be good enough.

acezar updated this revision to Diff 22496.Aug 31 2020, 11:04 AM
acezar updated this revision to Diff 22528.Sep 3 2020, 4:50 AM
acezar updated this revision to Diff 22532.Sep 3 2020, 8:49 AM
acezar updated this revision to Diff 22536.Sep 3 2020, 11:35 AM
acezar marked 2 inline comments as done.Sep 4 2020, 5:11 AM
acezar marked 3 inline comments as done.Sep 4 2020, 5:36 AM
acezar updated this revision to Diff 22537.
acezar marked 4 inline comments as done.Sep 4 2020, 5:48 AM
acezar updated this revision to Diff 22538.
acezar added inline comments.Sep 4 2020, 5:49 AM
rust/hg-core/src/revlog/patch.rs
27

Be cause it require quite some work and I want to advance rhg files --rev which is far more usefull.

70

Maybe some day

149

It should have been done and documented. But I've documented enough undocumented stuff for now.

acezar updated this revision to Diff 22539.Sep 4 2020, 5:51 AM
Alphare accepted this revision.Sep 6 2020, 3:51 AM
Alphare added inline comments.
rust/hg-core/Cargo.toml
27

This is marked as done, but I don't see it done.

acezar marked an inline comment as done.Sep 9 2020, 5:06 AM
acezar added inline comments.
rust/hg-core/Cargo.toml
27

It is done below

martinvonz added inline comments.
rust/hg-core/src/revlog/index.rs
24

I find offset + INDEX_ENTRY_SIZE <= bytes.len() more natural

34

might be useful to return an error here if offset != bytes.len()

87

It doesn't seem to explain the Some(0) value. I *think* I've read somewhere that the offset for the first entry is known to be 0 and used for storing some other data instead. Could you explain that here (or whatever is the right explanation if I got it wrong)?

88–91

nit: if rev == 0 { Some(0) } else { None } is a little shorter and at least as clear (IMO)

117

if not overridden by offset_override.

It still returns the offset, doesn't it?

122–124

How does this compare to ((BigEndian::read_u16(&self.bytes[0..2]) as usize) << 32) + (BigEndian::read_u32(&self.bytes[2..6]) as usize)?

123

mixing of inclusive and exclusive upper slice bounds for unclear reason

124

nit: unnecessary [..]?

rust/hg-core/src/revlog/patch.rs
257

unnecessary [..] here too

martinvonz requested changes to this revision.Sep 22 2020, 4:27 PM
This revision now requires changes to proceed.Sep 22 2020, 4:27 PM

I didn't review the tests. But this seems like a decent first implementation of a revlog reader. The Python code for revlog reading is substantially more complex. Although a lot of that complexity has to do with extracting maximum performance out of Python. I am very curious how many of those optimizations we'll need in Rust to obtain a similar level of performance! Decompression often dominates revlog read times and I suspect Rust will enable us to leverage multithreading more easily, eventually allowing us to easily surpass Python's speed.

My quality bar for this code is pretty low since it is an initial implementation and I don't want to block more Rust efforts. The remaining review comments are mostly nits. I'll happily approve once those minor changes are made. (I don't want to step on Martin's review!)

rust/hg-core/src/revlog/index.rs
87

Yeah, the initial revision's index bytes are used for storing the revlog header. See https://www.mercurial-scm.org/repo/hg/file/73a5aa5e1857/mercurial/helptext/internals/revlogs.txt#l96.

rust/hg-core/src/revlog/revlog.rs
61

Yeah, unconditionally mmap'ing the data file could lead to issues with very large revlogs. But for an initial implementation, it's probably fine. We can add the complexity later once it is needed.

194

We should return a Result instead of calling expect(). But this can be iterated on after this lands.

martinvonz added inline comments.Sep 22 2020, 10:04 PM
rust/hg-core/src/revlog/patch.rs
12

Can we just call it Chunk or maybe PatchChunk? I think it's a good rule of thumb that if a type is called Foo and the description is "A bar", then maybe the type should be called Bar instead.

14–16

Why signed? Do we need to match what the Python or C code does? Hmm, but it looks like the C version relies on getbe32() in bitmanipulation.h, and that uses uint32_t. That's based on my very cursory look, though.

21

nit: can use anonymous lifetime (unless that's against our coding style or something)

24

Maybe replace "Offset allow to take" by "The offset, taking"

26

drop the "ed" in "offseted"

My quality bar for this code is pretty low since it is an initial implementation and I don't want to block more Rust efforts. The remaining review comments are mostly nits. I'll happily approve once those minor changes are made. (I don't want to step on Martin's review!)

Sure, we can also just queue it (after just reviewing for malicious code). I'll probably queue it later tonight.

My quality bar for this code is pretty low since it is an initial implementation and I don't want to block more Rust efforts. The remaining review comments are mostly nits. I'll happily approve once those minor changes are made. (I don't want to step on Martin's review!)

Sure, we can also just queue it (after just reviewing for malicious code). I'll probably queue it later tonight.

I was about to queue the whole series, but it fails on this first patch :P Can you rebase the series? Thanks.

acezar marked an inline comment as done.Sep 23 2020, 8:07 AM
acezar updated this revision to Diff 22779.
This revision was not accepted when it landed; it landed in state Needs Review.
This revision was automatically updated to reflect the committed changes.

Followups:

Followups:

Thanks!