Only covers the needs of the upcoming rhg debugdata command.
Details
- Reviewers
Alphare martinvonz - Group Reviewers
hg-reviewers - Commits
- rHG26c53ee51c68: hg-core: Add a limited read only `revlog` implementation
Diff Detail
- Repository
- rHG Mercurial
- Lint
Automatic diff as part of commit; lint not applicable. - Unit
Automatic diff as part of commit; unit tests not applicable.
Event Timeline
This is my partial review, as indicated by my last comment, I'll be resuming it later today.
rust/hg-core/Cargo.toml | ||
---|---|---|
27 | You could add a comment saying that we don't use the miniz-oxide backend because its minimum Rust version is 1.36. However, this PR (https://github.com/Frommi/miniz_oxide/pull/84/files) introduces a flag no_extern_crate_alloc to bring the requirement back down to 1.34, so maybe leave a comment and we'll benchmark this later? (There also is the zlib implem from Cloudflare IIRC) | |
rust/hg-core/src/revlog/index.rs | ||
11 | nit: "Offsets of starts of index blocks" | |
47 | nit: s/then/the/ | |
106 | nit: corresponds | |
107 | nit: "overridden", "after" | |
112 | nit: overridden | |
113 | Is it right to call it offset_override if it's always overridden? Not that I care *that* much, but offset would be shorter. | |
117 | nit: overridden | |
122 | This needlessly allocates a Vec<u8>, this version doesn't: let mut arr = [0;8]; arr.copy_from_slice(&self.bytes[0..=5]); You can compare the assembly output here: https://godbolt.org/z/rdbb4P. (I've also tried a very dumb version with manual indexing and no ByteOrder, but index checking doesn't kick in and is less readable anyway. | |
rust/hg-core/src/revlog/patch.rs | ||
7 | Please use backticks (`) with inline code | |
27 | If it's the responsibility of an already-present method, why not delete it now? | |
55 | nit: revisions' | |
70 | We could potentially have a heuristic of approximately how many entries there are given a the length of data, but that's probably for later. | |
90 | This should probably be self.frags.fold(initial_size, |acc, m| acc + m.len_diff()), since iterators usually allow for better optimization (also it's less code, but the readability is debatable, I like it personally). |
Overall this looks good. I have obviously left many comments, but a lot of them are nits, and it's a huge patch.
rust/hg-core/src/revlog/patch.rs | ||
---|---|---|
249 | Why the lifetime? | |
262 | nit: probably too many parenthesis. | |
rust/hg-core/src/revlog/revlog.rs | ||
39 | I think this should probably be an error of sorts that falls back to Mercurial and not just a panic? | |
96 | It can probably be a &[RevlogEntry], but that doesn't matter much here. | |
98 | It's more efficient to do .rev on the iterator, this also removes the need for mut. | |
100 | You should only collect() once | |
102 | nit: .map(patch::PatchList::new) | |
169 | Should that be a fallback? | |
186 | nit: s/zst/zstd/ | |
196 | assert_eq! would be clearer and have a more useful error message |
rust/hg-core/src/revlog/index.rs | ||
---|---|---|
122 | Oops, should be let mut arr = [0;8]; arr[2..8].copy_from_slice(&bytes[0..=5]); |
rust/hg-core/src/revlog/index.rs | ||
---|---|---|
113 | fn offset() does not always return the overridden offset. | |
rust/hg-core/src/revlog/patch.rs | ||
249 | It was part of old code using &[u8] instead of Vec<u8> | |
rust/hg-core/src/revlog/revlog.rs | ||
96 | The actual RevlogEntry purpose is to get the partial entry data. It would requires two distinct structs to do what I think you suggest. Anyway, I see little interest in wrapping the final data right now in a generic struct as it is specific to the revlog kind. In a set of patches I'll submit latter, I have a ChanglogEntry that make sense of the data in the context of the changlog and a ManifestEntry that make sense of the data in the context of the manifest. ChangelogEntry having a manifest_node method while ManifestEntry as a files method. But not knowing what is in all the revlogs of all kinds, I don't know if there is some common pattern in their organization. Are they always lines organized? Is \0 always the field separator? | |
100 | Leads to borrowing issue. | |
102 | Some compilation issue here too. I will investigate. |
rust/hg-core/src/revlog/revlog.rs | ||
---|---|---|
94 | I missed it, but use of self is unnecessary, it should be an associated method. | |
96 | I was merely commenting on the fact that you used a Vec<_> and not a &[_], which is really nitpicking since it's a private function that forced no cloning or allocation in the first place, but it's always a good idea to take by reference if that's not more effort. Regarding your idea about generic RevlogEntry vs *log-specific structures, I too don't have enough perspective to be sure how the implementation should go. Maybe it should be a trait, for example. But let's stick with your idea of figuring things out along the way, that sounds good. | |
102 | This should work, if not we can talk about it on IRC so as to not pollute the review with debugging. |
rust/hg-core/src/revlog/revlog.rs | ||
---|---|---|
102 | I suppose so. I feel like there's something to be improved here, but if it doesn't show up in profiling I suppose it'll be good enough. |
rust/hg-core/Cargo.toml | ||
---|---|---|
27 | This is marked as done, but I don't see it done. |
rust/hg-core/Cargo.toml | ||
---|---|---|
27 | It is done below |
rust/hg-core/src/revlog/index.rs | ||
---|---|---|
24 | I find offset + INDEX_ENTRY_SIZE <= bytes.len() more natural | |
34 | might be useful to return an error here if offset != bytes.len() | |
87 | It doesn't seem to explain the Some(0) value. I *think* I've read somewhere that the offset for the first entry is known to be 0 and used for storing some other data instead. Could you explain that here (or whatever is the right explanation if I got it wrong)? | |
88–91 | nit: if rev == 0 { Some(0) } else { None } is a little shorter and at least as clear (IMO) | |
117 |
It still returns the offset, doesn't it? | |
122–124 | How does this compare to ((BigEndian::read_u16(&self.bytes[0..2]) as usize) << 32) + (BigEndian::read_u32(&self.bytes[2..6]) as usize)? | |
123 | mixing of inclusive and exclusive upper slice bounds for unclear reason | |
124 | nit: unnecessary [..]? | |
rust/hg-core/src/revlog/patch.rs | ||
257 | unnecessary [..] here too |
I didn't review the tests. But this seems like a decent first implementation of a revlog reader. The Python code for revlog reading is substantially more complex. Although a lot of that complexity has to do with extracting maximum performance out of Python. I am very curious how many of those optimizations we'll need in Rust to obtain a similar level of performance! Decompression often dominates revlog read times and I suspect Rust will enable us to leverage multithreading more easily, eventually allowing us to easily surpass Python's speed.
My quality bar for this code is pretty low since it is an initial implementation and I don't want to block more Rust efforts. The remaining review comments are mostly nits. I'll happily approve once those minor changes are made. (I don't want to step on Martin's review!)
rust/hg-core/src/revlog/index.rs | ||
---|---|---|
87 | Yeah, the initial revision's index bytes are used for storing the revlog header. See https://www.mercurial-scm.org/repo/hg/file/73a5aa5e1857/mercurial/helptext/internals/revlogs.txt#l96. | |
rust/hg-core/src/revlog/revlog.rs | ||
61 | Yeah, unconditionally mmap'ing the data file could lead to issues with very large revlogs. But for an initial implementation, it's probably fine. We can add the complexity later once it is needed. | |
194 | We should return a Result instead of calling expect(). But this can be iterated on after this lands. |
rust/hg-core/src/revlog/patch.rs | ||
---|---|---|
12 | Can we just call it Chunk or maybe PatchChunk? I think it's a good rule of thumb that if a type is called Foo and the description is "A bar", then maybe the type should be called Bar instead. | |
14–16 | Why signed? Do we need to match what the Python or C code does? Hmm, but it looks like the C version relies on getbe32() in bitmanipulation.h, and that uses uint32_t. That's based on my very cursory look, though. | |
21 | nit: can use anonymous lifetime (unless that's against our coding style or something) | |
24 | Maybe replace "Offset allow to take" by "The offset, taking" | |
26 | drop the "ed" in "offseted" |
My quality bar for this code is pretty low since it is an initial implementation and I don't want to block more Rust efforts. The remaining review comments are mostly nits. I'll happily approve once those minor changes are made. (I don't want to step on Martin's review!)
Sure, we can also just queue it (after just reviewing for malicious code). I'll probably queue it later tonight.
I was about to queue the whole series, but it fails on this first patch :P Can you rebase the series? Thanks.
Followups:
- D9100 hg-core: return Err on decompression error (D8958#inline-15004 followup)
- D9107 hg-core: return Err if offset != bytes.len() (D8958#inline-14994 followup 2/2)
- D9106 hg-core: make Index owner of its bytes (D8958#inline-14994 followup 1/2)
- D9105 hg-core: renaming of Chunk offset methods (D8958#inline-15002 followup)
- D9104 hg-core: minor rewording in docstring (D8958#inline-15005 followup)
- D9103 hg-core: use anonymous lifetime for impl Chunk (D8958#inline-15003 followup)
- D9102 hg-core: use u32 instead of i32 in Chunk (D8958#inline-15001 followup)
- D9101 hg-core: use the term chunk instead of frag (D8958#inline-15000 followup)
- D9099 hg-core: remove useless code (D8958#inline-14988 followup)
- D9098 hg-core: avoid memory allocation (D8958#inline-14990 followup)
- D9097 hg-core: minor docstring update (D8958#inline-14991 followup)
- D9096 hg-core: minor code style change (D8958#inline-14986 followup)
- D9095 hg-core: Explain offset override of first revision (D8958#inline-14992 followup)
- D9083 hg-core: minor code style change (D8958#inline-14993 followup)
You could add a comment saying that we don't use the miniz-oxide backend because its minimum Rust version is 1.36. However, this PR (https://github.com/Frommi/miniz_oxide/pull/84/files) introduces a flag no_extern_crate_alloc to bring the requirement back down to 1.34, so maybe leave a comment and we'll benchmark this later? (There also is the zlib implem from Cloudflare IIRC)