This is an archive of the discontinued Mercurial Phabricator instance.

rhg: Don’t compare ambiguous files one byte at a time
ClosedPublic

Authored by SimonSapin on Sep 13 2021, 2:14 PM.

Details

Summary

Even though the use of BufReader reduces the number of syscalls to read
the file from disk, .bytes() yields a separate Result for every byte.
Creating those results and dispatching on them is most likely costly.

Instead, this commit opts for simplicity by reading the entire file into memory
and comparing a single pair of byte strings. Note that memory already needs to
contain the entire previous contents of the file, as read from the filelog.
So with an extremely large file this doubles memory use but does not make it
grow by orders of magnitude.

At first I wrote code that still avoids reading the entire file into memory
and compares one buffer at a time with BufReader. Find this code below for
posterity. However its correctness is subtle. I ended up preferring the
simplicity of the obviously-correct single comparison.

rust
let mut reader = BufReader::new(fobj);
let mut expected = &contents_in_p1[..];
loop {
    let buf = reader.fill_buf().when_reading_file(&fs_path)?;
    if buf.is_empty() {
        // Found EOF
        return Ok(expected.is_empty());
    } else if let Some(rest) = expected.drop_prefix(buf) {
        // What we read so far matches the expected content, continue reading
        let buf_len = buf.len();
        reader.consume(buf_len);
        expected = rest
    } else {
        // Found different content
        return Ok(false);
    }
}

Diff Detail

Repository
rHG Mercurial
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

SimonSapin created this revision.Sep 13 2021, 2:14 PM

I don't think the buffered one is much worse? I'm fine with the general case being that we don't care about reading the entire file into memory since it should be fast enough, but maybe this will be worth revisiting some day (with a few benchmarks).

rust/Cargo.lock
3 ↗(On Diff #30230)

I think this shouldn't have been included. What is the reason for this, I'm curious?

SimonSapin added inline comments.Sep 14 2021, 12:12 AM
rust/Cargo.lock
3 ↗(On Diff #30230)

Reverted.

What happened is that at first CI failed with:

error[E0277]: can't compare `&[u8]` with `std::vec::Vec<u8>`
   --> rhg/src/commands/status.rs:280:30
    |
280 |     return Ok(contents_in_p1 == fs_contents);
    |                              ^^ no implementation for `&[u8] == std::vec::Vec<u8>`
    |
    = help: the trait `std::cmp::PartialEq<std::vec::Vec<u8>>` is not implemented for `&[u8]`

… even though the same code compiled on my machine. It looks like that PartialEq impl was added at some point between 1.41 and 1.55. I added &* to compare two &[u8] values instead, and ran cargo +1.41.1 test to double-check. I assume the older Cargo removed the version line it doesn’t know about, or something like that. Then I didn’t look at the diff again when amending and pushing.

Alphare accepted this revision.Sep 14 2021, 3:50 AM
This revision is now accepted and ready to land.Sep 14 2021, 3:50 AM