This is an archive of the discontinued Mercurial Phabricator instance.

Differential D11412

rhg: Don’t compare ambiguous files one byte at a time
ClosedPublic

Authored by SimonSapin on Sep 13 2021, 2:14 PM.

Download Raw Diff

Details

Reviewers

Alphare

Group Reviewers

hg-reviewers

Commits

rHGf9e6f2bb721d: rhg: Don’t compare ambiguous files one byte at a time

Summary

Even though the use of BufReader reduces the number of syscalls to read
the file from disk, .bytes() yields a separate Result for every byte.
Creating those results and dispatching on them is most likely costly.

Instead, this commit opts for simplicity by reading the entire file into memory
and comparing a single pair of byte strings. Note that memory already needs to
contain the entire previous contents of the file, as read from the filelog.
So with an extremely large file this doubles memory use but does not make it
grow by orders of magnitude.

At first I wrote code that still avoids reading the entire file into memory
and compares one buffer at a time with BufReader. Find this code below for
posterity. However its correctness is subtle. I ended up preferring the
simplicity of the obviously-correct single comparison.

rust
let mut reader = BufReader::new(fobj);
let mut expected = &contents_in_p1[..];
loop {
    let buf = reader.fill_buf().when_reading_file(&fs_path)?;
    if buf.is_empty() {
        // Found EOF
        return Ok(expected.is_empty());
    } else if let Some(rest) = expected.drop_prefix(buf) {
        // What we read so far matches the expected content, continue reading
        let buf_len = buf.len();
        reader.consume(buf_len);
        expected = rest
    } else {
        // Found different content
        return Ok(false);
    }
}

Diff Detail

Repository

rHG Mercurial

Branch

default

Lint

No Linters Available

Unit

No Unit Test Coverage

Event Timeline

SimonSapin created this revision.Sep 13 2021, 2:14 PM

Herald added a reviewer: hg-reviewers. · View Herald TranscriptSep 13 2021, 2:14 PM

Herald added a subscriber: mercurial-patches. · View Herald Transcript

I don't think the buffered one is much worse? I'm fine with the general case being that we don't care about reading the entire file into memory since it should be fast enough, but maybe this will be worth revisiting some day (with a few benchmarks).

rust/Cargo.lock
3	I think this shouldn't have been included. What is the reason for this, I'm curious?

SimonSapin updated this revision to Diff 30231.Sep 14 2021, 12:07 AM

SimonSapin added inline comments.Sep 14 2021, 12:12 AM

rust/Cargo.lock
3	Reverted. What happened is that at first CI failed with: error[E0277]: can't compare `&[u8]` with `std::vec::Vec<u8>` --> rhg/src/commands/status.rs:280:30 \| 280 \| return Ok(contents_in_p1 == fs_contents); \| ^^ no implementation for `&[u8] == std::vec::Vec<u8>` \| = help: the trait `std::cmp::PartialEq<std::vec::Vec<u8>>` is not implemented for `&[u8]` … even though the same code compiled on my machine. It looks like that `PartialEq` impl was added at some point between 1.41 and 1.55. I added `&*` to compare two `&[u8]` values instead, and ran `cargo +1.41.1 test` to double-check. I assume the older Cargo removed the version line it doesn’t know about, or something like that. Then I didn’t look at the diff again when amending and pushing.

Alphare accepted this revision.Sep 14 2021, 3:50 AM

This revision is now accepted and ready to land.Sep 14 2021, 3:50 AM

SimonSapin added a commit: rHGf9e6f2bb721d: rhg: Don’t compare ambiguous files one byte at a time.Sep 14 2021, 4:20 AM

Closed by commit rHGf9e6f2bb721d: rhg: Don’t compare ambiguous files one byte at a time (authored by SimonSapin). · Explain Why

This revision was automatically updated to reflect the committed changes.

Revision Contents
Changeset List

			Path	Packages
M			rust/Cargo.lock (2 lines)
M			rust/rhg/src/commands/status.rs (31 lines)

Diff	ID	Description	Created	Lint	Unit
Base		Base
Diff 1	30230		Sep 13 2021, 2:14 PM	★	★
Diff 2	30231		Sep 14 2021, 12:07 AM	★	★
Diff 3	30240	rHGf9e6f2bb721dccbd56b0b3f56c08fccb9b509fb3	Sep 13 2021, 12:48 PM	★	★

Commit	Parents	Author	Summary	Date
fe28bb3bfa45	50bda76239dd	Simon Sapin		Sep 13 2021, 12:48 PM

Status	Author	Revision
Closed	SimonSapin	D11412 rhg: Don’t compare ambiguous files one byte at a time
Closed	SimonSapin	D11411 rhg: Reuse manifest when checking status of multiple ambiguous files
Closed	SimonSapin	D11410 rust: Return HgError instead of RevlogError in revlog constructors
Closed	SimonSapin	D11409 rhg: Align with Python on some revset parsing corner cases
Closed	SimonSapin	D11408 rust: Add a Filelog struct that wraps Revlog
Closed	SimonSapin	D11407 rust: Add Repo::manifest(revision)
Closed	SimonSapin	D11406 rust: Keep lazily-initialized Changelog and Manifest log on the Repo object
Closed	SimonSapin	D11405 rust: Move lazy initialization of `Repo::dirstate_map` into a generic struct
Closed	SimonSapin	D11404 rust: Rename Manifest to Manifestlog, ManifestEntry to Manifest

Diff 30230

rust/Cargo.lock

	# This file is automatically @generated by Cargo.			# This file is automatically @generated by Cargo.
	# It is not intended for manual editing.			# It is not intended for manual editing.
	version = 3
	AlphareUnsubmitted Not Done I think this shouldn't have been included. What is the reason for this, I'm curious? Alphare: I think this shouldn't have been included. What is the reason for this, I'm curious?
	SimonSapinAuthorUnsubmitted Done Reverted. What happened is that at first CI failed with: error[E0277]: can't compare `&[u8]` with `std::vec::Vec<u8>` --> rhg/src/commands/status.rs:280:30 \| 280 \| return Ok(contents_in_p1 == fs_contents); \| ^^ no implementation for `&[u8] == std::vec::Vec<u8>` \| = help: the trait `std::cmp::PartialEq<std::vec::Vec<u8>>` is not implemented for `&[u8]` … even though the same code compiled on my machine. It looks like that `PartialEq` impl was added at some point between 1.41 and 1.55. I added `&` to compare two `&[u8]` values instead, and ran `cargo +1.41.1 test` to double-check. I assume the older Cargo removed the version line it doesn’t know about, or something like that. Then I didn’t look at the diff again when amending and pushing. SimonSapin:* Reverted. What happened is that at first CI failed with: ``` error[E0277]: can't compare `&…

	[[package]]			[[package]]
	name = "adler"			name = "adler"
	version = "0.2.3"			version = "0.2.3"
	source = "registry+https://github.com/rust-lang/crates.io-index"			source = "registry+https://github.com/rust-lang/crates.io-index"
	checksum = "ee2a4ec343196209d6594e19543ae87a39f96d5534d7174822a3ad825dd6ed7e"			checksum = "ee2a4ec343196209d6594e19543ae87a39f96d5534d7174822a3ad825dd6ed7e"

	[[package]]			[[package]]
	name = "aho-corasick"			name = "aho-corasick"

rust/rhg/src/commands/status.rs

	// status.rs			// status.rs
	//			//
	// Copyright 2020, Georges Racinet <georges.racinets@octobus.net>			// Copyright 2020, Georges Racinet <georges.racinets@octobus.net>
	//			//
	// This software may be used and distributed according to the terms of the			// This software may be used and distributed according to the terms of the
	// GNU General Public License version 2 or any later version.			// GNU General Public License version 2 or any later version.

	use crate::error::CommandError;			use crate::error::CommandError;
	use crate::ui::Ui;			use crate::ui::Ui;
	use clap::{Arg, SubCommand};			use clap::{Arg, SubCommand};
	use hg;			use hg;
	use hg::dirstate_tree::dispatch::DirstateMapMethods;			use hg::dirstate_tree::dispatch::DirstateMapMethods;
	use hg::errors::{HgError, IoResultExt};			use hg::errors::HgError;
	use hg::manifest::Manifest;			use hg::manifest::Manifest;
	use hg::matchers::AlwaysMatcher;			use hg::matchers::AlwaysMatcher;
	use hg::repo::Repo;			use hg::repo::Repo;
	use hg::utils::hg_path::{hg_path_to_os_string, HgPath};			use hg::utils::hg_path::{hg_path_to_os_string, HgPath};
	use hg::{HgPathCow, StatusOptions};			use hg::{HgPathCow, StatusOptions};
	use log::{info, warn};			use log::{info, warn};
	use std::convert::TryInto;
	use std::fs;
	use std::io::BufReader;
	use std::io::Read;

	pub const HELP_TEXT: &str = "			pub const HELP_TEXT: &str = "
	Show changed files in the working directory			Show changed files in the working directory

	This is a pure Rust version of `hg status`.			This is a pure Rust version of `hg status`.

	Some options might be missing, check the list below.			Some options might be missing, check the list below.
	";			";
	.find_file(hg_path)?			.find_file(hg_path)?
	.expect("ambgious file not in p1");			.expect("ambgious file not in p1");
	let filelog = repo.filelog(hg_path)?;			let filelog = repo.filelog(hg_path)?;
	let filelog_entry = filelog.get_node(file_node).map_err(\|_\| {			let filelog_entry = filelog.get_node(file_node).map_err(\|_\| {
	HgError::corrupted("filelog missing node from manifest")			HgError::corrupted("filelog missing node from manifest")
	})?;			})?;
	let contents_in_p1 = filelog_entry.data()?;			let contents_in_p1 = filelog_entry.data()?;

	let fs_path = repo			let fs_path = hg_path_to_os_string(hg_path).expect("HgPath conversion");
	.working_directory_vfs()			let fs_contents = repo.working_directory_vfs().read(fs_path)?;
	.join(hg_path_to_os_string(hg_path).expect("HgPath conversion"));			return Ok(contents_in_p1 == &*fs_contents);
	let hg_data_len: u64 = match contents_in_p1.len().try_into() {
	Ok(v) => v,
	Err(_) => {
	// conversion of data length to u64 failed,
	// good luck for any file to have this content
	return Ok(true);
	}
	};
	let fobj = fs::File::open(&fs_path).when_reading_file(&fs_path)?;
	if fobj.metadata().when_reading_file(&fs_path)?.len() != hg_data_len {
	return Ok(true);
	}
	for (fs_byte, &hg_byte) in BufReader::new(fobj).bytes().zip(contents_in_p1)
	{
	if fs_byte.when_reading_file(&fs_path)? != hg_byte {
	return Ok(true);
	}
	}
	Ok(false)
	}			}