This is an archive of the discontinued Mercurial Phabricator instance.

commit: improve the files field of changelog for merges
ClosedPublic

Authored by valentin.gatienbaron on Jul 7 2019, 12:18 PM.

Details

Summary

Currently, the files list of merge commits repeats all the deletions
(either actual deletions, or files that got renamed) that happened
between base and p2 of the merge. If p2 is the main branch, the list
can easily be much bigger than the change being merged.

This results in various problems worth improving:

  • changelog is bigger than necessary
  • hg log directory lists many unrelated merge commits, and hg log -v -r commit frequently fills multiple screens worth of files
  • it possibly slows down adjustlinkrev, by forcing it to read more manifests, and that function can certainly be a bottleneck
  • the server side of pulls can waste a lot of time simply opening the filelogs for pointless files (the constant factors for opening even a tiny filelog is apparently pretty bad)

So stop listing such files as described in the code. Impacted merge
commits and their descendants get a different hash than they would
have without this. This doesn't seem problematic, except for
convert. The previous commit helped with that in the hg->hg case (but
if you do svn->hg twice from scratch, hashes can still change).

The rest of the description is numbers. I don't have much to report,
because recreating the files list of existing repositories is not
easy:

  • debugupgradeformat and bundle/unbundle don't recreate the list
  • export/import tends to choke quickly applying patches or on description that contain diffs,
  • merge commits from the convert extension don't have the right files list for reasons orthogonal to the current commit
  • replaying the merge with hg update/hg merge/hg revert --all/hg commit can end up failing in hg revert
  • I wasn't sure that using debugsetparents + debugrebuilddirstate would really build the right thing

I measured commit time before and after this change, in a case with no
files filtered out, several files filtered out (no difference) and 5k
files filtered out (+1% time).

Recreating the 100 more recent merges in a private repo, the
concatenated uncompressed files lists goes from 1.12MB to
0.52MB. Excluding 3 merges that are not representative, then the size
goes from 570k to 15k.
I converted part of mozilla-central, and observed file list shrinking
quite a bit too, starting at the very first merge, 733641d9feaf, going
from 550 files to 10 files (although they have relatively few merges,
so they probably wouldn't care).

Diff Detail

Repository
rHG Mercurial
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

See some thoughts from me and others on https://bz.mercurial-scm.org/show_bug.cgi?id=4292. Is your goal to shrink the changelog? Or that you feel that the current behavior is incorrect? Or something else?

valentin.gatienbaron added a comment.EditedJul 8 2019, 11:03 PM

I hadn't seen the bug you linked to, or your recent change, though I saw ctx.filesadded() and friends (when writing this) and suspected they were written the way they are to work nicely on merges.

I didn't have one single goal when writing this. Hg's behavior is the source of several problems.

One problem is the size of the changelog and bundles. In part because I assume this creates slowness (hg log -u .. for instance), but also because things look broken as a result: the changelog is bigger than the manifestlog, some bundles consist almost purely of this list of files. I can't really analyze these parts of hg until this is fixed because this problem would shadow everything else.
I estimate that the changelog would be half the size with this (75% of the changelog taken by files list, 65% of uncompressed files list size come from merges, lists for merge get 60x shorter, and I assume compression ratio wouldn't get too much worse when shortening these files lists), if I could rewrite the history without changing hashes.

Then a couple of weeks ago, I looked at the speed of hg pull. I noticed that server-side hg was opening 35k filelogs when it really needed 12k (~50% server time doing that IIRC). Adding the filtering from this commit into the server-side of pull made it open 13k filelogs, which sounds like a reasonable number, and should reduce this ~50% of server time (I can't measure whether pull time would be improved given the cost of filtering on the fly, but it certainly wouldn't be any worse).

Finally, I realized that hg log can show seemingly random merge commits because of this, which makes me think this is probably causing other problems that I haven't identified yet.

valentin.gatienbaron edited the summary of this revision. (Show Details)Jul 9 2019, 6:43 AM
valentin.gatienbaron updated this revision to Diff 15830.
mharbison72 requested changes to this revision.Jul 9 2019, 12:45 PM
mharbison72 added a subscriber: mharbison72.

If this goes forward, can this be gated by a config option (even if it is on by default)? IIUC, this will change the hash for merges, and then cascade. That seems problematic for convert operations.

Most converts will do things that change hashes anyway, but the default options don't change anything. This is a convenient behavior for easily migrating to/from LFS, for example. (In theory, anyway. In practice, I've seen repos where a convert alters the changelog.) If there is a config option, convert should probably disable it by default to maintain the current behavior. I can see a use case for enabling it though, to clean up the repo.

This revision now requires changes to proceed.Jul 9 2019, 12:45 PM

If this goes forward, can this be gated by a config option (even if it is on by default)? IIUC, this will change the hash for merges, and then cascade. That seems problematic for convert operations.
Most converts will do things that change hashes anyway, but the default options don't change anything. This is a convenient behavior for easily migrating to/from LFS, for example. (In theory, anyway. In practice, I've seen repos where a convert alters the changelog.) If there is a config option, convert should probably disable it by default to maintain the current behavior. I can see a use case for enabling it though, to clean up the repo.

Given the implementation, it's easy to add a knob to disable the change. But that doesn't seem enough: If you have a repository where commits have been created with a mix of hg before and after this change, there is no single value of the knob that would make convert be the identity.

If this goes forward, can this be gated by a config option (even if it is on by default)? IIUC, this will change the hash for merges, and then cascade. That seems problematic for convert operations.
Most converts will do things that change hashes anyway, but the default options don't change anything. This is a convenient behavior for easily migrating to/from LFS, for example. (In theory, anyway. In practice, I've seen repos where a convert alters the changelog.) If there is a config option, convert should probably disable it by default to maintain the current behavior. I can see a use case for enabling it though, to clean up the repo.

Given the implementation, it's easy to add a knob to disable the change. But that doesn't seem enough: If you have a repository where commits have been created with a mix of hg before and after this change, there is no single value of the knob that would make convert be the identity.

It's not great, but at least if you know that about the repository, you can convert with it off up to the point where it switches over. Then run convert again with it on, and it will continue where it left off. Not user friendly at all, but at least there's a mechanism without having to use a hacked up copy of hg. Of course if there's a better way (maybe convert can recognize the source commit used the old way, and explicitly pass a parameter to localrepo.commitctx() to bypass this code), that would be nice.

Ok. Maybe it would be simpler or more robust to do the direct thing: optionally treat the files list in the input commit as input and reuse them blindly in the resulting commit, when doing a hg->hg conversion without filemap.
(btw, a tweak to test-merge-combinations shows the cases where convert is not the identity: 12-- and -1--, so cases where a file is absent in p1, that p2 added or modified, and the merge redeletes the file. Commit shows such files are modified (rightly) but convert doesn't)

Ok. Maybe it would be simpler or more robust to do the direct thing: optionally treat the files list in the input commit as input and reuse them blindly in the resulting commit, when doing a hg->hg conversion without filemap.

Maybe. I've been wondering if it's possible to just pass along the manifest and changelog instead of recalculating it. It seems there have been other issues over the years[1]. @yuja fixed something manifest related around the time of that thread IIRC. Even if a filemap is in use, the map may not modify early commits. And when those are changed unexpectedly, it makes me wonder what got lost/mangled. I looked through the repos I converted last year without any file mapping, and there were manifest node changes, but also differences in files and files+ in the changelog.

(btw, a tweak to test-merge-combinations shows the cases where convert is not the identity: 12-- and -1--, so cases where a file is absent in p1, that p2 added or modified, and the merge redeletes the file. Commit shows such files are modified (rightly) but convert doesn't)

Is that a non-tracked test? Can you throw a patch up somewhere showing this? One of the things I noticed last year was that I'd get better results if cleanp2[2] was forced to be empty. I was unable to come up with a simple test case showing a different hash with shipping code and a forcibly emptied cleanp2, circa 4.6. What you're describing seems to be the opposite of 216fa1ba9993, but something is wrong with that calculation, so maybe it's worth a try.

[1] https://www.mercurial-scm.org/pipermail/mercurial/2018-June/050921.html
[2] https://www.mercurial-scm.org/repo/hg/file/5.0.2/hgext/convert/hg.py#l552

JordiGH added inline comments.
mercurial/localrepo.py
2664

Not all merges can be detected by checking for p2 being non-null; p1 being null but p2 being non-null is perfectly acceptable everywhere else, although normally hg won't produce commits like this in most circumstances.

Ok. Maybe it would be simpler or more robust to do the direct thing: optionally treat the files list in the input commit as input and reuse them blindly in the resulting commit, when doing a hg->hg conversion without filemap.

Maybe. I've been wondering if it's possible to just pass along the manifest and changelog instead of recalculating it. It seems there have been other issues over the years[1]. @yuja fixed something manifest related around the time of that thread IIRC. Even if a filemap is in use, the map may not modify early commits. And when those are changed unexpectedly, it makes me wonder what got lost/mangled. I looked through the repos I converted last year without any file mapping, and there were manifest node changes, but also differences in files and files+ in the changelog.

Ideally, the whole file list would move out the changelog. Then changing the list of modified files wouldn't change hashes anymore, but of course at the cost of changing everything once.
Given that the filelist in the commit is (ignoring changes to mercurial) a function of the manifest, it may be reasonable to make convert reuse the filelist if the manifestnode gets reused. Or to put it differently, if the conversion is the identity so far for a commit, keep behaving as the identity.

(btw, a tweak to test-merge-combinations shows the cases where convert is not the identity: 12-- and -1--, so cases where a file is absent in p1, that p2 added or modified, and the merge redeletes the file. Commit shows such files are modified (rightly) but convert doesn't)

Is that a non-tracked test? Can you throw a patch up somewhere showing this? One of the things I noticed last year was that I'd get better results if cleanp2[2] was forced to be empty. I was unable to come up with a simple test case showing a different hash with shipping code and a forcibly emptied cleanp2, circa 4.6. What you're describing seems to be the opposite of 216fa1ba9993, but something is wrong with that calculation, so maybe it's worth a try.

Saying "the cases where convert misbehaves" is probably optimistic. It shows some cases where convert misbehaves. test-merge-combination.t is in the previous differential. cleanp2 seems orthogonal, as fiddling with it doesn't change what I see. This is the tweak I was talking about (based on this differential):

diff --git a/tests/test-merge-combination.t b/tests/test-merge-combination.t
--- a/tests/test-merge-combination.t
+++ b/tests/test-merge-combination.t
@@ -58,9 +58,16 @@ revision. "C" indicates that hg merge ha
   >           else expected=a
   >           fi
   >           got=`hg log -r 3 --template '{files}\n' | tr --delete 'e '`
+  >           hg --config extensions.convert= convert . repo-convert -s hg -d hg --sourcesort -q
+  >           got2=`hg --cwd repo-convert log -r 3 --template '{files}\n' | tr --delete 'e '`
   >           if [ "$got" = "$expected" ]
-  >           then echo "$line$conflicts: agree on \"$got\""
-  >           else echo "$line$conflicts: hg said \"$got\", expected \"$expected\""
+  >           then echo -n "$line$conflicts: agree on \"$got\""
+  >           else echo -n "$line$conflicts: hg said \"$got\", expected \"$expected\""
+  >           fi
+  >           if [ "$got" = "$got2" ]; then
+  >             echo
+  >           else
+  >             echo ", commit:\"$got\" vs convert:\"$got2\""
   >           fi
   >           cd ../
   >           rm -rf repo
@@ -115,7 +122,7 @@ All the merges of various file contents.
   12-1 C: agree on "a"
   12-2 C: hg said "", expected "a"
   12-3 C: agree on "a"
-  12-- C: agree on "a"
+  12-- C: agree on "a", commit:"a" vs convert:""
   1-11  : hg said "", expected "a"
   1-12  : agree on "a"
   1-1-  : agree on ""
@@ -135,7 +142,7 @@ All the merges of various file contents.
   -12- C: agree on "a"
   -1-1  : agree on ""
   -1-2  : agree on "a"
-  -1--  : agree on "a"
+  -1--  : agree on "a", commit:"a" vs convert:""
   --11  : agree on ""
   --12  : agree on "a"
   --1-  : agree on "a"
mercurial/localrepo.py
2664
JordiGH added inline comments.Jul 11 2019, 8:51 AM
mercurial/localrepo.py
2664

Hm, yes, you're right. I suppose it's a coincidence that a lot of things still work if p1 is null but p2 isn't. It's certainly not something that hg verify nor hg log complains about.

valentin.gatienbaron retitled this revision from commit: improve the files field of changelog for merges (RFC) to commit: improve the files field of changelog for merges.
valentin.gatienbaron edited the summary of this revision. (Show Details)
valentin.gatienbaron updated this revision to Diff 15905.

@mharbison72 I dealt with the problem of convert in the previous commit. I didn't add a config option to disable the new code path, because it's really awkward to document, and I think it would be annoying to use if I made more fixes (the other case where files are spuriously listed, involving exec bits, probably has little impact on repo size/pull performance but it may be worth fixing anyway because it causes confusion). But if there are downsides to what I did, the config option is still on the table.

mharbison72 accepted this revision.Jul 16 2019, 9:46 PM

@mharbison72 I dealt with the problem of convert in the previous commit. I didn't add a config option to disable the new code path, because it's really awkward to document, and I think it would be annoying to use if I made more fixes (the other case where files are spuriously listed, involving exec bits, probably has little impact on repo size/pull performance but it may be worth fixing anyway because it causes confusion). But if there are downsides to what I did, the config option is still on the table.

I'm still fuzzy on how all of the manifest stuff works in general. But it looks like the previous patch addresses my convert concerns, so I'm OK with not having a config option to disable this.

durin42 accepted this revision as: durin42.Jul 17 2019, 6:25 PM
This revision was not accepted when it landed; it landed in state Needs Review.
This revision was automatically updated to reflect the committed changes.