This is an archive of the discontinued Mercurial Phabricator instance.

copies: calculate mergecopies() based on pathcopies()
ClosedPublic

Authored by martinvonz on Apr 16 2019, 1:16 PM.

Details

Summary

When copies are stored in changesets, we need a changeset-centric
version of mergecopies() just like we have a changeset-centric version
of pathcopies(). I think the natural way of thinking about
mergecopies() is in terms of pathcopies() from the base to each of the
commits. So if we can rewrite mergecopies() based on two such
pathcopies() calls, we'll get the changeset-centric version for
free. That's what this patch does.

A nice bonus is that it ends up being a lot simpler. mergecopies() has
accumulated a lot of technical debt over time. One good example is the
code for dealing with grafts (the "partial/incomplete/dirty"
stuff). Since pathcopies() already deals with backwards renames and
ping-pong renames, we get that for free.

I've run tests with hard-coded debug logging for "fullcopy" and while
I haven't looked at every difference it produces, all the ones I have
looked at seemed reasonable to me. I'm a little surprised that no more
tests fail when run with '--extra-config-opt
experimental.copies.read-from=compatibility' compared to before this
patch. This patch also fixes the broken cases in test-annotate.t and
test-fastannotate.t. It also enables the part of test-copies.t that
was previously disabled exactly because mergecopies() needed to get a
changeset-centric version.

One drawback of the rewritten code is that we may now make
remotefilelog prefetch more files. We used to prefetch files that were
unique to either side of the merge compared to the other. We now
prefetch files that are unique to either side of the merge compared to
the base. This means that if you added the same file to each side, we
would not prefetch it before, but we would now. Such cases are
probably quite rare, but one likely scenario where they happen is when
moving from a commit to its successor (or the other way around). The
user will probably already have the files in the cache in such cases,
so it's probably not a big deal.

Some timings for calculating mergecopies between two revisions
(revisions shown on each line, all using the common ancestor as base):

In the hg repo:
4.8 4.9: 0.21s -> 0.21s
4.0 4.8: 0.35s -> 0.63s

In and old copy of the mozilla-unified repo:
FIREFOX_BETA_60_BASE^ FIREFOX_BETA_60_BASE: 0.82s -> 0.82s
FIREFOX_NIGHTLY_59_END FIREFOX_BETA_60_BASE: 2.5s -> 2.6s
FIREFOX_BETA_59_END FIREFOX_BETA_60_BASE: 3.9s -> 4.1s
FIREFOX_AURORA_50_BASE FIREFOX_BETA_60_BASE: 31s -> 33s

So it's measurably slower in most cases. The most significant
difference is in the hg repo between revisions 4.0 and 4.8. In that
case it seems to come from the fact that pathcopies() uses
fctx.isintroducedafter() (in _tracefile), while the old mergecopies()
used fctx.linkrev() (in _checkcopies()). That results in a single call
to filectx._adjustlinkrev(), which is responsible for the entire
difference in time (in my repo). So we pay a performance penalty but
we get more correct code (see change in
test-mv-cp-st-diff.t). Deleting the "== f.filenode()" in _tracefile()
recovers the lost performance in the hg repo.

There were are few other optimizations in _checkcopies() that I could
not measure any impact from. One was from the "seen" set. Another was
from a "continue" when the file was not in the destination manifest
(corresponding to "am" in _tracefile).

Also note that merge copies are not calculated when updating with a
clean working copy, which is probably the most common case. I
therefore think the much simpler code is worth the slowdown.

Diff Detail

Repository
rHG Mercurial
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

martinvonz created this revision.Apr 16 2019, 1:16 PM
martinvonz edited the summary of this revision. (Show Details)Apr 16 2019, 1:59 PM

I did a first path through it, the new code seems reasonable and easier
to read than the previous one. Some comments and questions below.

I did a first path through it, the new code seems reasonable and easier
to read than the previous one. Some comments and questions below.

The rest somehow didn't make it here, so I'll copy from the email (i.e. the below is from Pierre-Yves, not from me):

I did a first path through it, the new code seems reasonable and easier
to read than the previous one. Some comments and questions below.

On 4/16/19 7:19 PM, martinvonz (Martin von Zweigbergk) wrote:

martinvonz created this revision.
Herald added subscribers: mercurial-devel, mjpieters.
Herald added a reviewer: hg-reviewers.
REVISION SUMMARY
    When copies are stored in changesets, we need a changeset-centric
    version of mergecopies() just like we have a changeset-centric version
    of pathcopies(). I think the natural way of thinking about
    mergecopies() is in terms of pathcopies() from the base to each of the
    commits. So if we can rewrite mergecopies() based two such
    pathcopies() calls, we'll get the changeset-centric version for
    free. That's what this patch does.

I like the approach, if I understand it correctly, we'll have less
duplicated code in the end.

   
    A nice bonus is that it ends up being a lot simpler. mergecopies() has
    accumulated a lot of technical debt over time. One good example is the
    code for dealing with grafts (the "partial/incomplete/dirty"
    stuff). Since pathcopies() already deals with backwards renames and
    ping-pong renames, we get that for free.
   
    I've run tests with hard-coded debug logging for "fullcopy" and while
    I haven't looked at every difference it produces, all the ones I have
    looked at seemed reasonable to me.

How many difference did you had? can you share some example of them with
their explanation?

on related topic, you seems to be fixing a case in
test-fastannotate-hg.t and test-annotate-hg.t that is probably worth
mentioning.

    One drawback of the rewritten code is that we may now make
    remotefilelog prefetch more files. We used to prefetch files that were
    unique to either side of the merge compared to the other. We now
    prefetch files that are unique to either sise of the merge compared to

sise → side

    the base. This means that if you added the same file to each side, we
    would not prefetch it before, but we would now. Such cases are
    probably quite rare, but one likely scenario where they happen is when
    moving from a commit to its successor (or the other way around). The
    user will probably already have the files in the cache in such cases,
    so it's probably not a big deal.
   
    Some timings for calculating mergecopies between two revisions (all
    using the common ancestor as base):

Which revisions did you pick?

(for the record, the benchmark suite use 1daa622bbe42 and 76caed42cf7c)

    In the hg repo:
    4.8 4.8: 0.21s -> 0.21s
    4.0 4.8: 0.35s -> 0.63s
   
    In and old copy of the mozilla-unified repo:
    FIREFOX_BETA_60_BASE^ FIREFOX_BETA_60_BASE: 0.51s -> 0.60s
    FIREFOX_NIGHTLY_59_END FIREFOX_BETA_60_BASE: 2.1s -> 2.3s
    FIREFOX_BETA_59_END FIREFOX_BETA_60_BASE: 3.1s -> 3.3s
    FIREFOX_AURORA_50_BASE FIREFOX_BETA_60_BASE: 30s -> 35s
   
    So it's measurably slower in most cases. Note that merge copies are
    not calculated when updating with a clean working copy, which is
    probably the most common case. I therefore think the much simpler code
    is worth the slowdown.

Do you know where the slowdown comes from ?

REPOSITORY
    rHG Mercurial
REVISION DETAIL
    https://phab.mercurial-scm.org/D6255
AFFECTED FILES
    mercurial/copies.py
    tests/test-annotate.t
    tests/test-fastannotate-hg.t
    tests/test-graft.t
    tests/test-rename-merge2.t
CHANGE DETAILS
diff --git a/tests/test-rename-merge2.t b/tests/test-rename-merge2.t

  • a/tests/test-rename-merge2.t +++ b/tests/test-rename-merge2.t @@ -433,6 +433,9 @@

     --------------
     test L:nc a b R:up b   W:       - 12 merge b no ancestor
     --------------
+    all copies found (* = to merge, ! = divergent, % = renamed and deleted):
+     src: 'a' -> dst: 'b'
+    checking for directory renames
     resolving manifests
      branchmerge: True, force: False, partial: False
      ancestor: 924404dff337, local: 86a2aa42fc76+, remote: af30c7647fc7
@@ -469,6 +472,9 @@
     --------------
     test L:up b   R:nm a b W:       - 13 merge b no ancestor
     --------------
+    all copies found (* = to merge, ! = divergent, % = renamed and deleted):
+     src: 'a' -> dst: 'b'
+    checking for directory renames
     resolving manifests
      branchmerge: True, force: False, partial: False
      ancestor: 924404dff337, local: 59318016310c+, remote: bdb19105162a
@@ -506,6 +512,9 @@
     --------------
     test L:nc a b R:up a b W:       - 14 merge b no ancestor
     --------------
+    all copies found (* = to merge, ! = divergent, % = renamed and deleted):
+     src: 'a' -> dst: 'b'
+    checking for directory renames
     resolving manifests
      branchmerge: True, force: False, partial: False
      ancestor: 924404dff337, local: 86a2aa42fc76+, remote: 8dbce441892a
@@ -543,6 +552,9 @@
     --------------
     test L:up b   R:nm a b W:       - 15 merge b no ancestor, remove a
     --------------
+    all copies found (* = to merge, ! = divergent, % = renamed and deleted):
+     src: 'a' -> dst: 'b'
+    checking for directory renames
     resolving manifests
      branchmerge: True, force: False, partial: False
      ancestor: 924404dff337, local: 59318016310c+, remote: bdb19105162a
@@ -580,6 +592,9 @@
     --------------
     test L:nc a b R:up a b W:       - 16 get a, merge b no ancestor
     --------------
+    all copies found (* = to merge, ! = divergent, % = renamed and deleted):
+     src: 'a' -> dst: 'b'
+    checking for directory renames
     resolving manifests
      branchmerge: True, force: False, partial: False
      ancestor: 924404dff337, local: 86a2aa42fc76+, remote: 8dbce441892a
@@ -617,6 +632,9 @@
     --------------
     test L:up a b R:nc a b W:       - 17 keep a, merge b no ancestor
     --------------
+    all copies found (* = to merge, ! = divergent, % = renamed and deleted):
+     src: 'a' -> dst: 'b'
+    checking for directory renames
     resolving manifests
      branchmerge: True, force: False, partial: False
      ancestor: 924404dff337, local: 0b76e65c8289+, remote: 4ce40f5aca24
@@ -653,6 +671,9 @@
     --------------
     test L:nm a b R:up a b W:       - 18 merge b no ancestor
     --------------
+    all copies found (* = to merge, ! = divergent, % = renamed and deleted):
+     src: 'a' -> dst: 'b'
+    checking for directory renames
     resolving manifests
      branchmerge: True, force: False, partial: False
      ancestor: 924404dff337, local: 02963e448370+, remote: 8dbce441892a
@@ -695,6 +716,9 @@
     --------------
     test L:up a b R:nm a b W:       - 19 merge b no ancestor, prompt remove a
     --------------
+    all copies found (* = to merge, ! = divergent, % = renamed and deleted):
+     src: 'a' -> dst: 'b'
+    checking for directory renames
     resolving manifests
      branchmerge: True, force: False, partial: False
      ancestor: 924404dff337, local: 0b76e65c8289+, remote: bdb19105162a
diff --git a/tests/test-graft.t b/tests/test-graft.t

  • a/tests/test-graft.t +++ b/tests/test-graft.t @@ -75,6 +75,8 @@

   
     $ hg graft -r 2 --base 3
     grafting 2:5c095ad7e90f "2"
+  note: possible conflict - c was deleted and renamed to:
+   a
     note: graft of 2:5c095ad7e90f created no changes to commit
   
   Can't continue without starting:
@@ -220,6 +222,9 @@
     committing changelog
     updating the branch cache
     grafting 5:97f8bfe72746 "5"
+    all copies found (* = to merge, ! = divergent, % = renamed and deleted):
+     src: 'c' -> dst: 'b'
+    checking for directory renames
     resolving manifests
      branchmerge: True, force: True, partial: False
      ancestor: 4c60f11aa304, local: 6b9e5368ca4e+, remote: 97f8bfe72746
@@ -233,6 +238,9 @@
     $ HGEDITOR=cat hg graft 4 3 --log --debug
     scanning for duplicate grafts
     grafting 4:9c233e8e184d "4"
+    all copies found (* = to merge, ! = divergent, % = renamed and deleted):
+     src: 'c' -> dst: 'b'
+    checking for directory renames
     resolving manifests
      branchmerge: True, force: True, partial: False
      ancestor: 4c60f11aa304, local: 1905859650ec+, remote: 9c233e8e184d
@@ -1129,7 +1137,6 @@
     grafting 2:f58c7e2b28fa "C0"
     merging f1e and f1b to f1e
     merging f2a and f2c to f2c
-  merging f5b and f5a to f5a
   
   Test the cases A.1 (f4x) and A.7 (f3x).
   
diff --git a/tests/test-fastannotate-hg.t b/tests/test-fastannotate-hg.t

  • a/tests/test-fastannotate-hg.t +++ b/tests/test-fastannotate-hg.t @@ -273,37 +273,10 @@

     > EOF
     $ hg ci -mc -d '3 0'
     created new head
-BROKEN: 'a' was copied to 'b' on both sides. We should not get a merge conflict here
     $ hg merge
     merging b
-  warning: conflicts while merging b! (edit, then use 'hg resolve --mark')
-  0 files updated, 0 files merged, 0 files removed, 1 files unresolved
-  use 'hg resolve' to retry unresolved file merges or 'hg merge --abort' to abandon
-  [1]
-  $ cat b
-  <<<<<<< working copy: b80e3e32f75a - test: c
-  a
-  z
-  a
-  ||||||| base
-  =======
-  a
-  a
-  a
-  b4
-  c
-  b5
-  >>>>>>> merge rev:    64afcdf8e29e - test: mergeb
-  $ cat <<EOF > b
-  > a
-  > z
-  > a
-  > b4
-  > c
-  > b5
-  > EOF
-  $ hg resolve --mark -q
-  $ rm b.orig
+  0 files updated, 1 files merged, 0 files removed, 0 files unresolved
+  (branch merge, don't forget to commit)
     $ echo d >> b
     $ hg ci -mmerge2 -d '4 0'
   
diff --git a/tests/test-annotate.t b/tests/test-annotate.t

  • a/tests/test-annotate.t +++ b/tests/test-annotate.t @@ -273,37 +273,10 @@

     > EOF
     $ hg ci -mc -d '3 0'
     created new head
-BROKEN: 'a' was copied to 'b' on both sides. We should not get a merge conflict here
     $ hg merge
     merging b
-  warning: conflicts while merging b! (edit, then use 'hg resolve --mark')
-  0 files updated, 0 files merged, 0 files removed, 1 files unresolved
-  use 'hg resolve' to retry unresolved file merges or 'hg merge --abort' to abandon
-  [1]
-  $ cat b
-  <<<<<<< working copy: b80e3e32f75a - test: c
-  a
-  z
-  a
-  ||||||| base
-  =======
-  a
-  a
-  a
-  b4
-  c
-  b5
-  >>>>>>> merge rev:    64afcdf8e29e - test: mergeb
-  $ cat <<EOF > b
-  > a
-  > z
-  > a
-  > b4
-  > c
-  > b5
-  > EOF
-  $ hg resolve --mark -q
-  $ rm b.orig
+  0 files updated, 1 files merged, 0 files removed, 0 files unresolved
+  (branch merge, don't forget to commit)
     $ echo d >> b
     $ hg ci -mmerge2 -d '4 0'
   
diff --git a/mercurial/copies.py b/mercurial/copies.py

  • a/mercurial/copies.py +++ b/mercurial/copies.py @@ -373,57 +373,6 @@

   
       return u1, u2
   
-def _makegetfctx(ctx):
-    """return a 'getfctx' function suitable for _checkcopies usage

  • -    We have to re-setup the function building 'filectx' for each -    '_checkcopies' to ensure the linkrev adjustment is properly setup for -    each. Linkrev adjustment is important to avoid bug in rename -    detection. Moreover, having a proper '_ancestrycontext' setup ensures -    the performance impact of this adjustment is kept limited. Without it, -    each file could do a full dag traversal making the time complexity of -    the operation explode (see issue4537). - -    This function exists here mostly to limit the impact on stable. Feel -    free to refactor on default. -    """ -    rev = ctx.rev() -    repo = ctx._repo -    ac = getattr(ctx, '_ancestrycontext', None) -    if ac is None: -        revs = [rev] -        if rev is None: -            revs = [p.rev() for p in ctx.parents()] -        ac = repo.changelog.ancestors(revs, inclusive=True) -        ctx._ancestrycontext = ac -    def makectx(f, n): -        if n in node.wdirfilenodeids:  # in a working context? -            if ctx.rev() is None: -                return ctx.filectx(f) -            return repo[None][f] -        fctx = repo.filectx(f, fileid=n) -        # setup only needed for filectx not create from a changectx -        fctx._ancestrycontext = ac -        fctx._descendantrev = rev -        return fctx -    return util.lrucachefunc(makectx) - -def _combinecopies(copyfrom, copyto, finalcopy, diverge, incompletediverge): -    """combine partial copy paths""" -    remainder = {} -    for f in copyfrom: -        if f in copyto: -            finalcopy[copyto[f]] = copyfrom[f] -            del copyto[f] -    for f in incompletediverge: -        assert f not in diverge -        ic = incompletediverge[f] -        if ic[0] in copyto: -            diverge[f] = [copyto[ic[0]], ic[1]] -        else: -            remainder[f] = ic -    return remainder -

   def mergecopies(repo, c1, c2, base):
       """
       Finds moves and copies between context c1 and c2 that are relevant for
@@ -526,168 +475,93 @@
       This is pretty slow when a lot of changesets are involved but will track all
       the copies.
       """
-    # In certain scenarios (e.g. graft, update or rebase), base can be
-    # overridden We still need to know a real common ancestor in this case We
-    # can't just compute _c1.ancestor(_c2) and compare it to ca, because there
-    # can be multiple common ancestors, e.g. in case of bidmerge.  Because our
-    # caller may not know if the revision passed in lieu of the CA is a genuine
-    # common ancestor or not without explicitly checking it, it's better to
-    # determine that here.
-    #
-    # base.isancestorof(wc) is False, work around that
-    _c1 = c1.p1() if c1.rev() is None else c1
-    _c2 = c2.p1() if c2.rev() is None else c2
-    # an endpoint is "dirty" if it isn't a descendant of the merge base
-    # if we have a dirty endpoint, we need to trigger graft logic, and also
-    # keep track of which endpoint is dirty
-    dirtyc1 = not base.isancestorof(_c1)
-    dirtyc2 = not base.isancestorof(_c2)
-    graft = dirtyc1 or dirtyc2
-    tca = base
-    if graft:
-        tca = _c1.ancestor(_c2)

  • -    limit = _findlimit(repo, c1, c2) -

       m1 = c1.manifest()
       m2 = c2.manifest()
       mb = base.manifest()
   
-    # gather data from _checkcopies:
-    # - diverge = record all diverges in this dict
-    # - copy = record all non-divergent copies in this dict
-    # - fullcopy = record all copies in this dict
-    # - incomplete = record non-divergent partial copies here
-    # - incompletediverge = record divergent partial copies here
-    diverge = {} # divergence data is shared
-    incompletediverge  = {}
-    data1 = {'copy': {},
-             'fullcopy': {},
-             'incomplete': {},
-             'diverge': diverge,
-             'incompletediverge': incompletediverge,
-            }
-    data2 = {'copy': {},
-             'fullcopy': {},
-             'incomplete': {},
-             'diverge': diverge,
-             'incompletediverge': incompletediverge,
-            }
+    copies1 = pathcopies(base, c1)
+    copies2 = pathcopies(base, c2)
+
+    inversecopies1 = {}
+    inversecopies2 = {}
+    for dst, src in copies1.items():
+        inversecopies1.setdefault(src, []).append(dst)
+    for dst, src in copies2.items():
+        inversecopies2.setdefault(src, []).append(dst)
+
+    copy = {}
+    diverge = {}
+    renamedelete = {}
+    allsources = set(inversecopies1) | set(inversecopies2)
+    for src in allsources:
+        dsts1 = inversecopies1.get(src)
+        dsts2 = inversecopies2.get(src)
+        if dsts1 and dsts2:
+            # copied/renamed on both sides
+            if src not in m1 and src not in m2:
+                # renamed on both sides
+                dsts1 = set(dsts1)
+                dsts2 = set(dsts2)
+                # If there's some overlap in the rename destinations, we
+                # consider it not divergent. For example, if side 1 copies 'a'
+                # to 'b' and 'c' and deletes 'a', and side 2 copies 'a' to 'c'
+                # and 'd' and deletes 'a'.

Formatting could help this comment a bit. something like

side 1: a -> b -> c
side 2: a   ->    c -> d

+                if dsts1 & dsts2:
+                    for dst in (dsts1 & dsts2):
+                        copy[dst] = src
+                else:
+                    diverge[src] = sorted(dsts1 | dsts2)
+            elif src in m1 and src in m2:
+                # copied on both sides
+                dsts1 = set(dsts1)
+                dsts2 = set(dsts2)
+                for dst in (dsts1 & dsts2):
+                    copy[dst] = src
+            # TODO: Handle cases where it was renamed on one side and copied
+            # on the other side

Is this TODO a regression or some ported limitation.

+        elif dsts1:
+            # copied/renamed only on side 1
+            if src not in m2:
+                # deleted on side 2
+                if src not in m1:
+                    # renamed on side 1, deleted on side 2
+                    renamedelete[src] = dsts1
+            elif m2[src] != mb[src]:
+                # modified on side 2
+                for dst in dsts1:
+                    if dst not in m2:
+                        # dst not added on side 2 (handle as regular
+                        # "both created" case in manifestmerge in that case)

Can you elaborate a bit on what this case means (and how we deal with
it) especially, what happens if dst is indeed in m2 ?

+                        copy[dst] = src
+        elif dsts2:
+            # copied/renamed only on side 2
+            if src not in m1:
+                # deleted on side 1
+                if src not in m2:
+                    # renamed on side 2, deleted on side 1
+                    renamedelete[src] = dsts2
+            elif m1[src] != mb[src]:
+                # modified on side 1
+                for dst in dsts2:
+                    if dst not in m1:
+                        # dst not added on side 1 (handle as regular
+                        # "both created" case in manifestmerge in that case)
+                        copy[dst] = src

Since this seems the very same code as the previous clause, I wonder if
we could factor it out. This would help to prevent subtle bug when we
update it. (the answer might be "no because python is slow).

+    renamedeleteset = set()
+    divergeset = set()
+    for src, dsts in diverge.items():
+        divergeset.update(dsts)
+    for src, dsts in renamedelete.items():
+        renamedeleteset.update(dsts)

small nits: Is there any reason not to use diverge.values and
renamedelete.values here ?

I did a first path through it, the new code seems reasonable and easier
to read than the previous one. Some comments and questions below.

The rest somehow didn't make it here, so I'll copy from the email (i.e. the below is from Pierre-Yves, not from me):

Ah, the reason for that might be that Phabricator was down at the time of the email.

Replying to just a few things now. Will reply to the rest later.

I did a first path through it, the new code seems reasonable and easier
to read than the previous one. Some comments and questions below.

The rest somehow didn't make it here, so I'll copy from the email (i.e. the below is from Pierre-Yves, not from me):
I did a first path through it, the new code seems reasonable and easier
to read than the previous one. Some comments and questions below.

Thanks for reviewing!

    I've run tests with hard-coded debug logging for "fullcopy" and while
    I haven't looked at every difference it produces, all the ones I have
    looked at seemed reasonable to me.

How many difference did you had? can you share some example of them with
their explanation?

Without explanation :), see http://paste.debian.net/1077862/

It just seemed to long to include in the commit message.

martinvonz edited the summary of this revision. (Show Details)Apr 16 2019, 5:06 PM

I did a first path through it, the new code seems reasonable and easier
to read than the previous one. Some comments and questions below.

The rest somehow didn't make it here, so I'll copy from the email (i.e. the below is from Pierre-Yves, not from me):

    Some timings for calculating mergecopies between two revisions (all
    using the common ancestor as base):

Which revisions did you pick?

The revisions are those before the colon, so e.g. from hg version 4.0 to 4.8. Oops, the first line there should say "4.8 4.9", not "4.8 4.8". I've fixed that now.

    In the hg repo:
    4.8 4.8: 0.21s -> 0.21s
    4.0 4.8: 0.35s -> 0.63s
   
    In and old copy of the mozilla-unified repo:
    FIREFOX_BETA_60_BASE^ FIREFOX_BETA_60_BASE: 0.51s -> 0.60s
    FIREFOX_NIGHTLY_59_END FIREFOX_BETA_60_BASE: 2.1s -> 2.3s
    FIREFOX_BETA_59_END FIREFOX_BETA_60_BASE: 3.1s -> 3.3s
    FIREFOX_AURORA_50_BASE FIREFOX_BETA_60_BASE: 30s -> 35s

martinvonz edited the summary of this revision. (Show Details)Apr 16 2019, 5:08 PM

From Pierre-Yves:

Can you give that a shot with the two revisions we use in the benchmark
suite, this is a pair expensive with the current algorithm.
1daa622bbe42 76caed42cf7c

Sure, it takes about 29 seconds with or without this patch. It seems the difference is smaller than the noise level.

The rest somehow didn't make it here, so I'll copy from the email (i.e. the below is from Pierre-Yves, not from me):

   

on related topic, you seems to be fixing a case in
test-fastannotate-hg.t and test-annotate-hg.t that is probably worth
mentioning.

Sure, I'll add that.

    One drawback of the rewritten code is that we may now make
    remotefilelog prefetch more files. We used to prefetch files that were
    unique to either side of the merge compared to the other. We now
    prefetch files that are unique to either sise of the merge compared to

sise → side

Thanks, done.

    Some timings for calculating mergecopies between two revisions (all
    using the common ancestor as base):

Which revisions did you pick?

I've tried to clarify this in the commit message.

    So it's measurably slower in most cases. Note that merge copies are
    not calculated when updating with a clean working copy, which is
    probably the most common case. I therefore think the much simpler code
    is worth the slowdown.

Do you know where the slowdown comes from ?

One difference is that the new code ends up calling adjustlinkrev() (from the isintroducedafter() call in _tracefile()), which the old code somehow avoids. That's your area of expertise, so maybe you can figure it out? I've already spent several hours on it without understanding much more than I did before.

+            # TODO: Handle cases where it was renamed on one side and copied
+            # on the other side

Is this TODO a regression or some ported limitation.

It's ported (i.e. it existed before this patch). See commit message of D6242.

+        elif dsts1:
+            # copied/renamed only on side 1
+            if src not in m2:
+                # deleted on side 2
+                if src not in m1:
+                    # renamed on side 1, deleted on side 2
+                    renamedelete[src] = dsts1
+            elif m2[src] != mb[src]:
+                # modified on side 2
+                for dst in dsts1:
+                    if dst not in m2:
+                        # dst not added on side 2 (handle as regular
+                        # "both created" case in manifestmerge in that case)

Can you elaborate a bit on what this case means (and how we deal with
it) especially, what happens if dst is indeed in m2 ?

Oops, I think that was supposed say "otherwise" instead of "in that case". I'll fix.

+                        copy[dst] = src
+        elif dsts2:
+            # copied/renamed only on side 2
+            if src not in m1:
+                # deleted on side 1
+                if src not in m2:
+                    # renamed on side 2, deleted on side 1
+                    renamedelete[src] = dsts2
+            elif m1[src] != mb[src]:
+                # modified on side 1
+                for dst in dsts2:
+                    if dst not in m1:
+                        # dst not added on side 1 (handle as regular
+                        # "both created" case in manifestmerge in that case)
+                        copy[dst] = src

Since this seems the very same code as the previous clause, I wonder if
we could factor it out. This would help to prevent subtle bug when we
update it. (the answer might be "no because python is slow).

Yes, I also considered that. I wasn't sure what a good name for the method would be and I gave up. I'll try again.

+    renamedeleteset = set()
+    divergeset = set()
+    for src, dsts in diverge.items():
+        divergeset.update(dsts)
+    for src, dsts in renamedelete.items():
+        renamedeleteset.update(dsts)

small nits: Is there any reason not to use diverge.values and
renamedelete.values here ?

This is existing code, but I can send a separate patch for that.

martinvonz edited the summary of this revision. (Show Details)Apr 17 2019, 5:12 PM
martinvonz updated this revision to Diff 14804.
martinvonz updated this revision to Diff 14809.Apr 17 2019, 6:06 PM

Since this seems the very same code as the previous clause, I wonder if
we could factor it out. This would help to prevent subtle bug when we
update it. (the answer might be "no because python is slow).

Yes, I also considered that. I wasn't sure what a good name for the method would be and I gave up. I'll try again.

I've picked a name and done the refactoring now.

martinvonz planned changes to this revision.Apr 27 2019, 3:21 AM

I'll spend a bit more time to see if I can figure out why pathcopies() and mergecopies() walk file ancestor differently. The way mergecopies() does it is faster, so I'l see if I can use that for pathcopies() too. D6274:: can still be queued if anyone has time.

martinvonz edited the summary of this revision. (Show Details)Apr 28 2019, 3:00 AM
martinvonz updated this revision to Diff 14943.

I'll spend a bit more time to see if I can figure out why pathcopies() and mergecopies() walk file ancestor differently. The way mergecopies() does it is faster, so I'l see if I can use that for pathcopies() too. D6274:: can still be queued if anyone has time.

I thought I was done with that after finding some bugs in mergecopies(). I thought fixing those would make mergecopies() as slow as pathcopies(), but that still doesn't seem to explain it :( Maybe I'll spend even more time on this tomorrow.

martinvonz edited the summary of this revision. (Show Details)Apr 29 2019, 6:52 PM
martinvonz updated this revision to Diff 14964.

I'll spend a bit more time to see if I can figure out why pathcopies() and mergecopies() walk file ancestor differently. The way mergecopies() does it is faster, so I'l see if I can use that for pathcopies() too. D6274:: can still be queued if anyone has time.

I thought I was done with that after finding some bugs in mergecopies(). I thought fixing those would make mergecopies() as slow as pathcopies(), but that still doesn't seem to explain it :( Maybe I'll spend even more time on this tomorrow.

The biggest difference turned out to come from the isintruducedafter() that I mentioned earlier. I'd be fine with removing that, but we can discuss that after this patch is landed. I think it's an improvement to make pathcopies() and mergecopies() more consistent anyway.

While investigating differences between pathcopies() and mergecopies(), I noticed some other differences and I've added tests for them. As you can see in this patch, some of them are now fixed.

This revision was automatically updated to reflect the committed changes.

This might break --pure without --local in the annotate tests. No idea if that's a valid combination, but the buildbots (mostly) use that. In fairness, it seems that this combination had an error where _filecommit() was given too many arguments in the direct ancestors, so maybe the real breakage occurred in there. But there seems to be extra output here.

https://buildbot.mercurial-scm.org/builders/macOS%2010.12%20hg%20tests/builds/828/steps/pure/logs/stdio

It seems similar to --pure without --local failing in test-repo-compengines.t recently.

https://www.mercurial-scm.org/pipermail/mercurial-devel/2019-April/130762.html

This might break --pure without --local in the annotate tests. No idea if that's a valid combination, but the buildbots (mostly) use that. In fairness, it seems that this combination had an error where _filecommit() was given too many arguments in the direct ancestors, so maybe the real breakage occurred in there. But there seems to be extra output here.
https://buildbot.mercurial-scm.org/builders/macOS%2010.12%20hg%20tests/builds/828/steps/pure/logs/stdio
It seems similar to --pure without --local failing in test-repo-compengines.t recently.
https://www.mercurial-scm.org/pipermail/mercurial-devel/2019-April/130762.html

It's apparently the pure file merge code (simplemerge.py, I think) that works differently. The common ancestor's content is:

a
a
a

The local side is:

a
z
a

The other side is:

a
a
a
b4
c
b6

The pure version thinks they conflict and gives result:

a
z
a
<<<<<<< working copy: b80e3e32f75a - test: c
||||||| base
a
=======
a
b4
c
b5
>>>>>>> merge rev:    64afcdf8e29e - test: mergeb

I'll see if I can work around it by changing the contents of the file a bit.