This is an archive of the discontinued Mercurial Phabricator instance.

copytrace: move fast heuristic copytracing algorithm to core
ClosedPublic

Authored by pulkit on Sep 4 2017, 6:56 PM.

Details

Summary

copytrace extension in fb-hgext has a heuristic implementation of copy tracing
which is faster than the current copy tracing. The heuristic limits the search
of copies to just files that are either:

  1. Renames in the same directory
  2. Moved to other directory with same name

The default copytrace implementation is very slow as it finds all the new files
that were added from merge base up to the head commit and for each file it
checks whether it this was copied or moved version of a different file.

Stash@fb did analysis for the above heuristics on the fb repo and found that
among 2,443,768 moves/copies there are only 32,234 moves/copies which does not
fall under the above heuristics which is approx. 0.013 of total copies.

This patch moves the heuristics algorithm under config
experimental.copytrace=heuristics.

While moving fbext to core, this patch removes couple of less useful config
options named sourcecommitlimit and maxmovescandidatestocheck.

Tests are also added for the heuristics algorithm, which are basically copied
from fbext/tests/test-copytrace.t. The tests follow a pattern creating a server
repo and then cloning to a local repo to create public and draft changesets, the
distinction which will be useful in upcoming patches.

After this patch experimental.copytrace has the following behaviour:

  1. off: turns off copytracing
  2. heuristics: use the heuristic algorithm added in this patch.
  3. everything else: use the full copytracing algorithm

.. feature::

A new fast heuristic algorithm for copytracing which assumes that the files
moves are either::
1) Renames in the same directory
2) Moves in other directories with same names
You can use this algorithm by setting `experimental.copytrace=heuristics`.

Diff Detail

Repository
rHG Mercurial
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

pulkit created this revision.Sep 4 2017, 6:56 PM
yuja requested changes to this revision.Sep 6 2017, 10:18 AM
yuja added a subscriber: yuja.

Generally looks good.

We might need something to handle copies between anc (real common
ancestor) and base (pseudo merge base) as we do in the "full" copy tracing,
but I'm not pretty sure. Since this is an experimental feature, and it may miss
some copies by design, I think this is mostly good to go.

mercurial/copies.py
22

Nit: there are only two use sites, so let's remove the module-level
alias.

606

This cdst/csrc naming is confusing because c1 is actually the
source revision (= the original wctx) in "update" scenario. And
IIUC, we are searching for copies from c1 to c2.

Can you rename them?

636

Perhaps this wouldn't stop if the base were in the other side.
I don't think that would happen thanks to how mergecopies()
are used currently, but it's probably better to error out early.

639

Typo: "switching", and missing "\n".

This revision now requires changes to proceed.Sep 6 2017, 10:18 AM
pulkit planned changes to this revision.Sep 7 2017, 5:59 AM
pulkit added inline comments.
mercurial/copies.py
606

Yes, sure. :)

636

I am sorry but I didn't understand the base were in other side thing. Did you mean base is a child of ctx?

yuja added inline comments.Sep 7 2017, 10:09 AM
mercurial/copies.py
636

In theory, base could be anywhere between c1 and c2. If it belonged to the c1 branch, c2.p1().p1()... would never reach
to it.

o c1 (wctx)
|
o base
|
:  o c2
| /
o anc

This won't happen in production because base is set to c2.p1(),
but it's better to break a possible infinite loop.

pulkit edited edge metadata.Sep 9 2017, 6:51 PM
pulkit edited the summary of this revision. (Show Details)
pulkit updated this revision to Diff 1688.
yuja requested changes to this revision.Sep 13 2017, 9:54 AM

I don't think we can just swap c1 and c2 because what we're calculating
is the copy from c1 to c2, which is directed. Instead, maybe we can simply
fall back to the "full" algorithm if base isn't in the c2 branch.

while ctx.rev() > base.rev():
    ...
if ctx != base:
    switch to full copytracing
mercurial/copies.py
634

Nit: s/mdst/m1/

639
  • len(smartset) might be expensive, just evaluate it as boolean
  • repo.revs('%d::%d', c2rev, baserev) should be safer
646

I don't think we can do that, but if we could, we should swap
m1 and m2, too.

This revision now requires changes to proceed.Sep 13 2017, 9:54 AM
pulkit updated this revision to Diff 1791.Sep 13 2017, 2:54 PM
yuja accepted this revision.Sep 14 2017, 10:34 AM

Queued, thanks.

I don't think we can just swap c1 and c2 because what we're calculating
is the copy from c1 to c2,

For the record, this statement appears to be wrong, sorry. Perhaps the idea
of swapping c1 and c2 would be valid, though there was a bug. However, I
still think the new code is better because the untested part is simpler and
less error-prone.

mercurial/copies.py
637

Folded these variables which are used only once.

This revision is now accepted and ready to land.Sep 14 2017, 10:34 AM
This revision was automatically updated to reflect the committed changes.