This is an archive of the discontinued Mercurial Phabricator instance.

dirstate: when calling rebuild(), avoid some N^2 codepaths
ClosedPublic

Authored by spectral on Dec 13 2019, 6:57 PM.

Details

Summary

I had a user repo with 200k files in it. Calling hg debugrebuilddirstate took
tens of minutes (I didn't wait for it). In that situation,
changedfiles==allfiles, and both are lists. This meant that we had to run an
average of 100k comparisons, for each of 200k files, just to check whether a
file needed to have normallookup called (it always did), or drop.

While it's probably not a huge issue, in my very awkward synthetic benchmark I
wrote (not using a benchmark library or anything), I was seeing some slowdowns
for small-changedfiles and very-large-allfiles invocations, with an inflection
somewhere around 10 items in changedfiles (regardless of the size of allfiles);
above 10 items in changedfiles, the new code appears to always be faster. For
the case of 50k files in changedfiles and the same items in allfiles, I'm seeing
differences of 15s of just running comparisons vs. 0.003793s. I haven't bothered
to run a comparison of 200k items in changedfiles and allfiles. :)

Diff Detail

Repository
rHG Mercurial
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

spectral created this revision.Dec 13 2019, 6:57 PM
martinvonz added inline comments.
mercurial/dirstate.py
611–620

How slow? Specifically, how much slower (in percent, or dB, or whatever) is it compared to not converting to a set? I wonder if it's worth the code. hg debugrebuilddirstate should be a very rare operation.

spectral added inline comments.Dec 13 2019, 8:27 PM
mercurial/dirstate.py
611–620

With 1 file in changedfiles and 200k files in allfiles, approximately 15x slower. That said, it's a little over 1.5ms in the old version, so in absolute terms this isn't a big difference until you get really huge repos. I haven't really investigated when changedfiles is not None, if it's just for hg debugrebuilddirstate --minimal, I agree this isn't worth the complexity, but it looks like mq, strip, and absorb all call rebuild in some cases, and strip and absorb will pass in a list of files. Unsure if that changes things materially, though.

martinvonz added inline comments.Dec 16 2019, 7:01 PM
mercurial/dirstate.py
611–620

Okay, that's at least a measurable difference, so I'm fine with the extra complexity.

623

We typically use the & operator instead of .intersection(). I'll change it in flight.

This revision was not accepted when it landed; it landed in state Needs Review.
This revision was automatically updated to reflect the committed changes.