Page MenuHomePhabricator

setdiscovery: make progress on most connected groups each roundtrip
ClosedPublic

Authored by martinvonz on Mar 4 2018, 11:58 AM.

Details

Summary

Consider history like this:

o
| o
| |
| o
| |
| o
|/
o
| o
| |
| o
| |
| o
|/
o
| o
| |
| o
| |
| o
|/
o
~

Assume the left mainline is available in the remote repo and the other
commits are only in the local repo. Also imagine that instead of 3
local branches with 3 commits on each, there are 1000 branches (the
number of commits on each doesn't matter much here). In such a
scenario, the current setdiscovery code will pick a sample size of 200
among these branches and ask the remote which of them it has. However,
the discovery for each such branch is completely independent of the
discovery for the others -- knowing whether the remote has a commit in
one branch doesn't give us any information about the other
branches. The discovery will therefore take at least 5 roundtrips
(maybe more depending on which commit in each linear chain was
sampled). Since the discovery for each branch is independent, there is
no reason to let one branch wait for another, so this patch makes it
so we sample at least as many commits as there are branches. It may
still happen (it's very likely, even) that we get multiple samples
from one branch and none from another, but that will even out over a
few rounds and I think this is still a big improvement.

Because of http header size limits, we still use the old behavior
unless experimental.httppostargs=true.

I've timed this by running hg debugdiscovery mozilla-unified --debug in the
mozilla-try repo. Both repos were local. Before this patch, last part
of the output was:

2249 total queries in 5276.4859s
elapsed time:  5276.652634 seconds
heads summary:
  total common heads:         13
    also local heads:          4
    also remote heads:         8
    both:                      4
  local heads:             28317
    common:                    4
    missing:               28313
  remote heads:               12
    common:                    8
    unknown:                   4
local changesets:        2014901
  common:                 530373
  missing:               1484528
common heads: 1dad417c28ad 4a108e94d3e2 4d7ef530fffb 5350524bb654 777e60ca8853 7d97fafba271 9cd2ab4d0029 a55ce37217da d38398e5144e dcc6d7a0dc00 e09297892ada e24ec6070d7b fd559328eaf3

After this patch, the output was (including all the samples, since
there were so few now):

taking initial sample
query 2; still undecided: 1599476, sample size is: 108195
sampling from both directions
query 3; still undecided: 810922, sample size is: 194158
sampling from both directions
query 4; still undecided: 325882, sample size is: 137302
sampling from both directions
query 5; still undecided: 111459, sample size is: 74586
sampling from both directions
query 6; still undecided: 26805, sample size is: 23960
sampling from both directions
query 7; still undecided: 2549, sample size is: 2528
sampling from both directions
query 8; still undecided: 21, sample size is: 21
8 total queries in 24.5064s
elapsed time:  24.670051 seconds
heads summary:
  total common heads:         13
    also local heads:          4
    also remote heads:         8
    both:                      4
  local heads:             28317
    common:                    4
    missing:               28313
  remote heads:               12
    common:                    8
    unknown:                   4
local changesets:        2014901
  common:                 530373
  missing:               1484528
common heads: 1dad417c28ad 4a108e94d3e2 4d7ef530fffb 5350524bb654 777e60ca8853 7d97fafba271 9cd2ab4d0029 a55ce37217da d38398e5144e dcc6d7a0dc00 e09297892ada e24ec6070d7b fd559328eaf3

Diff Detail

Repository
rHG Mercurial
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

martinvonz created this revision.Mar 4 2018, 11:58 AM
indygreg requested changes to this revision.Mar 4 2018, 2:28 PM
indygreg added a subscriber: indygreg.

The commit message looks incomplete?

We really want a change like this to be documented. Could you please write more in the commit message and inline? The docstring of _takequicksample() is also now inaccurate.

Also, I'm not convinced this change is correct. Should we connect IRL?

This revision now requires changes to proceed.Mar 4 2018, 2:28 PM

The commit message looks incomplete?
We really want a change like this to be documented. Could you please write more in the commit message and inline? The docstring of _takequicksample() is also now inaccurate.
Also, I'm not convinced this change is correct. Should we connect IRL?

Heh, I sent this by mistake. I think it should be functionally correct, but we can optimize better. I'll drop this patch.

martinvonz abandoned this revision.Mar 4 2018, 2:44 PM
martinvonz retitled this revision from setdiscovery: include all local heads in second "known" request (issue5809) to setdiscovery: make progress on each connected group each roundtrip.Apr 17 2019, 1:19 PM
martinvonz edited the summary of this revision. (Show Details)
martinvonz edited the summary of this revision. (Show Details)
martinvonz updated this revision to Diff 14801.
martinvonz abandoned this revision.May 15 2019, 2:20 PM
martinvonz reclaimed this revision.Tue, May 21, 2:08 PM
martinvonz edited the summary of this revision. (Show Details)Tue, May 21, 2:08 PM
martinvonz updated this revision to Diff 15208.
martinvonz retitled this revision from setdiscovery: make progress on each connected group each roundtrip to setdiscovery: make progress on most connected groups each roundtrip.Tue, May 21, 3:49 PM
martinvonz edited the summary of this revision. (Show Details)
martinvonz edited the summary of this revision. (Show Details)

I feel like I am missing something. Your commit message seems to be talking using at least as many item in the sameple than there is independant connected set. However your code seems to use "heads(undecided)" that is a quite different. Using independant connected set seems like a good trade off (but might be expensive to compute). Using all heads can significantly bloat the discovery without giving it a significant edge in many cases.

Can you clarify this ?

mercurial/setdiscovery.py
98–99

Calling this "limited argument" seems like wireprotocol details leaking into more abstract discovery logic. Can we give it a different name ? (maybe something like "maxsize=200" or "extensiblesample=False") ?

I feel like I am missing something. Your commit message seems to be talking using at least as many item in the sameple than there is independant connected set. However your code seems to use "heads(undecided)" that is a quite different. Using independant connected set seems like a good trade off (but might be expensive to compute). Using all heads can significantly bloat the discovery without giving it a significant edge in many cases.

Good point. The case I can think of is when you have a tree of commits on the local side. Something like this:

o
| o
| | o
| | | o
| | |/
| |/
| o
| |
| o
| |
| o
|/
o
~

If we have a long sequence of commits there and many heads, we would increase the sampling of the (mostly-)linear part unnecessarily. I'll see if there's an easy way to improve that.

Can you clarify this ?

mercurial/setdiscovery.py
98–99

Good point. Done.

martinvonz edited the summary of this revision. (Show Details)Tue, May 21, 6:01 PM
martinvonz updated this revision to Diff 15210.

I feel like I am missing something. Your commit message seems to be talking using at least as many item in the sameple than there is independant connected set. However your code seems to use "heads(undecided)" that is a quite different. Using independant connected set seems like a good trade off (but might be expensive to compute). Using all heads can significantly bloat the discovery without giving it a significant edge in many cases.

Good point. The case I can think of is when you have a tree of commits on the local side. Something like this:

o
| o
| | o
| | | o
| | |/
| |/
| o
| |
| o
| |
| o
|/
o
~

If we have a long sequence of commits there and many heads, we would increase the sampling of the (mostly-)linear part unnecessarily. I'll see if there's an easy way to improve that.

I think it was as easy as changing from using number of heads to using number of roots. Do you think there are still cases we'd handle poorly? I think there are things we can improve in the size-limited case too (it seems like it should be better to include certain points, like the mid-point, than to pick randomly), but that's a bigger task that I'm not willing to start working on now.

We could maybe make it a function of both the number of heads and roots. That is not strictly the number of connected set, but that would provide a more conservative approach. That could over-sample for hour-glass shape, but they are probably less common.

We could maybe make it a function of both the number of heads and roots. That is not strictly the number of connected set, but that would provide a more conservative approach. That could over-sample for hour-glass shape, but they are probably less common.

The current way (only consider roots) should not over-sample, right? It still seems very effective in practice.

We could maybe make it a function of both the number of heads and roots. That is not strictly the number of connected set, but that would provide a more conservative approach. That could over-sample for hour-glass shape, but they are probably less common.

The current way (only consider roots) should not over-sample, right? It still seems very effective in practice.

I suppose it will if you have thousands of roots converging to one point, then that point diverging again towards thousands of heads, all of that actually missing from the remote, but anyway, the current random wouldn't catch that either.
Anyway, basing on the number of roots is biased towards cases where lots of independents branches are missing from the remote, like in the use case from the commit message, whereas basing on the number of heads would be biased towards lots of independent branches being common (while not being remote heads, since we know them already, hence, cases where they have been merged in the remote). I imagine the latter being quite common as well.

I would be in favor of something based on both. Still I start feeling like we should be a bit cautious with the impact on all of this on network time and remote servers CPU time in practical cases (although the 'known' question is quite fast). At which threshold would an additional roundtrip be actually faster for everyone ?

Something only based on the number of root can also over sample. For a "simple" example, imagine a undecideded set with many roots that eventually all merge into a few heads.
If most of that set is common between local and remote, few question about the part of the history at the heads will quickly "decide" many changesets. Numberous questions at the roots part of the history won't.

Also I'd like to mention that I have changesets that add more precise timing info, both for the Python variant (this one) and the Rust variant. These are the ones I used to put timing info in D6430 and D6428. I can share them, but I wouldn't submit them.

We could maybe make it a function of both the number of heads and roots. That is not strictly the number of connected set, but that would provide a more conservative approach. That could over-sample for hour-glass shape, but they are probably less common.

The current way (only consider roots) should not over-sample, right? It still seems very effective in practice.

I suppose it will if you have thousands of roots converging to one point, then that point diverging again towards thousands of heads, all of that actually missing from the remote, but anyway, the current random wouldn't catch that either.

Right, I also thought of that case, but it just seemed too obscure to worry about.

Anyway, basing on the number of roots is biased towards cases where lots of independents branches are missing from the remote, like in the use case from the commit message, whereas basing on the number of heads would be biased towards lots of independent branches being common (while not being remote heads, since we know them already, hence, cases where they have been merged in the remote). I imagine the latter being quite common as well.
I would be in favor of something based on both.

I'm happy with what we get for this simple change, but feel free to send a followup.

Still I start feeling like we should be a bit cautious with the impact on all of this on network time and remote servers CPU time in practical cases (although the 'known' question is quite fast). At which threshold would an additional roundtrip be actually faster for everyone ?

I agree with you in principle. Roundtrips also cost a bit in terms of CPU and bandwidth, so it's probably not a big concern?

This revision was automatically updated to reflect the committed changes.