This is an archive of the discontinued Mercurial Phabricator instance.

Differential D6430

rust-discovery: using from Python code
ClosedPublic

Authored by gracinet on May 22 2019, 1:00 PM.

Download Raw Diff

Details

Reviewers

None

Group Reviewers

hg-reviewers

Commits

rHG4d20b1fe8a72: rust-discovery: using from Python code

Summary

As previously done in other topics, the Rust version is used if it's been
built.

The version fully in Rust of the partialdiscovery class has the performance
advantage over the Python version (actually using the Rust MissingAncestor) if
the undecided set is big enough. Otherwise no sampling occurs, and the
discovery is reasonably fast anyway.

Note: it's hard to predict the size of the initial undecided set, it can
depend on the kind of topological changes between the local and remote graphs.
The point of the Rust version is to make the bad cases acceptable.

More specifically, the performance advantages are:

faster sampling, especially takefullsample()
much faster addmissings() in almost all cases (see commit message in grandparent of the present changeset)
no conversion cost of the undecided set at the interface between Rust and Python

Measurements with big undecided sets

For an extreme example, discovery between mozilla-try and mozilla-unified
(over one million undecided revisions, same case as in dbd0fcca6dfc), we
get roughly a x2.5/x3 better performance:

Growing sample size (5% starting with 200): time goes down from
210 to 72 seconds.
Constant sample size of 200: time down from 1853 to 659 seconds.

With a sample size computed from number of roots and heads of the
undecided set (respectsize is False), here are perfdiscovery results:

Before ! wall 9.358729 comb 9.360000 user 9.310000 sys 0.050000 (median of 50)
After ! wall 3.793819 comb 3.790000 user 3.750000 sys 0.040000 (median of 50)

In that later case, the sample sizes are routinely in the hundreds of
thousands of revisions. While still faster, the Rust iteration in
addmissings has less of an advantage than with smaller sample sizes, but
one sees addcommons becoming faster, probably a consequence of not having
to copy big sets back and forth.

This example is not a goal in itself, but it showcases several different
areas in which the process can become slow, due to different factors, and
how this full Rust version can help.

Measurements with small undecided sets

In cases the undecided set is small enough than no sampling occurs,
the Rust version has a disadvantage at init if targetheads is really big
(some time is lost in the translation to Rust data structures),
and that is compensated by the faster addmissings().

On a private repository with over one million commits, we still get a minor
improvement, of 6.8%:

Before ! wall 0.593585 comb 0.590000 user 0.550000 sys 0.040000 (median of 50)
After  ! wall 0.553035 comb 0.550000 user 0.520000 sys 0.030000 (median of 50)

What's interesting in that case is the first addinfo() at 180ms for Rust and
233ms for Python+C, mostly due to add_missings and the children cache
computation being done in less than 0.2ms on the Rust side vs over 40ms on the
Python side.

The worst case we have on hand is with mozilla-try, prepared with
discovery-helper.sh for 10 heads and depth 10, time goes up 2.2% on the median.
In this case targetheads is really huge with 165842 server heads.

Before ! wall 0.823884 comb 0.810000 user 0.790000 sys 0.020000 (median of 50)
After  ! wall 0.842607 comb 0.840000 user 0.800000 sys 0.040000 (median of 50)

If that would be considered a problem, more adjustments can be made, which are
prematurate at this stage: cooking special variants of methods of the inner
MissingAncestors object, retrieving local heads directly from Rust to avoid
the cost of conversion. Effort would probably be better spent at this point
improving the surroundings if needed.

Here's another data point with a smaller repository, pypy, where performance
is almost identical

Before ! wall 0.015121 comb 0.030000 user 0.020000 sys 0.010000 (median of 186)
After ! wall 0.015009 comb 0.010000 user 0.010000 sys 0.000000 (median of 184)

Diff Detail

Repository

rHG Mercurial

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

Event Timeline

gracinet created this revision.May 22 2019, 1:00 PM

Herald added a reviewer: hg-reviewers. · View Herald TranscriptMay 22 2019, 1:00 PM

Herald added a subscriber: mercurial-devel. · View Herald Transcript

gracinet mentioned this in D2647: setdiscovery: make progress on most connected groups each roundtrip.May 22 2019, 2:15 PM

I think this series needs to be updated for D2647. I'm curious to see the timing numbers after that patch (I can get those myself once the series is updated if you're tired of running the perf commands).

gracinet edited the summary of this revision. (Show Details)Jun 12 2019, 2:17 PM

gracinet updated this revision to Diff 15473.

Update of the whole series done but, ended up in an inconsistent state, and I have to go. Please don't act on it until I give the signal it's ready (sorry for inconvenience)

gracinet edited the summary of this revision. (Show Details)Jun 13 2019, 9:33 AM

gracinet updated this revision to Diff 15487.

Instability fixed, and I took the opportunity to rebase again to leverage policy.importrust, that's been queued yesterday (rHGf7385ed775a8)

I added new performance measurements for the big pathological case (mozilla-try / mozilla-unified) with respectsize=False. Speedup, compared to parent commit is still about x2.5 for me, not exactly for the same reasons, that's interesting.

This example is not a goal in itself,

Hmm, I would actually consider that the strongest selling point of this series. The other case (the 6.8% improvement) doesn't seem enough to justify all this code.

but it showcases several different
areas in which the process can become slow, due to different factors, and
how this full Rust version can help.

I don't see this as much of an argument for discovery in Rust. I only care about practically measurable differences. The only real improvement was that "pull into mozilla-try repo from mozilla-unified" case.

I'm fine with taking this series if we think that that use case is important enough to justify the extra maintenance of having a second implementation of the discovery code. I wouldn't think that use case is very common, but I'm guessing that the fact that you spent time on this series means that you have customers with repos with similar characteristics.

What do others think?

Nobody has reacted yet to @martinvonz call for other opinions, can someone look at it?

Let me restate that the series makes a good difference when the sampling part of the discovery process actually kicks in, and it does that roughly three times faster. In other cases, none of the new Rust code is actually executed, the main difference lies in data interchange at the Rust - Python interface. (recall that all performance comparisons are against a Rust-enhanced Mercurial, in particular with the Rust MissingAncestors object)

The size of the undecided set can behave in a chaotic manner, especially in discoveries that involve continuous integration repositories with lots of merging and branching. The mozilla-unified / mozilla-try is but an example of that. Yes, we've seen discoveries go havoc at our customer's, with the size of the undecided set exploding. What the series does performance-wise is help keep these cases to an acceptable time.

@martinvonz: I understand your concern about double maintenance. Let me reassure you that we are willing to maintain that, and it will cost us less than tracking down the perfect real-life example. About the 6.8%, that's a case where the undecided set is small, it's there to demonstrate that despite the focus on the sampling is still a bit faster on other code paths.

This series has also potential for :

further performance improvements if that's needed in the future; the obvious candidate would be a more aggressive integration with the MissingAncestors object, especially in addcommons()
pure Rust push/pull machinery, such as fastpaths doing the discovery entirely in Rust and bailing out if there's nothing to actually send or receive, smart status able to give discovery information in near realtime etc.

The ×2.5 speedup is the kind of things that motivate this series. Even if the mozilla-unified/mozilla-try is just and example it triggers the kind of pathological case we encounter in real life: large undecided set. We keep finding such pathological case from time to time, and will keep finding them. In addition there are case with legitimate large undedicated set (the mozilla example for one).
Having this faster code significantly reduce the impact of these pathological cases.

In addition as Georges pointed out, having this logic in Rust is a good step toward having a fast path detecting no-op pull. Something else we are interrested in.

In D6430#96225, @marmoute wrote:

The ×2.5 speedup is the kind of things that motivate this series. Even if the mozilla-unified/mozilla-try is just and example it triggers the kind of pathological case we encounter in real life: large undecided set. We keep finding such pathological case from time to time, and will keep finding them. In addition there are case with legitimate large undedicated set (the mozilla example for one).
Having this faster code significantly reduce the impact of these pathological cases.

I still don't see how those cases are CPU-bound.

I won't object if someone else who believes in this queues the series, though. I'll even add accept stamps. But I want to see that some other reviewer believes in it first.

In the couple of last version, we already saved minutes in real life use case simply by improving pure CPU processing time client side. Can you elaborate on what other types of evidences you need to convince you there exist CPU bound case for discovery?

I tested this series on our repository and found the following speedup in a case when there are 6k heads and 15k csets missing:

Before:

! wall 3.281062 comb 3.280000 user 3.200000 sys 0.080000 (best of 25)
! wall 4.941540 comb 4.940000 user 4.860000 sys 0.080000 (max of 25)
! wall 4.181741 comb 4.180800 user 4.103600 sys 0.077200 (avg of 25)
! wall 4.281518 comb 4.290000 user 4.180000 sys 0.110000 (median of 25)

After:

! wall 2.589386 comb 2.590000 user 2.490000 sys 0.100000 (best of 25)
! wall 4.658467 comb 4.660000 user 4.510000 sys 0.150000 (max of 25)
! wall 3.624166 comb 3.625200 user 3.544400 sys 0.080800 (avg of 25)
! wall 3.633663 comb 3.640000 user 3.540000 sys 0.100000 (median of 25)

I'm not thrilled with this - it's a lot of code, and I think there are some simpler options that haven't been explored. For example, clients could use the logexchange infrastructure to prime the sampling process and do a lot better.

That said, it seems "fine" and I'm okay to land it with the understanding that if it becomes a maintenance burden we'll almost certainly rip it out in favor of the pure-Python codepath. I'm queueing it and will send a follow-up for one of my review comments.

gracinet added a commit: rHG4d20b1fe8a72: rust-discovery: using from Python code.Aug 14 2019, 4:17 PM

This revision was not accepted when it landed; it landed in state Needs Review.

Closed by commit rHG4d20b1fe8a72: rust-discovery: using from Python code (authored by gracinet). · Explain Why

This revision was automatically updated to reflect the committed changes.

Revision Contents
Changeset List

			Path	Packages
M			mercurial/setdiscovery.py (5 lines)

Diff	ID	Description	Created	Lint	Unit
Base		Base
Diff 1	15230		May 22 2019, 1:00 PM	★	★
Diff 2	15473		Jun 12 2019, 2:17 PM	★	★
Diff 3	15487		Jun 13 2019, 9:33 AM	★	★
Diff 4	16192	rHG4d20b1fe8a72fde190018a608b6b26f9e2d2cd45	Feb 20 2019, 3:04 AM	★	★

Status	Author	Revision
Closed	gracinet	D6430 rust-discovery: using from Python code
Closed	gracinet	D6429 rust-discovery: optimization of add commons/missings for empty arguments
Closed	gracinet	D6428 rust-discovery: using the children cache in add_missing
Closed	gracinet	D6427 discovery: new devel.discovery.randomize option
Closed	gracinet	D6426 rust-discovery: optionally don't randomize at all, for tests
Closed	gracinet	D6425 rust-discovery: exposing sampling to python
Closed	gracinet	D6424 rust-discovery: takefullsample() core implementation
Closed	gracinet	D6423 rust-discovery: core implementation for take_quick_sample()
Closed	gracinet	D6517 rust-discovery: read the index from a repo passed at init
Closed	gracinet	D6516 rust-discovery: accept the new 'respectsize' init arg

Diff 16192

mercurial/setdiscovery.py


	from .i18n import _			from .i18n import _
	from .node import (			from .node import (
	nullid,			nullid,
	nullrev,			nullrev,
	)			)
	from . import (			from . import (
	error,			error,
				policy,
	util,			util,
	)			)

	def _updatesample(revs, heads, sample, parentfn, quicksamplesize=0):			def _updatesample(revs, heads, sample, parentfn, quicksamplesize=0):
	"""update an existing sample to match the expected size			"""update an existing sample to match the expected size

	The sample is updated with revs exponentially distant from each head of the			The sample is updated with revs exponentially distant from each head of the
	<revs> set. (H~1, H~2, H~4, H~8, etc).			<revs> set. (H~1, H~2, H~4, H~8, etc).
	takefrom = list(revs - sample)			takefrom = list(revs - sample)
	if self.randomize:			if self.randomize:
	sample.update(random.sample(takefrom, more))			sample.update(random.sample(takefrom, more))
	else:			else:
	takefrom.sort()			takefrom.sort()
	sample.update(takefrom[:more])			sample.update(takefrom[:more])
	return sample			return sample

				partialdiscovery = policy.importrust('discovery',
				member='PartialDiscovery',
				default=partialdiscovery)

	def findcommonheads(ui, local, remote,			def findcommonheads(ui, local, remote,
	initialsamplesize=100,			initialsamplesize=100,
	fullsamplesize=200,			fullsamplesize=200,
	abortwhenunrelated=True,			abortwhenunrelated=True,
	ancestorsof=None,			ancestorsof=None,
	samplegrowth=1.05):			samplegrowth=1.05):
	'''Return a tuple (common, anyincoming, remoteheads) used to identify			'''Return a tuple (common, anyincoming, remoteheads) used to identify
	missing nodes from or in remote.			missing nodes from or in remote.