( )⚙ D6430 rust-discovery: using from Python code

This is an archive of the discontinued Mercurial Phabricator instance.

rust-discovery: using from Python code
ClosedPublic

Authored by gracinet on May 22 2019, 1:00 PM.

Details

Summary

As previously done in other topics, the Rust version is used if it's been
built.

The version fully in Rust of the partialdiscovery class has the performance
advantage over the Python version (actually using the Rust MissingAncestor) if
the undecided set is big enough. Otherwise no sampling occurs, and the
discovery is reasonably fast anyway.

Note: it's hard to predict the size of the initial undecided set, it can
depend on the kind of topological changes between the local and remote graphs.
The point of the Rust version is to make the bad cases acceptable.

More specifically, the performance advantages are:

  • faster sampling, especially takefullsample()
  • much faster addmissings() in almost all cases (see commit message in grandparent of the present changeset)
  • no conversion cost of the undecided set at the interface between Rust and Python

Measurements with big undecided sets

For an extreme example, discovery between mozilla-try and mozilla-unified
(over one million undecided revisions, same case as in dbd0fcca6dfc), we
get roughly a x2.5/x3 better performance:

Growing sample size (5% starting with 200): time goes down from
210 to 72 seconds.
Constant sample size of 200: time down from 1853 to 659 seconds.

With a sample size computed from number of roots and heads of the
undecided set (respectsize is False), here are perfdiscovery results:

Before ! wall 9.358729 comb 9.360000 user 9.310000 sys 0.050000 (median of 50)
After ! wall 3.793819 comb 3.790000 user 3.750000 sys 0.040000 (median of 50)

In that later case, the sample sizes are routinely in the hundreds of
thousands of revisions. While still faster, the Rust iteration in
addmissings has less of an advantage than with smaller sample sizes, but
one sees addcommons becoming faster, probably a consequence of not having
to copy big sets back and forth.

This example is not a goal in itself, but it showcases several different
areas in which the process can become slow, due to different factors, and
how this full Rust version can help.

Measurements with small undecided sets

In cases the undecided set is small enough than no sampling occurs,
the Rust version has a disadvantage at init if targetheads is really big
(some time is lost in the translation to Rust data structures),
and that is compensated by the faster addmissings().

On a private repository with over one million commits, we still get a minor
improvement, of 6.8%:

Before ! wall 0.593585 comb 0.590000 user 0.550000 sys 0.040000 (median of 50)
After  ! wall 0.553035 comb 0.550000 user 0.520000 sys 0.030000 (median of 50)

What's interesting in that case is the first addinfo() at 180ms for Rust and
233ms for Python+C, mostly due to add_missings and the children cache
computation being done in less than 0.2ms on the Rust side vs over 40ms on the
Python side.

The worst case we have on hand is with mozilla-try, prepared with
discovery-helper.sh for 10 heads and depth 10, time goes up 2.2% on the median.
In this case targetheads is really huge with 165842 server heads.

Before ! wall 0.823884 comb 0.810000 user 0.790000 sys 0.020000 (median of 50)
After  ! wall 0.842607 comb 0.840000 user 0.800000 sys 0.040000 (median of 50)

If that would be considered a problem, more adjustments can be made, which are
prematurate at this stage: cooking special variants of methods of the inner
MissingAncestors object, retrieving local heads directly from Rust to avoid
the cost of conversion. Effort would probably be better spent at this point
improving the surroundings if needed.

Here's another data point with a smaller repository, pypy, where performance
is almost identical

Before ! wall 0.015121 comb 0.030000 user 0.020000 sys 0.010000 (median of 186)
After ! wall 0.015009 comb 0.010000 user 0.010000 sys 0.000000 (median of 184)

Diff Detail

Repository
rHG Mercurial
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

gracinet created this revision.May 22 2019, 1:00 PM

I think this series needs to be updated for D2647. I'm curious to see the timing numbers after that patch (I can get those myself once the series is updated if you're tired of running the perf commands).

gracinet edited the summary of this revision. (Show Details)Jun 12 2019, 2:17 PM
gracinet updated this revision to Diff 15473.

Update of the whole series done but, ended up in an inconsistent state, and I have to go. Please don't act on it until I give the signal it's ready (sorry for inconvenience)

gracinet edited the summary of this revision. (Show Details)Jun 13 2019, 9:33 AM
gracinet updated this revision to Diff 15487.

Instability fixed, and I took the opportunity to rebase again to leverage policy.importrust, that's been queued yesterday (rHGf7385ed775a8)

I added new performance measurements for the big pathological case (mozilla-try / mozilla-unified) with respectsize=False. Speedup, compared to parent commit is still about x2.5 for me, not exactly for the same reasons, that's interesting.

This example is not a goal in itself,

Hmm, I would actually consider that the strongest selling point of this series. The other case (the 6.8% improvement) doesn't seem enough to justify all this code.

but it showcases several different
areas in which the process can become slow, due to different factors, and
how this full Rust version can help.

I don't see this as much of an argument for discovery in Rust. I only care about practically measurable differences. The only real improvement was that "pull into mozilla-try repo from mozilla-unified" case.

I'm fine with taking this series if we think that that use case is important enough to justify the extra maintenance of having a second implementation of the discovery code. I wouldn't think that use case is very common, but I'm guessing that the fact that you spent time on this series means that you have customers with repos with similar characteristics.

What do others think?

Nobody has reacted yet to @martinvonz call for other opinions, can someone look at it?

Let me restate that the series makes a good difference when the sampling part of the discovery process actually kicks in, and it does that roughly three times faster. In other cases, none of the new Rust code is actually executed, the main difference lies in data interchange at the Rust - Python interface. (recall that all performance comparisons are against a Rust-enhanced Mercurial, in particular with the Rust MissingAncestors object)

The size of the undecided set can behave in a chaotic manner, especially in discoveries that involve continuous integration repositories with lots of merging and branching. The mozilla-unified / mozilla-try is but an example of that. Yes, we've seen discoveries go havoc at our customer's, with the size of the undecided set exploding. What the series does performance-wise is help keep these cases to an acceptable time.

@martinvonz: I understand your concern about double maintenance. Let me reassure you that we are willing to maintain that, and it will cost us less than tracking down the perfect real-life example. About the 6.8%, that's a case where the undecided set is small, it's there to demonstrate that despite the focus on the sampling is still a bit faster on other code paths.

This series has also potential for :

  • further performance improvements if that's needed in the future; the obvious candidate would be a more aggressive integration with the MissingAncestors object, especially in addcommons()
  • pure Rust push/pull machinery, such as fastpaths doing the discovery entirely in Rust and bailing out if there's nothing to actually send or receive, smart status able to give discovery information in near realtime etc.

The ×2.5 speedup is the kind of things that motivate this series. Even if the mozilla-unified/mozilla-try is just and example it triggers the kind of pathological case we encounter in real life: large undecided set. We keep finding such pathological case from time to time, and will keep finding them. In addition there are case with legitimate large undedicated set (the mozilla example for one).
Having this faster code significantly reduce the impact of these pathological cases.

In addition as Georges pointed out, having this logic in Rust is a good step toward having a fast path detecting no-op pull. Something else we are interrested in.

The ×2.5 speedup is the kind of things that motivate this series. Even if the mozilla-unified/mozilla-try is just and example it triggers the kind of pathological case we encounter in real life: large undecided set. We keep finding such pathological case from time to time, and will keep finding them. In addition there are case with legitimate large undedicated set (the mozilla example for one).
Having this faster code significantly reduce the impact of these pathological cases.

I still don't see how those cases are CPU-bound.

I won't object if someone else who believes in this queues the series, though. I'll even add accept stamps. But I want to see that some other reviewer believes in it first.

In the couple of last version, we already saved minutes in real life use case simply by improving pure CPU processing time client side. Can you elaborate on what other types of evidences you need to convince you there exist CPU bound case for discovery?

pulkit added a subscriber: pulkit.Jul 17 2019, 6:33 AM

I tested this series on our repository and found the following speedup in a case when there are 6k heads and 15k csets missing:

Before:

! wall 3.281062 comb 3.280000 user 3.200000 sys 0.080000 (best of 25)
! wall 4.941540 comb 4.940000 user 4.860000 sys 0.080000 (max of 25)
! wall 4.181741 comb 4.180800 user 4.103600 sys 0.077200 (avg of 25)
! wall 4.281518 comb 4.290000 user 4.180000 sys 0.110000 (median of 25)

After:

! wall 2.589386 comb 2.590000 user 2.490000 sys 0.100000 (best of 25)
! wall 4.658467 comb 4.660000 user 4.510000 sys 0.150000 (max of 25)
! wall 3.624166 comb 3.625200 user 3.544400 sys 0.080800 (avg of 25)
! wall 3.633663 comb 3.640000 user 3.540000 sys 0.100000 (median of 25)

I'm not thrilled with this - it's a lot of code, and I think there are some simpler options that haven't been explored. For example, clients could use the logexchange infrastructure to prime the sampling process and do a lot better.

That said, it seems "fine" and I'm okay to land it with the understanding that if it becomes a maintenance burden we'll almost certainly rip it out in favor of the pure-Python codepath. I'm queueing it and will send a follow-up for one of my review comments.

This revision was not accepted when it landed; it landed in state Needs Review.
This revision was automatically updated to reflect the committed changes.