rust-cpython: bindings for MissingAncestors
ClosedPublic

Authored by gracinet on Jan 10 2019, 4:56 AM.

Details

Summary

The exposition is rather straightforward, except for the
remove_ancestors_from() method, which forces us to an inefficient
conversion between Python sets and Rust HashSets.

Two alternatives are proposed in comments:

  • changing the inner API to "emit" the revision numbers to discard this would be a substantial change, and it would be better only in the cases where there are more to retain than to discard
  • mutating the Python set directly: this would force us to define an abstract RevisionSet trait, and implement it both for plain HashSet and for a struct enclosing a Python set with the GIL marker Python<'p>, also a non trivial effort.

The main (and seemingly only) caller of this method being
mercurial.setdiscovery, which is currently undergoing serious refactoring,
it's not clear whether these improvements would be worth the effort right now,
so we're leaving it as-is.

Also, in get_bases() (will also be used by setdiscovery), we'd prefer
to build a Python set directly, but we resort to returning a tuple, waiting
to hear back from our PR onto rust-cpython about that

Diff Detail

Repository
rHG Mercurial
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
gracinet created this revision.Jan 10 2019, 4:56 AM
yuja added a subscriber: yuja.Jan 11 2019, 9:10 AM

Queued up to this patch, thanks.

+ def new(_cls, index: PyObject, bases: PyObject) -> PyResult<MissingAncestors> {
+ let bases_vec: Vec<Revision> = rev_pyiter_collect(py, &bases)?;
+ let inner = CoreMissing::new(Index::new(py, index)?, bases_vec);

We might want to directly build HashSet<Revision> here if that matters.

+ def missingancestors(&self, revs: PyObject) -> PyResult<PyList> {
+ let mut inner = self.inner(py).borrow_mut();
+ let revs_vec: Vec<Revision> = rev_pyiter_collect(py, &revs)?;
+ let missing_vec = match inner.missing_ancestors(revs_vec) {
+ Ok(missing) => missing,
+ Err(e) => {
+ return Err(GraphError::pynew(py, e));
+ }
+ };

+ // convert as Python list
+ let mut missing_pyint_vec: Vec<PyObject> = Vec::with_capacity(
+ missing_vec.len());
+ for rev in missing_vec {
+ missing_pyint_vec.push(rev.to_py_object(py).into_object());
+ }
+ Ok(PyList::new(py, missing_pyint_vec.as_slice()))

Maybe this can be extracted to a helper function so that we can .map()
the result.

This revision was automatically updated to reflect the committed changes.
kevincox added inline comments.Jan 12 2019, 6:30 AM
rust/hg-cpython/src/ancestors.rs
170

This could be a .collect() call. Something like:

let bases_vec: Vec<PyObject> = bases_set.into_iter()
    .map(|rev| rev.to_py_object(py).into_object())
    .collect();
193

This could also be a collect().

204

This match can become a .map_error(|e| GraphError::pynew(py, e))?

212

This can also be a .collect()

gracinet added inline comments.Jan 14 2019, 1:16 PM
rust/hg-cpython/src/ancestors.rs
170

Hi @kevincox. I have a general question for these, please correct me if I'm making wrong assumptions.

Given that we know in advance exactly the number of elements required, and that it can be large, doesn't it add overhead to convert everything to collection of iterators? I suppose the Vec would have to grow several times, and that's as many calls to malloc() inernally. I'm not sure at this point it would be very costly, but in this case where we consume the Vec immediately, is there a reason to believe that going the collect() way could have their own performance benefits?

Note: this code landed, but I'm about to submit some related refactors, and in this specific case, it's going to be replace I hope very soon with a call to PySet, but I'm asking in general.

Cheers,

kevincox added inline comments.Jan 14 2019, 6:16 PM
rust/hg-cpython/src/ancestors.rs
170

Iterator has a size_hint method which can provide a clue as to the size of the iterator if know. There is also the ExactSizeIterator trait but it is sufficient to say that most simple operations on slice (or Vec) iterators will maintain the size hint and that collecting into a vector will be the most efficient way to construct a vector.

In theory the performance could be slightly better for the collect approach as you could avoid some bounds checking and incrementing the size each time but in practice I would expect similar performance.

So the TL;DR is don't worry about collect performance especially for the simple situations. If there are more allocations then necessary then the bug is in the rust std crate.

gracinet added inline comments.Jan 14 2019, 7:45 PM
rust/hg-cpython/src/ancestors.rs
170

Thanks for the detailed answer.

Then the problem, in case of iterators coming directly from Python, is that they don't currently implement size_hint(), so there's an improvement to be done at this level (many Pythoin iterators do have a __length_hint() method, so it shouldn't be a problem to reuse it)

I have currently some suspicions that the conversions between Rust and Python are the bottleneck in some important cases, so I'll have to measure this things anyway. I won't touch that immediately, but I'll get back to it with real soon.

gracinet marked 2 inline comments as done.Jan 14 2019, 8:57 PM
gracinet added inline comments.
rust/hg-cpython/src/ancestors.rs
170

Got this one in the wrong direction (sorry about that!) : in this case, this is a conversion from Rust, so we de have proper size_hint()
Still, I'll keep in mind the question for the other direction (in which we are using collect() already.

kevincox added inline comments.Jan 15 2019, 3:02 AM
rust/hg-cpython/src/ancestors.rs
170

That makes sense. Also if you still want to use iterators without wiring up size hints properly you could do something like:

let v = Vector::with_capacity(hint);
v.extend(iterators);