( )⚙ D6594 [ABANDONED] RFC dirstatemap

This is an archive of the discontinued Mercurial Phabricator instance.

[ABANDONED] RFC dirstatemap
AbandonedPublic

Authored by Alphare on Jul 1 2019, 11:50 AM.

Details

Reviewers
None
Group Reviewers
hg-reviewers
Summary

This is a Rust implementation of the dirstatemap class.

The Python implementation uses propertycache for lazy initialization, some of
which have side effects, all for performance reasons. While meant for
encapsulation, callers from different parts of the code break said
encapsulation. Lastly, most of its inner datastructures are used as iterators
from the rest of the code base.
All of the above proved to be a real challenge and the reason why this is an
RFC patch. While the code works (read: the tests pass), it is slower and
harder to maintain than it should be.

I will direct your attention to two files to start:

  • rust/hg-cpython/src/dirstate/dirstate_map.rs
  • rust/hg-cpython/src/dirstate/macros.rs

These files contain documentation and TODO information that contextualize the
related code. I hope there is enough information to help you get an idea of
the issue.

Diff Detail

Repository
rHG Mercurial
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

Alphare created this revision.Jul 1 2019, 11:50 AM

If that helps, I just wrote an article on sharing references between Rust and Python: https://raphaelgomes.dev/blog/articles/2019-07-01-sharing-references-between-python-and-rust.html.

kevincox added inline comments.Jul 4 2019, 11:19 AM
rust/hg-core/src/dirstate/dirstate_map.rs
48

Can you just #[derive(Default)]?

96

This should probably be replaced with an enum. You can implement TryFrom<u8> for ease of conversion from python.

228

It would probably be good to move these sential values into well-named constants. That or add comments explaining what they mean in each case.

238

Drop the -> ()

246

This function seems quite odd. Why isn't dirs always set? It seems like you do some initialization then set it. However does this mean that this class has a different set of allowed methods before and after dirs is set?

If this is the case you should either A) Add assertions at the top of every method to ensure they are only called at allowed times or B) split out the initialization into a builder type.

Otherwise why can't dirs be set at initialization.

yuja added a subscriber: yuja.Jul 7 2019, 9:52 AM

sharing references between Rust and Python:
https://raphaelgomes.dev/blog/articles/2019-07-01-sharing-references-between-python-and-rust.html.

Do you have some benchmark number compared to simpler (and dumb) approaches
such as caching PyList/Dict representation?

I heard boxing a PyObject has measurable cost, so we might even want to keep
the entire data backed by PyObjects depending on how frequently the data will
be exposed to Python world.

(I don't have expertise to comment on the soundness of the proposed ref handling
business, and this patch is too big for me to review, sorry.)

Alphare marked 4 inline comments as done.Jul 8 2019, 9:34 AM

Do you have some benchmark number compared to simpler (and dumb) approaches
such as caching PyList/Dict representation?

I don't have any benchmarks as of yet, sorry.

I heard boxing a PyObject has measurable cost,

Do you have a link to any benchmarks? I'd be interested.

so we might even want to keep the entire data backed by PyObjects
depending on how frequently the data will be exposed to Python world.

How would we decouple hg-core and hg-cpython then? If Python expects references to big objects (within propertycache for instance), we still need to exchange a fair bit of data from Rust if Rust does the heavy lifting.

(I don't have expertise to comment on the soundness of the proposed ref handling
business, and this patch is too big for me to review, sorry.)

I understand. I am planning to split the change into smaller bits, but I was hoping to get some advice beforehand to reduce noise and history management work. You already have helped me a lot, so has @kevincox.

rust/hg-core/src/dirstate/dirstate_map.rs
246

I agree that this pattern is quite unsatisfactory. The Python implementation uses a lot of lazy properties to omit some sub data structures until (or unless) they are really needed. This proved quite awkward to replicate, but there has to be a better solution to this problem.
Your idea of using a builder pattern is tempting, but I don't know how it'll play out with Python, I will have to try.

Alphare updated this revision to Diff 15791.Jul 8 2019, 9:53 AM
yuja added a comment.Jul 8 2019, 10:16 AM
> I heard boxing a PyObject has measurable cost,
Do you have a link to any benchmarks? I'd be interested.

No number. IIRC, Augie tried to implement a parser by using nom, which
was breezing fast, until converting parsed result back to PyObjects.
I think we also have some cached PyObjects in our C extension.

> so we might even want to keep the entire data backed by PyObjects 
> depending on how frequently the data will be exposed to Python world.
How would we decouple hg-core and hg-cpython then?

By defining data-layer interface as trait?

It might be less safe because of 'py' handle of rust-cpython, but if we
plan to switch to PyO3, maybe we can ignore the issue for the time being.

If Python expects references to big objects (within propertycache for
instance), we still need to exchange a fair bit of data from Rust if Rust
does the heavy lifting.

(I read "big" as "many".)

Yes, so we'll probably want to minimize the data exchanged between Rust
and Python. This means we'll eventually move more logic to Rust side, which
hopefully reduce the number of elements to be exposed through the iterator
interface. Until then, it might be better to move iterator-heavy objects
to CPython/PyO3 layer. Just my guess.

In D6594#96518, @yuja wrote:

Yes, so we'll probably want to minimize the data exchanged between Rust
and Python. This means we'll eventually move more logic to Rust side, which
hopefully reduce the number of elements to be exposed through the iterator
interface. Until then, it might be better to move iterator-heavy objects
to CPython/PyO3 layer. Just my guess.

This is likely true. But I suspect we don't need to worry too much unless performance is shown to be an issue. Then it makes sense to make spot-optimizations.

Especially if there are plans to move more code into Rust over time it will be very easy to spend a lot of time optimizing the interface only to find that the interface has moved a layer up in 4 months.

rust/hg-core/src/dirstate/dirstate_map.rs
276

Give this magic number a name?

I have updated and split this RFC into a new series (https://phab.mercurial-scm.org/D6625, see the Stack tab).
This differential can be considered outdated.

@kevincox: I will try my best to take into account the comments you've already left here after the new series was published.

Alphare retitled this revision from RFC dirstatemap to [ABANDONED] RFC dirstatemap.Jul 10 2019, 4:40 PM
Alphare abandoned this revision.Jul 10 2019, 4:43 PM

(+ infra list, some folks that I think volunteered in the past)

rust/hg-core/src/utils/files.rs