Page MenuHomePhabricator

treedirstate: add Python linkage

Authored by mbthomas on Nov 14 2017, 12:39 PM.


Group Reviewers
Restricted Project
rFBHGX4afa6d8392af: treedirstate: add Python linkage

Adds a python module that uses the Rust treedirstate to replace the dirstate

Diff Detail

rFBHGX Facebook Mercurial Extensions
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

mbthomas created this revision.Nov 14 2017, 12:39 PM
Herald added a reviewer: Restricted Project. · View Herald TranscriptNov 14 2017, 12:39 PM
mbthomas updated this revision to Diff 3525.Nov 15 2017, 4:15 AM
mbthomas updated this revision to Diff 3598.Nov 17 2017, 9:45 AM
quark added a subscriber: quark.Nov 17 2017, 8:01 PM

Haven't looked into details yet.

The "modern" style is put extensions under hgext3rd.treedirstate. It is less likely to conflict with other modules and avoids one lookup when being loaded.


We usually keep the GPL header here.


This affects hg help output. Use lower-case description to be more consistent with other extensions.


Put from __future__ import absolute_import here to make test-check-py3-compat-hg.t happy.

mbthomas updated this revision to Diff 3653.Nov 20 2017, 12:20 PM
durham added a subscriber: durham.Nov 21 2017, 12:34 AM
durham added inline comments.

This pattern will require searching down into the tree at each step right? Versus maintaining the state of the last search? Maybe add a comment explaining that room for future optimization?


Why do we not need to look at the 'now' argument when serializing? The pure/ implementation of pack_dirstate says it's there to handle the case of a dirstate write happening in the same second as a file modification, so it's ambiguous if the file is clean or not.

mbthomas added inline comments.Nov 21 2017, 6:32 AM

Will do. I contemplated maintaining the state, but it's actually pretty tricky. This seems good enough for now, given that our goal is to minimise the number of full-tree traversals anyway.


That's coming in D1404.

mbthomas updated this revision to Diff 3728.Nov 21 2017, 1:35 PM

The "modern" style is put extensions under hgext3rd.treedirstate.

I thought it was the other way round, so thanks for the pointer. I also took a look at your stack, and I think for consistency I should rename treedirstate/__init__/py to hgext3rd/ and move the rust code to the package hgext3rd.rust.treedirstate. Does that sound right?

Also I see you used non-camel-case for your python interop layer (with #[allow(non_camel_case_types)]). Again, for consistency, I should probably do the same.

quark added a comment.EditedNov 22 2017, 12:33 PM

I thought it was the other way round, so thanks for the pointer. I also took a look at your stack, and I think for consistency I should rename treedirstate/__init__/py to hgext3rd/ and move the rust code to the package hgext3rd.rust.treedirstate. Does that sound right?

I think treedirstate as a whole could be an extension with a Rust submodule.

The reason I put Rust module separately is because I plan to eventually use it not only for changelog indexes, but also obsstore indexes (reused by inhibit or fbamend), or packfile indexes. So it's more like a shared Rust module with some low-level components used by multiple extensions.

mbthomas updated this revision to Diff 3769.Nov 22 2017, 1:14 PM
jsgf requested changes to this revision.Nov 22 2017, 5:45 PM
jsgf added a subscriber: jsgf.
jsgf added inline comments.

Does this need to be a macro? What about something like:

fn map_pyerr<F, T, E>(py: &Python, expr: F) -> PyResult<T>
    where F: FnOnce() -> Result<T, E>,
          E: Error,
    expr().map_err(|e| PyErr::new(py, e.description())

Or is that that too cumbersome if the type parameters can't be inferred?


Might be a bit cleaner to move the PyIterator::from_object() call to its own line (bind to a new var). Or maybe map the iterator rather than using a for loop.


I assume all these ? will return failure if there's a tuple size or element type mismatch. Are the errors that end up on the Python side reasonably descriptive about what actually went wrong?


This this only accept UTF-8 encoded Python strings, or does it do some kind of conversion for Rust strings?


I'd prefer to flip this around: if !cond { cond not true } else { cond true } reads awkwardly.

I assume the 'r' in &FileState::new('r', 0, size, 0) means the same thing as in state != 'r'.


Lift out the Ok(...)s:

let res = if let Ok(filename) ... {
} else {


Lift out the Ok(...)s


The hastrackeddir / has_tracked_dir is bugging me. Is there a reason to make them different?


Does this mean anything non-bytes will be treated as the initial case? Should it explicitly check for however None gets represented? Otherwise I could imagine confusing bugs if you accidentally pass something else.

Could you use something like filename: Option<PyBytes> to get the macro to handle this automatically?


Lift Ok

This revision now requires changes to proceed.Nov 22 2017, 5:45 PM
quark added inline comments.Nov 22 2017, 11:30 PM

Do you need features in the git version? I added D1493 to remove 'static lifetime. A side effect of that is the version can be used.


Maybe it could be Into<PyErr> like D1470.


The Mercurial coding style is to not use underscores. Either here or .py has to be inconsistent. I slightly prefer making .py consistent.


This seems unnecessary. There is a check-code rule that forbids .next():

(r'\.next\(\)', "don't use .next(), use next(...)"),

Is list() necessary?


nit: use a tuple to make it immutable.


nit: Prefer with self._opener(...) as st:


nit: Use with repo.wlock(): instead.




Maybe move if useinnewrepos to wrapnewreporequirements as if repo.ui.configbool('format', 'usetreedirstate', True):. This is consistent with format.uselz4.


repo.dirstate is a filecache property. Use extensions.wrapfilecache instead.


What is this used for? Could it be removed?

setconfig in *setup usually makes chg sad (since the config is a flat map, chg will pick whatever being rewritten by the extension when creating a new ui object. That means the value stored in a config file will be ignored).

Thanks for the comments. Will take me a little while to rework to address them.


I need this fix from the git version, as I've hit the crash that results without it a couple of times.


@jsgf: The reason for the macro is to pass in the type of python exception to raise; PyErr::new is ambiguous without it.

An alternative without macros is to write code like:

    .map_err(|e| PyErr::new::<exc::IOError, _>(py, e.description())?

@quark: Is that safe to do? Also, it makes all your errors RuntimeErrors. I was hoping to give more meaningful exception types, but maybe that's not useful.


Yes, they should be proper Python exceptions (like IndexError), which is the same you'd get elsewhere in Mercurial in that scenario. (In practice these are all C-based dirstatetuple objects, which have exactly four elements and this is the only way to access them).


I should probably be using a PyBytes here. In practice it's always a string containing a single ASCII character. For a single byte value, it's surprisingly hard to extract safely.


Interesting, I see 'r' as the abnormal case and prefer that second. state != 'r' is somehow qualitatively different to !(state == 'r') in my head, even though they're the same. I'll flip them round and see how it looks to me.


It's needed for Python 2.7's iterator interface.


I think so, to match the API for dict.keys(). Importantly you're allowed to modify the dict whilst iterating over keys(). I'm not sure if anything actually uses it.


I add the config option in a later diff (this one was large enough already), but I'll move where the test happens.


It's used for telemetry.

quark added inline comments.Nov 23 2017, 12:50 PM

That's sad. I guess that explained some of the weird segfaults I have seen. Thanks for locating the root cause!


My understanding is ? does an implicit into() so it looks safe?
You can have other logic in Into to decide which Python error type to use.


I notice there is tweakdefaults._trackdirstatesizes, since it also has a repo object, could you use repo.requirements instead?

mbthomas added inline comments.Nov 23 2017, 1:10 PM

There are so many long methods that end up wrapping lines anyway, I think I will just go with the explicit map approach.

quark added inline comments.Nov 23 2017, 1:24 PM

For Python::acquire_gil(), Python itself allows nested locks from a same thread. With the py_class! macro expanded, all instance methods seem to have py: Python as its input parameter, which indicates lock acquired. That means other foreign threads are not taking the lock. So I think it's safe in instance methods. It adds some overhead calling PyGILState_Ensure and PyGILState_Release though.

mbthomas added inline comments.Nov 24 2017, 9:05 AM

The mercurial tests also notice these methods because their signatures are sufficiently Pythonish for their regexs to match. We'd either need to mark these files as excluded from the tests, or accept that these are "python functions implemented in Rust" and adopt the python standards.


Option<PyBytes> works. I'll use that. I'll also fix up a bunch other places in this file where I could use PyBytes in the signature.

mbthomas updated this revision to Diff 3822.Nov 24 2017, 11:53 AM
stash added a subscriber: stash.Nov 27 2017, 4:48 AM
stash added inline comments.

Is it possible to avoid using RefCell<...> and use just make some methods accept &mut self?

mbthomas added inline comments.Nov 27 2017, 7:58 AM

The rust-cpython library requires all methods accept &self, so you have to use interior mutability to have mutable state. (See the docs for instance methods for the requirements).

mbthomas updated this revision to Diff 3864.Nov 27 2017, 8:00 AM
durham accepted this revision.Nov 27 2017, 12:37 PM
This revision was automatically updated to reflect the committed changes.