( )⚙ D7834 nodemap: have some python code writing a nodemap in persistent binary form

This is an archive of the discontinued Mercurial Phabricator instance.

nodemap: have some python code writing a nodemap in persistent binary form
ClosedPublic

Authored by marmoute on Jan 11 2020, 12:02 PM.

Details

Summary

This python code aims to be as "simple" as possible. It is a reference
implementation of the data we are going to write on disk (and possibly,
later a way for pure python install to make sure the on disk data are up to
date).

It is not optimized for performance and rebuild the full data structure from
the index every time.

This is a stepping stone toward a persistent nodemap on disk.

Diff Detail

Repository
rHG Mercurial
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

marmoute created this revision.Jan 11 2020, 12:02 PM
marmoute updated this revision to Diff 19171.Jan 13 2020, 10:06 AM
martinvonz requested changes to this revision.Jan 16 2020, 4:44 PM
martinvonz added a subscriber: martinvonz.
martinvonz added inline comments.
mercurial/revlogutils/nodemap.py
48–50

nit: call it serialize() to match the description? or maybe serialize_index() and rename _dump_trie() to _serialize_trie()

58

nit: replace "rev-0" by "rev 0" so it's not interpreted as subtraction

67

nit: maybe "... involution (is its own inverse)" because i don't think most people know what an involution is

Also, you could rename the function to encode_rev(). You could then do:

# encode_rev is its own inverse
decode_rev = encode_rev

Or not even define that since it's not used

80

s/store/stores/

82

s/nn/n/
s/key/keys/
s/Value/Values/

86

nit: will rev, entry in enumerate(index) help? (i know we don't care about speed)

113–119

Would it work to replace these lines by the following?

new = {}
block[_to_int(current_hex[level])] = new
_insert_into_block(index, level + 1, new, other_rev, other_hex)
_insert_into_block(index, level + 1, new, current_rev, current_hex)
122

as mentioned above, i think _serialize_tree() would be a clearer name (it matches how you described it in the docstring)

146

and this could be _serialize_block()

156

serialize_value()?

This revision now requires changes to proceed.Jan 16 2020, 4:44 PM
marmoute added inline comments.Jan 16 2020, 4:52 PM
mercurial/revlogutils/nodemap.py
113–119

no, we need the value of block to be updated after each iteration otherwise the next iteration would not descend to the level it just created.

martinvonz added inline comments.Jan 16 2020, 5:10 PM
mercurial/revlogutils/nodemap.py
113–119

The test case still passes with that change, though....

martinvonz added inline comments.Jan 17 2020, 12:32 AM
mercurial/debugcommands.py
2090

could you also rename args to opts? or is there a reason to not follow the same convention we follow everywhere (?) else?

marmoute added inline comments.Jan 17 2020, 5:18 AM
mercurial/debugcommands.py
2090

No good reason for this to not be opts. I'll rename it.

Can you clarify which change you are actually requesting (I see mostly nits).

So far I see the **args**opts change but I might be missing others.

Can you clarify which change you are actually requesting (I see mostly nits).
So far I see the **args**opts change but I might be missing others.

I'd prefer if you go through them all.

As stated yesterday, I'll have limited screen exposure time available for the next couple of day. I am going to ignore any minor nits during that time. (Am I happy to look at them afterward and follow up).

Right now, I have only identified the s/**args/**opts/ as non nits. I am sending an updated version that update that part. If any remains, please highlight any other major complains about this patch that is worth blocking the rest of the series.

marmoute updated this revision to Diff 19424.Jan 17 2020, 1:05 PM
marmoute added inline comments.Jan 31 2020, 4:51 AM
mercurial/revlogutils/nodemap.py
48–50

I don't really like serialize because we won't have a deserialize step. The data should be usable directly with a mmap from disk. So if we standardize function name on something I would rather go toward "persist" (and "parse" for the python function that checks it).

This can apply to the "dump" function and docstring.

58

subtraction by zero would be odd, but why not.

67

I am not convinced by the encode_rev/decode_rev form. I don't think it very important to define involution, people can look for it if they need to. The reste of the docstring already explain the functionworks both way.

86

We need to index the entry anyway, so that won't change the line much. I find the current form clearer.

113–119

With more eyes, I am seeing what you are doing now. Yeah I think if would work. And it is more consistent with the recursivity used int he previous conditional block. Updating the patch,

122

I would rather move to _persist_trie. What do you think?

146

so it would be _persist_block()

156

and _persist_value()

marmoute retitled this revision from nodemap: have some python code serializing a nodemap to nodemap: have some python code writing a nodemap in persistent binary form.Jan 31 2020, 10:27 AM
marmoute edited the summary of this revision. (Show Details)
marmoute updated this revision to Diff 19750.
martinvonz added inline comments.Jan 31 2020, 2:20 PM
mercurial/revlogutils/nodemap.py
48–50

Unless the Python code is going to mmap the file contents directly into Python objects (which I don't think is going to happen), there will be serialization and deserialization involved.

67

I don't think it very important to define involution, people can look for it if they need to.

But it's helpful to be helpful, no? Is there a good reason *not* to try to help? Maybe I'm in a small minority who didn't know what an involution is?

86

I don't see why you'd be against enumerate() here. Are you simply against enumerate() in general?

marmoute added inline comments.Jan 31 2020, 2:34 PM
mercurial/revlogutils/nodemap.py
48–50

The Python code is not meant to be use in production, especially not the parsing one (because it bring no advantage).

So I want the API to be named after the actual semantic, not after this annecdotical python code.

86

I am not against enumerate in general, I just feel like direct indexing is clearer here (and the different is not that important here anyway).

marmoute updated this revision to Diff 19779.Jan 31 2020, 2:43 PM
marmoute added inline comments.Feb 2 2020, 3:15 AM
mercurial/revlogutils/nodemap.py
86

Also, using enumerate in the incremental version of this code would make things sinificantly messier, so it won't. And keeping the two code similar has value.

marmoute updated this revision to Diff 19824.Feb 2 2020, 3:17 AM

small doc update on .#s[1]

marmoute updated this revision to Diff 19882.Feb 4 2020, 7:23 PM

rebase to latest default

martinvonz added inline comments.Feb 4 2020, 7:40 PM
mercurial/revlogutils/nodemap.py
48–50

The actual semantic of this code (the Python code) is to serialize the data. To me, "persist" implies writing it to disk, which is not what this code does. I think it's just misleading to call it that even if the Rust code will actually be writing directly to disk.

marmoute added inline comments.Feb 5 2020, 3:43 AM
mercurial/revlogutils/nodemap.py
48–50

On my side, I feel like serialize is misleading because this is not meant to have a "deserialization" phase taht reconstruct the data in memory.

Can we wait for the bulk of the code to be in before we decide on a name and do a (simpler) changeset to align things to the naming scheme?

marmoute updated this revision to Diff 20015.Feb 7 2020, 5:28 PM
This revision was not accepted when it landed; it landed in state Needs Review.
This revision was automatically updated to reflect the committed changes.