This is an archive of the discontinued Mercurial Phabricator instance.

state: add logic to parse the state file in old way if cbor fails
AbandonedPublic

Authored by pulkit on Mar 3 2018, 2:19 PM.

Details

Reviewers
None
Group Reviewers
hg-reviewers
Summary

With updating the format of statefiles to use cbor, there can be cases when a
user updates hg in between a unresolved graft, rebase, histedit, cbor load
function in new hg won't be able to parse that properly. So let's fallback to
reading the file old way if cbor fails or does not returns a dictionary as cbor
may read that data without any error.

Diff Detail

Repository
rHG Mercurial
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

pulkit created this revision.Mar 3 2018, 2:19 PM
pulkit added inline comments.Mar 3 2018, 2:21 PM
mercurial/state.py
90

I am not confident about this part. Like to have some suggestions here.

92

Should I change thirdparty/cbor/ to raise specific errors?

durin42 added inline comments.
mercurial/state.py
90

Probably treat not-a-dict as corrupt and fall back to the other format?

92

Ugh. Yeah, maybe see if they'd take a patch upstream to raise a more explicit exception type.

pulkit updated this revision to Diff 6618.Mar 4 2018, 4:29 PM
pulkit added inline comments.Mar 4 2018, 4:31 PM
mercurial/state.py
92

@durin42 I feel sorry about adding a TODO but I did that because I won't be able to work for a week or so dur to laptop issue. If you are not okay with that, I am fine with these lying here.

pulkit updated this revision to Diff 6759.Mar 9 2018, 5:55 AM

Not sure where to record this comment in this series. So I'll pick this commit.

I think we want an explicit version header in the state files so clients know when they may be reading a file in an old format. For example, if I start a merge in one terminal window, I sometimes move to another terminal window to resolve parts of it. The multiple windows may be running different Mercurial versions. For example, sometimes one of the shells has a virtualenv activated and that virtualenv is running an older Mercurial. We don't want the older Mercurial trampling on state needed by the new Mercurial.

mercurial/state.py
88

bad ident

90

How about not requiring it to be a dict? I imagine practically all callers will want to pass a dict, but why does this class have to enforce it? I think the API would be simpler if it was an opaque object. In the simple case of graft, the state is simply a list of nodes. However, for more complex states, we could have nested structures. I don't see why cmdstate should be involved in lookups into the top-level structure. The difference is subtle, but here's an example:

# With dict-aware cmdstate
cmdstate = ...
cmdstate.load()
version = cmdstate['version']
for car in cmdstate['cars']:
   for wheel in car['wheels']:
       # whatever

# With agnostic cmdstate
cmdstate = ...
parking = cmdstate.load()
version = parking['version']
for car in parking['cars']:
   for wheel in car['wheels']:
       # whatever

Not sure where to record this comment in this series. So I'll pick this commit.
I think we want an explicit version header in the state files so clients know when they may be reading a file in an old format. For example, if I start a merge in one terminal window, I sometimes move to another terminal window to resolve parts of it. The multiple windows may be running different Mercurial versions. For example, sometimes one of the shells has a virtualenv activated and that virtualenv is running an older Mercurial. We don't want the older Mercurial trampling on state needed by the new Mercurial.

Also, it doesn't seem like CBOR defines any magic bytes to start the top-level object with, so it's not obvious to me that an old state file (e.g. on containing just a nodeid) could not be parsed as a valid CBOR file. Perhaps cmdstate should help with that? We could make it always add a first item that's just "CBOR" or something (it seem very unlikely that we'd have that in an old state file), and we could fail when reading a state file that doesn't have that. Or would could require any new state files to have a new name than the old ones?

pulkit added inline comments.Mar 26 2018, 11:08 AM
mercurial/state.py
90

I don't have a strong reason. I just found this an easy way. I was going to add setitem() too so that we can store more new values or change existing values.

For graft case, it won't be just a list of nodes, we are going to store info about --no-commit flag in graftstate. Moreover we need graft --abort too which needs to store more things.

The one benefit is that we can pass the state object everywhere, lookup for values and update values on fly.

Not sure where to record this comment in this series. So I'll pick this commit.
I think we want an explicit version header in the state files so clients know when they may be reading a file in an old format. For example, if I start a merge in one terminal window, I sometimes move to another terminal window to resolve parts of it. The multiple windows may be running different Mercurial versions. For example, sometimes one of the shells has a virtualenv activated and that virtualenv is running an older Mercurial. We don't want the older Mercurial trampling on state needed by the new Mercurial.

Also, it doesn't seem like CBOR defines any magic bytes to start the top-level object with, so it's not obvious to me that an old state file (e.g. on containing just a nodeid) could not be parsed as a valid CBOR file. Perhaps cmdstate should help with that? We could make it always add a first item that's just "CBOR" or something (it seem very unlikely that we'd have that in an old state file), and we could fail when reading a state file that doesn't have that. Or would could require any new state files to have a new name than the old ones?

I can add another key value pair 'magicstring: CBOR' which should be present in the cbor format state files, or we can start storing statefiles in .hg/state/ folder.

pulkit updated this revision to Diff 7300.Mar 26 2018, 1:09 PM
pulkit updated this revision to Diff 7306.Mar 26 2018, 1:17 PM
pulkit marked an inline comment as done.Mar 26 2018, 1:18 PM
martinvonz added inline comments.Mar 26 2018, 1:29 PM
mercurial/state.py
90

I don't have a strong reason. I just found this an easy way.

It seems the other way is still slightly easier :) You'd replace this:

def __getitem__(self, key):
    return self.opts[key]

def __setitem__(self, key, value):
    updates = {key: value}
    self.opts.update(updates)

by this:

def get(self):
    return self.state

def set(self, state):
    self.state = state

I'm not even sure we'd need those two methods. It might be enough to make load() return the state and make save() accept (and require) a new state.

The constructor would change from:

if not opts:
    self.opts = {}
else:
    self.opts = opts

to:

self.state = state

The graft code for reading would change from:

cmdstate.load()
if cmdstate['version'] < 2:
    nodes = cmdstate['nodes']
    revs = [repo[node].rev() for node in nodes]
else:
    # state-file is written by a newer mercurial than what we are
    # using
    raise error.Abort(_("unable to read to read the state file"))

to:

state = cmdstate.load()
if state['version'] < 2:
    nodes = state['nodes']
    revs = [repo[node].rev() for node in nodes]
else:
    # state-file is written by a newer mercurial than what we are
    # using
    raise error.Abort(_("unable to read to read the state file"))

The graft code for writing would change from:

cmdstate.addopts({'version': 1, 'nodes': nodelines})
cmdstate.save()

to:

cmdstate.set({'version': 1, 'nodes': nodelines})
cmdstate.save()

The one benefit is that we can pass the state object everywhere, lookup for values and update values on fly.

We can still pass the "cmdstate" instance around if we wanted to do that. I think it's a small plus that we can pass *just* the state around, though. That means we can pass the state into a function and we can look at the call site and know that it won't be calling .save() on it.

pulkit added inline comments.Mar 27 2018, 4:11 AM
mercurial/state.py
90

I don't feel convinced on what can be the benefit of it and my opinion can be influenced by how I used it in evolve extension. But I can change it the other way around too, let me know if you want me to change.

martinvonz added inline comments.Mar 27 2018, 10:58 AM
mercurial/state.py
90

If it's up to me, I'd definitely say that you should change it. I don't see any benefits of the current proposal. I'd be happy to hear what other reviewers think.

martinvonz added inline comments.Mar 27 2018, 11:03 AM
mercurial/state.py
90

Oh, and I should clarify that I think the root of the problem is that you're mixing the class for reading with state itself (but only the top-level items of it). Let's say someone realizes they want to iterate over the items in the state, would we also add a .items() then? (I don't think that's a likely scenario, but it shows what I mean.)

yuja added a subscriber: yuja.Mar 27 2018, 11:43 AM
yuja added inline comments.
mercurial/state.py
90

Sorry if it's already discussed, but what's the benefit of cmdstate
not being a plain dict?

I think plain load/save functions are much simpler than a dict-like object
backed by file, i.e.

state = statemod.load(repo, fname)
state = {'version': 1, 'nodes': nodelines}
statemod.save(repo, fname, state)

(statemod.load/save could be a class for serialization/deserialization)

The state dict would be wrapped anyway depending on application logic.
If we can find out a common behavior between "state"s, we can extract it to a state class,
but that will still be separated from the storage format.

pulkit abandoned this revision.Jun 14 2018, 3:58 PM

Thanks for all the reviews and helping me improving this. The improved version of state file which is already pushed to core should raise error.CorruptedState() if an old version of state file is found and caller should catch it and parse the old way. I don't think having a registrar is required anymore. Abandoning this differential.