Page MenuHomePhabricator

state: add a magicheader to each state file
Needs ReviewPublic

Authored by pulkit on Mar 26 2018, 1:09 PM.

Details

Reviewers
None
Group Reviewers
hg-reviewers
Summary

This patch adds a magic header to each state file written using CBOR format so
that we can distinguish between old state files and new one even if old state
files get successfully parsed by cbor.

Diff Detail

Repository
rHG Mercurial
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

pulkit created this revision.Mar 26 2018, 1:09 PM
yuja added a subscriber: yuja.Mar 27 2018, 8:58 AM

I think the magic has to vary depending on the current state file format.
The state file must be backward/forward compatible with the older/newer formats.

If the current magic is 2\n (and is parsed as int(f.readline())) for example, the
CBOR preamble would have to be 3\n. Otherwise, old hg client cant determine
whether the state file is corrupted or written by newer client.

FWIW, last time I reviewed simplekeyvaluefile, I insisted that a parser should
take a file stream instead of a filename, so the caller can handle compatibility thingy.

with open("statefile", "rb") as fp:
    try:
        version = int(fp.readline())
    except ValueError:
        raise some nicer exception
    if version == 2:
        return readold(fp)
    elif version == 3:
        return readcbor(fp)
In D2945#47831, @yuja wrote:

I think the magic has to vary depending on the current state file format.
The state file must be backward/forward compatible with the older/newer formats.
If the current magic is 2\n (and is parsed as int(f.readline())) for example, the
CBOR preamble would have to be 3\n. Otherwise, old hg client cant determine
whether the state file is corrupted or written by newer client.

There are some old state files, which don't have a version header on top of them like graftstate. I need suggestion on how to handle them.

yuja added a comment.Mar 28 2018, 9:39 AM

There are some old state files, which don't have a version header on top
of them like graftstate. I need suggestion on how to handle them.

In that case, we would have to either do some heuristic (like histedit), or maintain
old/new state files (like mergestate.) Perhaps graft can do the latter?

pulkit added a subscriber: jeffpc.Jun 14 2018, 4:10 PM

There is a suggestion from @jeffpc to have a plain text header in statefiles like HGStateFile so that user can grep for them or look into files and know that they are state files. How does that sound?
(@jeffpc I am not sure I explained this correctly, it will be great if you can chime in.)

jeffpc added a comment.EditedJun 14 2018, 8:59 PM
In D2945#58586, @pulkit wrote:

There is a suggestion from @jeffpc to have a plain text header in statefiles like HGStateFile so that user can grep for them or look into files and know that they are state files. How does that sound?
(@jeffpc I am not sure I explained this correctly, it will be great if you can chime in.)

I don't really care about the actual string value, I just think it is a good storage design practice to have every on-disk structure contain a magic value of some sort. (Beware, I have a storage background and therefore opinions about these things... ;) ) This makes it easier to do all sort of data recovery should the need arise. E.g.,

  • if you know that a random file (e.g., from lost+found) is a hg state file, you can ignore/delete it and just reclone your repos
  • hg can detect some corruption better, and depending on the state file regenerate it or error out
  • hg developers can more easily sort out what's what when looking at buffers/files/etc.
  • as an added bonus, the magic can be used for format versioning

FWIW, the three major things that I consider essential to good on-disk structures:

  1. magic numbers - see above
  2. checksums - possibly two checksums: one for the header and one for the payload
  3. back-pointers - see below

Back-pointers allow you to tie the file/buffer into the bigger context - be it which state file it is, or which repository it belongs to. In this case a state file's back-pointer could be something like the repo-id + state-file-name stored in the state file's header. (I realize that there is no repo-id, but in general some approximation of it tends to be better than nothing.) This should make it painfully obvious what data one is dealing with and how to parse/interpret it just by looking at the byte-stream without any other context.

By the way, I think that the magic should *not* be part of the CBOR payload but rather be at a fixed location in the file. The easier it is to find, the easier it is to detect issues (and to avoid trying to decode bad/malicious input).