This patch adds a magic header to each state file written using CBOR format so
that we can distinguish between old state files and new one even if old state
files get successfully parsed by cbor.
I think the magic has to vary depending on the current state file format.
The state file must be backward/forward compatible with the older/newer formats.
If the current magic is 2\n (and is parsed as int(f.readline())) for example, the
CBOR preamble would have to be 3\n. Otherwise, old hg client cant determine
whether the state file is corrupted or written by newer client.
FWIW, last time I reviewed simplekeyvaluefile, I insisted that a parser should
take a file stream instead of a filename, so the caller can handle compatibility thingy.
with open("statefile", "rb") as fp: try: version = int(fp.readline()) except ValueError: raise some nicer exception if version == 2: return readold(fp) elif version == 3: return readcbor(fp)
There are some old state files, which don't have a version header on top
of them like graftstate. I need suggestion on how to handle them.
In that case, we would have to either do some heuristic (like histedit), or maintain
old/new state files (like mergestate.) Perhaps graft can do the latter?
I don't really care about the actual string value, I just think it is a good storage design practice to have every on-disk structure contain a magic value of some sort. (Beware, I have a storage background and therefore opinions about these things... ;) ) This makes it easier to do all sort of data recovery should the need arise. E.g.,
- if you know that a random file (e.g., from lost+found) is a hg state file, you can ignore/delete it and just reclone your repos
- hg can detect some corruption better, and depending on the state file regenerate it or error out
- hg developers can more easily sort out what's what when looking at buffers/files/etc.
- as an added bonus, the magic can be used for format versioning
FWIW, the three major things that I consider essential to good on-disk structures:
- magic numbers - see above
- checksums - possibly two checksums: one for the header and one for the payload
- back-pointers - see below
Back-pointers allow you to tie the file/buffer into the bigger context - be it which state file it is, or which repository it belongs to. In this case a state file's back-pointer could be something like the repo-id + state-file-name stored in the state file's header. (I realize that there is no repo-id, but in general some approximation of it tends to be better than nothing.) This should make it painfully obvious what data one is dealing with and how to parse/interpret it just by looking at the byte-stream without any other context.
By the way, I think that the magic should *not* be part of the CBOR payload but rather be at a fixed location in the file. The easier it is to find, the easier it is to detect issues (and to avoid trying to decode bad/malicious input).
There seems to have been no activities on this Diff for the past 3 Months.
By policy, we are automatically moving it out of the need-review state.
Please, move it back to need-review without hesitation if this diff should still be discussed.