Page MenuHomePhabricator

changing-files: rework the way we store changed files in side-data

Authored by marmoute on Sep 26 2020, 8:10 AM.



We need to store new data so this is a good opportunity to rework this fully.

  1. We directly store the list of affected file in the side data:
  • This avoid having to fetch and parse the files list in the revision in addition to the sidedata. Making the data more self sufficient.
  • This work around situation where that files field contains wrong information, and open the way to other bug fixing (eg: issue6219)
  • The format (fixed initial index, sorted files) allow for fast lookup of filename within the structure.
  • This unify the storage of affected files and copies sources and destination, limiting the number filename stored redundantly.
  • This prepare for the fact we should drop the files as soon as we do any change affecting the revision schema.
  • This rely on compression to avoid a significant increase of the changelog.d. More testing on this will be done before we freeze the final format.
  1. We can store additional data:
  • The new "merged" field,
  • A future "salvaged" set recording files that might have been deleted but have were still present in the final result.

Diff Detail

rHG Mercurial
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

marmoute created this revision.Sep 26 2020, 8:10 AM
pulkit added a subscriber: pulkit.Oct 1 2020, 10:33 AM
pulkit added inline comments.

nit: I guess we can directly use files.* attributes.


nit: we can preserve this line of documentation.


This diff can ideally be in a different patch as it's not related to storage rework.


Coming from documenting mergestate constants, I think it will be nice to have documentation for all these keys as a followup.

Alphare requested changes to this revision.Oct 1 2020, 11:31 AM
Alphare added a subscriber: Alphare.

Pretty straightforward and clear code, thanks, I have a few remarks that block me from accepting, but other than that it looks good.


Is there a reason to use big-endian instead of little-endian (as x86 and x86-64 are little-endian)? I had the same question asked when I proposed the first draft of the dirstate cache.


unsigned, I assume?


Same as above, and below


Please fix the typos in this sentence, it's kind of hard to read.


The in + get seems wasteful, but I'm guessing you don't expect people to actually use the pure version?


I think we could benefit from some sort of asserts when reading sizes or in case of overflow.

This revision now requires changes to proceed.Oct 1 2020, 11:31 AM
marmoute added inline comments.Oct 1 2020, 12:01 PM

My three main reason to go for big-endian

Big endian is more standard (network encoding) and everything else in Mercurial use big endian. So sticking with the current practice is better for consistency.

x86 process are usually very good at dealing with big endian data, and I have never seen any significant slowdown for using BE data.

Using Big Endian for storage usually make sure people though about Endianness for storage before it is too late. Avoiding releasing inconsistent endianess in the wild, corrupting user data.


sure :-/


Will do. I think it just need a:

If now copyIf no copy/ and the value or this fieldthe value of this field.

do you see anything else ?


We could, but we do not have caching in the object yet. But it is coming right after. I'll follow up with a cleanup.


Indeed, we should. This probably got dropped while manipulating patches.


I don't expect this to be a major bottleneck. I can do some performance measurement once the dust settle. but I do not expect to find anything special.


Good point.


Which diff are you thinking about ? the specific filesmerged = computechangesetfilesmerged(ctx) line ? If so the value would not be used anywhere without this change.

Alphare added inline comments.Oct 1 2020, 12:06 PM

Sure, makes sense, thanks!


s/irrevant/irrelevant/ also

Alphare accepted this revision.Oct 2 2020, 3:32 AM
marmoute updated this revision to Diff 23007.Oct 2 2020, 12:39 PM
pulkit accepted this revision.Oct 6 2020, 4:18 AM
This revision is now accepted and ready to land.Oct 6 2020, 4:18 AM