( )⚙ D10025 revlogv2: don't assume that the sidedata of the last rev is right after data

This is an archive of the discontinued Mercurial Phabricator instance.

revlogv2: don't assume that the sidedata of the last rev is right after data
ClosedPublic

Authored by Alphare on Feb 19 2021, 6:16 AM.

Details

Summary

We are going to be rewriting sidedata soon, it's going to be appended to the
revlog data file, meaning that the data and the sidedata might not be
contiguous.

Diff Detail

Repository
rHG Mercurial
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Alphare created this revision.Feb 19 2021, 6:16 AM
baymax updated this revision to Diff 25992.Mar 1 2021, 11:51 AM

✅ refresh by Heptapod after a successful CI run (🐙 💚)

marmoute requested changes to this revision.Mar 2 2021, 1:48 PM
This revision now requires changes to proceed.Mar 2 2021, 1:48 PM
Alphare updated this revision to Diff 26088.Mar 4 2021, 10:18 AM
marmoute requested changes to this revision.Mar 9 2021, 11:16 AM
marmoute added inline comments.
mercurial/revlog.py
2338–2341

Why don't we simply store the absolute offset for the sidedata instead of needing arithmetic ?

This revision now requires changes to proceed.Mar 9 2021, 11:16 AM
Alphare added inline comments.Mar 10 2021, 8:57 AM
mercurial/revlog.py
2338–2341

We do store the absolute offset, the point here is to find the index after the last entry's (potential) sidedata. Perhaps this variable should be named prev_sidedata_offset to make that clearer?

marmoute added inline comments.Mar 11 2021, 1:07 PM
mercurial/revlog.py
2338–2341

So we are looking to the offset of the first byte that does not contains data?

If so why are we looking for side data in particular and not just actual data too? If not, what is going on here ?

Alphare added inline comments.Mar 12 2021, 3:56 AM
mercurial/revlog.py
2338–2341

Revlog data files are still append-only (even with the re-writing mechanism of the next patches in the stack), and the sidedata is the last thing that is written for any given entry. So if you want to find the offset of all data (in the general sense of the term), then you look for the end of the sidedata of the last entry. Is there something I missed about there being a simpler solution, or should I add a comment explaining this?

marmoute added inline comments.Mar 12 2021, 4:04 AM
mercurial/revlog.py
2338–2341

Is it ? Because if the last entry did not had any sidedata. We need to take look at its actual data instead, isn't it? and actually if the previous entry had sidedata that we needed to compute after the fact, it sidedata will be written after the last entry data… etc.

So the current approach seems quite fragile to me. It need a least a clear comment explaining what is it trying to do. And probably jus to be moved within a function with a clear semantic and docstring. The "last offset" information could be stored in the docket to make all this simpler soonish.

Alphare added inline comments.Mar 12 2021, 4:11 AM
mercurial/revlog.py
2338–2341

if the last entry did not have any sidedata, we need to take look at its actual data instead

Right, hence the check for the sidedata offset being 0, which falls back to the regular end.

if the previous entry had sidedata that we needed to compute after the fact, it sidedata will be written after the last entry data

If the sidedata needs to be computer after the fact, it means - for now - that it doesn't currently have sidedata, so we'll write (correctly) after the end of the data, do so for however many revisions in the group, then compute their potential respective sidedata and append that to the data file one right after the other (so, separate from their data, but always after).

Is that clearer? If so I'll explain better in code.

marmoute added inline comments.Mar 12 2021, 4:28 AM
mercurial/revlog.py
2338–2341

What is the case I described (last revision with no sidedata, previous one with sidedata post-computed) happens for data committed before we start this write.

Alphare added inline comments.Mar 12 2021, 5:34 AM
mercurial/revlog.py
2338–2341

Right, sorry, I misread your previous message. I'm sending a change with a linear scan of the index as a first step, pending the use of a docket file, so at least it's correct in all cases.

Alphare updated this revision to Diff 26253.Mar 12 2021, 6:33 AM
marmoute requested changes to this revision.Mar 12 2021, 1:09 PM
marmoute added inline comments.
mercurial/revlog.py
2429

before getting out of experimental actually. We should add the TODO next the associated config in config item too.

2441

We should take the max of both no matter what would we not ? offset = max(offset, self.end(rev), self.end_sidedata(rev)) or something similar.

This revision now requires changes to proceed.Mar 12 2021, 1:09 PM
Alphare marked 2 inline comments as done.Mar 15 2021, 6:24 AM
Alphare updated this revision to Diff 26331.
marmoute accepted this revision.Mar 15 2021, 6:59 AM
This revision was not accepted when it landed; it landed in state Needs Review.
This revision was automatically updated to reflect the committed changes.