( )⚙ D11239 debugcommands: introduce a debug command to repair repos affected by issue6528

This is an archive of the discontinued Mercurial Phabricator instance.

debugcommands: introduce a debug command to repair repos affected by issue6528
ClosedPublic

Authored by Alphare on Aug 2 2021, 9:36 AM.

Details

Summary

This command is quite basic and slow, it will loop over the entirety of the
filelogs in the repository and check each revision for corruption, then fixes
the affected filelogs. It takes under 25 minutes for Mozilla-Central on my
not-top-of-the-line laptop, using the --to-report and --from-report options
will make this pretty tolerable to use, I think.

This change also introduces a test for the fix.

Diff Detail

Repository
rHG Mercurial
Branch
stable
Lint
No Linters Available
Unit
No Unit Test Coverage

Event Timeline

Alphare created this revision.Aug 2 2021, 9:36 AM
marmoute requested changes to this revision.Aug 2 2021, 3:03 PM
marmoute added a subscriber: marmoute.

We are moving in the right direction, but I made multiple comment about adjustement I feel necessary before we can consider this function safe.

mercurial/debugcommands.py
1457

small nits: maybe "debug-repair-issue6528" instead of "debug-fix-issue6528" ?

1475

idea for improvement (might be useful t write them down in comment:

allow to specify filelog pattern and revision pattern to restrict the search. Useful for faster scan when user know what they are looking for.

1482

The following format would seems more suitable:

<ascii-hex of the corrupted file revision> <unencoded-filename>

Putting the hex first is simpler, since we then know the rest of the line if about the filename.

Putting the filename instead of encoded index path means we are more independant of the underlying format variante (with or without dot-encode for example).

1484

(not sure this is really unlikely, I would drop that remark)

1493

Why do we need the bool here ?

1497

Same feedback about fixrepair

mercurial/revlogutils/rewrite.py
479

note: this seems "fine" to me for stable. However this seems a good indication that we need some store/repo level API for "give me all the filelog"

482

It is probably worth adding a programming error if the repository is not locked.

492

This seems the only line that check and enforce the version.

We need:

  1. a programming error to catch the issue on Windows/Optimized setting
  2. a higher level check of the repository format in the top level command
508–512

If there is a crash/interrupt during the operation, we will leave temporary file behind. We should add a try:, finally: clause here, and possibly a no-interrupt context.

522–526

nits: this is the sensitive part, it is probably worth factoring it out in a small dedicated function since most of it is common, but for the offset computation.

What do you think ?

I addition it seems like a good idea to mention the "speedup" idea from Joerg (about only checking the delta when possible)

543

Since this is a sensitive issue, we should probably add a docstring about the detection method here.

562

Not sure what "unfortunate cyclic sitatuation" means here ‽

584

I don't think we capitalize error output in general.

586

tiny nits: this should probably be a msg %= on the previous line

596

You should probably sort the resulting to_fix set to make sure we process things in a predictable order (and start to end seems like a good idea in general)

607

adding uninterruptible for the full command seems quite overkill, especially since the command can apparently run for half an hours.

See my earlier comment about only using ui.uninterruptible() while doing the actual fixing.

653

small nits: if to_fix: seem more pythonic to me

654–657

We should sort to_fix to ensure growing order.

660

I don't think we capitalize output.

tests/test-issue6528.t
170–171

We should make this is ===== level title and use ----- level title (and associated description) for the individual tests within this section.

181–183

We should add debug index call here to show they are swapped

196–197

small nits: the output says we "found" the corruption, but does not state that we fixed it.

We should also check the index content after the command run.

219–220

Testing a multi-lines report seems important too.

This revision now requires changes to proceed.Aug 2 2021, 3:03 PM
Alphare marked 23 inline comments as done.Aug 4 2021, 9:16 AM

Pending CI refresh

mercurial/debugcommands.py
1482

I've updated the patch to a new format that uses unencoded filenames but still groups the revisions per filelog (when exporting, the format still allows for duplicate filelogs, it's just less efficient since it's just looping over lines).

This is done following an IRC discussion where @marmoute and I disagreed on the exact format, and this is the version I feel is simpler. If more people disagree with my choice, I'll change it, but it's probably not worth bikeshedding too hard on this.

tests/test-issue6528.t
219–220

I've created a multiple revisions + multiple files repo

Alphare marked an inline comment as done.Aug 4 2021, 10:22 AM
Alphare retitled this revision from debugcommands: introduce a debug command to fix repos affected by issue6528 to debugcommands: introduce a debug command to repair repos affected by issue6528.
Alphare updated this revision to Diff 29790.
Alphare updated this revision to Diff 29791.Aug 4 2021, 11:07 AM
marmoute requested changes to this revision.Aug 4 2021, 3:55 PM

Added a handful of new comments.

mercurial/debugcommands.py
1482

I am fine with using multiple nodes per lines but I could probably use , for separation between node and still for separation with the filename.

I would also gather all the lines before processing since that isn't very costly. However this is a quite minor changes

mercurial/revlogutils/rewrite.py
482

note: this is not the right location for this comment :-/ the lock checking should be done when we actually rewrite revlog.

We do not usually needs it, but the rewrite code is "exceptionnal" enough to add some double checking.

496–498

This should go in _reorder_filelog_parents (I am starting to suspect phabricator simply shifted all comments a couple of line too low)

534

We need a nointerrupt context here, don't we ?

579–580

It would seem cleaner to have this function only determine the status of the revision and have the warning message only issued by the called of this method.

The "corrupted" is a bit scary too. maybe "affected by issue6528"

This will be useful if we want to detect this in more situation (e.g. during exchange)

619–620

Here it is. It still feel like too wide and in the wrong place:

  • we want it as narrow as possible since it is "a bit hacky"
  • we want it in the function that actually setup the dangerous bit so that it (1) is always in place (2) stay next to the dangerous implementation.
640–644

Looks like you want to:

  1. gather the list of filelog
  2. use its len(xxx) for total
  3. use that list for the iteration
This revision now requires changes to proceed.Aug 4 2021, 3:55 PM
Alphare marked 7 inline comments as done.Aug 5 2021, 4:09 AM
Alphare added inline comments.
mercurial/revlogutils/rewrite.py
640–644

I guess the memory footprint wouldn't get too bad even on larger repos.

baymax updated this revision to Diff 29800.Aug 5 2021, 5:11 AM

✅ refresh by Heptapod after a successful CI run (🐙 💚)
⚠ This patch is intended for stable ⚠

marmoute added inline comments.Aug 5 2021, 5:22 AM
mercurial/revlogutils/rewrite.py
640–644

Yeah, I dont expect thousand of these, and if so, the associated repository probably have large machine using it. (and worst case the report file can be feed in multiple step).

marmoute accepted this revision.Aug 6 2021, 6:03 AM
marmoute added inline comments.
mercurial/revlogutils/rewrite.py
482

My two past self eventually talked to each other and we probably want is in both. I'll prepare you a patch you can fold into this one.

617

This is probably a bit too verbose for the common case. I'll prepare something to control that.

So we have this part covered (slowly). We can now (in other Diffs) move to the next points.

  • implement a faster method
  • fix this issue on the fly during exchange
  • ensure things are fixed on upgrade (cf D11252)
mercurial/revlogutils/rewrite.py
482

Actually, I amd now triple confused about what I meant, so lets leave it that way,

617

Actually, this is only issued when processing report, so this seems fine.

Alphare marked an inline comment as done.Aug 6 2021, 6:41 AM
Alphare updated this revision to Diff 29829.
Alphare updated this revision to Diff 29835.Aug 6 2021, 1:50 PM
marmoute accepted this revision.Aug 6 2021, 4:47 PM
baymax updated this revision to Diff 29843.Aug 7 2021, 9:47 AM

✅ refresh by Heptapod after a successful CI run (🐙 💚)
⚠ This patch is intended for stable ⚠

baymax updated this revision to Diff 29855.Aug 7 2021, 6:09 PM

✅ refresh by Heptapod after a successful CI run (🐙 💚)
⚠ This patch is intended for stable ⚠

pulkit accepted this revision.Aug 10 2021, 8:16 AM
This revision is now accepted and ready to land.Aug 10 2021, 8:16 AM