( )⚙ D10712 tests: show that hg fails to rollback transaction on revlog split

This is an archive of the discontinued Mercurial Phabricator instance.

tests: show that hg fails to rollback transaction on revlog split
AbandonedPublic

Authored by valentin.gatienbaron on May 16 2021, 3:31 PM.

Details

Reviewers
joerg.sonnenberger
Group Reviewers
hg-reviewers

Diff Detail

Repository
rHG Mercurial
Branch
default
Lint
No Linters Available
Unit
No Unit Test Coverage

Event Timeline

joerg.sonnenberger requested changes to this revision.May 16 2021, 9:03 PM

This whole stack exposes a real problem and one quite a bit older, but it is not correct in a number of ways as is.

First of all, the check added in 8502f76dbfd7 is broken for the revlog inline to non-inline migration.

Second, the old revision computation for the inline case was indeed incorrect. That needs to be fixed and tested appropriately. I'm not convinced that the test is the right (or best way) to do it. I have to think about it more.

Third, the recover case needs to be fixed properly and the fix as it stands is wrong. We don't actually want to ever migrate back to inline revlogs so the goal is to drop the older entries and also fix the journal entry. There is a small race condition in that case indeed, but the stack handles it completely in the wrong direction.

This revision now requires changes to proceed.May 16 2021, 9:03 PM

This whole stack exposes a real problem and one quite a bit older, but it is not correct in a number of ways as is.
First of all, the check added in 8502f76dbfd7 is broken for the revlog inline to non-inline migration.

I didn't realize that was a recent-ish check. That check strikes me as fairly reasonable. Before the code could have "truncate"d by extending the file, then truncated to the final length, which may have resulted in the right final state, but is the intermediate state full of nul bytes actually handled properly by concurrent readers?

Third, the recover case needs to be fixed properly and the fix as it stands is wrong. We don't actually want to ever migrate back to inline revlogs so the goal is to drop the older entries and also fix the journal entry. There is a small race condition in that case indeed, but the stack handles it completely in the wrong direction.

Not sure what your concern is. You have to be go back to inline revlog when the transaction was interrupted before both new .d and .i file were written, short of pushing some logic for splitting revlog into the transaction rollback code. Once they are both written my change never rolls that back, it just drops the older journal entries.

There are three stages for the revlog and I think we need to teach the transaction playback code at least a bit about this to handle it correctly:

(1) We write the old index offset to the journal for the index file and record zero for the data file. After this the data is copied from the index to the data file.
(2) We write the temporary index file. We should now also write the end-of-file offset for the pre-transaction state to disk now, so that recover doesn't lose data.
(3) We rename the index file and record the new index offset for the pre-transaction state.

As I said, the current state is broken in a number of ways. Before the introduction of transaction.readjournal, the on-disk and in-memory journal ended up being inconsistent. That is part of why the problem wasn't seen before. But that change made the error consistently apply for both transaction rollback and "recover".

The check in 8502f76dbfd7 fails, if the abort happens after all three stages are completed. Your change in this stack addresses that, but I think it misses the important case of failing when the rename itself finished, but the recording of the transaction journal afterwards is not successful. We will need some fault injection logic to properly test this. We also need to fix the case of abort before the rename happened, e.g. remove the data file. I think we can also try to detect the current bad journal entry and attempt a fix up (by recomputing the expected data entry), but I have to think about that a bit more.

@joerg.sonnenberger does your recent series means this one isn't needed anymore?

Correct, this stack should have beeen superseded by those.