This is an archive of the discontinued Mercurial Phabricator instance.

transaction: only keep file names in-memory for journal [WIP]
AbandonedPublic

Authored by joerg.sonnenberger on Oct 21 2020, 5:45 PM.

Details

Reviewers
indygreg
Group Reviewers
hg-reviewers
Summary

The offsets are normally only used during rollback and can be read back
from disk in that case. The exception is currently the migration from
inline to non-inline revlog. The current iteration scans the on-disk
journal and computes the last revision based on that.

Diff Detail

Repository
rHG Mercurial
Branch
default
Lint
No Linters Available
Unit
No Unit Test Coverage

Event Timeline

I admit that I don't know this code well at all. Maybe the offsets are there to protect against concurrent operations? I suppose that can only happen with bad locking (such as NFS), but maybe that's still what they're for?

My understanding of the transaction processing is that we keep track of all files modified. For each file, we remember the old size. When we have to rollback the transaction, we iterate over that list and truncate it at that point. Now this part works as before, just that we don't keep the file name -> offset table in memory but read back the on-disk representation. I don't think this makes a performance difference either compared to everything else going on. There is one more user of the offsets and that's the logic in revlog for moving from inline data storage to separated data files. With the current patch, it would be O(n^2) worst case as it reparses the whole file.

joerg.sonnenberger abandoned this revision.Nov 7 2020, 1:09 PM

I'll cut this into smaller parts that are easier to review and decide on which steps are too far.

I was just attempting to read the transactions code as part of D9274.

I was skeptical of the need to maintain the in-memory copy of entries to the transaction journal files. It feels like something we shouldn't need to do and my kneejerk reaction is we should rip out this complexity and read from the file handle (as this patch does) instead. Although I would like to go diving into the history to find out why the in-memory copy exists as it may have been introduced for a good reason (performance?). Note that we do unlink the journal files a bit early in transaction commit/rollback code - earlier than I think is reasonable. And our low-level testing around these transaction primitives may be lacking. There be many dragons in this code...

Please proceed with rewriting this as smaller patches. And don't be scared to rename existing methods along the way to make things more clear. e.g. add() should probably be something like record_file_append(). Yes, we can use _ in function names now. And if we are changing the API of a function, I would prefer it be renamed at the same time to improve readability. You are already making an API break, so you might as well improve the name...