The transaction object used to keep a mapping table of path names to
journal entries and a list of journal entries consisting of path and
file offset to truncate on rollback. The offsets are used in three
cases. repair.strip and rollback process all of them in one go, but they
care about the order. For them, it is perfectly reasonable to read the
journal back from disk as both operations already involve at least one
system call per journal entry. The other consumer is the revlog logic
for moving from inline to external data storage. It doesn't care about
the order of the journal and just needs to original offset stored.
Further optimisations are possible here to move the in-memory journal to
a set(), but without memoisation of the original revlog size this could
turn it into O(n^2) behavior in worst case when many revlogs need to
migrated.
Details
Diff Detail
- Repository
- rHG Mercurial
- Lint
Automatic diff as part of commit; lint not applicable. - Unit
Automatic diff as part of commit; unit tests not applicable.
Event Timeline
I want a 2nd set of eyes on this.
I _think_ this may introduce a subtle bug regarding split revlogs. Before, we always recorded the offset of .i files before .d. After, because we sort, we'll truncate .d then .i. This leaves a window where a .i index references entries in a .d that don't exist. If a new reader comes along during that window, they'll read an index with no corresponding data. Unless we handle this case in the revlog reader code, I'd feel much better if we preserved insertion order and truncated .i before .d files so this race condition didn't exist.
I understand the concern and agree. I see two options of dealing with this without breaking the intention of the change here.
(1) Reverse the order. This would fix the immediate problem, but is more risky for similar classes of issues in other places, e.g. tree manifest chains.
(2) Adopt the logic from D9237 to read back the journal to recover the original order. Since both _playback and repair.strip will only ever be called once per transaction, it doesn't alter the performance much.
I'm inclined to go with (2).
Let's refactor the playback code to read from the journal file [handle]. That will preserve the order. There should be no meaningful performance implications here.
Once we do that, that frees you up to refactor the in-memory data structures, possibly even deleting them completely. I strongly prefer we do as much as possible in separate commits so we can bisect any performance regressions that may occur. I'm most worried about removal of the in-memory data structure regressing performance of large clones.
The last update implemented the option to read to journal from disk, in case that wasn't clear from the description change.
mercurial/transaction.py | ||
---|---|---|
403 | This line crashes on Windows, using 60523483897c. I'm not sure why, since it implements __getattr__ and delegates to the internal file handle. Here's an example using hg split, though I have seen it in the test failures too. hg split -r tip (leaving bookmark @) switching to topic tag-cache-display 0 files updated, 0 files merged, 0 files removed, 0 files unresolved reverting hg reverting hgext\phabricator.py reverting mercurial\dispatch.py reverting setup.py starting interactive selection ** Unknown exception encountered with possibly-broken third-party extension "mercurial_keyring" (version N/A) ** which supports versions unknown of Mercurial. ** Please disable "mercurial_keyring" and try your action again. ** If that fixes the bug please report it to https://bitbucket.org/Mekk/mercurial_keyring/issues ** Python 2.7.17 (v2.7.17:c2f86d86e6, Oct 19 2019, 21:01:17) [MSC v.1500 64 bit (AMD64)] ** Mercurial Distributed SCM (version 5.6+222-60523483897c+20201201) ** Extensions loaded: absorb, blackbox, evolve 10.2.0.dev, extdiff, fix, mercurial_keyring, mq, phabblocker, phabricator, rebase, show, strip, topic 0.21.0.dev Traceback (most recent call last): File "C:\Users\Matt\hg\hg", line 43, in <module> File "C:\Users\Matt\hg\mercurial\dispatch.py", line 115, in run def run(): File "C:\Users\Matt\hg\mercurial\dispatch.py", line 266, in dispatch except error.Abort as inst: File "C:\Users\Matt\hg\mercurial\dispatch.py", line 442, in _runcatch ui.flush() File "C:\Users\Matt\hg\mercurial\dispatch.py", line 451, in _callcatch File "C:\Users\Matt\hg\mercurial\scmutil.py", line 155, in callcatch return func() File "C:\Users\Matt\hg\mercurial\dispatch.py", line 432, in _runcatchfunc b"%s debugger specified " File "C:\Users\Matt\hg\mercurial\dispatch.py", line 1229, in _dispatch ui.warn(_(b"warning: --repository ignored\n")) File "C:\Users\Matt\hg\mercurial\dispatch.py", line 883, in runcommand repo, File "C:\Users\Matt\hg\mercurial\dispatch.py", line 1240, in _runcommand if repo and repo != req.repo: File "C:\Users\Matt\hg\mercurial\dispatch.py", line 1226, in <lambda> repo = repo.unfiltered() File "C:\Users\Matt\hg\mercurial\util.py", line 1867, in check return func(*args, **kwargs) File "C:\Users\Matt\hg\mercurial\util.py", line 1867, in check return func(*args, **kwargs) File "C:\Users\Matt\hg\hgext\mq.py", line 4237, in mqcommand return orig(ui, repo, *args, **kwargs) File "C:\Users\Matt\hg\mercurial\util.py", line 1867, in check return func(*args, **kwargs) File "C:\Users\Matt\AppData\Roaming\Python\Python27\site-packages\hgext3rd\evolve\cmdrewrite.py", line 1300, in cmdsplit lockmod.release(tr, lock, wlock) File "C:\Users\Matt\hg\mercurial\lock.py", line 397, in release lock.release() File "C:\Users\Matt\hg\mercurial\transaction.py", line 462, in release self._abort() File "C:\Users\Matt\hg\mercurial\transaction.py", line 661, in _abort entries = self.readjournal() File "C:\Users\Matt\hg\mercurial\transaction.py", line 44, in _active return func(self, *args, **kwds) File "C:\Users\Matt\hg\mercurial\transaction.py", line 421, in readjournal for l in self._file: TypeError: 'mixedfilemodewrapper' object is not iterable |
mercurial/transaction.py | ||
---|---|---|
403 | Right, so the Windows wrapper doesn't implement line iteration. Can you try using for l in self._file.readlines(): instead? It's a bit wasteful, but as this isn't the fast path anyway, I don't think we care too much. |
This line crashes on Windows, using 60523483897c. I'm not sure why, since it implements __getattr__ and delegates to the internal file handle. Here's an example using hg split, though I have seen it in the test failures too.