Page MenuHomePhabricator

transaction: change list of journal entries into a dictionary
ClosedPublic

Authored by joerg.sonnenberger on Nov 7 2020, 4:32 PM.

Details

Summary

The transaction object used to keep a mapping table of path names to
journal entries and a list of journal entries consisting of path and
file offset to truncate on rollback. The offsets are used in three
cases. repair.strip and rollback process all of them in one go, but they
care about the order. For them, it is perfectly reasonable to read the
journal back from disk as both operations already involve at least one
system call per journal entry. The other consumer is the revlog logic
for moving from inline to external data storage. It doesn't care about
the order of the journal and just needs to original offset stored.
Further optimisations are possible here to move the in-memory journal to
a set(), but without memoisation of the original revlog size this could
turn it into O(n^2) behavior in worst case when many revlogs need to
migrated.

Diff Detail

Repository
rHG Mercurial
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

I want a 2nd set of eyes on this.

I _think_ this may introduce a subtle bug regarding split revlogs. Before, we always recorded the offset of .i files before .d. After, because we sort, we'll truncate .d then .i. This leaves a window where a .i index references entries in a .d that don't exist. If a new reader comes along during that window, they'll read an index with no corresponding data. Unless we handle this case in the revlog reader code, I'd feel much better if we preserved insertion order and truncated .i before .d files so this race condition didn't exist.

I understand the concern and agree. I see two options of dealing with this without breaking the intention of the change here.
(1) Reverse the order. This would fix the immediate problem, but is more risky for similar classes of issues in other places, e.g. tree manifest chains.
(2) Adopt the logic from D9237 to read back the journal to recover the original order. Since both _playback and repair.strip will only ever be called once per transaction, it doesn't alter the performance much.

I'm inclined to go with (2).

Let's refactor the playback code to read from the journal file [handle]. That will preserve the order. There should be no meaningful performance implications here.

Once we do that, that frees you up to refactor the in-memory data structures, possibly even deleting them completely. I strongly prefer we do as much as possible in separate commits so we can bisect any performance regressions that may occur. I'm most worried about removal of the in-memory data structure regressing performance of large clones.

joerg.sonnenberger edited the summary of this revision. (Show Details)Nov 8 2020, 9:29 AM
joerg.sonnenberger updated this revision to Diff 23432.

The last update implemented the option to read to journal from disk, in case that wasn't clear from the description change.

This revision was not accepted when it landed; it landed in state Needs Review.
This revision was automatically updated to reflect the committed changes.
mharbison72 added inline comments.
mercurial/transaction.py
403

This line crashes on Windows, using 60523483897c. I'm not sure why, since it implements __getattr__ and delegates to the internal file handle. Here's an example using hg split, though I have seen it in the test failures too.

hg split -r tip
(leaving bookmark @)
switching to topic tag-cache-display
0 files updated, 0 files merged, 0 files removed, 0 files unresolved
reverting hg
reverting hgext\phabricator.py
reverting mercurial\dispatch.py
reverting setup.py
starting interactive selection
** Unknown exception encountered with possibly-broken third-party extension "mercurial_keyring" (version N/A)
** which supports versions unknown of Mercurial.
** Please disable "mercurial_keyring" and try your action again.
** If that fixes the bug please report it to https://bitbucket.org/Mekk/mercurial_keyring/issues
** Python 2.7.17 (v2.7.17:c2f86d86e6, Oct 19 2019, 21:01:17) [MSC v.1500 64 bit (AMD64)]
** Mercurial Distributed SCM (version 5.6+222-60523483897c+20201201)
** Extensions loaded: absorb, blackbox, evolve 10.2.0.dev, extdiff, fix, mercurial_keyring, mq, phabblocker, phabricator, rebase, show, strip, topic 0.21.0.dev
Traceback (most recent call last):
  File "C:\Users\Matt\hg\hg", line 43, in <module>

  File "C:\Users\Matt\hg\mercurial\dispatch.py", line 115, in run
    def run():
  File "C:\Users\Matt\hg\mercurial\dispatch.py", line 266, in dispatch
    except error.Abort as inst:
  File "C:\Users\Matt\hg\mercurial\dispatch.py", line 442, in _runcatch
    ui.flush()
  File "C:\Users\Matt\hg\mercurial\dispatch.py", line 451, in _callcatch

  File "C:\Users\Matt\hg\mercurial\scmutil.py", line 155, in callcatch
    return func()
  File "C:\Users\Matt\hg\mercurial\dispatch.py", line 432, in _runcatchfunc
    b"%s debugger specified "
  File "C:\Users\Matt\hg\mercurial\dispatch.py", line 1229, in _dispatch
    ui.warn(_(b"warning: --repository ignored\n"))
  File "C:\Users\Matt\hg\mercurial\dispatch.py", line 883, in runcommand
    repo,
  File "C:\Users\Matt\hg\mercurial\dispatch.py", line 1240, in _runcommand
    if repo and repo != req.repo:
  File "C:\Users\Matt\hg\mercurial\dispatch.py", line 1226, in <lambda>
    repo = repo.unfiltered()
  File "C:\Users\Matt\hg\mercurial\util.py", line 1867, in check
    return func(*args, **kwargs)
  File "C:\Users\Matt\hg\mercurial\util.py", line 1867, in check
    return func(*args, **kwargs)
  File "C:\Users\Matt\hg\hgext\mq.py", line 4237, in mqcommand
    return orig(ui, repo, *args, **kwargs)
  File "C:\Users\Matt\hg\mercurial\util.py", line 1867, in check
    return func(*args, **kwargs)
  File "C:\Users\Matt\AppData\Roaming\Python\Python27\site-packages\hgext3rd\evolve\cmdrewrite.py", line 1300, in cmdsplit
    lockmod.release(tr, lock, wlock)
  File "C:\Users\Matt\hg\mercurial\lock.py", line 397, in release
    lock.release()
  File "C:\Users\Matt\hg\mercurial\transaction.py", line 462, in release
    self._abort()
  File "C:\Users\Matt\hg\mercurial\transaction.py", line 661, in _abort
    entries = self.readjournal()
  File "C:\Users\Matt\hg\mercurial\transaction.py", line 44, in _active
    return func(self, *args, **kwds)
  File "C:\Users\Matt\hg\mercurial\transaction.py", line 421, in readjournal
    for l in self._file:
TypeError: 'mixedfilemodewrapper' object is not iterable
mercurial/transaction.py
403

Right, so the Windows wrapper doesn't implement line iteration. Can you try using for l in self._file.readlines(): instead? It's a bit wasteful, but as this isn't the fast path anyway, I don't think we care too much.