This is an archive of the discontinued Mercurial Phabricator instance.

perftweaks: store rev to node mapping for preloading
Needs RevisionPublic

Authored by mbthomas on Aug 22 2017, 8:46 AM.
Tags
None
Subscribers

Details

Reviewers
durham
Group Reviewers
Restricted Project
Summary

Change perftweaks to store the node-to-rev mapping for interesting
revisions, rather than just the rev numbers.

Previously this was used to prepopulate the changelog's node-to-rev
lookup trie, however since we only stored the revision numbers this
still required looking at the revlog index contents in order to
read out the node hash.

With these changes, the trie can be prepopulated without touching
the index.

On its own, this gives only a modest performance increase, however
when combined with using mmap for the revlog index (coming in
another diff), it gives a noticable improvment on repos with very
large revlogs, as we no longer need to touch large portions of the
revlog index on every invocation.

When loading the cache data, we check that the rev numbers are
in range for the revlog, and drop any entries that are out of
range. The assumption is that a node hash won't be reassigned to
another rev index. The other way round ought to be ok, as the
old hash ought to be abandoned if it's no longer part of the
revlog. If there's any more checking needed, please suggest in
the review comments.

Test Plan

Re-run unit tests.
Generate cache data on a large repo and verify indexes can be
preloaded without touching the index contents.
Run with mmap change to verify perf improvement.

Diff Detail

Repository
rFBHGX Facebook Mercurial Extensions
Branch
default
Lint
Lint OK
Unit
Unit Tests OK

Event Timeline

mbthomas created this revision.Aug 22 2017, 8:46 AM
durham requested changes to this revision.Aug 24 2017, 12:58 PM

I think we want to be pretty conservative about cache invalidation initially. If we get the map wrong, we could end up reading entirely incorrect data out of the changelog and associating it with the incorrect node. A few ideas:

  • We could initially treat the cache as risky, and instead of using the data we could just confirm that the data is always correct, by comparing the getnode(rev) result with our stored node->rev. If our cache doesn't match the truth, print a ui.develwarn so we see it. Once we have confidence, we can actually use the map result instead of calling getnode(rev).
  • We should invalidate the entire cache when repo.destroyed() is called. That is the function called at the end of a strip, to let extensions perform invalidation like this.
  • Maybe we could only use the cached mappings for commits older than tip - 10000. Any node that claims to be newer than tip-10000 we could do getnode(rev) and put the actual node value in the nodemap instead. Since the The top of the changelog is where the most churn happens and where the most risk will be, this should help prevent most bad cache issues and shouldn't add much IO slow down (10k revs is only 620KB of .i file)
hgext3rd/perftweaks.py
223

Is this better than revstr, node = line.split(':')?

Also, mercurial generally avoids using _ for unused variables, because _ is already the function name for the string internationalization function. I usually use x or some such

This revision now requires changes to proceed.Aug 24 2017, 12:58 PM
quark added a subscriber: quark.Nov 21 2017, 10:21 PM

This is no longer needed after D1472.