This is an archive of the discontinued Mercurial Phabricator instance.

revlog: add option to mmap revlog index
ClosedPublic

Authored by mbthomas on Aug 22 2017, 10:37 AM.

Details

Summary

Following on from Jun Wu's patch last October[1], we have found that using mmap
for the revlog index in repos with large revlogs gives a noticable performance
improvment (~110ms on each hg invocation), particularly for commands that don't
touch the index very much.

This changeset adds this as an option, activated by a new experimental config
option so that it can be enabled on a per-repo basis. The configuration option
specifies an index size threshold at which Mercurial will switch to using mmap
to access the index.

If the configuration option is not specified, the default remains to load the
full file, which seems to be the best option for smaller repos.

Some initial performance numbers for average of 5 invocations of hg log -l 5
for different cache states:

Repo:HGFB
Index size:2.3MBmuch bigger
read (warm):237ms432ms
mmap (warm):227ms321ms
(-3%)(-26%)
read (cold):397ms696ms
mmap (cold):410ms888ms
(+3%)(+28%)

[1] https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-October/088737.html

Test Plan

hg log --config experimental.mmapindex=true

Diff Detail

Repository
rHG Mercurial
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

mbthomas created this revision.Aug 22 2017, 10:37 AM
simonfar added inline comments.
mercurial/revlog.py
347

Is there any reason for mmapread to not immediately wrap the data in a util.buffer for you?

indygreg requested changes to this revision.Aug 23 2017, 11:31 AM

This seems reasonable as an experimental feature.

I worry about memory mapping all files without regards to size. That seems unnecessary.

Do you know what happens with the underlying file descriptor when it is memory mapped? If an fd stays open, that could lead to fd exhaustion. This can be mitigated by only using mmap for certain files (minimize size, changelog, etc). It's something I'd like an answer to before this is queued.

mercurial/localrepo.py
607–608

As you wrote in the commit message, the benefits of memory mapping likely kick in at some threshold and we'll want to make this behavior conditional on index size. It is tempting to implement that today to facilitate experimentation. If you can do that without introducing extra system calls for the default file open case, do it.

mercurial/revlog.py
332

We only support 2.7 now. So this version sniffing can be removed.

This revision now requires changes to proceed.Aug 23 2017, 11:31 AM
durin42 requested changes to this revision.Aug 23 2017, 11:58 AM
durin42 added a subscriber: durin42.

Also, please include at least one log invocation or similar in a test with the config knob set? And maybe some rough benchmark results in the commit message to justify the feature. :)

quark added a subscriber: quark.Aug 23 2017, 12:05 PM

I think it might make sense to only apply it to manifest and changelog. Filelogs are usually short and probably do not need it.

In D477#7739, @indygreg wrote:

Do you know what happens with the underlying file descriptor when it is memory mapped? If an fd stays open, that could lead to fd exhaustion. This can be mitigated by only using mmap for certain files (minimize size, changelog, etc). It's something I'd like an answer to before this is queued.

The fd remains open. (Actually, the fd is dup'd and the original one is closed. The dup remains open.) For this reason I'm going to restrict it to changelog and manifest indexes only, as those are the ones that are known to grow big.

mercurial/revlog.py
347

I took it from Jun's original patch, but I think it's simply separation of responsibilities. util.mmapread's purpose is to wrap mmap(... ACCESS_READ) and handle the case where the file is empty, which normally fails.

mbthomas edited edge metadata.Aug 24 2017, 10:00 AM
mbthomas updated this revision to Diff 1245.

Switch to index size threshold for mmap, and only do it for changelog and manifest.
Rebase and add test.

I've switched to a configurable threshold for mmap as suggested. This does require an extra stat call for the changelog and manifest indexes, however python already calls stat each time you open a file, so we should just hit the filesystem cache.

simonfar added inline comments.Aug 24 2017, 10:26 AM
tests/test-revlog-mmapindex.t
13–14

Two things:

  1. Can you add a ui.debug() or similar so that you can show that mmap() is in use here?
  2. Can you repeat the test without mmap() to show that mmap() and not-mmap() both still work?
mbthomas added inline comments.Aug 24 2017, 10:35 AM
tests/test-revlog-mmapindex.t
13–14

For (1) There's no ui object available in the revlog.revlog constructor. Is there a way to get one?
For (2), the rest of the test suite will be loading the revlog without mmap, so I didn't think it necessary to test it again here.

indygreg added inline comments.Aug 24 2017, 10:18 PM
tests/test-revlog-mmapindex.t
13–14

I'd do it with inline Python. Something like:

from mercurial import hg, ui as uimod
myui = uimod.ui.load()
repo = hg.repository(myui, path='.')

  1. Test repo.changelog here.

Also, excessive hg commands makes tests slow. Unfortunately, doing this in Python is probably a bit of work :/

I think revlog is decoupled from ui intentionally. I would avoid using ui.debug to test it, but just have a extension in .t that wraps the mmap method and print something to indicate mmap is being used.

mbthomas updated this revision to Diff 1294.Aug 25 2017, 11:58 AM

Make mmap test check for mmapping by making util.mmapread verbose

quark added inline comments.Aug 25 2017, 12:03 PM
tests/test-revlog-mmapindex.t
16

In this case verbosemmap.py is not very useful in other tests. The common pattern is to make it inside the .t file:

$ cat > verbosemmap.py <<EOF
> ...
> EOF
$ cat >> $HGRCPATH.py << EOF
> ...
> verbosemmap.py=$TESTTMP/verbosemmap.py
> EOF
mbthomas updated this revision to Diff 1297.Aug 25 2017, 2:46 PM

Move verbosemmap.py into the test

quark accepted this revision as: quark.Sep 11 2017, 6:34 PM

Generally look good to me.

mercurial/revlog.py
308

This variable is only used in __init__ so it is unnecessary to attach it to revlog object. In theory it could make a little difference if there are many revlog objects being created.

1564

I have checked this checkinlinesize code path and believe this patch isn't making things worse.

That said, I feel checkinlinesize could be troublesome in some cases. And maybe it's safer to move it to a post-close transaction hook.

mbthomas updated this revision to Diff 1786.Sep 13 2017, 1:27 PM

Rebase and remove mmapindexthreshold from revlog object

durin42 requested changes to this revision.Sep 15 2017, 11:25 AM

Please include benchmark numbers in your commit message.

This revision now requires changes to proceed.Sep 15 2017, 11:25 AM
mbthomas edited the summary of this revision. (Show Details)Sep 15 2017, 12:23 PM

I had put the numbers in the commit message, but that doesn't update the Phabricator summary, so I've done that manually.

It should be clear from the numbers that the improvement only really surfaces when the filesystem cache is warm, and there is a penalty for using this option when the filesystem cache is cold. Since the cache should be warm more often than not, the expectation is an overall improvement.

mbthomas requested review of this revision.Sep 15 2017, 12:28 PM
durham added a subscriber: durham.Sep 15 2017, 12:30 PM

Also, the improvements should become more pronounced if we fix some algorithms (like phases) that access really old commits.

quark added a comment.Sep 15 2017, 4:33 PM

I had put the numbers in the commit message, but that doesn't update the Phabricator summary, so I've done that manually.

I think you were using arc and can use hg phabsend to get the commit message updated. See https://www.mercurial-scm.org/wiki/Phabricator for instructions.

durham edited the summary of this revision. (Show Details)Sep 15 2017, 7:52 PM
quark added a comment.Sep 18 2017, 7:49 PM

237ms for HG seems there is something wrong.

durin42 requested changes to this revision.Sep 20 2017, 12:03 PM
durin42 added inline comments.
mercurial/revlog.py
348

This makes me very uncomfortable, because util.mmapread() returns the empty string if it doesn't like the file descriptor. I'd be less worried if util.mmapread() crashed, or if this was more cautious about the mmapread failing for some reason.

This revision now requires changes to proceed.Sep 20 2017, 12:03 PM
durin42 accepted this revision.Sep 21 2017, 11:36 AM

I'm landing this, but we should plan to figure out the right tradeoff values and get it on by default or discard the extra complexity at some point. That doesn't have to be before 4.4, but it'd be nice to have that sorted out by 4.5.

This revision was automatically updated to reflect the committed changes.