This is an archive of the discontinued Mercurial Phabricator instance.

branchmap: update rev-branch-cache incrementally
AbandonedPublic

Authored by joerg.sonnenberger on Dec 13 2020, 1:49 PM.

Details

Reviewers
marmoute
Group Reviewers
hg-reviewers
Summary

Historically, the revision to branch mapping cache was updated on demand
and shared via bundle2 to avoid the cost of rebuilding on first use.

Introduce an extension point register_changeset in the repository to
register new changesets. Extend commit, unbundling and exchange to invoke
this for each new changelog.

Ensure that the rev-branch-cache is updated using this extension
point. This makes the whole bundle part redundant. The change exposed
a performance issue in the cache implementation since repeated
extending of a bytearray has quadratic runtime complexity. Avoid this by
using the normal amortized constant time method for resizing dynamic
arrays.

The additional processing slows "hg unbundle" of the whole hg repository
down by around 1.8% and less for larger repositories.

The bundle capability is still advertised and will be unset separately
to avoid the tests churn.

Diff Detail

Repository
rHG Mercurial
Branch
default
Lint
No Linters Available
Unit
No Unit Test Coverage

Event Timeline

joerg.sonnenberger retitled this revision from branchmap: update rev-branch-cache automatically [POC] to branchmap: update rev-branch-cache incrementally.Dec 14 2020, 2:01 PM
joerg.sonnenberger edited the summary of this revision. (Show Details)
joerg.sonnenberger updated this revision to Diff 24264.
pulkit added a subscriber: pulkit.Dec 17 2020, 6:02 AM
pulkit added inline comments.
mercurial/changelog.py
609–610

Hm, is this unrelated change?

mercurial/changelog.py
609–610

Partially, I discovered in the next change that hgext.git doesn't expose read() and it was easier to do it earlier. I can pull it out into a separate commit as it is also a micro-optimisation.

pulkit added inline comments.Dec 17 2020, 7:17 AM
mercurial/changelog.py
609–610

Yes please do.

joerg.sonnenberger marked an inline comment as done.Dec 17 2020, 7:32 AM
joerg.sonnenberger added inline comments.
mercurial/changelog.py
609–610

Done in D9626.

joerg.sonnenberger marked an inline comment as done.Dec 17 2020, 8:17 AM
joerg.sonnenberger updated this revision to Diff 24341.
joerg.sonnenberger edited the summary of this revision. (Show Details)Dec 17 2020, 8:21 AM

I am big +1 to the idea. This is something which I personally felt a need for in past and discussed it with @marmoute.

mercurial/branchmap.py
739

Hm, shouldn't this be b'\0' * (max(...) - rbccur)?

mercurial/bundle2.py
2481–2484

Better to mention upto which mercurial version the part was used and what recent versions do.

mercurial/localrepo.py
1992
  1. We can get the node from changsetrevision, hence no need to pass it separately.
  1. minor name nit: we can call it regitser_new_changeset, the extra new will make it more obvious. Else we will require some documentation for this function.
  1. How about not making this a method on localrepo object? Having a normal function will help extensions wrap it easily.
joerg.sonnenberger marked an inline comment as done.Dec 18 2020, 2:56 PM
joerg.sonnenberger updated this revision to Diff 24420.
mercurial/branchmap.py
739

The goal is to have room to grow for the next change to avoid quadratic resizing. If we need to resize, double the total size. The max size is to cheaply avoid a few iterations for the first few revisions. I've added a comment to make that clear.

mercurial/localrepo.py
1992
  1. It's passed in to avoid the extra lookup and parsing for the important case of bundle operations. This means a minor penalty for commit, but it makes a measurable difference for the other cases.
  1. IMO "register" already implies that it is something new. I can add a few words in the interface definition though.
  1. Both work for me, no opinion yet on what is less effort for extensions.

The overall idea of warming this as we add thing is good. However I have reservation about the details on the new API (the idea of the new API itself seems good).

mercurial/branchmap.py
709–721

That signature change is weird.

  1. we end up doing an extra lookup for nodeid
  2. we end up exposing implementation details about extra to code that did not needed it before.

I don't understand why we need to shift responsability here and have the branch cache care about this.

740–742

In the comment, I seems mention of resizing double the size. But I don't see this happening in final code. Am I missing something ?

mercurial/changegroup.py
328

Why not simply pass the revision number to the API and let lower lever create the changelogrevision if they need it ?

mercurial/interfaces/repository.py
1644–1650

new naming allow this to be register_changeset, we should probably use that.

WE should probably introduce this API in a different changeset to simplify the discussion around it.

1647–1648

What about the cache the latest changelogrevision accessed to make this free and help in other cases ?

marmoute requested changes to this revision.Jan 13 2021, 8:52 AM
This revision now requires changes to proceed.Jan 13 2021, 8:52 AM
mercurial/branchmap.py
709–721

I think redundant keys are an interface bug. The node was available for all, so at least onchangelog in changegroup.py would have to do a lookup anyway. For the commit caller it would mean shuffling things around a bit.

I'll check the extra logic here again, it's been inconsistent across the code base. This is a bit of a left over from earlier iterations, but it can likely be made tighter and simpler now.

740–742

Extending by requiredsize will normally double the size, except when it was sparse before. In that case the gap will count toward to extension, which seems a reasonable compromise.

mercurial/changegroup.py
328

readfiles does the expensive part of creating the changelogrevision already. Not having to process the text twice makes a noticable difference. Note that we don't have the revision number here.

mercurial/interfaces/repository.py
1644–1650

I'm not attached to the name. The other concern was if it should be function on the module scope to allow easier monkey patching. Not sure if that question warrants splitting the change.

1647–1648

I tried adding a cache for the last changelogrevision, but it turned out to be more expensive than shuffling things around a bit here. See the remark about readfiles earlier, which is the primarily time sink right now.

joerg.sonnenberger marked 5 inline comments as not done.Jan 13 2021, 10:32 AM
joerg.sonnenberger added inline comments.
mercurial/branchmap.py
709–721

The extra logic is essentially what changelog.branchinfo does. D9603 refactors that to put it into revbranchcache as the only consumer in core. I've kept it for the moment since topics interacts with this code and would break.

joerg.sonnenberger marked an inline comment as not done.Jan 13 2021, 3:55 PM
joerg.sonnenberger updated this revision to Diff 24818.
joerg.sonnenberger marked 2 inline comments as done.Jan 13 2021, 4:45 PM
joerg.sonnenberger edited the summary of this revision. (Show Details)
joerg.sonnenberger updated this revision to Diff 24821.
marmoute added inline comments.Jan 14 2021, 5:06 AM
mercurial/branchmap.py
709–721

My main concerns is responsibility and implementation details leaking into a simple cache. Could we have the extra-extraction done in the register_changeset method itself ? or probably better, putting them on changelogrevision since it is the one responsible for the details of the data access.

709–721

Both changegroup and commit application should have a revision number handy (that just the size of the revlog before creating the changeset). Revision number are much more efficient that nodeid overall. I would much prefer this kind of low level API to standardize on using them.

740–742

I get it now. I don't know if it me being tired of the code being a bit too sneaky. but it might be worth having some explicit arithmetic or a clarification comment to highlight that the operation result in a array that is "current_size + currently needed size" leading to (about) a ×2 size increase.

mercurial/interfaces/repository.py
1644–1650

Wrapping of repository level method is a common operation in extension. That is probably a good place to put it. (And also a place because they are too many method on localrepo… but these one is not a too bad fit in my opinion.)

The motivation to split the change is also about making the first half easier to land sooner, and to avoid mixing discussion on two different topic. That is not a requirement, but it might help.

1647–1648

could we cache the result of readfiles ? to reuse it at that time ?

joerg.sonnenberger marked an inline comment as done.Jan 14 2021, 5:40 PM
joerg.sonnenberger added inline comments.
mercurial/branchmap.py
709–721

I've moved the extra logic into changelogrevision, so it is again in one place only. Shouldn't make a performance difference and is cleaner. I would really prefer to not add assumptions about the revisions of new nodes, that seems to be error prone and also not what we do right now. Otherwise phases.registernew wouldn't lookup the revision either. I do expect this to be optimized in the revlog one way or the other.

740–742

Right, the adjusted comment should explain that.

joerg.sonnenberger marked an inline comment as done.EditedJan 14 2021, 8:20 PM
joerg.sonnenberger abandoned this revision.

Superseded by D9778, D9779, D9780 and D9781.