This is an archive of the discontinued Mercurial Phabricator instance.

repository: introduce register_changeset callback
ClosedPublic

Authored by joerg.sonnenberger on Jan 14 2021, 8:19 PM.

Details

Summary

The new callback is called whenever a changeset is added to the repository
(commit, unbundle or exchange). Since the bulk operations already parse
the changeset (readfiles or full changesetrevision), always use the
latter to avoid redundant lookups. The first consumer of the new
interface needs to look at extra.

Diff Detail

Repository
rHG Mercurial
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Still not convinced with the API, so poking at what else would be possible.
I nothing obvious emerge we should move forward with that series before the freeze, the change final change is quite valuable.

marmoute requested changes to this revision.Jan 18 2021, 1:15 PM

So I am still not fan of this API:

  • it pass around quite high level object where I think rev would be more appropriate,
  • it rely on revision adder to call the callback themself in "high level" layer, make this odd of this being forgotten high and the API more fragile,
  • it trigger the method even for revision "not actually added because we know them already"

I poked around at an alternative approach that is lower level (so less fragile), is only called for revision actually added, and pass revision number around, relying on some simple cache mechanism to keep things performant.

See D9826, D9827, and D9828

This revision now requires changes to proceed.Jan 18 2021, 1:15 PM
  • it pass around quite high level object where I think rev would be more appropriate,

I don't disagree, but that's a more general change IMO. I really dislike D9826 as it introduces new hidden assumptions about how a revlog backend works. I was looking at whether the higher layers actually care about the new node much, but haven't pushed into that direction yet to either give the callbacks always both node and rev or just rev. So any change in that area should start by _addrevision returning either both node,rev or just rev. But that's quite a different scope all by itself.

  • it rely on revision adder to call the callback themself in "high level" layer, make this odd of this being forgotten high and the API more fragile,

I did original versions by hooking into _addrevision, but ultimately decided against it because there are very few such high-level places and they already fail badly if changes slip through. Given that it completely avoids the need for another caching layer for changelogrevisions to perform well, but seems much better to do it on the higher layer. I've also tried such a layer, but it also came at a measurable price in my tests since it is actually not that hot right now.

  • it trigger the method even for revision "not actually added because we know them already"

This is only possible for commitctx and I'm not sure that is expected to trigger a duplicate either. The other cases all have to deal with the difference between duplicate and newly added anyway.

  • it pass around quite high level object where I think rev would be more appropriate,

I don't disagree, but that's a more general change IMO. I really dislike D9826 as it introduces new hidden assumptions about how a revlog backend works.

How so ?

I was looking at whether the higher layers actually care about the new node much, but haven't pushed into that direction yet to either give the callbacks always both node and rev or just rev. So any change in that area should start by _addrevision returning either both node,rev or just rev. But that's quite a different scope all by itself.

I am not sure what you mean, but that is probably related to the fact I did not get the first part :-)

  • it rely on revision adder to call the callback themself in "high level" layer, make this odd of this being forgotten high and the API more fragile,

I did original versions by hooking into _addrevision, but ultimately decided against it because there are very few such high-level places and they already fail badly if changes slip through. Given that it completely avoids the need for another caching layer for changelogrevisions to perform well, but seems much better to do it on the higher layer. I've also tried such a layer, but it also came at a measurable price in my tests since it is actually not that hot right now.

Do you have a way to get number of my proposal ? The caching layer is really simple and I don't expect a large overhead. Even if they are few high level place and really prefer contained interface.

  • it trigger the method even for revision "not actually added because we know them already"

This is only possible for commitctx and I'm not sure that is expected to trigger a duplicate either. The other cases all have to deal with the difference between duplicate and newly added anyway.

If if possible for bundle too, is it not ? You just have to apply a bundle you already have, don't you ? Having the callback call only for new rev seems simpler as code in such callback don't have to deal with the de duplication each.

@joerg.sonnenberger and I had a discussion on IRC about this API and the result is that Joerg prefers the higher level API for reason I now understand better without necessarly finding them decisive while I still prefer the lower level API for reason that (hopefully) Joerg understand better without finding them more decisive. So we are doing to need a bit more time to thing about that (and probably about the broader picture of the full API of the involved object) with probably more people.

So lets delay this for later during/after the freeze.

Don't forget that we both like the two previous Diff (D9778 and D9779) and we would like to see them in 5.7 if possible.

As discussed, move to using revision for the new function. Most of the prep work is factored out into smaller changesets, except changegroup.py, since that would create a small penalty by itself for no good reason. The goal for follow-up changes is to provide the revision from _addrevision directly (either instead or in addition to the node).

I like the fact we now take a rev as argument. However two discussion remains:

  • could/should we stop passing the changelogrevision as argument ?
  • should we move this within _add_revision or keep it at an higher level ?

I like the fact we now take a rev as argument. However two discussion remains:

  • could/should we stop passing the changelogrevision as argument ?

The performance critical path is IMO in during changegroup application (unbundle) and that one is already doing most of the work.
There are some good chances that consumers of the API will look at older revisions in at least some use cases, so the simple
last-use-cache won't work as well in that case. As such I think it is both simpler and more predictable to do the work once.

  • should we move this within _add_revision or keep it at an higher level ?

I think the behavior is still more consistent on the higher level. I am looking at a follow up change for the tags cache and that needs a secondary extension point for when manifests are added after changesets. I'm not sure yet if there is a use case for notification on all manifests.

I like the fact we now take a rev as argument. However two discussion remains:

  • could/should we stop passing the changelogrevision as argument ?

The performance critical path is IMO in during changegroup application (unbundle) and that one is already doing most of the work.
There are some good chances that consumers of the API will look at older revisions in at least some use cases, so the simple
last-use-cache won't work as well in that case. As such I think it is both simpler and more predictable to do the work once.

They are enough good argument here (for example the caching being fragile if callback do back things) to convince me here. Can you elaborate the method documentation a bit to highlight that?

  • should we move this within _add_revision or keep it at an higher level ?

I think the behavior is still more consistent on the higher level. I am looking at a follow up change for the tags cache and that needs a secondary extension point for when manifests are added after changesets. I'm not sure yet if there is a use case for notification on all manifests.

I am not convinced yet, but this is kind of a minor point. I am not sure about what is needed for the manifest, but for ChangedFiles related thing, we will definitely requires to do a second pass on changesets once the manifest and filelog are in. So they might be things common here.

This revision was not accepted when it landed; it landed in state Needs Review.
This revision was automatically updated to reflect the committed changes.