Page MenuHomePhabricator

tags: redo .hgtags file node cache to work more like the revbranchcache
Needs ReviewPublic

Authored by joerg.sonnenberger on Mar 1 2021, 12:18 PM.

Details

Reviewers
Alphare
Group Reviewers
hg-reviewers
Summary

Computing tags requires parsing .hgtags for all heads. Mercurial
therefore keeps a cache to efficiently find the .hgtags version of a
revision without having to parse the manifest, but this cache is
computed lazily and often incomplete.

The new implementation of the test works a lot more like the
revbranchcache and updates the cache in two stages:
(1) When a new changeset is added, check if .hgtags is touched. The
common case is that it didn't change and it is therefore inherited from
the parent. Now the parent might not be fully resolved yet (missing
manifest), so just keep a dictionary mapping to the parent revision that
potentially touched it.
(2) At the end of the transaction, resolve entries before writing the
cache to disk. At this point, manifests are all known, so they can be
parsed as necessary. The fast path here is reading just the delta, but
this doesn't necessarily answer the question, since the delta could have
been to a non-parent.

If the cache logic hits an invalid or missing node, it will recheck all
nodes. This is a bit more expensive, but simplifies the logic and avoids
recursions. This penalty is normally hit only once, but read-only
clients should run debugupdatecaches once and after strips. The
rewritten version no longer uses a separate missing item. This matters
only if a node with 32bit leading nulls exist other than the nullid, but
that is as likely as 32bit leading ones with the old code.

Extend verification has the .hgtags file node at least exists. It
doesn't create a measurable difference for the rebuild time for a tag
heavy repository.

The code structure of the tag prepares for mmap access to the cache
file. The only in-place operation is append and that verifies the old
file size and is only used if nothing else changed.

Diff Detail

Repository
rHG Mercurial
Branch
default
Lint
No Linters Available
Unit
No Unit Test Coverage

Event Timeline

joerg.sonnenberger edited the summary of this revision. (Show Details)Mar 1 2021, 8:35 PM
joerg.sonnenberger updated this revision to Diff 26010.
joerg.sonnenberger retitled this revision from tags: redo .hgtags file node cache to work more like the revbranchcache [WIP] to tags: redo .hgtags file node cache to work more like the revbranchcache.Mar 5 2021, 10:17 PM
joerg.sonnenberger edited the summary of this revision. (Show Details)
joerg.sonnenberger updated this revision to Diff 26117.

There are still a few asserts to ensure internal consistency for now, but this should be mostly in shape now.

I am +1 on the idea. @marmoute you might want to have a look.

The overall logic looks good (and simple to boot!) to me, but I have some comments about readability/maintainability.

mercurial/tags.py
689

Could you please use a more descriptive variable name? To me, a small comment would go a long way on clarifying the semantics of some of the less obvious ones (all the more for when the code evolves).

714

Could you add some comments about what we're comparing here? I understand we can't use a namedtuple (or anything readable) because Python, but this makes it hard to reason about the structure of record and why we're always splitting around 4.

839

Doesn't that result in two lookups?

884

Possibly have a _reset method for this?

joerg.sonnenberger marked 3 inline comments as done.Thu, Apr 1, 4:43 PM
joerg.sonnenberger updated this revision to Diff 26642.
mercurial/tags.py
714

The record format itself is described in the docstring of the class. Not sure if an explicit reference to that here really helps?

Alphare added inline comments.Thu, Apr 8, 5:42 AM
mercurial/tags.py
714

I'm not sure which docstring you're reffering to, so maybe it would help.
Sorry for the delay in review btw. :)

mercurial/tags.py
714

Line 657ff in the new version.

Alphare accepted this revision.Thu, Apr 8, 8:06 AM
Alphare added inline comments.
mercurial/tags.py
714

Right, I'm blind.

What happens in the stream clone case ? Or during the transition from older to newer client ? streaming content from an older clone will not contains a fully up to date cache, and existing repository neither. How are we dealing with this ?

You mention the need to update the cache after stripping, but I don't see such change in the code. am I missing something ?

mercurial/tags.py
763–765

Question for a possible follow up: Should we make this revnum only ? I find low level function that accept both weird.

790

note that the files files can be wrong, and they are recurrent instance of it in the wild. The new sidedata storage that we use for copy tracing can fix this, but I don't think we can transparently expose it from the changelog revision abstraction.

We should at least get a XXX comment about it.

816

Can we get some more information about the error here ? liek what kind of invalid topology and for which revision ?

837

Same feedback about .files here.

Do you have any performance number for that update ? for example the time we need to do that full some known large repositories, and the impact on a clone ?

How do you deal with repository corruption ? (for example if some old revision reference some missing hgtags ?)

What happens in the stream clone case ? Or during the transition from older to newer client ? streaming content from an older clone will not contains a fully up to date cache, and existing repository neither. How are we dealing with this ?
You mention the need to update the cache after stripping, but I don't see such change in the code. am I missing something ?

The cache will fix up itself when it detects missing entries. This can take a moment and won't persist for true read-only repositories (since they are read-only). That's why there is a note in the release notes. It is note worthy as the case can be more expensive than before for (very) large repositories. I have plans to revisit the strip handling at least for modern clients to do the right thing in first place, but will be separate and also cover rev-branch-cache.

mercurial/tags.py
763–765

The function is used internally where we have the choice and externally, where it is a mix of box. fnoderevs would easily be changed to operate on nodes, _readtagcache not so much, I think.

790

In what way? If the basic "files (potentially) touched" is wrong, a lot of places in the tree are going to be failing. I'm aware that there are false positives, but those don't matter.

816

This is basically just an explicit sanity check at this point. It can be triggered only by broken revlogs, e.g. missing parent or parents not sorted before their children. I'd expect that to fall badly in other places, too. So I'm somewhat reluctant to add anything more here.

Do you have any performance number for that update ? for example the time we need to do that full some known large repositories, and the impact on a clone ?

A full cache rebuild clocks at around 45s for my NetBSD src test repo with 950k revisions. That cost is dominated by the changelog parsing, so number of hgtags revisions doesn't matter for that part. There would be an additional pass reading those manifest (deltas), but I found that one to be negligible at least for the tag heavy repos I found. I didn't specifically measure clone, because I don't expect it to be significant. The changelog is already parsed at that point and the computations are otherwise fairly cheap.

How do you deal with repository corruption ? (for example if some old revision reference some missing hgtags ?)

There is a fast path for adding missing records beyond the end of the index. When asking about a revision beyond the current end of the cache, it just tries the forward computation. As long as the existing entries are all valid, only new revisions are touched. That is essentially the fix-up missed updates from older hg versions.
When looking up an existing entry in the cache, _validaterecord checks that the truncated node match and that the filenode exists. If either is wrong, the cache is completely re-validated.

What happens in the stream clone case ? Or during the transition from older to newer client ? streaming content from an older clone will not contains a fully up to date cache, and existing repository neither. How are we dealing with this ?
You mention the need to update the cache after stripping, but I don't see such change in the code. am I missing something ?

The cache will fix up itself when it detects missing entries.

What does "fix up itself" involve here ? fixing the one missing entry ? or triggering a full computation, taking possibly minutes ?

This can take a moment and won't persist for true read-only repositories (since they are read-only). That's why there is a note in the release notes. It is note worthy as the case can be more expensive than before for (very) large repositories.

I am a bit worried about that, because people don't read the release note and upgrade has a good chance to introduce some cripplying performance for unsuspecting user of larger repositories :-/

I have plans to revisit the strip handling at least for modern clients to do the right thing in first place, but will be separate and also cover rev-branch-cache.

That sounds fine, strip is already a slow operation.

mercurial/tags.py
763–765

What about getting explicite getfnode_node(…) in addition to a revnum only getfnode(…)

(not too important)

790

They can be files that should be in the files field and are not. See https://bz.mercurial-scm.org/show_bug.cgi?id=6219 for details.

816

Maybe a ProgrammingError then ? or a dedicated ConsistencyError ? The current message in a high level exception class intended for the user (Abort), but is neither understandable by normal user nor actionnable by anyone. That is not great.

What if we use a different file ? This would allow us to use a different semantic an adjusted plan for a smoother transition.