Page MenuHomePhabricator

fastannotate: initial import from Facebook's hg-experimental

Authored by durin42 on Aug 1 2018, 11:17 AM.



I made as few changes as I could to get the tests to pass, but this
was a bit involved due to some churn in the blame code since someone
last gave fastannotate any TLC.

There's still follow-up work here to rip out support for old versions
of hg and to integrate the protocol with modern standards.

Some performance numbers (all on my 2016 MacBook Pro with a 2.6Ghz i7):

Mercurial mercurial/
traditional blame
time: real 1.050 secs (user 0.990+0.000 sys 0.060+0.000)
build cache
time: real 5.900 secs (user 5.720+0.000 sys 0.110+0.000)
time: real 0.120 secs (user 0.100+0.000 sys 0.020+0.000)

Mercurial mercurial/
traditional blame
time: real 3.330 secs (user 3.220+0.000 sys 0.070+0.000)
build cache
time: real 30.610 secs (user 30.190+0.000 sys 0.230+0.000)
time: real 0.180 secs (user 0.160+0.000 sys 0.020+0.000)

mozilla-central dom/ipc/ContentParent.cpp
traditional blame
time: real 7.640 secs (user 7.210+0.000 sys 0.380+0.000)
build cache
time: real 98.650 secs (user 97.000+0.000 sys 0.950+0.000)
time: real 1.580 secs (user 1.340+0.000 sys 0.240+0.000)

mozilla-central dom/base/nsDocument.cpp
traditional blame
time: real 17.110 secs (user 16.490+0.000 sys 0.500+0.000)
build cache
time: real 399.750 secs (user 394.520+0.000 sys 2.610+0.000)
time: real 1.780 secs (user 1.530+0.000 sys 0.240+0.000)

So building the cache is expensive (but might be faster with xdiff
enabled), but the blame results are *way* faster.

Diff Detail

rHG Mercurial
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

durin42 created this revision.Aug 1 2018, 11:17 AM
quark added a subscriber: quark.Aug 1 2018, 11:45 AM

I would mention in the commit message that building cache is much faster with linkrevcache prebuilt.

durin42 updated this revision to Diff 9767.Aug 1 2018, 6:47 PM
quark added a comment.Aug 2 2018, 4:17 PM

I'd also like to see C linelog benchmark data mentioned. The current commit message implies diff algorithm is the bottleneck. That's misleading.

In D3994#62892, @quark wrote:

I'd also like to see C linelog benchmark data mentioned. The current commit message implies diff algorithm is the bottleneck. That's misleading.

Please feel encouraged to contribute your own numbers here. I probably can't be bothered.

quark edited the summary of this revision. (Show Details)Aug 2 2018, 6:43 PM
quark edited the summary of this revision. (Show Details)Aug 2 2018, 6:56 PM
quark edited the summary of this revision. (Show Details)Aug 2 2018, 8:52 PM

I only looked at some of this code.

Overall, my threshold for adding to core is low. I think we should land then iterate.

I would really like to see functionality related to modifying the behavior of annotate live in core. I think it will be easier to comprehend when implemented that way. I envision someday having arguments passed to annotate() that control whether linelog is used, etc. We may end up inventing something like diffopts but for annotate operations.

It might help to think of this extension as two features: a cache for linelog data and user-facing UI to use it. The former is the most important to move to core. Especially since I imagine we don't want an hg fastannotate command lingering around forever (we would want to consolidate with hg annotate I think).

The most concerning part of this extension is the wire protocol functionality. It is not suitable to ship to end-users because of forward compatibility concerns.

The wire protocol functionality is literally transferring files off disk with little to no regard for the format of those files. The wire protocol doesn't expose the version of the files on disk. And that's obviously bad for future compatibility. We need some kind of compatibility checking that the client is capable of understanding the data the server will send it. This could be as simple as the server capabilities string advertising the on-disk format and clients ensuring they can speak the server's format before attempting to fetch that data.

Of course, there could be a mix of data versions on disk given the current simple storage mechanism. We probably want to use separate directories for each version of the cached data so we don't mix streams. Or at the very least the server needs to validate that on-disk data it is serving matches the expected version from the client. Otherwise e.g. a server could serve up an old file version. Or a downgraded server could serve up a new file version.

I think the most robust way forward is to have version-specific capabilities, wire protocol commands, and on-disk locations. This leaves the least margin for error and yields the greatest flexibility for changing formats and exchange semantics later without breaking BC.


branch here is probably not the correct terminology.


I would prefer if there were a separate boolean option for each mode. That way, individual .hg/hgrc files can override specific behavior without overriding global defaults. It is a more flexible config model. We can, of course, have some options imply others.


Should this be a template?


This should be a more description option, like exchange-cache or cache-pull.


Since remotefilelog is not part of core, we probably want to nuke this feature.


I think this wants to be in extsetup() since it mutates global module state.

Ideally it would be part of reposetup() since it only matters on repos that have this extension enabled.

(This comes into play in e.g. hgweb environments, where some repos may not have the extension enabled.)


Won't repo.requirements always be defined?


This wants to move to a utils module somewhere.


I'm sure @yuja has opinions about this file.


The use of \0 means we're a binary format. I'd prefer that we not e.g. format integers to strings if we're using binary exchange. Maybe throw some CBOR at the problem?


A lot of this function can be deleted since we don't support remotefilelog.


Ideally this would call a method directly instead of going through callcommand(). (I would like to remove callcommand() in the future.)


Perhaps the default value for mainbranch should be default?


Nuke it. Then inline the lambda so it isn't abstracted.


The 1st and 3rd setup items here should be in extsetup() since they modify global state (I think).

Ideally we'd have a registrar for these things so unloaded extensions don't have wire protocol modifications affect other repos in-process. But, yeah, not easy.


Nit: I'd name this magic instead of header since that appears to be what it is (REVMAP1\0).


I generally like packing all the constant-length fields first then putting variable-length fields afterwards. That way you can do one read to get all the known data then do another read to get the variable-length data.

Also, \0 terminated paths mean the reader has to scan. It is better to encode the length so that scan can be avoided.

durin42 edited the summary of this revision. (Show Details)Aug 9 2018, 3:13 PM

Addressed comments as follow-ups, mostly via recording a big TODO.


This actually runs inside a reposetup() method, so I think it's already sufficiently paranoid.

indygreg accepted this revision.Aug 20 2018, 3:52 PM

I looked this over again. I still have a number of concerns around this functionality. It really needs to be cleaned up. Especially the wire protocol and UI bits. But I suppose that's what the EXPERIMENTAL label is for. As long as we're not committed to backwards compatibility, I think it is appropriate to land in core then iterate. But let's try to fix as much before the end of 4.8 as possible to minimize impact for people who deploy this. The wire protocol and on-disk format compatibility issues should be top of our list.

This revision is now accepted and ready to land.Aug 20 2018, 3:52 PM
This revision was automatically updated to reflect the committed changes.