This is an archive of the discontinued Mercurial Phabricator instance.

changegroup: add functionality to skip adding changelog data to changegroup
ClosedPublic

Authored by pulkit on Sep 18 2018, 6:43 AM.

Details

Summary

In narrow extension, when we have a non-ellipses narrow working copy and we
extend it, we pull all the changelog data again and the client tries to reapply
all that changelog data.

While downloading millions of changeset data is still not very expensive but
applying them on the client side is very expensive and takes ~10 minutes. These
10 minutes are added to every hg tracked --addinclude <> call and extending
a narrow copy becomes very slow.

This patch adds a new changelog argument to cgpacker.generate() fn. If the
changelog argument is set to False, we won't yield the changelog data. We still
have to iterate over the deltas returned by _generatechangelog() because that's
a generator and builds the data for clstate variable which is required for
calculating manifests and filelogs.

Diff Detail

Repository
rHG Mercurial
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

pulkit created this revision.Sep 18 2018, 6:43 AM

I agree this is dirty and I will be very happy to improve this. I can think of folllowing improvements

First is this:

@@ -837,10 +838,14 @@ class cgpacker(object):
         size = 0
 
         clstate, deltas = self._generatechangelog(cl, clnodes)
-        for delta in deltas:
-            for chunk in _revisiondeltatochunks(delta, self._builddeltaheader):
-                size += len(chunk)
-                yield chunk
+        if changelog:
+            for delta in deltas:
+                for chunk in _revisiondeltatochunks(delta, self._builddeltaheader):
+                    size += len(chunk)
+                    yield chunk
+        else:
+            for delta in deltas:
+                pass
 
         close = closechunk()
         size += len(close)

The second improvement is to add a part argument instead of changelog where by default, part = ('changelog', 'manifests', 'treemanifests', 'filelog') and caller can pass a value of parts. The call in narrow will pass ('manifests', 'treemanifests', 'filelog') and in future we can reduce that to ('treemanifests', 'filelogs').

indygreg accepted this revision.Sep 18 2018, 1:53 PM
indygreg added a subscriber: indygreg.

I think the code as written makes sense: I'd rather have 1 loop over deltas than 2.

Regarding the other potential improvement, the reality of changegroups is that both they and bundle2 are container formats. In the context of bundle2, I think it makes more sense to send "delta groups" as individual bundle2 parts and to do away with changegroups completely. My guess is we haven't done this because the changegroup code has always existed, is non-trivial, and "just works." In other words, there hasn't been a pressing need to split up the changegroup code, so nobody has done it.

This revision is now accepted and ready to land.Sep 18 2018, 1:53 PM
This revision was automatically updated to reflect the committed changes.

I think the code as written makes sense: I'd rather have 1 loop over deltas than 2.
Regarding the other potential improvement, the reality of changegroups is that both they and bundle2 are container formats. In the context of bundle2, I think it makes more sense to send "delta groups" as individual bundle2 parts and to do away with changegroups completely. My guess is we haven't done this because the changegroup code has always existed, is non-trivial, and "just works." In other words, there hasn't been a pressing need to split up the changegroup code, so nobody has done it.

Yep, just because changegroup exists, and it was not very much hard to prevent sending the changegroup data, I prevented myself working on the "delta groups" idea. However, in future, if we want to come up with a new format due to some reasons, individual "delta groups" sounds a promising idea.