( )⚙ D1272 repack: sort orphan nodes by size

This is an archive of the discontinued Mercurial Phabricator instance.

repack: sort orphan nodes by size
ClosedPublic

Authored by phillco on Oct 31 2017, 2:02 AM.
Tags
None
Subscribers
None

Details

Reviewers
durham
Group Reviewers
Restricted Project
Commits
rFBHGX7ac80b88cb99: repack: sort orphan nodes by size
Summary

Add repack.chainorphansbysize (default True).

When enabled, we take all orphaned nodes (nodes that are not part of a chain),
and put them into a new chain at the end, so we can get some minimal
compression out of them. Right now, they default to each being stored as
fulltexts, which is wasteful.

We sort the orphan chain by size, descending, to make the largest version
quickest to access, on the assumption that it is probably the newest. (This is
what Git does for its packed data, and it is a decent fallback if ancestry is
not available)

Example chain output, before:

A->B C D->E->F G H

After:

A->B D->E->F G->C->H
(assuming len(G)>=C=>H)

(I'm still adding a test case, but the code itself could be reviewed.)

Diff Detail

Repository
rFBHGX Facebook Mercurial Extensions
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

phillco created this revision.Oct 31 2017, 2:02 AM
Herald added a reviewer: Restricted Project. · View Herald TranscriptOct 31 2017, 2:02 AM
phillco edited the summary of this revision. (Show Details)Oct 31 2017, 2:06 AM
phillco edited the summary of this revision. (Show Details)
phillco edited the summary of this revision. (Show Details)
phillco edited the summary of this revision. (Show Details)Oct 31 2017, 2:08 AM
phillco retitled this revision from repack: sort orphan nodes by size to Add `repack.chainorphansbysize` (default True)..
phillco updated this revision to Diff 3174.
phillco retitled this revision from Add `repack.chainorphansbysize` (default True). to repack: sort orphans by size and chain together.
phillco edited the summary of this revision. (Show Details)
durham requested changes to this revision.Oct 31 2017, 6:42 PM

Any idea what the perf impact of this change is?

remotefilelog/repack.py
16

Missing trailing comma

521

A filter having side effects seems a bit funky. I'm not really convinced this is a cleaner pattern than just building a second list in a loop.

525

If we post-pone the listification until here, we could avoid creating two lists. i.e. the first generator would feed into the second directly.

579

Instead of depending on ordering (which I don't think guarantees that delta chain entries are directly next to each other), can we instead just keep track in the deltabase loop above of any commit that A) doesn't have a delta base, and B) isn't used as a delta base?

597

We probably should. This can result in flakey tests otherwise. Maybe just sort by node as a secondary key. That said, the python docs say sorted is guaranteed to be stable for things with the same key, assuming the input list has the same order each time. So maybe we just sort orphans by node first.

598

Instead of reversing the sorted output, you can pass reverse=True to sorted.

602

Maybe put this behind a 'if self.debugflag:' check. This loop can be a bit of a critical path, and doing a bunch of unnecessary string formats could add up.

613

This makes the function quite larger. Could we refactor this to be a separate function?

This revision now requires changes to proceed.Oct 31 2017, 6:42 PM
phillco retitled this revision from repack: sort orphans by size and chain together to repack: sort orphan nodes by size.Nov 6 2017, 12:51 PM
phillco updated this revision to Diff 3291.
phillco updated this revision to Diff 3292.
phillco planned changes to this revision.Nov 6 2017, 12:52 PM
phillco updated this revision to Diff 3307.Nov 7 2017, 11:55 AM
phillco updated this revision to Diff 3308.Nov 7 2017, 11:58 AM
durham requested changes to this revision.Nov 7 2017, 1:27 PM

Looks like this is still in progress? Putting back on your queue

This revision now requires changes to proceed.Nov 7 2017, 1:27 PM

Yes, sorry about that. Submitted by mistake.

phillco marked 7 inline comments as done.Nov 10 2017, 1:00 AM
phillco edited the summary of this revision. (Show Details)
phillco updated this revision to Diff 3386.
phillco updated this revision to Diff 3387.Nov 10 2017, 1:09 AM
phillco added inline comments.Nov 10 2017, 1:10 AM
tests/test-remotefilelog-lfs.t
238–239 ↗(On Diff #3387)

Should probably add a hg debugdatapack $TESTTMP/hgcache/master/packs/*.datapack here on @ so we can see the type of change happening.

durham requested changes to this revision.Nov 14 2017, 11:57 AM

How much does this affect performance? Requesting changes for that information and for the potential KeyError I commented on.

remotefilelog/repack.py
458

Indentation isn't quite right.

538

Since this is a hotpath, we might want to avoid the double dictionary lookup. Might be better to do:

deltatuple = deltabases.get(node)
if deltatuple is None:
   nobase.add(node)
else:
  deltabase, chainlen = deltatuple
540–542

If node is not in deltabases, as the condition above checks, won't this line throw a KeyError?

This revision now requires changes to proceed.Nov 14 2017, 11:57 AM
phillco added inline comments.Nov 14 2017, 12:26 PM
remotefilelog/repack.py
540–542

No, because deltabases is a defaultdict. That's why we have to check first.

phillco added inline comments.Nov 14 2017, 12:29 PM
remotefilelog/repack.py
540–542

Although, now that I think about it, we can just assign the default here and make it a regular dict, which is probably clearer.

@durham locally, I found it added 5s to a 70s full repack.

phillco updated this revision to Diff 3504.Nov 14 2017, 9:48 PM
phillco marked an inline comment as done.Nov 14 2017, 9:48 PM
phillco marked 5 inline comments as done.Nov 15 2017, 12:51 PM
phillco updated this revision to Diff 3531.

(done)

remotefilelog/repack.py
538

Done. The only hitch is we have to write the nullid deltabase here too, otherwise it will throw a KeyError on one of the later loops.

durham accepted this revision.Nov 15 2017, 6:21 PM
durham added inline comments.
tests/test-remotefilelog-repack-fast.t
265

Kinda weird that the delta length is 70 and the blob size is 8 here? When it's not actually a delta? In the old version of the lines above, delta length and blob size matched when delta base was 0

Not related to this diff, but strange.

This revision is now accepted and ready to land.Nov 15 2017, 6:21 PM
phillco added inline comments.Nov 16 2017, 1:22 AM
tests/test-remotefilelog-repack-fast.t
265

Yeah, that is. I'll take a look if I can remember.