Page MenuHomePhabricator

incremental-repack: prefer small packs across generations over largest ones

Authored by singhsrb on Nov 21 2017, 2:59 PM.


Group Reviewers
Restricted Project

Earlier algorithm for picking the packs considered repacking the
largest two packs if there were not sufficient number of packs in any
generation for the repacking. This can lead to issues because the largest packs
can be huge and the resulting repacking takes a long time.

The new algorithm prefers picking smaller packs across generations if there are
not sufficient packs in a generation which avoids this issue.

Test Plan

Ran all the tests.

Diff Detail

rFBHGX Facebook Mercurial Extensions
Lint Skipped
Unit Tests Skipped

Event Timeline

singhsrb created this revision.Nov 21 2017, 2:59 PM
Herald added a reviewer: Restricted Project. · View Herald TranscriptNov 21 2017, 2:59 PM
singhsrb planned changes to this revision.Nov 21 2017, 3:07 PM

Actually, its best to avoid the highest generation if we are going across generations. So, this needs a bit of change.

phillco added inline comments.


singhsrb updated this revision to Diff 3745.Nov 21 2017, 4:18 PM
singhsrb marked an inline comment as done.Nov 21 2017, 4:19 PM
singhsrb updated this revision to Diff 3757.Nov 21 2017, 9:10 PM
durham added a subscriber: durham.Nov 22 2017, 10:15 AM

Can we hold off on landing this until after the holiday? I need to think through the ramifications of this, and it might be better to do it in a room together with a whiteboard.

singhsrb added a comment.EditedNov 22 2017, 2:15 PM


I think the biggest diversion with this commit from the current behavior is that the repack incremental can leave 2 huge pack files even after running infinite times with the current repacksizelimit (which is 100MB) and generations (which are 1GB, 100MB, 1MB) for the datapacks. For example, say you already had two pack files, A with size 45 GB and B with size 2 GB. These files would never be packed together until there is another pack file C which has size > 1GB. When C is eventually generated, it will repack with the B (because its the smaller between A and B) to result in D and leave A untouched. Now, A and D will never be repacked till there is another pack E which has size > 1GB and so on.

This automatically achieves the objective of not dealing with too huge pack files. The hg repack can repack the huge files if needed. Since there can already be users which have huge pack file, not dealing with the huge file can result in good benefits. Also, this is not a bug in the algorithm as such as it can be corrected by picking a large enough repacksizelimit.

singhsrb updated this revision to Diff 4017.Nov 30 2017, 4:23 PM
singhsrb added inline comments.Nov 30 2017, 4:37 PM

Since we are only picking at least 2 packs for repacking after this change instead of 3 as it was earlier, there can be a problem of convergence (number of packs considered for repacking being almost always less than the new packs being generated) which was pointed out by @durham. In particular, since the second highest generation is starting at 100MB, if we have enough packs > 100MB we will only pack only 2 of them at once. To take care of this, I am increasing the default repacksizelimit to 1GB from 100MB. This should have the desired effect.

Note that the maxrepackpacks limit ensures that all the smaller packs are not considered for the repacking even though they might fall under the repacksizelimit.

singhsrb updated this revision to Diff 4068.Dec 1 2017, 7:49 PM
singhsrb added inline comments.Dec 1 2017, 7:56 PM

Reverted this to 100MB because the change from a minimum of 2 to 3 files was only required when we had the following behavior:

  1. We either packed all the files in a generation if they were less than the size limit.
  2. Or we packed only 2 files from the generation (which we later changed to 3).

This is not really required if we are packing across generations till we hit the size limit.

singhsrb abandoned this revision.Dec 1 2017, 9:05 PM

Had an offline discussion with @durham regarding this. He pointed out enough cases where this might not be a better idea over the current algorithm. Therefore, abandoning it to clear the queue. There will be some follow up commits which incorporate the good ideas from this commit.

D1588 and D1596 address the major ideas from this commit.