To speed up pack lookups (especially when there are lots of packs), we
should maintain an lru ordering of the packs and perform searches in that
order, since it's likely the next entry we search for will be in the same pack
file as the last entry we searched for. This commit achieves the same.
Details
- Reviewers
durham - Group Reviewers
Restricted Project - Commits
- rFBHGXf5784ad7d9e9: basepack: add a lru cache for the pack files
- Ran all the tests.
- Created ~2k pack files in a large repo.
- Time taken without the cache:
- hg update b while at a: ~18 minutes.
- hg update a while at b: ~23 seconds.
- Time taken with the cache:
- hg update b while at a: ~14 seconds.
- hg update a while at b: ~9 seconds.
- Time taken without the cache:
Diff Detail
- Repository
- rFBHGX Facebook Mercurial Extensions
- Lint
Automatic diff as part of commit; lint not applicable. - Unit
Automatic diff as part of commit; unit tests not applicable.
Event Timeline
I have verified that the cache behaves as expected. I will add more tests for the caching anyways. I wanted to put this out so that there are some eyes on it before I go too far.
Please see D1220 for some idea of how I tested the improvements due to the cache.
remotefilelog/basepack.py | ||
---|---|---|
105 | Why not put the default value on the basepackstore instead of putting the same value on both derived types? | |
129 | Why do we need the notion of a path in this class? I would kinda expect this class to just have an add(pack) and __iter__ functions? No need to implement the rest of the apis if we never use them. Then this could just be named self._packs and the member on basepackstore could go back to being self.packs? | |
130 | If the reason for making _pathtopack a dictionary was because lrucachedict is a dictionary, you can just make lrucachedict a dictionary of pack->True | |
167 | This sort could still save on the initial lookup right? Seems like a reasonable win for such a small amount of code. |
@durham I am assuming you are good with the overall approach. I will make the appropriate fixes.
remotefilelog/basepack.py | ||
---|---|---|
105 | The value happens to be the same but in general, I wanted the child classes to decide on the default size. Or, at least have the option of overriding the default in the child classes. | |
129 | I considered this and then was pretty much lazy because I had already written the code based on the dict. Wanted to get the idea out there more than get it out perfectly :). I will fix this! | |
130 | Sure! | |
167 | I remember I put this TODO here because the caching I added doesn't use the most recent files anyways (and I wanted to bring that to attention of the reviewers) and it seemed to not make too much of a difference. I have a better idea to actually load the cache with the most recent entries instead of a cold start. |
remotefilelog/basepack.py | ||
---|---|---|
243–244 | Hmm, the _getavailablepackfiles returns the most recent packs but conversion to set here can possibly throw away that information. We can probably also refactor/fix this code. |
Sorry for the late review. I think it looks good, but it just needs a test.
remotefilelog/basepack.py | ||
---|---|---|
47 | You can still put this on the basepackstore class, so subclasses can override it if they want. | |
129 | I'd probably avoid nesting the classes. Just make this a top level class. Nested classes introduce some possible confusion around scope and what the inner class has access to. | |
141 | Generally I would check if self._lastpack is not None: instead of relying on the default handler. A pack implementation might implement something like len which python would then use to check if the pack was empty here, which could be expensive. | |
170 | This pattern is pretty clever! |
Ah I see the test is another diff. Probably worth fixing some of the comments, but accepting so you're not blocked after.
The test is still pending but I guess this is in a good shape now.
remotefilelog/basepack.py | ||
---|---|---|
141 | Neat! I wasn't aware so much can happen under the hood. Will keep this in mind for future! |
tests/remotefilelog-datapack.py | ||
---|---|---|
305 | This loop is unnecessary. I might have a better way to test this. |
tests/remotefilelog-datapack.py | ||
---|---|---|
296 | It's a bit confusing to call this packs since it's actually a bunch of revisions. I was pretty confused about why 'pack.index' worked below. Maybe rename this 'deltachains' or something? |
You can still put this on the basepackstore class, so subclasses can override it if they want.