This is an archive of the discontinued Mercurial Phabricator instance.

packs: improve packfile metrics
ClosedPublic

Authored by phillco on Nov 3 2017, 12:12 PM.
Tags
None
Subscribers

Details

Reviewers
singhsrb
durham
Group Reviewers
Restricted Project
Commits
rFBHGX15e22474bcec: packs: improve packfile metrics
Summary

The previous solution was incomplete. This solution logs once per run, with two
separate metrics (filestore_ and treestore_), each logging the number of packs
and bytes. I also did some refactoring.

Diff Detail

Repository
rFBHGX Facebook Mercurial Extensions
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

phillco created this revision.Nov 3 2017, 12:12 PM
Herald added a reviewer: Restricted Project. · View Herald TranscriptNov 3 2017, 12:12 PM
phillco updated this revision to Diff 3261.Nov 3 2017, 12:13 PM
phillco added inline comments.
remotefilelog/basepack.py
92–109

This is basically @singhsrb's refactor to _getavailablepackfiles in D1217 with some nore goodies.

phillco edited the summary of this revision. (Show Details)Nov 3 2017, 12:26 PM
phillco added inline comments.Nov 3 2017, 1:41 PM
remotefilelog/basepack.py
92–109

*more

singhsrb requested changes to this revision.Nov 3 2017, 2:00 PM

Thanks for doing this refactoring and saving developer time in the future :). I just have a few suggestions/comments.

remotefilelog/basepack.py
61

I don't see what the general convention is but I would prefer to see

for filepath, _mtime, _size in self._getavailablepackfilessorted():
105

I removed the id from ids at this point in my refactor. I did not confirm if it actually helps or not but maybe something worth considering.

121

size -> _size

136

file->_file, mtime->_mtime

143

hmm, number of packs in self.packs can be different from ones that gettotalsize perceives because self.packs may not have the most recent packs..

This revision now requires changes to proceed.Nov 3 2017, 2:00 PM
phillco added inline comments.Nov 3 2017, 2:20 PM
remotefilelog/basepack.py
61

Why prefix them?

phillco edited the summary of this revision. (Show Details)Nov 3 2017, 3:16 PM
phillco added inline comments.Nov 3 2017, 3:22 PM
remotefilelog/basepack.py
61

We talked offline -- it's to mark it as unused. I like the idea marking unused variables, but _size looks too much to me like an instance variable (and the next person might not rename it).

Since _ is used by the localization framework, I humbly propose __. It doesn't look _too_ weird, although it's not been done before.

105

My hunch is the combined time of removing the items is greater than the time saved in subsequent lookups -- but I could be wrong. It's probably still a good idea to reduce memory usage.

143

This is a great point, the two could be inconsistent.

singhsrb added inline comments.Nov 3 2017, 8:34 PM
remotefilelog/basepack.py
61

Does it make more sense to actually get the tuple and extract stuff out of it instead of expanding in the loop in this case?

phillco added inline comments.Nov 5 2017, 9:05 PM
remotefilelog/basepack.py
61

Would that be to eliminate the unused variables?

phillco updated this revision to Diff 3288.Nov 5 2017, 9:25 PM
phillco marked an inline comment as done.Nov 5 2017, 9:33 PM
phillco added inline comments.
remotefilelog/basepack.py
143

I ended up combining the calculation of both into a gettotalsizeandcount (ick) to ensutre consistency.

phillco updated this revision to Diff 3290.Nov 5 2017, 9:33 PM
phillco marked 5 inline comments as done.Nov 6 2017, 12:24 PM
singhsrb accepted this revision.Nov 6 2017, 1:41 PM

Looks good to me mostly. But I don't have much context here. So, I will be more comfortable with another pair of eyes on this!

remotefilelog/basepack.py
61

I was thinking of something like:

for pack in self._getavailablepackfilessorted():
  filepath = pack.filepath
  .
  .
143

That's what I was mainly concerned about. Thanks for taking care of this!

remotefilelog/shallowutil.py
98

I am not sure sum is the right word here mainly because of its use in the string case. Perhaps, add or aggregate is better.

This revision is now accepted and ready to land.Nov 6 2017, 1:41 PM
quark added a subscriber: quark.Nov 6 2017, 2:03 PM
quark added inline comments.
remotefilelog/basepack.py
61

I misunderstood the discussion last week (I thought it was choosing between _name vs __). But if it's between _name and name, I don't think it has to be that strict as long as pyflakes does not complain.

The unprefixed version will causes leas code churn if the variables are used in a later patch, which makes the blame output tidier. I remembered that the Mercurial community strongly preferred less code churn when mpm was still there - for example, there was an AST transformer to convert 'str' to b'str' for Py3 compatibility, instead of doing a codemod.

durham accepted this revision.Nov 6 2017, 2:12 PM

The 'count is not incremented' comment is the only important thing. All my other comments are nits or suggestions.

remotefilelog/basepack.py
61

It looks like the 2nd and 3rd parameter from _getavailablepackfilessorted is never used. Can we just not return them?

105

If we're going to remove it, I'd add a comment explaining why. I don't think we really need that optimization, since most of the time this loop will only be over 10's of items.

139

Count is not incremented

remotefilelog/shallowrepo.py
80–81

I'd drop the underscore and have the helper function add it. So the caller doesn't have to be aware of formatting.

I'd also not use *[] in this case and just pass the list as a list, since that's how it's consumed later. I know we use the * pattern in other places in remotefilelog, but I've come to regret it. If you did want to use *stores in the function, then you'd just pass , 'filestore_', packcontentstore, packmetadatastore) here.

Same for sumdicts, where we pass in *metrics, and receive out *dicts. Just drop the *

remotefilelog/shallowutil.py
106

The default of 0 here means it won't work with anything other than numbers.

>>> from collections import defaultdict
>>> foo = defaultdict(lambda: 0)
>>> foo['bar'] += ['a', 'b', 'c']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for +=: 'int' and 'list'
singhsrb added inline comments.Nov 6 2017, 2:23 PM
remotefilelog/basepack.py
139

Well definitely needed another pair of eyes :).

phillco marked an inline comment as done.Nov 6 2017, 2:25 PM
phillco added inline comments.
remotefilelog/basepack.py
61

I think it's more confusing if we don't, because that function claims to otherwise be identical to _getavailablepackfilessorted().

139

whoops, thanks

remotefilelog/shallowutil.py
106

hm, derp. I'll just drop the strings array comment for now. We could always set the default conditionally based on the types of the values later on, if we really want that use case.

phillco added inline comments.Nov 6 2017, 2:26 PM
remotefilelog/basepack.py
61

*_getavailablepackfiles()

phillco updated this revision to Diff 3298.Nov 6 2017, 4:31 PM
phillco marked 10 inline comments as done.Nov 6 2017, 9:09 PM
phillco updated this revision to Diff 3301.