( )⚙ D6417 context: get filesadded() and filesremoved() from changeset if configured

This is an archive of the discontinued Mercurial Phabricator instance.

context: get filesadded() and filesremoved() from changeset if configured
ClosedPublic

Authored by martinvonz on May 21 2019, 8:32 PM.

Details

Summary

This adds the read side for getting the sets of added and removed
files from the changeset extras. I timed this command on the hg repo:

hg log -T '{rev}\n {files}\n %:{file_mods}\n +{file_adds}\n -{file_dels}\n'

It took 1m21s before and 6.4s after. I also used that command to check
that the result didn't change compared to calculating the values from
the manifests on the fly (it didn't change).

In the mozilla-unified repo, the same command run on
FIREFOX_BETA_58_END::FIREFOX_BETA_59_END went from 29s to 0.67s.

Diff Detail

Repository
rHG Mercurial
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

martinvonz created this revision.May 21 2019, 8:32 PM
martinvonz updated this revision to Diff 15232.May 22 2019, 1:02 PM

I can't really comment on the storage format. I'm not keen on using extras
for this kind of stuff (including copies), but that seems be okay for
experiment.

@indygreg Any comments?

+def decodefileindices(files, data):
+ try:
+ subset = []
+ for str in data.split('\0'):
+ i = int(str)

Better to not shadow str() function.

+ if i < 0 or i > len(files):

Off by one?

+ return None
+ subset.append(files[i])
+ return subset
+ except (ValueError, IndexError):
+ # Perhaps someone had chosen the same key name (e.g. "added") and
+ # used different syntax for the value.

In D6417#93707, @yuja wrote:

I can't really comment on the storage format. I'm not keen on using extras
for this kind of stuff (including copies), but that seems be okay for
experiment.

Do we have a better place for it? What's your concern with using extras? Is it that we're storing information that could instead be calculated? I agree, but the same is true about the list of files, of course (and linkrevs, although they're not stored in the changeset). Or that a user could set the values? I agree about that too, but I don't know what to do about that. We could create a cache for this information, but we can't really create a cache for the copy information for Google's use case (serving copy information together with changesets). At least it wouldn't be a cache in the usual sense. It could be a separate storage still, of course. @marmoute has been working on that a bit. We'd need that storage to be exchanged before we could use it. We would also need it to be considered the source of truth for copy information (which probably means that it should live in .hg/store/ rather than .hg/cache/). I don't know exactly how that aligns with @marmoute's plans. I also haven't thought about how a migration would work for us if we eventually decided to switch over from storage in extras to a separate storage, but that will probably not be a huge problem.

@indygreg Any comments?

+def decodefileindices(files, data):
+ try:
+ subset = []
+ for str in data.split('\0'):
+ i = int(str)

Better to not shadow str() function.

Good point. Done.

+ if i < 0 or i > len(files):

Off by one?

Oops, that's embarrassing. Done.

+ return None
+ subset.append(files[i])
+ return subset
+ except (ValueError, IndexError):
+ # Perhaps someone had chosen the same key name (e.g. "added") and
+ # used different syntax for the value.

martinvonz updated this revision to Diff 15275.May 28 2019, 1:04 PM
yuja added a comment.May 28 2019, 7:44 PM
> I can't really comment on the storage format. I'm not keen on using extras
>  for this kind of stuff (including copies), but that seems be okay for
>  experiment.
Do we have a better place for it?

I don't think so.

What's your concern with using extras?
Is it that we're storing information that could instead be calculated?

No, I don't care much about that.

Or that a user could set the values?

Somewhat yes.

I just have a feeling that these copies/added/removed data are first class,
the repo can be somewhat corrupted if these data are wrong, which I don't think
are data meant to be stored in the extras.

Ideally, we can add some repo requirement and bump the revlog format to
store these data properly, but that's a big change. So I said storing in
extras seems okay for the time being.

I'm also not super crazy about abusing extras for this. But it is the best compromise considering the "better" solutions require a lot more effort and thought. At some point, I would like Mercurial's storage and wire protocol to grow official APIs for storing and exchanging arbitrary data outside the current storage primitives. Maybe we can shoehorn storage into revlogs in .hg/store/meta. I dunno. Feels like sprint material to me.

This revision was automatically updated to reflect the committed changes.