This is an archive of the discontinued Mercurial Phabricator instance.

Differential D6417

context: get filesadded() and filesremoved() from changeset if configured
ClosedPublic

Authored by martinvonz on May 21 2019, 8:32 PM.

Download Raw Diff

Details

Reviewers

None

Group Reviewers

hg-reviewers

Commits

rHG602469a91550: context: get filesadded() and filesremoved() from changeset if configured

Summary

This adds the read side for getting the sets of added and removed
files from the changeset extras. I timed this command on the hg repo:

hg log -T '{rev}\n {files}\n %:{file_mods}\n +{file_adds}\n -{file_dels}\n'

It took 1m21s before and 6.4s after. I also used that command to check
that the result didn't change compared to calculating the values from
the manifests on the fly (it didn't change).

In the mozilla-unified repo, the same command run on
FIREFOX_BETA_58_END::FIREFOX_BETA_59_END went from 29s to 0.67s.

Diff Detail

Repository

rHG Mercurial

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

Event Timeline

martinvonz created this revision.May 21 2019, 8:32 PM

Herald added a reviewer: hg-reviewers. · View Herald TranscriptMay 21 2019, 8:32 PM

Herald added a subscriber: mercurial-devel. · View Herald Transcript

martinvonz added a child revision: D6418: copies: split up _chain() in naive chaining and filtering steps.May 21 2019, 8:32 PM

martinvonz updated this revision to Diff 15232.May 22 2019, 1:02 PM

pulkit added a child revision: D6419: copies: do full filtering at end of _changesetforwardcopies().May 25 2019, 6:13 PM

I can't really comment on the storage format. I'm not keen on using extras
for this kind of stuff (including copies), but that seems be okay for
experiment.

@indygreg Any comments?

+def decodefileindices(files, data):
+ try:
+ subset = []
+ for str in data.split('\0'):
+ i = int(str)

Better to not shadow str() function.

+ if i < 0 or i > len(files):

Off by one?

+ return None
+ subset.append(files[i])
+ return subset
+ except (ValueError, IndexError):
+ # Perhaps someone had chosen the same key name (e.g. "added") and
+ # used different syntax for the value.

In D6417#93707, @yuja wrote:

I can't really comment on the storage format. I'm not keen on using extras
for this kind of stuff (including copies), but that seems be okay for
experiment.

Do we have a better place for it? What's your concern with using extras? Is it that we're storing information that could instead be calculated? I agree, but the same is true about the list of files, of course (and linkrevs, although they're not stored in the changeset). Or that a user could set the values? I agree about that too, but I don't know what to do about that. We could create a cache for this information, but we can't really create a cache for the copy information for Google's use case (serving copy information together with changesets). At least it wouldn't be a cache in the usual sense. It could be a separate storage still, of course. @marmoute has been working on that a bit. We'd need that storage to be exchanged before we could use it. We would also need it to be considered the source of truth for copy information (which probably means that it should live in .hg/store/ rather than .hg/cache/). I don't know exactly how that aligns with @marmoute's plans. I also haven't thought about how a migration would work for us if we eventually decided to switch over from storage in extras to a separate storage, but that will probably not be a huge problem.

@indygreg Any comments?

+def decodefileindices(files, data):
+ try:
+ subset = []
+ for str in data.split('\0'):
+ i = int(str)

Better to not shadow str() function.

Good point. Done.

+ if i < 0 or i > len(files):

Off by one?

Oops, that's embarrassing. Done.

+ return None
+ subset.append(files[i])
+ return subset
+ except (ValueError, IndexError):
+ # Perhaps someone had chosen the same key name (e.g. "added") and
+ # used different syntax for the value.

martinvonz updated this revision to Diff 15275.May 28 2019, 1:04 PM

> I can't really comment on the storage format. I'm not keen on using extras
>  for this kind of stuff (including copies), but that seems be okay for
>  experiment.
Do we have a better place for it?

I don't think so.

What's your concern with using extras?
Is it that we're storing information that could instead be calculated?

No, I don't care much about that.

Or that a user could set the values?

Somewhat yes.

I just have a feeling that these copies/added/removed data are first class,
the repo can be somewhat corrupted if these data are wrong, which I don't think
are data meant to be stored in the extras.

Ideally, we can add some repo requirement and bump the revlog format to
store these data properly, but that's a big change. So I said storing in
extras seems okay for the time being.

I'm also not super crazy about abusing extras for this. But it is the best compromise considering the "better" solutions require a lot more effort and thought. At some point, I would like Mercurial's storage and wire protocol to grow official APIs for storing and exchanging arbitrary data outside the current storage primitives. Maybe we can shoehorn storage into revlogs in .hg/store/meta. I dunno. Feels like sprint material to me.

Closed by commit rHG602469a91550: context: get filesadded() and filesremoved() from changeset if configured (authored by martinvonz). · Explain WhyJun 1 2019, 7:36 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents
Changeset List

			Path	Packages
M			mercurial/changelog.py (26 lines)
M			mercurial/context.py (12 lines)

Diff	ID	Description	Created	Lint	Unit
Base		Base
Diff 1	15215		May 21 2019, 8:32 PM	★	★
Diff 2	15232		May 22 2019, 1:02 PM	★	★
Diff 3	15275		May 28 2019, 1:04 PM	★	★
Diff 4	15324	rHG602469a915503e3996f21828ee34cb253922776f	May 15 2019, 1:20 AM	★	★

Status	Author	Revision
Closed	martinvonz	D6422 copies: avoid calling matcher if matcher.always()
Closed	martinvonz	D6421 copies: avoid unnecessary copying of copy dict
Closed	martinvonz	D6420 copies: don't filter out copy targets created on other side of merge commit
Closed	martinvonz	D6419 copies: do full filtering at end of _changesetforwardcopies()
Closed	martinvonz	D6418 copies: split up _chain() in naive chaining and filtering steps
Closed	martinvonz	D6417 context: get filesadded() and filesremoved() from changeset if configured
Closed	martinvonz	D6416 changelog: optionally store added and removed files in changeset extras
Closed	martinvonz	D6369 templatekw: make {file_*} compare to both merge parents (issue4292)
Closed	martinvonz	D6370 templatekw: move showfileadds() close to showfile{mods,dels}()
Closed	martinvonz	D6368 tests: add test for {file_mods}, {file_adds}, {file_dels} on merge commit
Closed	martinvonz	D6367 context: add ctx.files{modified,added,removed}() methods

Diff 15324

mercurial/changelog.py

	def encodefileindices(files, subset):			def encodefileindices(files, subset):
	subset = set(subset)			subset = set(subset)
	indices = []			indices = []
	for i, f in enumerate(files):			for i, f in enumerate(files):
	if f in subset:			if f in subset:
	indices.append('%d' % i)			indices.append('%d' % i)
	return '\0'.join(indices)			return '\0'.join(indices)

				def decodefileindices(files, data):
				try:
				subset = []
				for strindex in data.split('\0'):
				i = int(strindex)
				if i < 0 or i >= len(files):
				return None
				subset.append(files[i])
				return subset
				except (ValueError, IndexError):
				# Perhaps someone had chosen the same key name (e.g. "added") and
				# used different syntax for the value.
				return None

	def stripdesc(desc):			def stripdesc(desc):
	"""strip trailing whitespace and leading and trailing empty lines"""			"""strip trailing whitespace and leading and trailing empty lines"""
	return '\n'.join([l.rstrip() for l in desc.splitlines()]).strip('\n')			return '\n'.join([l.rstrip() for l in desc.splitlines()]).strip('\n')

	class appender(object):			class appender(object):
	'''the changelog index must be updated last on disk, so we use this class			'''the changelog index must be updated last on disk, so we use this class
	to delay writes to it'''			to delay writes to it'''
	def __init__(self, vfs, name, mode, buf):			def __init__(self, vfs, name, mode, buf):
	class _changelogrevision(object):			class _changelogrevision(object):
	# Extensions might modify _defaultextra, so let the constructor below pass			# Extensions might modify _defaultextra, so let the constructor below pass
	# it in			# it in
	extra = attr.ib()			extra = attr.ib()
	manifest = attr.ib(default=nullid)			manifest = attr.ib(default=nullid)
	user = attr.ib(default='')			user = attr.ib(default='')
	date = attr.ib(default=(0, 0))			date = attr.ib(default=(0, 0))
	files = attr.ib(default=attr.Factory(list))			files = attr.ib(default=attr.Factory(list))
				filesadded = attr.ib(default=None)
				filesremoved = attr.ib(default=None)
	p1copies = attr.ib(default=None)			p1copies = attr.ib(default=None)
	p2copies = attr.ib(default=None)			p2copies = attr.ib(default=None)
	description = attr.ib(default='')			description = attr.ib(default='')

	class changelogrevision(object):			class changelogrevision(object):
	"""Holds results of a parsed changelog revision.			"""Holds results of a parsed changelog revision.

	Changelog revisions consist of multiple pieces of data, including			Changelog revisions consist of multiple pieces of data, including
	def files(self):			def files(self):
	off = self._offsets			off = self._offsets
	if off[2] == off[3]:			if off[2] == off[3]:
	return []			return []

	return self._text[off[2] + 1:off[3]].split('\n')			return self._text[off[2] + 1:off[3]].split('\n')

	@property			@property
				def filesadded(self):
				rawindices = self.extra.get('filesadded')
				return rawindices and decodefileindices(self.files, rawindices)

				@property
				def filesremoved(self):
				rawindices = self.extra.get('filesremoved')
				return rawindices and decodefileindices(self.files, rawindices)

				@property
	def p1copies(self):			def p1copies(self):
	rawcopies = self.extra.get('p1copies')			rawcopies = self.extra.get('p1copies')
	return rawcopies and decodecopies(rawcopies)			return rawcopies and decodecopies(rawcopies)

	@property			@property
	def p2copies(self):			def p2copies(self):
	rawcopies = self.extra.get('p2copies')			rawcopies = self.extra.get('p2copies')
	return rawcopies and decodecopies(rawcopies)			return rawcopies and decodecopies(rawcopies)

mercurial/context.py

	def files(self):			def files(self):
	return self._changeset.files			return self._changeset.files
	def filesmodified(self):			def filesmodified(self):
	modified = set(self.files())			modified = set(self.files())
	modified.difference_update(self.filesadded())			modified.difference_update(self.filesadded())
	modified.difference_update(self.filesremoved())			modified.difference_update(self.filesremoved())
	return sorted(modified)			return sorted(modified)
	def filesadded(self):			def filesadded(self):
				source = self._repo.ui.config('experimental', 'copies.read-from')
				if (source == 'changeset-only' or
				(source == 'compatibility' and
				self._changeset.filesadded is not None)):
				return self._changeset.filesadded or []

	added = []			added = []
	for f in self.files():			for f in self.files():
	if not any(f in p for p in self.parents()):			if not any(f in p for p in self.parents()):
	added.append(f)			added.append(f)
	return added			return added
	def filesremoved(self):			def filesremoved(self):
				source = self._repo.ui.config('experimental', 'copies.read-from')
				if (source == 'changeset-only' or
				(source == 'compatibility' and
				self._changeset.filesremoved is not None)):
				return self._changeset.filesremoved or []

	removed = []			removed = []
	for f in self.files():			for f in self.files():
	if f not in self:			if f not in self:
	removed.append(f)			removed.append(f)
	return removed			return removed

	@propertycache			@propertycache
	def _copies(self):			def _copies(self):