hgext/fix.py
398–399	Maybe update the docstring to reflect the new grouping?
548–550	very minor nit: I usually find it clearer to prefix unused variables with `_`

Diff	ID	Description	Created	Lint	Unit
Base		Base
Diff 1	29889		Aug 11 2021, 10:16 PM	★	★
Diff 2	30179		Sep 2 2021, 6:07 PM	★	★
Diff 3	30330		Sep 20 2021, 4:29 PM	★	★
Diff 4	30710		Oct 11 2021, 10:03 PM	★	★
Diff 5	30739	rHGf12a19d03d2ce06e4c107a8f42944e57a6881229	Sep 2 2021, 5:08 PM	★	★

Commit	Parents	Author	Summary	Date
e638ee88a3a6	f6db112be899	Danny Hooper		Sep 2 2021, 5:08 PM

	Status	Author	Revision
	Closed	hooper	D11280 fix: reduce number of tool executions
	Closed	hooper	D11279 fix: add test to demonstrate how many times tools are executed

Diff 30330

hgext/fix.py

	# Rather than letting each worker independently fetch the files			# Rather than letting each worker independently fetch the files
	# (which also would add complications for shared/keepalive			# (which also would add complications for shared/keepalive
	# connections), prefetch them all first.			# connections), prefetch them all first.
	_prefetchfiles(repo, workqueue, basepaths)			_prefetchfiles(repo, workqueue, basepaths)

	# There are no data dependencies between the workers fixing each file			# There are no data dependencies between the workers fixing each file
	# revision, so we can use all available parallelism.			# revision, so we can use all available parallelism.
	def getfixes(items):			def getfixes(items):
	for rev, path in items:			for srcrev, path, dstrevs in items:
	ctx = repo[rev]			ctx = repo[srcrev]
	olddata = ctx[path].data()			olddata = ctx[path].data()
	metadata, newdata = fixfile(			metadata, newdata = fixfile(
	ui, repo, opts, fixers, ctx, path, basepaths, basectxs[rev]			ui,
	)			repo,
	# Don't waste memory/time passing unchanged content back, but			opts,
	# produce one result per item either way.			fixers,
	yield (			ctx,
	rev,
	path,			path,
	metadata,			basepaths,
	newdata if newdata != olddata else None,			basectxs[srcrev],
	)			)
				# We ungroup the work items now, because the code that consumes
				# these results has to handle each dstrev separately, and in
				# topological order. Because these are handled in topological
				# order, it's important that we pass around references to
				# "newdata" instead of copying it. Otherwise, we would be
				# keeping more copies of file content in memory at a time than
				# if we hadn't bothered to group/deduplicate the work items.
				data = newdata if newdata != olddata else None
				for dstrev in dstrevs:
				yield (dstrev, path, metadata, data)

	results = worker.worker(			results = worker.worker(
	ui, 1.0, getfixes, tuple(), workqueue, threadsafe=False			ui, 1.0, getfixes, tuple(), workqueue, threadsafe=False
	)			)

	# We have to hold on to the data for each successor revision in memory			# We have to hold on to the data for each successor revision in memory
	# until all its parents are committed. We ensure this by committing and			# until all its parents are committed. We ensure this by committing and
	# freeing memory for the revisions in some topological order. This			# freeing memory for the revisions in some topological order. This
	"""			"""
	replacements = {			replacements = {
	prec: [succ] for prec, succ in pycompat.iteritems(replacements)			prec: [succ] for prec, succ in pycompat.iteritems(replacements)
	}			}
	scmutil.cleanupnodes(repo, replacements, b'fix', fixphase=True)			scmutil.cleanupnodes(repo, replacements, b'fix', fixphase=True)


	def getworkqueue(ui, repo, pats, opts, revstofix, basectxs):			def getworkqueue(ui, repo, pats, opts, revstofix, basectxs):
	"""Constructs the list of files to be fixed at specific revisions			"""Constructs a list of files to fix and which revisions each fix applies to

	It is up to the caller how to consume the work items, and the only			To avoid duplicating work, there is usually only one work item for each file
	dependence between them is that replacement revisions must be committed in			revision that might need to be fixed. There can be multiple work items per
	topological order. Each work item represents a file in the working copy or			file revision if the same file needs to be fixed in multiple changesets with
	in some revision that should be fixed and written back to the working copy			different baserevs. Each work item also contains a list of changesets where
	or into a replacement revision.			the file's data should be replaced with the fixed data. The work items for
				earlier changesets come earlier in the work queue, to improve pipelining by
	Work items for the same revision are grouped together, so that a worker			allowing the first changeset to be replaced while fixes are still being
	pool starting with the first N items in parallel is likely to finish the			computed for later changesets.
	first revision's work before other revisions. This can allow us to write
	the result to disk and reduce memory footprint. At time of writing, the			Also returned is a map from changesets to the count of work items that might
				AlphareUnsubmitted Done Maybe update the docstring to reflect the new grouping? Alphare: Maybe update the docstring to reflect the new grouping?
	partition strategy in worker.py seems favorable to this. We also sort the			affect each changeset. This is used later to count when all of a changeset's
	items by ascending revision number to match the order in which we commit			work items have been finished, without having to inspect the remaining work
	the fixes later.			queue in each worker subprocess.

				The example work item (1, "foo/bar.txt", (1, 2, 3)) means that the data of
				bar.txt should be read from revision 1, then fixed, and written back to
				revisions 1, 2 and 3. Revision 1 is called the "srcrev" and the list of
				revisions is called the "dstrevs". In practice the srcrev is always one of
				the dstrevs, and we make that choice when constructing the work item so that
				the choice can't be made inconsistently later on. The dstrevs should all
				have the same file revision for the given path, so the choice of srcrev is
				arbitrary. The wdirrev can be a dstrev and a srcrev.
	"""			"""
	workqueue = []			dstrevmap = collections.defaultdict(list)
	numitems = collections.defaultdict(int)			numitems = collections.defaultdict(int)
	maxfilesize = ui.configbytes(b'fix', b'maxfilesize')			maxfilesize = ui.configbytes(b'fix', b'maxfilesize')
	for rev in sorted(revstofix):			for rev in sorted(revstofix):
	fixctx = repo[rev]			fixctx = repo[rev]
	match = scmutil.match(fixctx, pats, opts)			match = scmutil.match(fixctx, pats, opts)
	for path in sorted(			for path in sorted(
	pathstofix(ui, repo, pats, opts, match, basectxs[rev], fixctx)			pathstofix(ui, repo, pats, opts, match, basectxs[rev], fixctx)
	):			):
	fctx = fixctx[path]			fctx = fixctx[path]
	if fctx.islink():			if fctx.islink():
	continue			continue
	if fctx.size() > maxfilesize:			if fctx.size() > maxfilesize:
	ui.warn(			ui.warn(
	_(b'ignoring file larger than %s: %s\n')			_(b'ignoring file larger than %s: %s\n')
	% (util.bytecount(maxfilesize), path)			% (util.bytecount(maxfilesize), path)
	)			)
	continue			continue
	workqueue.append((rev, path))			baserevs = tuple(ctx.rev() for ctx in basectxs[rev])
				dstrevmap[(fctx.filerev(), baserevs, path)].append(rev)
	numitems[rev] += 1			numitems[rev] += 1
				workqueue = [
				(dstrevs[0], path, dstrevs)
				for (filerev, baserevs, path), dstrevs in dstrevmap.items()
				]
				# Move work items for earlier changesets to the front of the queue, so we
				# might be able to replace those changesets (in topological order) while
				# we're still processing later work items. There are some situations where
				# this doesn't help much, but some situations where it lets us buffer O(1)
				# files instead of O(n) files.
				workqueue.sort(key=lambda item: min(item[2]))
	return workqueue, numitems			return workqueue, numitems


	def getrevstofix(ui, repo, opts):			def getrevstofix(ui, repo, opts):
	"""Returns the set of revision numbers that should be fixed"""			"""Returns the set of revision numbers that should be fixed"""
	if opts[b'all']:			if opts[b'all']:
	revs = repo.revs(b'(not public() and not obsolete()) or wdir()')			revs = repo.revs(b'(not public() and not obsolete()) or wdir()')
	elif opts[b'source']:			elif opts[b'source']:


	def getbasepaths(repo, opts, workqueue, basectxs):			def getbasepaths(repo, opts, workqueue, basectxs):
	if opts.get(b'whole'):			if opts.get(b'whole'):
	# Base paths will never be fetched for line range determination.			# Base paths will never be fetched for line range determination.
	return {}			return {}

	basepaths = {}			basepaths = {}
	for rev, path in workqueue:			for srcrev, path, _dstrevs in workqueue:
	fixctx = repo[rev]			fixctx = repo[srcrev]
	for basectx in basectxs[rev]:			for basectx in basectxs[srcrev]:
				AlphareUnsubmitted Done very minor nit: I usually find it clearer to prefix unused variables with `_` Alphare: very minor nit: I usually find it clearer to prefix unused variables with `_`
	basepath = copies.pathcopies(basectx, fixctx).get(path, path)			basepath = copies.pathcopies(basectx, fixctx).get(path, path)
	if basepath in basectx:			if basepath in basectx:
	basepaths[(basectx.rev(), fixctx.rev(), path)] = basepath			basepaths[(basectx.rev(), fixctx.rev(), path)] = basepath
	return basepaths			return basepaths


	def unionranges(rangeslist):			def unionranges(rangeslist):
	"""Return the union of some closed intervals			"""Return the union of some closed intervals
	basectxs[rev].add(pctx)			basectxs[rev].add(pctx)
	return basectxs			return basectxs


	def _prefetchfiles(repo, workqueue, basepaths):			def _prefetchfiles(repo, workqueue, basepaths):
	toprefetch = set()			toprefetch = set()

	# Prefetch the files that will be fixed.			# Prefetch the files that will be fixed.
	for rev, path in workqueue:			for srcrev, path, _dstrevs in workqueue:
	if rev == wdirrev:			if srcrev == wdirrev:
	continue			continue
	toprefetch.add((rev, path))			toprefetch.add((srcrev, path))

	# Prefetch the base contents for lineranges().			# Prefetch the base contents for lineranges().
	for (baserev, fixrev, path), basepath in basepaths.items():			for (baserev, fixrev, path), basepath in basepaths.items():
	toprefetch.add((baserev, basepath))			toprefetch.add((baserev, basepath))

	if toprefetch:			if toprefetch:
	scmutil.prefetchfiles(			scmutil.prefetchfiles(
	repo,			repo,

tests/test-fix.t


	$ hg fix --working-dir -r "all()" \			$ hg fix --working-dir -r "all()" \
	> --config "fix.log:command=\"$PYTHON\" \"$LOGGER\" {rootpath}" \			> --config "fix.log:command=\"$PYTHON\" \"$LOGGER\" {rootpath}" \
	> --config "fix.log:pattern=glob:**.log"			> --config "fix.log:pattern=glob:**.log"

	$ cat $LOGFILE \| sort \| uniq -c			$ cat $LOGFILE \| sort \| uniq -c
	4 bar.log			4 bar.log
	4 baz.log			4 baz.log
	4 foo.log			3 foo.log
	4 qux.log			2 qux.log

				$ cd ..

				For tools that support line ranges, it's wrong to blindly re-use fixed file
				content for the same file revision if it appears twice with different baserevs,
				because the line ranges could be different. Since computing line ranges is
				ambiguous, this isn't a matter of correctness, but it affects the usability of
				this extension. It could maybe be simpler if baserevs were computed on a
				per-file basis to make this situation impossible to construct.

				In the following example, we construct two subgraphs with the same file
				revisions, and fix different sub-subgraphs to get different baserevs and
				different changed line ranges. The key precondition is that revisions 1 and 4
				have the same file revision, and the key result is that their successors don't
				have the same file content, because we want to fix different areas of that same
				file revision's content.

				$ hg init differentlineranges
				$ cd differentlineranges

				$ printf "a\nb\n" > file.changed
				$ hg commit -Aqm "0 ab"
				$ printf "a\nx\n" > file.changed
				$ hg commit -Aqm "1 ax"
				$ hg remove file.changed
				$ hg commit -Aqm "2 removed"
				$ hg revert file.changed -r 0
				$ hg commit -Aqm "3 ab (reverted)"
				$ hg revert file.changed -r 1
				$ hg commit -Aqm "4 ax (reverted)"

				$ hg manifest --debug --template "{hash}\n" -r 0; \
				> hg manifest --debug --template "{hash}\n" -r 3
				418f692145676128d2fb518b027ddbac624be76e
				418f692145676128d2fb518b027ddbac624be76e
				$ hg manifest --debug --template "{hash}\n" -r 1; \
				> hg manifest --debug --template "{hash}\n" -r 4
				09b8b3ce5a507caaa282f7262679e6d04091426c
				09b8b3ce5a507caaa282f7262679e6d04091426c

				$ hg fix --working-dir -r 1+3+4
				3 new orphan changesets

				$ hg cat file.changed -r "successors(1)" --hidden
				a
				X
				$ hg cat file.changed -r "successors(4)" --hidden
				A
				X

	$ cd ..			$ cd ..

			Path	Packages
M			hgext/fix.py (93 lines)
M			tests/test-fix.t (53 lines)

This is an archive of the discontinued Mercurial Phabricator instance.

fix: reduce number of tool executions
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 30330

hgext/fix.py

tests/test-fix.t

This is an archive of the discontinued Mercurial Phabricator instance.

fix: reduce number of tool executionsClosedPublic

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 30330

hgext/fix.py

tests/test-fix.t

fix: reduce number of tool executions
ClosedPublic

Revision Contents
Changeset List