This is an archive of the discontinued Mercurial Phabricator instance.

Differential D5267

revlog: automatically read from opened file handles
ClosedPublic

Authored by indygreg on Nov 13 2018, 3:41 PM.

Download Raw Diff

Details

Reviewers

None

Group Reviewers

hg-reviewers

Commits

rHGe9293c5f8bb9: revlog: automatically read from opened file handles

Summary

The revlog reading code commonly opens a new file handle for
reading on demand. There is support for passing a file handle
to revlog.revision(). But it is marked as an internal argument.

When revlogs are written, we write() data as it is available. But
we don't flush() data until all revisions are written.

Putting these two traits together, it is possible for an in-process
revlog reader during active writes to trigger the opening of a new
file handle on a file with unflushed writes. The reader won't have
access to all "available" revlog data (as it hasn't been flushed).
And with the introduction of the previous patch, this can lead to
the revlog raising an error due to a partial read.

I witnessed this behavior when applying changegroup data (via
hg pull) before issue6006 was fixed via different means. Having
this and the previous patch in play would have helped cause errors
earlier rather than manifesting as hash verification failures.

While this has been a long-standing issue, I believe the relatively
new delta computation code has tickled it into being more common.
This is because the new delta computation code will compute deltas
in more scenarios. This can lead to revlog reading. While the delta
computation code is probably supposed to reuse file handles, it
appears it isn't doing so in all circumstances.

But the issue runs deeper than that. Theoretically, any code can
access revision data during revlog writes. It appears we were just
getting lucky that it wasn't. (The "add revision callback" passed to
addgroup() provides an avenue to do this.)

If I changed the revlog's behavior to not cache the full revision
text or to clear caches after revision insertion during addgroup(),
I was able to produce crashes 100% of the time when writing changelog
revisions. This is because changelog's add revision callback attempts
to resolve the revision data to access the changed files list. And
without the revision's fulltext being cached, we performed a revlog
read, which required opening a new file handle. This attempted to read
unflushed data, leading to a partial read and a crash.

This commit teaches the revlog to store the file handles used for
writing multiple revisions during addgroup(). It also teaches the
code for resolving a file handle when reading to use these handles,
if available. This ensures that *any* reads (regardless of their
source) use the active writing file handles, if available. These
file handles have access to the unflushed data because they wrote it.
This allows reads to complete without issue.

Diff Detail

Repository

rHG Mercurial

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

Event Timeline

indygreg created this revision.Nov 13 2018, 3:41 PM

Herald added a reviewer: hg-reviewers. · View Herald TranscriptNov 13 2018, 3:41 PM

Herald added a subscriber: mercurial-devel. · View Herald Transcript

This looked a bit scary, but it should work so long as both reader and writer
sides do seek().

Queued, thanks.

Closed by commit rHGe9293c5f8bb9: revlog: automatically read from opened file handles (authored by indygreg). · Explain WhyNov 14 2018, 7:29 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents
Changeset List

			Path	Packages
M			mercurial/revlog.py (32 lines)

Status	Author	Revision
Closed	indygreg	D5267 revlog: automatically read from opened file handles
Closed	indygreg	D5266 revlog: detect incomplete revlog reads
Closed	indygreg	D5265 revlog: use single file handle when de-inlining revlog

Diff 12534

mercurial/revlog.py

	self._sparserevlog = False			self._sparserevlog = False
	self._srdensitythreshold = 0.50			self._srdensitythreshold = 0.50
	self._srmingapsize = 262144			self._srmingapsize = 262144

	# Make copy of flag processors so each revlog instance can support			# Make copy of flag processors so each revlog instance can support
	# custom flags.			# custom flags.
	self._flagprocessors = dict(_flagprocessors)			self._flagprocessors = dict(_flagprocessors)

				# 2-tuple of file handles being used for active writing.
				self._writinghandles = None

	mmapindexthreshold = None			mmapindexthreshold = None
	v = REVLOG_DEFAULT_VERSION			v = REVLOG_DEFAULT_VERSION
	opts = getattr(opener, 'options', None)			opts = getattr(opener, 'options', None)
	if opts is not None:			if opts is not None:
	if 'revlogv2' in opts:			if 'revlogv2' in opts:
	# version 2 revlogs always use generaldelta.			# version 2 revlogs always use generaldelta.
	v = REVLOGV2 \| FLAG_GENERALDELTA \| FLAG_INLINE_DATA			v = REVLOGV2 \| FLAG_GENERALDELTA \| FLAG_INLINE_DATA
	elif 'revlogv1' in opts:			elif 'revlogv1' in opts:

	def _datafp(self, mode='r'):			def _datafp(self, mode='r'):
	"""file object for the revlog's data file"""			"""file object for the revlog's data file"""
	return self.opener(self.datafile, mode=mode)			return self.opener(self.datafile, mode=mode)

	@contextlib.contextmanager			@contextlib.contextmanager
	def _datareadfp(self, existingfp=None):			def _datareadfp(self, existingfp=None):
	"""file object suitable to read data"""			"""file object suitable to read data"""
				# Use explicit file handle, if given.
	if existingfp is not None:			if existingfp is not None:
	yield existingfp			yield existingfp

				# Use a file handle being actively used for writes, if available.
				# There is some danger to doing this because reads will seek the
				# file. However, _writeentry() performs a SEEK_END before all writes,
				# so we should be safe.
				elif self._writinghandles:
				if self._inline:
				yield self._writinghandles[0]
				else:
				yield self._writinghandles[1]

				# Otherwise open a new file handle.
	else:			else:
	if self._inline:			if self._inline:
	func = self._indexfp			func = self._indexfp
	else:			else:
	func = self._datafp			func = self._datafp
	with func() as fp:			with func() as fp:
	yield fp			yield fp

	trindex = len(self) - 1			trindex = len(self) - 1
	dataoff = self.end(tiprev)			dataoff = self.end(tiprev)

	tr.add(self.datafile, dataoff)			tr.add(self.datafile, dataoff)

	if fp:			if fp:
	fp.flush()			fp.flush()
	fp.close()			fp.close()
				# We can't use the cached file handle after close(). So prevent
				# its usage.
				self._writinghandles = None

	with self._indexfp('r') as ifh, self._datafp('w') as dfh:			with self._indexfp('r') as ifh, self._datafp('w') as dfh:
	for r in self:			for r in self:
	dfh.write(self._getsegmentforrevs(r, r, df=ifh)[1])			dfh.write(self._getsegmentforrevs(r, r, df=ifh)[1])

	with self._indexfp('w') as fp:			with self._indexfp('w') as fp:
	self.version &= ~FLAG_INLINE_DATA			self.version &= ~FLAG_INLINE_DATA
	self._inline = False			self._inline = False
	# platforms. Windows requires that a file positioning call be made			# platforms. Windows requires that a file positioning call be made
	# when the file handle transitions between reads and writes. See			# when the file handle transitions between reads and writes. See
	# 3686fa2b8eee and the mixedfilemodewrapper in windows.py. On other			# 3686fa2b8eee and the mixedfilemodewrapper in windows.py. On other
	# platforms, Python or the platform itself can be buggy. Some versions			# platforms, Python or the platform itself can be buggy. Some versions
	# of Solaris have been observed to not append at the end of the file			# of Solaris have been observed to not append at the end of the file
	# if the file was seeked to before the end. See issue4943 for more.			# if the file was seeked to before the end. See issue4943 for more.
	#			#
	# We work around this issue by inserting a seek() before writing.			# We work around this issue by inserting a seek() before writing.
	# Note: This is likely not necessary on Python 3.			# Note: This is likely not necessary on Python 3. However, because
				# the file handle is reused for reads and may be seeked there, we need
				# to be careful before changing this.
	ifh.seek(0, os.SEEK_END)			ifh.seek(0, os.SEEK_END)
	if dfh:			if dfh:
	dfh.seek(0, os.SEEK_END)			dfh.seek(0, os.SEEK_END)

	curr = len(self) - 1			curr = len(self) - 1
	if not self._inline:			if not self._inline:
	transaction.add(self.datafile, offset)			transaction.add(self.datafile, offset)
	transaction.add(self.indexfile, curr * len(entry))			transaction.add(self.indexfile, curr * len(entry))
	given a set of deltas, add them to the revision log. the			given a set of deltas, add them to the revision log. the
	first delta is against its parent, which should be in our			first delta is against its parent, which should be in our
	log, the rest are against the previous delta.			log, the rest are against the previous delta.

	If ``addrevisioncb`` is defined, it will be called with arguments of			If ``addrevisioncb`` is defined, it will be called with arguments of
	this revlog and the node that was added.			this revlog and the node that was added.
	"""			"""

				if self._writinghandles:
				raise error.ProgrammingError('cannot nest addgroup() calls')

	nodes = []			nodes = []

	r = len(self)			r = len(self)
	end = 0			end = 0
	if r:			if r:
	end = self.end(r - 1)			end = self.end(r - 1)
	ifh = self._indexfp("a+")			ifh = self._indexfp("a+")
	isize = r * self._io.size			isize = r * self._io.size
	if self._inline:			if self._inline:
	transaction.add(self.indexfile, end + isize, r)			transaction.add(self.indexfile, end + isize, r)
	dfh = None			dfh = None
	else:			else:
	transaction.add(self.indexfile, isize, r)			transaction.add(self.indexfile, isize, r)
	transaction.add(self.datafile, end)			transaction.add(self.datafile, end)
	dfh = self._datafp("a+")			dfh = self._datafp("a+")
	def flush():			def flush():
	if dfh:			if dfh:
	dfh.flush()			dfh.flush()
	ifh.flush()			ifh.flush()

				self._writinghandles = (ifh, dfh)

	try:			try:
	deltacomputer = deltautil.deltacomputer(self)			deltacomputer = deltautil.deltacomputer(self)
	# loop through our set of deltas			# loop through our set of deltas
	for data in deltas:			for data in deltas:
	node, p1, p2, linknode, deltabase, delta, flags = data			node, p1, p2, linknode, deltabase, delta, flags = data
	link = linkmapper(linknode)			link = linkmapper(linknode)
	flags = flags or REVIDX_DEFAULT_FLAGS			flags = flags or REVIDX_DEFAULT_FLAGS

	addrevisioncb(self, node)			addrevisioncb(self, node)

	if not dfh and not self._inline:			if not dfh and not self._inline:
	# addrevision switched from inline to conventional			# addrevision switched from inline to conventional
	# reopen the index			# reopen the index
	ifh.close()			ifh.close()
	dfh = self._datafp("a+")			dfh = self._datafp("a+")
	ifh = self._indexfp("a+")			ifh = self._indexfp("a+")
				self._writinghandles = (ifh, dfh)
	finally:			finally:
				self._writinghandles = None

	if dfh:			if dfh:
	dfh.close()			dfh.close()
	ifh.close()			ifh.close()

	return nodes			return nodes

	def iscensored(self, rev):			def iscensored(self, rev):
	"""Check if a file revision is censored."""			"""Check if a file revision is censored."""

Diff	ID	Description	Created	Lint	Unit
Base		Base
Diff 1	12529		Nov 13 2018, 3:41 PM	★	★
Diff 2	12534	rHGe9293c5f8bb9726958c581d967ca7f72f8ca70a4	Nov 13 2018, 3:32 PM	★	★