This is an archive of the discontinued Mercurial Phabricator instance.

Differential D9237

transaction: only keep file names in-memory for journal [WIP]
AbandonedPublic

Authored by joerg.sonnenberger on Oct 21 2020, 5:45 PM.

Download Raw Diff

Details

Reviewers

indygreg

Group Reviewers

hg-reviewers

Summary

The offsets are normally only used during rollback and can be read back
from disk in that case. The exception is currently the migration from
inline to non-inline revlog. The current iteration scans the on-disk
journal and computes the last revision based on that.

Diff Detail

Repository

rHG Mercurial

Branch

default

Lint

No Linters Available

Unit

No Unit Test Coverage

Event Timeline

joerg.sonnenberger created this revision.Oct 21 2020, 5:45 PM

Herald added a reviewer: indygreg. · View Herald TranscriptOct 21 2020, 5:45 PM

Herald added a reviewer: hg-reviewers. · View Herald Transcript

Herald added a subscriber: mercurial-patches. · View Herald Transcript

I admit that I don't know this code well at all. Maybe the offsets are there to protect against concurrent operations? I suppose that can only happen with bad locking (such as NFS), but maybe that's still what they're for?

My understanding of the transaction processing is that we keep track of all files modified. For each file, we remember the old size. When we have to rollback the transaction, we iterate over that list and truncate it at that point. Now this part works as before, just that we don't keep the file name -> offset table in memory but read back the on-disk representation. I don't think this makes a performance difference either compared to everything else going on. There is one more user of the offsets and that's the logic in revlog for moving from inline data storage to separated data files. With the current patch, it would be O(n^2) worst case as it reparses the whole file.

I'll cut this into smaller parts that are easier to review and decide on which steps are too far.

I was just attempting to read the transactions code as part of D9274.

I was skeptical of the need to maintain the in-memory copy of entries to the transaction journal files. It feels like something we shouldn't need to do and my kneejerk reaction is we should rip out this complexity and read from the file handle (as this patch does) instead. Although I would like to go diving into the history to find out why the in-memory copy exists as it may have been introduced for a good reason (performance?). Note that we do unlink the journal files a bit early in transaction commit/rollback code - earlier than I think is reasonable. And our low-level testing around these transaction primitives may be lacking. There be many dragons in this code...

Please proceed with rewriting this as smaller patches. And don't be scared to rename existing methods along the way to make things more clear. e.g. add() should probably be something like record_file_append(). Yes, we can use _ in function names now. And if we are changing the API of a function, I would prefer it be renamed at the same time to improve readability. You are already making an API break, so you might as well improve the name...

joerg.sonnenberger mentioned this in D9278: transaction: split new files into a separate set.Nov 7 2020, 6:41 PM

joerg.sonnenberger mentioned this in D9277: transaction: change list of journal entries into a dictionary.Nov 7 2020, 6:44 PM

Revision Contents
Changeset List

			Path	Packages
M			mercurial/repair.py (20 lines)
M			mercurial/revlog.py (25 lines)
M			mercurial/transaction.py (56 lines)
M			tests/test-mq-qpush-fail.t (2 lines)

Commit	Parents	Author	Summary	Date
cf85ef6eeec6	2bb2cb29e4f4	Joerg Sonnenberger		Oct 21 2020, 5:44 PM

Diff 23276

mercurial/repair.py


	with ui.uninterruptible():			with ui.uninterruptible():
	try:			try:
	with repo.transaction(b"strip") as tr:			with repo.transaction(b"strip") as tr:
	# TODO this code violates the interface abstraction of the			# TODO this code violates the interface abstraction of the
	# transaction and makes assumptions that file storage is			# transaction and makes assumptions that file storage is
	# using append-only files. We'll need some kind of storage			# using append-only files. We'll need some kind of storage
	# API to handle stripping for us.			# API to handle stripping for us.
	offset = len(tr._entries)			before = tr._map.copy()

	tr.startgroup()			tr.startgroup()
	cl.strip(striprev, tr)			cl.strip(striprev, tr)
	stripmanifest(repo, striprev, tr, files)			stripmanifest(repo, striprev, tr, files)

	for fn in files:			for fn in files:
	repo.file(fn).strip(striprev, tr)			repo.file(fn).strip(striprev, tr)
	tr.endgroup()			tr.endgroup()

	for i in pycompat.xrange(offset, len(tr._entries)):			after = tr._map.difference(before)
	file, troffset, ignore = tr._entries[i]			if after:
				tr._file.seek(0)
				for l in tr._file:
				file, troffset = l.split(b'\0')
				if file not in after:
				continue
				troffset = int(troffset)
	with repo.svfs(file, b'a', checkambig=True) as fp:			with repo.svfs(file, b'a', checkambig=True) as fp:
	fp.truncate(troffset)			fp.truncate(troffset)
	if troffset == 0:			if troffset == 0:
	repo.store.markremoved(file)			repo.store.markremoved(file)

	deleteobsmarkers(repo.obsstore, stripobsidx)			deleteobsmarkers(repo.obsstore, stripobsidx)
	del repo.obsstore			del repo.obsstore
	repo.invalidatevolatilesets()			repo.invalidatevolatilesets()
	repo._phasecache.filterunknown(repo)			repo._phasecache.filterunknown(repo)

	if tmpbundlefile:			if tmpbundlefile:
	ui.note(_(b"adding branch\n"))			ui.note(_(b"adding branch\n"))

mercurial/revlog.py

	to use multiple index and data files.			to use multiple index and data files.
	"""			"""
	tiprev = len(self) - 1			tiprev = len(self) - 1
	if (			if (
	not self._inline			not self._inline
	or (self.start(tiprev) + self.length(tiprev)) < _maxinline			or (self.start(tiprev) + self.length(tiprev)) < _maxinline
	):			):
	return			return
				tr.add(self.datafile, 0)

	trinfo = tr.find(self.indexfile)			trinfo = tr.findjournaloffset(self.indexfile)
	if trinfo is None:			if trinfo is None:
	raise error.RevlogError(			raise error.RevlogError(
	_(b"%s not found in the transaction") % self.indexfile			_(b"%s not found in the transaction") % self.indexfile
	)			)
				troffset = trinfo[1]
	trindex = trinfo[2]			trindex = 0
	if trindex is not None:
	dataoff = self.start(trindex)
	else:
	# revlog was stripped at start of transaction, use all leftover data
	trindex = len(self) - 1
	dataoff = self.end(tiprev)

	tr.add(self.datafile, dataoff)

	if fp:			if fp:
	fp.flush()			fp.flush()
	fp.close()			fp.close()
	# We can't use the cached file handle after close(). So prevent			# We can't use the cached file handle after close(). So prevent
	# its usage.			# its usage.
	self._writinghandles = None			self._writinghandles = None

	with self._indexfp(b'r') as ifh, self._datafp(b'w') as dfh:			with self._indexfp(b'r') as ifh, self._datafp(b'w') as dfh:
	for r in self:			for r in self:
	dfh.write(self._getsegmentforrevs(r, r, df=ifh)[1])			dfh.write(self._getsegmentforrevs(r, r, df=ifh)[1])
				if troffset <= self.start(r):
				trindex = r

	with self._indexfp(b'w') as fp:			with self._indexfp(b'w') as fp:
	self.version &= ~FLAG_INLINE_DATA			self.version &= ~FLAG_INLINE_DATA
	self._inline = False			self._inline = False
	io = self._io			io = self._io
	for i in self:			for i in self:
	e = io.packentry(self.index[i], self.node, self.version, i)			e = io.packentry(self.index[i], self.node, self.version, i)
	fp.write(e)			fp.write(e)
	transaction.add(self.datafile, offset)			transaction.add(self.datafile, offset)
	transaction.add(self.indexfile, curr * len(entry))			transaction.add(self.indexfile, curr * len(entry))
	if data[0]:			if data[0]:
	dfh.write(data[0])			dfh.write(data[0])
	dfh.write(data[1])			dfh.write(data[1])
	ifh.write(entry)			ifh.write(entry)
	else:			else:
	offset += curr * self._io.size			offset += curr * self._io.size
	transaction.add(self.indexfile, offset, curr)			transaction.add(self.indexfile, offset)
	ifh.write(entry)			ifh.write(entry)
	ifh.write(data[0])			ifh.write(data[0])
	ifh.write(data[1])			ifh.write(data[1])
	self._enforceinlinesize(transaction, ifh)			self._enforceinlinesize(transaction, ifh)
	nodemaputil.setup_persistent_nodemap(transaction, self)			nodemaputil.setup_persistent_nodemap(transaction, self)

	def addgroup(			def addgroup(
	self,			self,

	r = len(self)			r = len(self)
	end = 0			end = 0
	if r:			if r:
	end = self.end(r - 1)			end = self.end(r - 1)
	ifh = self._indexfp(b"a+")			ifh = self._indexfp(b"a+")
	isize = r * self._io.size			isize = r * self._io.size
	if self._inline:			if self._inline:
	transaction.add(self.indexfile, end + isize, r)			transaction.add(self.indexfile, end + isize)
	dfh = None			dfh = None
	else:			else:
	transaction.add(self.indexfile, isize, r)			transaction.add(self.indexfile, isize)
	transaction.add(self.datafile, end)			transaction.add(self.datafile, end)
	dfh = self._datafp(b"a+")			dfh = self._datafp(b"a+")

	def flush():			def flush():
	if dfh:			if dfh:
	dfh.flush()			dfh.flush()
	ifh.flush()			ifh.flush()

mercurial/transaction.py

	report,			report,
	opener,			opener,
	vfsmap,			vfsmap,
	entries,			entries,
	backupentries,			backupentries,
	unlink=True,			unlink=True,
	checkambigfiles=None,			checkambigfiles=None,
	):			):
	for f, o, _ignore in entries:			for f, o in entries:
	if o or not unlink:			if o or not unlink:
	checkambig = checkambigfiles and (f, b'') in checkambigfiles			checkambig = checkambigfiles and (f, b'') in checkambigfiles
	try:			try:
	fp = opener(f, b'a', checkambig=checkambig)			fp = opener(f, b'a', checkambig=checkambig)
	if fp.tell() < o:			if fp.tell() < o:
	raise error.Abort(			raise error.Abort(
	_(			_(
	b"attempted to truncate %s to %d bytes, but it was "			b"attempted to truncate %s to %d bytes, but it was "
	self._report = report			self._report = report
	# a vfs to the store content			# a vfs to the store content
	self._opener = opener			self._opener = opener
	# a map to access file in various {location -> vfs}			# a map to access file in various {location -> vfs}
	vfsmap = vfsmap.copy()			vfsmap = vfsmap.copy()
	vfsmap[b''] = opener # set default value			vfsmap[b''] = opener # set default value
	self._vfsmap = vfsmap			self._vfsmap = vfsmap
	self._after = after			self._after = after
	self._entries = []			self._map = set()
	self._map = {}
	self._journal = journalname			self._journal = journalname
	self._undoname = undoname			self._undoname = undoname
	self._queue = []			self._queue = []
	# A callback to do something just after releasing transaction.			# A callback to do something just after releasing transaction.
	if releasefn is None:			if releasefn is None:
	releasefn = lambda tr, success: None			releasefn = lambda tr, success: None
	self._releasefn = releasefn			self._releasefn = releasefn

	self._checkambigfiles = set()			self._checkambigfiles = set()
	if checkambigfiles:			if checkambigfiles:
	self._checkambigfiles.update(checkambigfiles)			self._checkambigfiles.update(checkambigfiles)

	self._names = [name]			self._names = [name]

	# A dict dedicated to precisely tracking the changes introduced in the			# A dict dedicated to precisely tracking the changes introduced in the
	# transaction.			# transaction.
	self.changes = {}			self.changes = {}

	# a dict of arguments to be passed to hooks			# a dict of arguments to be passed to hooks
	self.hookargs = {}			self.hookargs = {}
	self._file = opener.open(self._journal, b"w")			self._file = opener.open(self._journal, b"w+")

	# a list of ('location', 'path', 'backuppath', cache) entries.			# a list of ('location', 'path', 'backuppath', cache) entries.
	# - if 'backuppath' is empty, no file existed at backup time			# - if 'backuppath' is empty, no file existed at backup time
	# - if 'path' is empty, this is a temporary transaction file			# - if 'path' is empty, this is a temporary transaction file
	# - if 'location' is not empty, the path is outside main opener reach.			# - if 'location' is not empty, the path is outside main opener reach.
	# use 'location' value as a key in a vfsmap to find the right 'vfs'			# use 'location' value as a key in a vfsmap to find the right 'vfs'
	# (cache is currently unused)			# (cache is currently unused)
	self._backupentries = []			self._backupentries = []

	@active			@active
	def endgroup(self):			def endgroup(self):
	"""apply delayed registration of file entry.			"""apply delayed registration of file entry.

	This is used by strip to delay vision of strip offset. The transaction			This is used by strip to delay vision of strip offset. The transaction
	sees either none or all of the strip actions to be done."""			sees either none or all of the strip actions to be done."""
	q = self._queue.pop()			q = self._queue.pop()
	for f, o, data in q:			for f, o in q:
	self._addentry(f, o, data)			self._addentry(f, o)

	@active			@active
	def add(self, file, offset, data=None):			def add(self, file, offset):
	"""record the state of an append-only file before update"""			"""record the state of an append-only file before update"""
	if file in self._map or file in self._backupmap:			if file in self._map or file in self._backupmap:
	return			return
	if self._queue:			if self._queue:
	self._queue[-1].append((file, offset, data))			self._queue[-1].append((file, offset))
	return			return

	self._addentry(file, offset, data)			self._addentry(file, offset)

	def _addentry(self, file, offset, data):			def _addentry(self, file, offset):
	"""add a append-only entry to memory and on-disk state"""			"""add a append-only entry to memory and on-disk state"""
	if file in self._map or file in self._backupmap:			if file in self._map or file in self._backupmap:
	return			return
	self._entries.append((file, offset, data))			self._map.add(file)
	self._map[file] = len(self._entries) - 1
	# add enough data to the journal to do the truncate			# add enough data to the journal to do the truncate
	self._file.write(b"%s\0%d\n" % (file, offset))			self._file.write(b"%s\0%d\n" % (file, offset))
	self._file.flush()			self._file.flush()

	@active			@active
	def addbackup(self, file, hardlink=True, location=b''):			def addbackup(self, file, hardlink=True, location=b''):
	"""Adds a backup of the file to the transaction			"""Adds a backup of the file to the transaction

	# skip discard() loop since we're sure no open file remains			# skip discard() loop since we're sure no open file remains
	del files[:]			del files[:]
	finally:			finally:
	for f in files:			for f in files:
	f.discard()			f.discard()
	return any			return any

	@active			@active
	def find(self, file):			def findjournaloffset(self, file):
	if file in self._map:			if file not in self._map:
	return self._entries[self._map[file]]
	if file in self._backupmap:
	return self._backupentries[self._backupmap[file]]
	return None			return None
				self._file.seek(0)
				offset = None
				for l in self._file:
				f, o = l.split(b'\0')
				if f == file:
				offset = o
				return offset

	@active			@active
	def replace(self, file, offset, data=None):			def replace(self, file, offset):
	'''			'''
	replace can only replace already committed entries			replace can only replace already committed entries
	that are not pending in the queue			that are not pending in the queue
	'''			'''

	if file not in self._map:			if file not in self._map:
	raise KeyError(file)			raise KeyError(file)
	index = self._map[file]
	self._entries[index] = (file, offset, data)
	self._file.write(b"%s\0%d\n" % (file, offset))			self._file.write(b"%s\0%d\n" % (file, offset))
	self._file.flush()			self._file.flush()

	@active			@active
	def nest(self, name='<unnamed>'):			def nest(self, name='<unnamed>'):
	self._count += 1			self._count += 1
	self._usages += 1			self._usages += 1
	self._names.append(name)			self._names.append(name)
	vfs.unlink(b)			vfs.unlink(b)
	except (IOError, OSError, error.Abort) as inst:			except (IOError, OSError, error.Abort) as inst:
	if not c:			if not c:
	raise			raise
	# Abort may be raise by read only opener			# Abort may be raise by read only opener
	self._report(			self._report(
	b"couldn't remove %s: %s\n" % (vfs.join(b), inst)			b"couldn't remove %s: %s\n" % (vfs.join(b), inst)
	)			)
	self._entries = []			self._map = set()
	self._writeundo()			self._writeundo()
	if self._after:			if self._after:
	self._after()			self._after()
	self._after = None # Help prevent cycles.			self._after = None # Help prevent cycles.
	if self._opener.isfile(self._backupjournal):			if self._opener.isfile(self._backupjournal):
	self._opener.unlink(self._backupjournal)			self._opener.unlink(self._backupjournal)
	if self._opener.isfile(self._journal):			if self._opener.isfile(self._journal):
	self._opener.unlink(self._journal)			self._opener.unlink(self._journal)
	u = vfs.reljoin(base, uname)			u = vfs.reljoin(base, uname)
	util.copyfile(vfs.join(b), vfs.join(u), hardlink=True)			util.copyfile(vfs.join(b), vfs.join(u), hardlink=True)
	undobackupfile.write(b"%s\0%s\0%s\0%d\n" % (l, f, u, c))			undobackupfile.write(b"%s\0%s\0%s\0%d\n" % (l, f, u, c))
	undobackupfile.close()			undobackupfile.close()

	def _abort(self):			def _abort(self):
	self._count = 0			self._count = 0
	self._usages = 0			self._usages = 0
				mapping = []
				if self._map:
				self._file.seek(0)
				for l in self._file:
				f, o = l.split(b'\0')
				mapping.append((f, int(o)))
	self._file.close()			self._file.close()
	self._backupsfile.close()			self._backupsfile.close()

	try:			try:
	if not self._entries and not self._backupentries:			if not self._map and not self._backupentries:
	if self._backupjournal:			if self._backupjournal:
	self._opener.unlink(self._backupjournal)			self._opener.unlink(self._backupjournal)
	if self._journal:			if self._journal:
	self._opener.unlink(self._journal)			self._opener.unlink(self._journal)
	return			return

	self._report(_(b"transaction abort!\n"))			self._report(_(b"transaction abort!\n"))

	try:			try:
	for cat in sorted(self._abortcallback):			for cat in sorted(self._abortcallback):
	self._abortcallback[cat](self)			self._abortcallback[cat](self)
	# Prevent double usage and help clear cycles.			# Prevent double usage and help clear cycles.
	self._abortcallback = None			self._abortcallback = None
	_playback(			_playback(
	self._journal,			self._journal,
	self._report,			self._report,
	self._opener,			self._opener,
	self._vfsmap,			self._vfsmap,
	self._entries,			mapping,
	self._backupentries,			self._backupentries,
	False,			False,
	checkambigfiles=self._checkambigfiles,			checkambigfiles=self._checkambigfiles,
	)			)
	self._report(_(b"rollback completed\n"))			self._report(_(b"rollback completed\n"))
	except BaseException as exc:			except BaseException as exc:
	self._report(_(b"rollback failed - please run hg recover\n"))			self._report(_(b"rollback failed - please run hg recover\n"))
	self._report(			self._report(
	backupentries = []			backupentries = []

	fp = opener.open(file)			fp = opener.open(file)
	lines = fp.readlines()			lines = fp.readlines()
	fp.close()			fp.close()
	for l in lines:			for l in lines:
	try:			try:
	f, o = l.split(b'\0')			f, o = l.split(b'\0')
	entries.append((f, int(o), None))			entries.append((f, int(o)))
	except ValueError:			except ValueError:
	report(			report(
	_(b"couldn't read journal entry %r!\n") % pycompat.bytestr(l)			_(b"couldn't read journal entry %r!\n") % pycompat.bytestr(l)
	)			)

	backupjournal = b"%s.backupfiles" % file			backupjournal = b"%s.backupfiles" % file
	if opener.exists(backupjournal):			if opener.exists(backupjournal):
	fp = opener.open(backupjournal)			fp = opener.open(backupjournal)

tests/test-mq-qpush-fail.t

	> def wrapplayback(orig,			> def wrapplayback(orig,
	> journal, report, opener, vfsmap, entries, backupentries,			> journal, report, opener, vfsmap, entries, backupentries,
	> unlink=True, checkambigfiles=None):			> unlink=True, checkambigfiles=None):
	> orig(journal, report, opener, vfsmap, entries, backupentries, unlink,			> orig(journal, report, opener, vfsmap, entries, backupentries, unlink,
	> checkambigfiles)			> checkambigfiles)
	> # Touching files truncated at "transaction.abort" causes			> # Touching files truncated at "transaction.abort" causes
	> # forcible re-loading invalidated filecache properties			> # forcible re-loading invalidated filecache properties
	> # (including repo.changelog)			> # (including repo.changelog)
	> for f, o, _ignore in entries:			> for f, o in entries:
	> if o or not unlink:			> if o or not unlink:
	> os.utime(opener.join(f), (0.0, 0.0))			> os.utime(opener.join(f), (0.0, 0.0))
	> def extsetup(ui):			> def extsetup(ui):
	> extensions.wrapfunction(transaction, '_playback', wrapplayback)			> extensions.wrapfunction(transaction, '_playback', wrapplayback)
	> EOF			> EOF
	$ hg qpush -a --config extensions.wrapplayback=$TESTTMP/wrapplayback.py && echo 'qpush succeeded?!'			$ hg qpush -a --config extensions.wrapplayback=$TESTTMP/wrapplayback.py && echo 'qpush succeeded?!'
	applying patch1			applying patch1
	applying patch2			applying patch2

Diff	ID	Base	Description	Created	Lint	Unit
Base			Base
Diff 1	23276			Oct 21 2020, 5:44 PM	★	★