This is an archive of the discontinued Mercurial Phabricator instance.

Differential D7199

lock: fix race in lock-breaking code
ClosedPublic

Authored by valentin.gatienbaron on Nov 1 2019, 9:51 PM.

Download Raw Diff

Details

Reviewers

indygreg

Group Reviewers

hg-reviewers

Commits

rHG039fbd14d4e2: lock: fix race in lock-breaking code
rHG355a6ef12ba9: lock: fix race in lock-breaking code

Summary

With low frequency, I see hg pulls fail with output like:
abort: no such file or directory: .hg/store/lock

I think what happens is, in lock.py, in:

def _testlock(self, locker):
    if not self._lockshouldbebroken(locker):
        return locker

    # if locker dead, break lock.  must do this with another lock
    # held, or can race and break valid lock.
    try:
        with lock(self.vfs, self.f + b'.break', timeout=0):
            self.vfs.unlink(self.f)
    except error.LockError:
        return locker

if a lock is breakable on disk, and two hg processes concurrently get
to the "if locker dead" comment, a possible interleaving is: process1
finishes executing the function and then process2 finishes executing
the function. If that happens, process2 will either get ENOENT in
self.vfs.unlink (resulting in the spurious failure above), or break a
valid lock and potentially cause repository corruption.

The fix is simple enough: make sure the lock is breakable _inside_ the
critical section, because only then can we know that no other process
can invalidate our knowledge on the lock on disk.

I don't think there are tests for this. I've tested this manually
with:

diff --git a/mercurial/lock.py b/mercurial/lock.py

a/mercurial/lock.py

+++ b/mercurial/lock.py
@@ -351,6 +351,8 @@ class lock(object):

if not self._lockshouldbebroken(locker):
    return locker

+ import random
+ time.sleep(1. + random.random())

if locker dead, break lock. must do this with another lock
held, or can race and break valid lock. try:

@@ -358,6 +360,7 @@ class lock(object):

        self.vfs.unlink(self.f)
except error.LockError:
    return locker

+ time.sleep(1)

def testlock(self):
    """return id of locker if lock is valid, else None.

and I see this change of behavior before/after this commit:

$ $hg init repo
$ cd repo
$ ln -s $HOSTNAME/effffffc:987654321 .hg/wlock
$ touch a
$ $hg commit -Am_ & $hg commit -Am _; wait
-abort: No such file or directory: '/tmp/repo/.hg/wlock'
adding a
+warning: ignoring unknown working parent 679a8959a8ca!
+nothing changed

Diff Detail

Repository

rHG Mercurial

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

Event Timeline

valentin.gatienbaron created this revision.Nov 1 2019, 9:51 PM

Herald added a reviewer: hg-reviewers. · View Herald TranscriptNov 1 2019, 9:51 PM

Herald added a subscriber: mercurial-devel. · View Herald Transcript

valentin.gatienbaron edited the summary of this revision. (Show Details)Nov 1 2019, 9:56 PM

valentin.gatienbaron edited the summary of this revision. (Show Details)Nov 1 2019, 10:02 PM

valentin.gatienbaron edited the summary of this revision. (Show Details)Nov 1 2019, 10:04 PM

Thank you for tracking down this issue and for the detailed commit message!

I agree there is a race between 2 processes creating the lock and that we need to verify the created lock once we have it to ensure we "won" the file write race.

FWIW there are known issues with Mercurial's repository locking semantics. I believe we will need to break Mercurial's on-disk file layout to fix all the problems. But (hard-to-debug) issues like the one fixed in this patch do exist and can be fixed without breaking backwards compatibility. If this subject interests you, I suggest reading https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf which chronicles how many popular pieces of software fail to get filesystem consistency correct due to incorrect use of locks, fsync(), etc.

This revision is now accepted and ready to land.Nov 18 2019, 11:01 PM

valentin.gatienbaron added a commit: rHG355a6ef12ba9: lock: fix race in lock-breaking code.Nov 18 2019, 11:05 PM

Closed by commit rHG355a6ef12ba9: lock: fix race in lock-breaking code (authored by valentin.gatienbaron). · Explain Why

This revision was automatically updated to reflect the committed changes.

valentin.gatienbaron added a commit: rHG039fbd14d4e2: lock: fix race in lock-breaking code.Nov 18 2019, 11:11 PM

Thanks for the review. Did something go wrong in a rebase? I'm confused as to why there's two self.vfs.unlink calls in the diff displayed by phabricator.

I have read the paper you mention, and I have seen repositories get corrupted for these kinds of reasons (revlog index's length may get written without the corresponding data if the machine reboots, transaction commits but fncache gets truncated when one runs out of disk space). But I don't think it talks of locking problems in the absence of crashes, as is happening here. The ordering of filesystem operations in-memory is stricter.

Revision Contents
Changeset List

			Path	Packages
M			mercurial/lock.py (5 lines)

	Status	Author	Revision
	Closed	valentin.gatienbaron	D7199 lock: fix race in lock-breaking code
	Closed	valentin.gatienbaron	D7198 lock: refactor in preparation for next commit

Diff 18232

mercurial/lock.py

	def _testlock(self, locker):			def _testlock(self, locker):
	if not self._lockshouldbebroken(locker):			if not self._lockshouldbebroken(locker):
	return locker			return locker

	# if locker dead, break lock. must do this with another lock			# if locker dead, break lock. must do this with another lock
	# held, or can race and break valid lock.			# held, or can race and break valid lock.
	try:			try:
	with lock(self.vfs, self.f + b'.break', timeout=0):			with lock(self.vfs, self.f + b'.break', timeout=0):
				locker = self._readlock()
				if not self._lockshouldbebroken(locker):
				return locker
				self.vfs.unlink(self.f)

	self.vfs.unlink(self.f)			self.vfs.unlink(self.f)
	except error.LockError:			except error.LockError:
	return locker			return locker

	def testlock(self):			def testlock(self):
	"""return id of locker if lock is valid, else None.			"""return id of locker if lock is valid, else None.

	If old-style lock, we cannot tell what machine locker is on.			If old-style lock, we cannot tell what machine locker is on.

Diff	ID	Description	Created	Lint	Unit
Base		Base
Diff 1	17496		Nov 1 2019, 9:51 PM	★	★
Diff 2	18232	rHG355a6ef12ba9e9d6e101e8704edbfff4167f9dba	Nov 18 2019, 11:04 PM	★	★