( )⚙ D2720 debugcommands: introduce actions to perform deterministic reads

This is an archive of the discontinued Mercurial Phabricator instance.

debugcommands: introduce actions to perform deterministic reads
ClosedPublic

Authored by indygreg on Mar 7 2018, 8:36 PM.

Details

Summary

"readavailable" is useful as a debugging device to see what data is
available on a pipe. But the mechanism isn't deterministic because
what's available on a pipe is highly conditional on timing, system
load, OS behavior, etc. This makes it not suitable for tests.

We introduce "ereadline," "read," and "eread" for performing
deterministic I/O operations (at least on blocking file descriptors).

We stop short of converting existing consumers of "readavailable"
in tests because we're working out race conditions and deadlocks
on Windows. But the goal is to eventually move tests away from
"readavailable" to these new APIs.

Diff Detail

Repository
rHG Mercurial
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

indygreg created this revision.Mar 7 2018, 8:36 PM

This patch seems to deadlock on Windows when running test-ssh-proto-unbundle.t. There are two python processes idling:

c:\Python27\python.exe c:/Users/Matt/projects/hg/hg --verbose debugwireproto --localssh --noreadstderr
CWD: C:\Users\Matt\AppData\Local\Temp\hgtests.1mtivg\child1\test-ssh-proto-unbundle.t\no-bundle1\

and

c:\Python27\python.exe c:/Users/Matt/projects/hg/hg -R C:\Users\Matt\AppData\Local\Temp\hgtests.1mtivg\child1\test-ssh-proto-unbundle.t\no-bundle1 debugserve --sshstdio
CWD: C:\Users\Matt\AppData\Local\Temp\hgtests.1mtivg\child1\test-ssh-proto-unbundle.t\no-bundle1\

Past experience says that something needs to be flushed- stdout (and maybe stderr?) are full buffered on Windows IIRC. The following failure with this patch might also point to that:

--- c:/Users/Matt/projects/hg/tests/test-ssh-proto.t
+++ c:/Users/Matt/projects/hg/tests/test-ssh-proto.t.err
@@ -100,12 +100,12 @@
   $ hg debugserve --sshstdio --logiofd 1 << EOF
   > hello
   > EOF
-  o> write(4) -> 4:
-  o>     384\n
-  o> write(384) -> 384:
-  o>     capabilities: lookup branchmap pushkey known getbundle unbundlehash batch changegroupsubset streamreqs=generaldelta,revlogv1 $USUAL_BUNDLE2_CAPS_SERVER$unbundle=HG10GZ,HG10BZ,HG10UN\n
   384
   capabilities: lookup branchmap pushkey known getbundle unbundlehash batch changegroupsubset streamreqs=generaldelta,revlogv1 $USUAL_BUNDLE2_CAPS_SERVER$ unbundle=HG10GZ,HG10BZ,HG10UN
+  o> write(4) -> 4:
+  o>     384\n
+  o> write(384) -> 384:
+  o>     capabilities: lookup branchmap pushkey known getbundle unbundlehash batch changegroupsubset streamreqs=generaldelta,revlogv1 $USUAL_BUNDLE2_CAPS_SERVER$ unbundle=HG10GZ,HG10BZ,HG10UN\n
   o> flush() -> None

   $ hg debugserve --sshstdio --logiofile $TESTTMP/io << EOF

This patch seems to deadlock on Windows when running test-ssh-proto-unbundle.t. There are two python processes idling:
Past experience says that something needs to be flushed- stdout (and maybe stderr?) are full buffered on Windows IIRC. The following failure with this patch might also point to that:

Hmmm.

We explicitly create the pipes between the processes as unbuffered (bufsize=0). That being said, I would not at all be surprised if we need to throw a flush() in here somewhere.

Could you please run the test with run-tests.py -d and see where it is deadlocking?

The test failure due to the o> block is likely due to buffering or a race condition when writing to stdout of the hg debugwireproto process. I'll look into that after we fix the deadlock (since fixing might change order of things and I don't want to debug this in a system that is changing).

This patch seems to deadlock on Windows when running test-ssh-proto-unbundle.t. There are two python processes idling:
Past experience says that something needs to be flushed- stdout (and maybe stderr?) are full buffered on Windows IIRC. The following failure with this patch might also point to that:

Hmmm.
We explicitly create the pipes between the processes as unbuffered (bufsize=0). That being said, I would not at all be surprised if we need to throw a flush() in here somewhere.
Could you please run the test with run-tests.py -d and see where it is deadlocking?
The test failure due to the o> block is likely due to buffering or a race condition when writing to stdout of the hg debugwireproto process. I'll look into that after we fix the deadlock (since fixing might change order of things and I don't want to debug this in a system that is changing).

It looks like the middle of line 84. Here's the relevant output, from the start of the stuck command on 43 (the off-by-one in the SALT line is odd, but the other lines are line that too, so I assume it's a zero based index thing):

+ echo SALT1520615870 42 0
SALT1520615870 42 0
+ debugwireproto
++ cat -
+ commands='command unbundle
# This is "force" in hex.
    heads 666f726365
    PUSHFILE ../initial.v1.hg
eread 115'
+ echo 'testing ssh1'
testing ssh1
++ hg log -r tip -T '{node}'
+ tip=0000000000000000000000000000000000000000
+ echo 'command unbundle
# This is "force" in hex.
    heads 666f726365
    PUSHFILE ../initial.v1.hg
eread 115'
+ hg --verbose debugwireproto --localssh --noreadstderr
creating ssh peer from handshake results
i> write(104) -> 104:
i>     hello\n
i>     between\n
i>     pairs 81\n
i>     0000000000000000000000000000000000000000-0000000000000000000000000000000000000000
i> flush() -> None
o> readline() -> 4:
o>     384\n
o> readline() -> 384:
o>     capabilities: lookup branchmap pushkey known getbundle unbundlehash batch changegroupsubset streamreqs=
generaldelta,revlogv1 bundle2=HG20%0Abookmarks%0Achangegroup%3D01%2C02%0Adigests%3Dmd5%2Csha1%2Csha512%0Aerror
%3Dabort%2Cunsupportedcontent%2Cpushraced%2Cpushkey%0Ahgtagsfnodes%0Alistkeys%0Aphases%3Dheads%0Apushkey%0Arem
ote-changegroup%3Dhttp%2Chttps unbundle=HG10GZ,HG10BZ,HG10UN\n
o> readline() -> 2:
o>     1\n
o> readline() -> 1:
o>     \n
sending unbundle command
i> write(9) -> 9:
i>     unbundle\n
i> write(9) -> 9:
i>     heads 10\n
i> write(10) -> 10: 666f726365
i> flush() -> None
o> readline() -> 2:
o>     0\n
i> write(4) -> 4:
i>     426\n
i> write(426) -> 426:
i>     HG10UN\x00\x00\x00\x9eh\x98b\x13\xbdD\x85\xeaQS55\xe3\xfc\x9ex\x00zq\x1f\x00\x00\x00\x00\x00\x00\x00\x0
0\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\
x00\x00\x00\x00\x00h\x98b\x13\xbdD\x85\xeaQS55\xe3\xfc\x9ex\x00zq\x1f\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\
x00>cba485ca3678256e044428f70f58291196f6e9de\n
i>     test\n
i>     0 0\n
i>     foo\n
i>     \n
i>     initial\x00\x00\x00\x00\x00\x00\x00\x8d\xcb\xa4\x85\xca6x%n\x04D(\xf7\x0fX)\x11\x96\xf6\xe9\xde\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x
00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00h\x98b\x13\xbdD\x85\xeaQS55\xe3\xfc\x9ex\x00zq\x1f\x00\x00\x00\x00\x
00\x00\x00\x00\x00\x00\x00-foo\x00362fef284ce2ca02aecc8de6d5e8a1c3af0556fe\n
i>     \x00\x00\x00\x00\x00\x00\x00\x07foo\x00\x00\x00b6/\xef(L\xe2\xca\x02\xae\xcc\x8d\xe6\xd5\xe8\xa1\xc3\xa
f\x05V\xfe\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\

i> write(4) -> 4:
i> 426\n
i> write(426) -> 426:
i> HG10UN\x00\x00\x00\x9eh\x98b\x13\xbdD\x85\xeaQS55\xe3\xfc\x9ex\x00zq\x1f\x00\x00\x00\x00\x00\x00\x00\x0
0\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\
x00\x00\x00\x00\x00h\x98b\x13\xbdD\x85\xeaQS55\xe3\xfc\x9ex\x00zq\x1f\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\
x00>cba485ca3678256e044428f70f58291196f6e9de\n
i> test\n
i> 0 0\n
i> foo\n
i> \n
i> initial\x00\x00\x00\x00\x00\x00\x00\x8d\xcb\xa4\x85\xca6x%n\x04D(\xf7\x0fX)\x11\x96\xf6\xe9\xde\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x
00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00h\x98b\x13\xbdD\x85\xeaQS55\xe3\xfc\x9ex\x00zq\x1f\x00\x00\x00\x00\x
00\x00\x00\x00\x00\x00\x00-foo\x00362fef284ce2ca02aecc8de6d5e8a1c3af0556fe\n
i> \x00\x00\x00\x00\x00\x00\x00\x07foo\x00\x00\x00b6/\xef(L\xe2\xca\x02\xae\xcc\x8d\xe6\xd5\xe8\xa1\xc3\xa
f\x05V\xfe\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\

This is a wonky place to hang! For context, this is processing a command unbundle line. The lines that follow should be:

i> write(2) -> 2:
i>     0\n
i> flush() -> None
o> readline() -> 2:
o>     0\n
o> readline() -> 2:
o>     1\n
o> read(1) -> 1: 0
result: 0
remote output: 
e> read(115) -> 115:
e>     abort: incompatible Mercurial client; bundle2 required\n
e>     (see https://www.mercurial-scm.org/wiki/IncompatibleClient)\n

What happens under the hood during pushes is we send out chunks containing the bundle followed by an empty chunk. That code is:

for d in iter(lambda: fp.read(4096), ''):
    self._writeframed(d)
self._writeframed("", flush=True)

That empty chunk is apparently not getting sent. Or its logging is not getting written/printed.

This patch shouldn't have changed any behavior with regard to this part of the I/O. So I'm scratching my head over how this caused deadlock. Are you sure you can bisect it to this patch.

i> write(4) -> 4:
i> 426\n
i> write(426) -> 426:
i> HG10UN\x00\x00\x00\x9eh\x98b\x13\xbdD\x85\xeaQS55\xe3\xfc\x9ex\x00zq\x1f\x00\x00\x00\x00\x00\x00\x00\x0
0\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\
x00\x00\x00\x00\x00h\x98b\x13\xbdD\x85\xeaQS55\xe3\xfc\x9ex\x00zq\x1f\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\
x00>cba485ca3678256e044428f70f58291196f6e9de\n
i> test\n
i> 0 0\n
i> foo\n
i> \n
i> initial\x00\x00\x00\x00\x00\x00\x00\x8d\xcb\xa4\x85\xca6x%n\x04D(\xf7\x0fX)\x11\x96\xf6\xe9\xde\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x
00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00h\x98b\x13\xbdD\x85\xeaQS55\xe3\xfc\x9ex\x00zq\x1f\x00\x00\x00\x00\x
00\x00\x00\x00\x00\x00\x00-foo\x00362fef284ce2ca02aecc8de6d5e8a1c3af0556fe\n
i> \x00\x00\x00\x00\x00\x00\x00\x07foo\x00\x00\x00b6/\xef(L\xe2\xca\x02\xae\xcc\x8d\xe6\xd5\xe8\xa1\xc3\xa
f\x05V\xfe\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\

This is a wonky place to hang! For context, this is processing a command unbundle line. The lines that follow should be:

i> write(2) -> 2:
i>     0\n
i> flush() -> None
o> readline() -> 2:
o>     0\n
o> readline() -> 2:
o>     1\n
o> read(1) -> 1: 0
result: 0
remote output: 
e> read(115) -> 115:
e>     abort: incompatible Mercurial client; bundle2 required\n
e>     (see https://www.mercurial-scm.org/wiki/IncompatibleClient)\n

What happens under the hood during pushes is we send out chunks containing the bundle followed by an empty chunk. That code is:

for d in iter(lambda: fp.read(4096), ''):
    self._writeframed(d)
self._writeframed("", flush=True)

That empty chunk is apparently not getting sent. Or its logging is not getting written/printed.
This patch shouldn't have changed any behavior with regard to this part of the I/O. So I'm scratching my head over how this caused deadlock. Are you sure you can bisect it to this patch.

Yep. I imported this with hg phapread --stack, and can run the D2719 in 50 seconds or so. The debug run of this I let hang out for 20 minutes before killing it.

Here's the neighboring failure from D2719, because I can't explain the AWOL output. (But that shouldn't have stopped stdout above mid stream.)

--- e:/Projects/hg/tests/test-ssh-proto-unbundle.t
+++ e:/Projects/hg/tests/test-ssh-proto-unbundle.t.err
@@ -93,9 +93,6 @@
   o> read(1) -> 1: 0
   result: 0
   remote output:
-  e> read(-1) -> 115:
-  e>     abort: incompatible Mercurial client; bundle2 required\n
-  e>     (see https://www.mercurial-scm.org/wiki/IncompatibleClient)\n

   testing ssh2
   creating ssh peer from handshake results
@@ -143,9 +140,6 @@
   o> read(1) -> 1: 0
   result: 0
   remote output:
-  e> read(-1) -> 115:
-  e>     abort: incompatible Mercurial client; bundle2 required\n
-  e>     (see https://www.mercurial-scm.org/wiki/IncompatibleClient)\n

   $ cd ..

My only explanation is this is stdout output buffering and things are really hanging on the next read operation.

Maybe try sprinkling some ui.fout.flush() and/or util.stdout.flush() and/or sys.stdout.flush() calls throughout the debug command? Maybe at the first thing in the loop that evaluates commands to execute?

My only explanation is this is stdout output buffering and things are really hanging on the next read operation.
Maybe try sprinkling some ui.fout.flush() and/or util.stdout.flush() and/or sys.stdout.flush() calls throughout the debug command? Maybe at the first thing in the loop that evaluates commands to execute?

Both util.stdout.flush() and ui.fout.flush() get a little further in the same command; sys.stdout.flush() has no effect, when added at the top of the loop. But then it stalls on remote output:

...
o> readline() -> 2:
o>     0\n
i> write(4) -> 4:
i>     426\n
i> write(426) -> 426:
i>     HG10UN\x00\x00\x00\x9eh\x98b\x13\xbdD\x85\xeaQS55\xe3\xfc\x9ex\x00zq\x1f\x00\x00\x00\x00\x00\x00\x00\x0
0\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\
x00\x00\x00\x00\x00h\x98b\x13\xbdD\x85\xeaQS55\xe3\xfc\x9ex\x00zq\x1f\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\
x00>cba485ca3678256e044428f70f58291196f6e9de\n
i>     test\n
i>     0 0\n
i>     foo\n
i>     \n
i>     initial\x00\x00\x00\x00\x00\x00\x00\x8d\xcb\xa4\x85\xca6x%n\x04D(\xf7\x0fX)\x11\x96\xf6\xe9\xde\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x
00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00h\x98b\x13\xbdD\x85\xeaQS55\xe3\xfc\x9ex\x00zq\x1f\x00\x00\x00\x00\x
00\x00\x00\x00\x00\x00\x00-foo\x00362fef284ce2ca02aecc8de6d5e8a1c3af0556fe\n
i>     \x00\x00\x00\x00\x00\x00\x00\x07foo\x00\x00\x00b6/\xef(L\xe2\xca\x02\xae\xcc\x8d\xe6\xd5\xe8\xa1\xc3\xa
f\x05V\xfe\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00h\x98b\x13\xbdD\x85\xeaQS55\xe3\xfc\x9ex\x00zq\x1f
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x020\n
i>     \x00\x00\x00\x00\x00\x00\x00\x00
i> write(2) -> 2:
i>     0\n
i> flush() -> None
o> readline() -> 2:
o>     0\n
o> readline() -> 2:
o>     1\n
o> read(1) -> 1: 0
result: 0
remote output:

So I added a self.flush() inside ui._write() because I figured this is a problem on the remote end (and probably not localized like the debug command), but that didn't do anything. (ui._write_err() already flushes, and has a comment about stderr may be buffered on Windows.)

IDK if this means anything, but when it is stuck and I hit Ctrl + C, instead of terminating the test, the test simply continues with the script, and then hangs on the next 'remote output:' line. Typically, Ctrl + C ends the test runner.

IDK if this means anything, but when it is stuck and I hit Ctrl+C, instead of terminating the test, the test simply continues with the script, and then hangs on the next 'remote output:' line. Typically, Ctrl+C ends the test runner.

A couple of more tidbits on this:

  1. When I hit Ctrl+C, it *does* print the shell prompt, but the process keeps printing test output after it, so I missed it. When it jams, I can hit Ctrl+C again to get it going again.
  2. I did that all the way through the test, and eventually it got to the last one, which prints:
i> write(2) -> 2:
i>     0\n
i> flush() -> None
o> readline() -> 2:
o>     0\n
o> readline() -> 2:
o>     1\n
o> read(1) -> 1: 1
result: 1
remote output:
e> read(152) -> 152:
e>     adding changesets\n
e>     adding manifests\n
e>     adding file changes\n
e>     added 1 changesets with 1 changes to 1 files\n
e>     ui.write 1\n
e>     ui.write_err 1\n
e>     ui.write 2\n
e>     ui.write_err 2\n
+ '[' -n 1 ']'
+ hg --config extensions.strip= strip --no-backup -r 'all() - ::0000000000000000000000000000000000000000'
+ echo SALT1520635194 2034 0
SALT1520635194 2034 0

Note the exit code is 1 here. So is it possible that the debug command is having trouble reading bytes from an empty stderr in the initial cases?

At this point, I've narrowed it down to stderr. I misunderstood the 'noreadstderr', and wondered why we were bothering to pipe stderr if True. So I set it to os.devnull, and the test sails through. (But it does drop all "e>" lines, so that's not an actual solution.)

Is there some way to send a signal to a python process to get it to dump its stacks, like Ctrl+\ in java? Sprinkling random prints and flushes hasn't been fruitful so far.

See the contrib/showstack.py extension for registering SIGQUIT and SIGINFO signals to dump the active stack. Not sure if that works on Windows though.

indygreg edited the summary of this revision. (Show Details)Mar 12 2018, 8:07 PM
indygreg updated this revision to Diff 6976.
indygreg added a subscriber: yuja.

I removed the test changes from this. So we should be able to land it without causing chaos to Windows tests. It may conflict with other patches that @yuja and @mharbison72 have in flight though.

Sounds good to me. I think the only thing I have inflight is the patch with explicit flushes in the observer. I haven't tried Yuya's series yet, because I'm not sure where it fits into the other stuff inflight. I don't mind mopping this up after it lands.

durin42 accepted this revision.Mar 19 2018, 9:35 PM
This revision is now accepted and ready to land.Mar 19 2018, 9:35 PM
This revision was automatically updated to reflect the committed changes.