Page MenuHomePhabricator

testlib: add a small scrip to help process to synchronise using file
ClosedPublic

Authored by marmoute on Feb 28 2020, 1:55 PM.

Details

Summary

Creating and waiting for files is a robust way to synchronise two processes
running concurrently. We already use this approach in various tests. I am adding
a official script to do so before adding more usage of this.

Diff Detail

Repository
rHG Mercurial
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

marmoute created this revision.Feb 28 2020, 1:55 PM

This script feels extremely like a flaky test waiting to happen. Is there no alternative for this?

By using explicit wait on signal (through the fs) that each process reached the appropriate file. We avoid flackyness. There are already multiple use of this approach in the test suite, that does not suffer from flackyness (unlike the wheelbarrow of flaky test relying on sleep for sync).

By using explicit wait on signal (through the fs) that each process reached the appropriate file. We avoid flackyness. There are already multiple use of this approach in the test suite, that does not suffer from flackyness (unlike the wheelbarrow of flaky test relying on sleep for sync).

You avoid flakyness iff the test manages to finish this step in under 20 seconds (in the next change, as an example). Which is to say, this is still a flake waiting to happen, you've just made it less likely. I think it might be better to poll more often in the script and not even take a timeout: sleep forever waiting for the condition, and if it never comes let the test timeout at the runner level. Thoughts?

I'm also not happy about the 1-second floor this puts on the step. Doesn't sleep(1) support sub-second sleeps on all platforms at this point?

I think it might be better to poll more often in the script and not even take a timeout: sleep forever waiting for the condition, and if it never comes let the test timeout at the runner level. Thoughts?

I filed a bug about this that self-archived, but:

$ echo '  $ sleep 10' > test-timeout.t
$ time ./run-tests.py --local test-timeout.t -t 5
running 1 tests using 1 parallel processes
t
Failed test-timeout.t: timed out
# Ran 1 tests, 0 skipped, 1 failed.
python hash seed: 204038743

real    0m10.363s
user    0m0.000s
sys     0m0.030s

So it looks like tests never timeout, but then the result is discarded afterward if the timeout period elapsed. I can reproduce it on Windows and macOS.

By using explicit wait on signal (through the fs) that each process reached the appropriate file. We avoid flackyness. There are already multiple use of this approach in the test suite, that does not suffer from flackyness (unlike the wheelbarrow of flaky test relying on sleep for sync).

You avoid flakyness iff the test manages to finish this step in under 20 seconds (in the next change, as an example). Which is to say, this is still a flake waiting to happen, you've just made it less likely. I think it might be better to poll more often in the script and not even take a timeout: sleep forever waiting for the condition, and if it never comes let the test timeout at the runner level. Thoughts?

The 20 seconds seems like a lots of margin already, but I am fine with bumping it more it that make your more confortable.

Waiting for the test timeout is not a reasonable option because the test is killed without any details (and it is LOONG). The most common case for reaching the timeout is for one of the process to crash before reaching the checkpoing. When that happens, we want to be able to read the traceback. The second most common is code misbehaving and not going through the expected codepath. We also was to get output in this case. So in short, we need a clean way out in case of error and I have no better option than a (possibly long) timeout right now.

I'm also not happy about the 1-second floor this puts on the step. Doesn't sleep(1) support sub-second sleeps on all platforms at this point?

I am extremly sad too. But last time I checked, it was still not the case. We detect plateform and use small increment on better plateform (but I would rather follow up for that).

I think it might be better to poll more often in the script and not even take a timeout: sleep forever waiting for the condition, and if it never comes let the test timeout at the runner level. Thoughts?

I filed a bug about this that self-archived, but:

$ echo '  $ sleep 10' > test-timeout.t
$ time ./run-tests.py --local test-timeout.t -t 5
running 1 tests using 1 parallel processes
t
Failed test-timeout.t: timed out
# Ran 1 tests, 0 skipped, 1 failed.
python hash seed: 204038743
real    0m10.363s
user    0m0.000s
sys     0m0.030s

So it looks like tests never timeout, but then the result is discarded afterward if the timeout period elapsed. I can reproduce it on Windows and macOS.

Do you have a link to the bug ?

I think it might be better to poll more often in the script and not even take a timeout: sleep forever waiting for the condition, and if it never comes let the test timeout at the runner level. Thoughts?

I filed a bug about this that self-archived, but:

$ echo '  $ sleep 10' > test-timeout.t
$ time ./run-tests.py --local test-timeout.t -t 5
running 1 tests using 1 parallel processes
t
Failed test-timeout.t: timed out
# Ran 1 tests, 0 skipped, 1 failed.
python hash seed: 204038743
real    0m10.363s
user    0m0.000s
sys     0m0.030s

So it looks like tests never timeout, but then the result is discarded afterward if the timeout period elapsed. I can reproduce it on Windows and macOS.

Do you have a link to the bug ?

https://bz.mercurial-scm.org/show_bug.cgi?id=6125

Waiting for the test timeout is not a reasonable option because the test is killed without any details (and it is LOONG). The most common case for reaching the timeout is for one of the process to crash before reaching the checkpoing. When that happens, we want to be able to read the traceback. The second most common is code misbehaving and not going through the expected codepath. We also was to get output in this case. So in short, we need a clean way out in case of error and I have no better option than a (possibly long) timeout right now.

I wonder if the test harness can be modified to process the data collected up to the point of the timeout, so that it's obvious what is getting stuck.

Gentle ping on this patch.

I'm still -0 on this: I'd rather we found an approach that didn't require sleeping for so long. Perhaps a Python script would be a better fit here?

(I won't block this landing, but I won't push it.)

durin42 removed a subscriber: durin42.Mar 20 2020, 11:47 AM

I'm still -0 on this: I'd rather we found an approach that didn't require sleeping for so long. Perhaps a Python script would be a better fit here?
(I won't block this landing, but I won't push it.)

What abotu replacing the sleep 1 by a `python -c "import time; time.sleep(0.1)" would that make you happy ?

I'm still -0 on this: I'd rather we found an approach that didn't require sleeping for so long. Perhaps a Python script would be a better fit here?
(I won't block this landing, but I won't push it.)

What abotu replacing the sleep 1 by a `python -c "import time; time.sleep(0.1)" would that make you happy ?

Happy? No, not really. I think I'd rather it was a Python script, but what I'd _really_ rather (and what would actually make me happy) is that we didn't have sleep-required steps like this at all. They're inherently racy , and I feel like there's got to be a better solution.

At this point I don't have the patience to try and work through this patch, so you'll need to find a different reviewer.

durin42 removed a subscriber: durin42.Mar 20 2020, 11:53 AM

[…]
I'm also not happy about the 1-second floor this puts on the step. Doesn't sleep(1) support sub-second sleeps on all platforms at this point?

I am extremly sad too. But last time I checked, it was still not the case. We detect plateform and use small increment on better plateform (but I would rather follow up for that).

Good news, even is sub-second is not expect to work on all plateforms, I found out that we already use it in our test suite. So any plateform that does not support it are already broken. I'll send an update soon.

I am adding a small change to have the local timeout adjust itself according to the global time out. I am not aware of real live issues with the local timeout, but adding that logic is simple enough.

pulkit added a subscriber: pulkit.Mar 31 2020, 9:21 AM

I wanted to help with things here but unfortunately I have ~0 experience with shell scripts and the kind of process testing going in next few patches.

This revision was not accepted when it landed; it landed in state Needs Review.
This revision was automatically updated to reflect the committed changes.