Creating and waiting for files is a robust way to synchronise two processes
running concurrently. We already use this approach in various tests. I am adding
a official script to do so before adding more usage of this.
Details
- Reviewers
- None
- Group Reviewers
hg-reviewers - Commits
- rHG1ed6293fc31b: testlib: add a small scrip to help process to synchronise using file
Diff Detail
- Repository
- rHG Mercurial
- Branch
- default
- Lint
No Linters Available - Unit
No Unit Test Coverage
Event Timeline
This script feels extremely like a flaky test waiting to happen. Is there no alternative for this?
By using explicit wait on signal (through the fs) that each process reached the appropriate file. We avoid flackyness. There are already multiple use of this approach in the test suite, that does not suffer from flackyness (unlike the wheelbarrow of flaky test relying on sleep for sync).
You avoid flakyness iff the test manages to finish this step in under 20 seconds (in the next change, as an example). Which is to say, this is still a flake waiting to happen, you've just made it less likely. I think it might be better to poll more often in the script and not even take a timeout: sleep forever waiting for the condition, and if it never comes let the test timeout at the runner level. Thoughts?
I'm also not happy about the 1-second floor this puts on the step. Doesn't sleep(1) support sub-second sleeps on all platforms at this point?
I filed a bug about this that self-archived, but:
$ echo ' $ sleep 10' > test-timeout.t $ time ./run-tests.py --local test-timeout.t -t 5 running 1 tests using 1 parallel processes t Failed test-timeout.t: timed out # Ran 1 tests, 0 skipped, 1 failed. python hash seed: 204038743 real 0m10.363s user 0m0.000s sys 0m0.030s
So it looks like tests never timeout, but then the result is discarded afterward if the timeout period elapsed. I can reproduce it on Windows and macOS.
The 20 seconds seems like a lots of margin already, but I am fine with bumping it more it that make your more confortable.
Waiting for the test timeout is not a reasonable option because the test is killed without any details (and it is LOONG). The most common case for reaching the timeout is for one of the process to crash before reaching the checkpoing. When that happens, we want to be able to read the traceback. The second most common is code misbehaving and not going through the expected codepath. We also was to get output in this case. So in short, we need a clean way out in case of error and I have no better option than a (possibly long) timeout right now.
I'm also not happy about the 1-second floor this puts on the step. Doesn't sleep(1) support sub-second sleeps on all platforms at this point?
I am extremly sad too. But last time I checked, it was still not the case. We detect plateform and use small increment on better plateform (but I would rather follow up for that).
https://bz.mercurial-scm.org/show_bug.cgi?id=6125
Waiting for the test timeout is not a reasonable option because the test is killed without any details (and it is LOONG). The most common case for reaching the timeout is for one of the process to crash before reaching the checkpoing. When that happens, we want to be able to read the traceback. The second most common is code misbehaving and not going through the expected codepath. We also was to get output in this case. So in short, we need a clean way out in case of error and I have no better option than a (possibly long) timeout right now.
I wonder if the test harness can be modified to process the data collected up to the point of the timeout, so that it's obvious what is getting stuck.
I'm still -0 on this: I'd rather we found an approach that didn't require sleeping for so long. Perhaps a Python script would be a better fit here?
(I won't block this landing, but I won't push it.)
What abotu replacing the sleep 1 by a `python -c "import time; time.sleep(0.1)" would that make you happy ?
Happy? No, not really. I think I'd rather it was a Python script, but what I'd _really_ rather (and what would actually make me happy) is that we didn't have sleep-required steps like this at all. They're inherently racy , and I feel like there's got to be a better solution.
At this point I don't have the patience to try and work through this patch, so you'll need to find a different reviewer.
Good news, even is sub-second is not expect to work on all plateforms, I found out that we already use it in our test suite. So any plateform that does not support it are already broken. I'll send an update soon.
I am adding a small change to have the local timeout adjust itself according to the global time out. I am not aware of real live issues with the local timeout, but adding that logic is simple enough.
I wanted to help with things here but unfortunately I have ~0 experience with shell scripts and the kind of process testing going in next few patches.