This is an archive of the discontinued Mercurial Phabricator instance.

Differential D8189

testlib: add a small scrip to help process to synchronise using file
ClosedPublic

Authored by marmoute on Feb 28 2020, 1:55 PM.

Download Raw Diff

Details

Reviewers

None

Group Reviewers

hg-reviewers

Commits

rHG1ed6293fc31b: testlib: add a small scrip to help process to synchronise using file

Summary

Creating and waiting for files is a robust way to synchronise two processes
running concurrently. We already use this approach in various tests. I am adding
a official script to do so before adding more usage of this.

Diff Detail

Repository

rHG Mercurial

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

Event Timeline

marmoute created this revision.Feb 28 2020, 1:55 PM

Herald added a reviewer: hg-reviewers. · View Herald TranscriptFeb 28 2020, 1:55 PM

Herald added a subscriber: mercurial-devel. · View Herald Transcript

marmoute added a child revision: D8190: nodemap: test that concurrent process don't see the pending transaction.Feb 28 2020, 1:55 PM

This script feels extremely like a flaky test waiting to happen. Is there no alternative for this?

By using explicit wait on signal (through the fs) that each process reached the appropriate file. We avoid flackyness. There are already multiple use of this approach in the test suite, that does not suffer from flackyness (unlike the wheelbarrow of flaky test relying on sleep for sync).

In D8189#123280, @marmoute wrote:

By using explicit wait on signal (through the fs) that each process reached the appropriate file. We avoid flackyness. There are already multiple use of this approach in the test suite, that does not suffer from flackyness (unlike the wheelbarrow of flaky test relying on sleep for sync).

You avoid flakyness iff the test manages to finish this step in under 20 seconds (in the next change, as an example). Which is to say, this is still a flake waiting to happen, you've just made it less likely. I think it might be better to poll more often in the script and not even take a timeout: sleep forever waiting for the condition, and if it never comes let the test timeout at the runner level. Thoughts?

I'm also not happy about the 1-second floor this puts on the step. Doesn't sleep(1) support sub-second sleeps on all platforms at this point?

In D8189#123323, @durin42 wrote:

I think it might be better to poll more often in the script and not even take a timeout: sleep forever waiting for the condition, and if it never comes let the test timeout at the runner level. Thoughts?

I filed a bug about this that self-archived, but:

$ echo '  $ sleep 10' > test-timeout.t
$ time ./run-tests.py --local test-timeout.t -t 5
running 1 tests using 1 parallel processes
t
Failed test-timeout.t: timed out
# Ran 1 tests, 0 skipped, 1 failed.
python hash seed: 204038743

real    0m10.363s
user    0m0.000s
sys     0m0.030s

So it looks like tests never timeout, but then the result is discarded afterward if the timeout period elapsed. I can reproduce it on Windows and macOS.

In D8189#123323, @durin42 wrote:

In D8189#123280, @marmoute wrote:

By using explicit wait on signal (through the fs) that each process reached the appropriate file. We avoid flackyness. There are already multiple use of this approach in the test suite, that does not suffer from flackyness (unlike the wheelbarrow of flaky test relying on sleep for sync).

You avoid flakyness iff the test manages to finish this step in under 20 seconds (in the next change, as an example). Which is to say, this is still a flake waiting to happen, you've just made it less likely. I think it might be better to poll more often in the script and not even take a timeout: sleep forever waiting for the condition, and if it never comes let the test timeout at the runner level. Thoughts?

The 20 seconds seems like a lots of margin already, but I am fine with bumping it more it that make your more confortable.

Waiting for the test timeout is not a reasonable option because the test is killed without any details (and it is LOONG). The most common case for reaching the timeout is for one of the process to crash before reaching the checkpoing. When that happens, we want to be able to read the traceback. The second most common is code misbehaving and not going through the expected codepath. We also was to get output in this case. So in short, we need a clean way out in case of error and I have no better option than a (possibly long) timeout right now.

I'm also not happy about the 1-second floor this puts on the step. Doesn't sleep(1) support sub-second sleeps on all platforms at this point?

I am extremly sad too. But last time I checked, it was still not the case. We detect plateform and use small increment on better plateform (but I would rather follow up for that).

In D8189#123521, @mharbison72 wrote:
In D8189#123323, @durin42 wrote:

I think it might be better to poll more often in the script and not even take a timeout: sleep forever waiting for the condition, and if it never comes let the test timeout at the runner level. Thoughts?

I filed a bug about this that self-archived, but:
$ echo '  $ sleep 10' > test-timeout.t
$ time ./run-tests.py --local test-timeout.t -t 5
running 1 tests using 1 parallel processes
t
Failed test-timeout.t: timed out
# Ran 1 tests, 0 skipped, 1 failed.
python hash seed: 204038743
real    0m10.363s
user    0m0.000s
sys     0m0.030s
So it looks like tests never timeout, but then the result is discarded afterward if the timeout period elapsed. I can reproduce it on Windows and macOS.

Do you have a link to the bug ?

In D8189#123526, @marmoute wrote:
In D8189#123521, @mharbison72 wrote:
In D8189#123323, @durin42 wrote:

I think it might be better to poll more often in the script and not even take a timeout: sleep forever waiting for the condition, and if it never comes let the test timeout at the runner level. Thoughts?

I filed a bug about this that self-archived, but:
$ echo '  $ sleep 10' > test-timeout.t
$ time ./run-tests.py --local test-timeout.t -t 5
running 1 tests using 1 parallel processes
t
Failed test-timeout.t: timed out
# Ran 1 tests, 0 skipped, 1 failed.
python hash seed: 204038743
real    0m10.363s
user    0m0.000s
sys     0m0.030s
So it looks like tests never timeout, but then the result is discarded afterward if the timeout period elapsed. I can reproduce it on Windows and macOS.
Do you have a link to the bug ?

https://bz.mercurial-scm.org/show_bug.cgi?id=6125

Waiting for the test timeout is not a reasonable option because the test is killed without any details (and it is LOONG). The most common case for reaching the timeout is for one of the process to crash before reaching the checkpoing. When that happens, we want to be able to read the traceback. The second most common is code misbehaving and not going through the expected codepath. We also was to get output in this case. So in short, we need a clean way out in case of error and I have no better option than a (possibly long) timeout right now.

I wonder if the test harness can be modified to process the data collected up to the point of the timeout, so that it's obvious what is getting stuck.

Gentle ping on this patch.

I'm still -0 on this: I'd rather we found an approach that didn't require sleeping for so long. Perhaps a Python script would be a better fit here?

(I won't block this landing, but I won't push it.)

durin42 removed a subscriber: durin42.Mar 20 2020, 11:47 AM

In D8189#124122, @durin42 wrote:

I'm still -0 on this: I'd rather we found an approach that didn't require sleeping for so long. Perhaps a Python script would be a better fit here?
(I won't block this landing, but I won't push it.)

What abotu replacing the sleep 1 by a `python -c "import time; time.sleep(0.1)" would that make you happy ?

In D8189#124124, @marmoute wrote:

In D8189#124122, @durin42 wrote:

I'm still -0 on this: I'd rather we found an approach that didn't require sleeping for so long. Perhaps a Python script would be a better fit here?
(I won't block this landing, but I won't push it.)

What abotu replacing the sleep 1 by a `python -c "import time; time.sleep(0.1)" would that make you happy ?

Happy? No, not really. I think I'd rather it was a Python script, but what I'd _really_ rather (and what would actually make me happy) is that we didn't have sleep-required steps like this at all. They're inherently racy , and I feel like there's got to be a better solution.

At this point I don't have the patience to try and work through this patch, so you'll need to find a different reviewer.

durin42 removed a subscriber: durin42.Mar 20 2020, 11:53 AM

martinvonz mentioned this in D8190: nodemap: test that concurrent process don't see the pending transaction.Mar 20 2020, 1:55 PM

In D8189#123522, @marmoute wrote:

In D8189#123323, @durin42 wrote:

[…]
I'm also not happy about the 1-second floor this puts on the step. Doesn't sleep(1) support sub-second sleeps on all platforms at this point?

I am extremly sad too. But last time I checked, it was still not the case. We detect plateform and use small increment on better plateform (but I would rather follow up for that).

Good news, even is sub-second is not expect to work on all plateforms, I found out that we already use it in our test suite. So any plateform that does not support it are already broken. I'll send an update soon.

I am adding a small change to have the local timeout adjust itself according to the global time out. I am not aware of real live issues with the local timeout, but adding that logic is simple enough.

marmoute added a child revision: D8316: testlib: adjust wait-on-file timeout according to the global test timeout.Mar 20 2020, 7:07 PM

marmoute removed a child revision: D8190: nodemap: test that concurrent process don't see the pending transaction.

marmoute updated this revision to Diff 20856.

I wanted to help with things here but unfortunately I have ~0 experience with shell scripts and the kind of process testing going in next few patches.

marmoute added a commit: rHG1ed6293fc31b: testlib: add a small scrip to help process to synchronise using file.Apr 1 2020, 11:21 AM

This revision was not accepted when it landed; it landed in state Needs Review.

Closed by commit rHG1ed6293fc31b: testlib: add a small scrip to help process to synchronise using file (authored by marmoute). · Explain Why

This revision was automatically updated to reflect the committed changes.

Revision Contents
Changeset List

			Path	Packages
A	M		tests/testlib/wait-on-file (32 lines)

Diff	ID	Description	Created	Lint	Unit
Base		Base
Diff 1	20392		Feb 28 2020, 1:55 PM	★	★
Diff 2	20856		Mar 20 2020, 7:07 PM	★	★
Diff 3	20930	rHG1ed6293fc31bd70b0a40399bf2e10bd3b94d7462	Feb 27 2020, 8:23 PM	★	★

Status	Author	Revision
Closed	Alphare	D8162 hghave: add a `rust` keyword to detect the use of compiled rust code
Closed	marmoute	D8180 nodemap: check that a simple lookup works fine
Closed	marmoute	D8182 nodemap: document the docket attributes
Closed	marmoute	D8193 nodemap: automatically "vacuum" the persistent nodemap when too sparse
Closed	marmoute	D8192 nodemap: display percentage of unused in `hg debugnodemap`
Closed	marmoute	D8191 nodemap: make sure on disk change get rolled back with the transaction
Closed	marmoute	D8190 nodemap: test that concurrent process don't see the pending transaction
Closed	marmoute	D8316 testlib: adjust wait-on-file timeout according to the global test timeout
Closed	marmoute	D8189 testlib: add a small scrip to help process to synchronise using file
Closed	marmoute	D8188 nodemap: make sure the nodemap docket is updated after the changelog
Closed	marmoute	D8187 nodemap: make sure hooks have access to an up-to-date version
Closed	marmoute	D8186 nodemap: deal with the "debugupdatecache" case using a "fake" transaction
Closed	marmoute	D8185 changelog: change the implementation of `_divertopenener`
Closed	marmoute	D8184 nodemap: track the tip_node for validation
Closed	marmoute	D8183 nodemap: test that an outdated nodemap can catch up
Closed	marmoute	D8181 nodemap: add a todo list for getting out of experimental
Closed	Alphare	D8164 rust-nodemap: automatically use the rust index for persistent nodemap
Closed	Alphare	D8163 nodemap: use data from the index in debugnodemap --dump-new
Closed	Alphare	D8161 rust-nodemap: also clear Rust data in `clearcaches`
Closed	Alphare	D8160 rust-nodemap: add binding to `nodemap_update_data`
Closed	Alphare	D8159 rust-nodemap: add binding for `nodemap_data_incremental`
Closed	Alphare	D8158 rust-nodemap: add binding for `nodemap_data_all`
Closed	Alphare	D8157 rust-nodemap: use proper Index API instead of using the C API
Closed	Alphare	D8156 rust-nodemap: add utils for propagating errors
Closed	Alphare	D8155 rust-nodemap: add utils to create `Node`s from Python objects
Closed	Alphare	D8154 rust-index: add `append` method to cindex/Index
Closed	Alphare	D8153 rust-index: moved constructor in separate impl block
Closed	Alphare	D8152 revlog: using two new functions in C capsule from Rust code
Closed	marmoute	D8174 nodemap: refresh the persistent data on nodemap creation
Closed	marmoute	D8173 nodemap: warm the persistent nodemap on disk with debugupdatecache

Diff 20930

tests/testlib/wait-on-file

This file was added.

Property	Old Value	New Value
File Mode	null	100755

				#!/bin/bash
				#
				# wait up to TIMEOUT seconds until a WAIT_ON_FILE is created.
				#
				# In addition, this script can create CREATE_FILE once it is ready to wait.

				if [ $# -lt 2 ] \|\| [ $# -gt 3 ]; then
				echo $#
				echo "USAGE: $0 TIMEOUT WAIT_ON_FILE [CREATE_FILE]"
				fi

				timer="$1"
				wait_on="$2"
				create=""
				if [ $# -eq 3 ]; then
				create="$3"
				fi

				if [ -n "$create" ];
				then
				touch "$create"
				create=""
				fi
				while [ "$timer" -gt 0 ] && [ ! -f "$wait_on" ];
				do
				timer=$(( timer - 1))
				sleep 0.01
				done
				if [ "$timer" -le 0 ]; then
				echo "file not created after $1 seconds: $wait_on" >&2
				exit 1
				fi