* [PATCH v2] test/service: fix spurious failures by extending timeout
2022-10-06 8:17 [PATCH] test/service: fix spurious failures by extending timeout Harry van Haaren
@ 2022-10-06 8:28 ` Harry van Haaren
2022-10-06 8:39 ` David Marchand
2022-10-06 8:37 ` [PATCH] " Mattias Rönnblom
2022-10-06 12:52 ` [PATCH v3] " Harry van Haaren
2 siblings, 1 reply; 19+ messages in thread
From: Harry van Haaren @ 2022-10-06 8:28 UTC (permalink / raw)
To: dev
Cc: david.marchand, dpdklab, ci, Honnappa.Nagarahalli, mb,
mattias.ronnblom, thomas, Harry van Haaren
This commit extends the timeout for service_may_be_active()
from 100ms to 1000ms. Local testing on a idle and loaded system
(compiling DPDK with all cores) always completes after 1 ms.
The wait time for a service-lcore to finish is also extended
from 100ms to 1000ms.
The same timeout waiting code was duplicated in two tests, and
is now refactored to a standalone function avoiding duplication.
Reported-by: David Marchand <david.marchand@redhat.com>
Suggested-by: Mattias Ronnblom <mattias.ronnblom@ericsson.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
---
Apologies for the quick respin noise; only the first diff-section
is added, no changes to the rest of the patch.
v2:
- v1 addressed only testcase 15 issue, v2 also addresses test
case 5, which has an service-lcore wait code-path.
---
app/test/test_service_cores.c | 47 ++++++++++++++++-------------------
1 file changed, 22 insertions(+), 25 deletions(-)
diff --git a/app/test/test_service_cores.c b/app/test/test_service_cores.c
index 359b6dcd8b..4b147bd64c 100644
--- a/app/test/test_service_cores.c
+++ b/app/test/test_service_cores.c
@@ -123,14 +123,14 @@ unregister_all(void)
return TEST_SUCCESS;
}
-/* Wait until service lcore not active, or for 100x SERVICE_DELAY */
+/* Wait until service lcore not active, or for N times SERVICE_DELAY */
static void
wait_slcore_inactive(uint32_t slcore_id)
{
int i;
for (i = 0; rte_service_lcore_may_be_active(slcore_id) == 1 &&
- i < 100; i++)
+ i < 1000; i++)
rte_delay_ms(SERVICE_DELAY);
}
@@ -921,12 +921,26 @@ service_lcore_start_stop(void)
return unregister_all();
}
+static int
+service_ensure_stopped_with_timeout(uint32_t sid)
+{
+ /* give the service time to stop running */
+ int32_t timeout_ms = 1000;
+ int i;
+ for (i = 0; i < timeout_ms; i++) {
+ if (!rte_service_may_be_active(sid))
+ break;
+ rte_delay_ms(SERVICE_DELAY);
+ }
+
+ return rte_service_may_be_active(sid);
+}
+
/* stop a service and wait for it to become inactive */
static int
service_may_be_active(void)
{
const uint32_t sid = 0;
- int i;
/* expected failure cases */
TEST_ASSERT_EQUAL(-EINVAL, rte_service_may_be_active(10000),
@@ -946,19 +960,11 @@ service_may_be_active(void)
TEST_ASSERT_EQUAL(1, service_lcore_running_check(),
"Service core expected to poll service but it didn't");
- /* stop the service */
+ /* stop the service, and wait for not-active with timeout */
TEST_ASSERT_EQUAL(0, rte_service_runstate_set(sid, 0),
"Error: Service stop returned non-zero");
-
- /* give the service 100ms to stop running */
- for (i = 0; i < 100; i++) {
- if (!rte_service_may_be_active(sid))
- break;
- rte_delay_ms(SERVICE_DELAY);
- }
-
- TEST_ASSERT_EQUAL(0, rte_service_may_be_active(sid),
- "Error: Service not stopped after 100ms");
+ TEST_ASSERT_EQUAL(0, service_ensure_stopped_with_timeout(sid),
+ "Error: Service not stopped after timeout period.");
return unregister_all();
}
@@ -972,7 +978,6 @@ service_active_two_cores(void)
return TEST_SKIPPED;
const uint32_t sid = 0;
- int i;
uint32_t lcore = rte_get_next_lcore(/* start core */ -1,
/* skip main */ 1,
@@ -1002,16 +1007,8 @@ service_active_two_cores(void)
/* stop the service */
TEST_ASSERT_EQUAL(0, rte_service_runstate_set(sid, 0),
"Error: Service stop returned non-zero");
-
- /* give the service 100ms to stop running */
- for (i = 0; i < 100; i++) {
- if (!rte_service_may_be_active(sid))
- break;
- rte_delay_ms(SERVICE_DELAY);
- }
-
- TEST_ASSERT_EQUAL(0, rte_service_may_be_active(sid),
- "Error: Service not stopped after 100ms");
+ TEST_ASSERT_EQUAL(0, service_ensure_stopped_with_timeout(sid),
+ "Error: Service not stopped after timeout period.");
return unregister_all();
}
--
2.34.1
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2] test/service: fix spurious failures by extending timeout
2022-10-06 8:28 ` [PATCH v2] " Harry van Haaren
@ 2022-10-06 8:39 ` David Marchand
2022-10-06 8:54 ` Mattias Rönnblom
0 siblings, 1 reply; 19+ messages in thread
From: David Marchand @ 2022-10-06 8:39 UTC (permalink / raw)
To: Harry van Haaren
Cc: dev, dpdklab, ci, Honnappa.Nagarahalli, mb, mattias.ronnblom, thomas
On Thu, Oct 6, 2022 at 10:28 AM Harry van Haaren
<harry.van.haaren@intel.com> wrote:
>
> This commit extends the timeout for service_may_be_active()
> from 100ms to 1000ms. Local testing on a idle and loaded system
> (compiling DPDK with all cores) always completes after 1 ms.
>
> The wait time for a service-lcore to finish is also extended
> from 100ms to 1000ms.
>
> The same timeout waiting code was duplicated in two tests, and
> is now refactored to a standalone function avoiding duplication.
>
> Reported-by: David Marchand <david.marchand@redhat.com>
> Suggested-by: Mattias Ronnblom <mattias.ronnblom@ericsson.com>
> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Just to be sure, do we want such a timeout in the test logic itself?
Is it that you want to make sure that the synchronisation happens in a
"reasonable" (subject to discussion ;-)) amount of time?
Otherwise, the unit tests run in the CI are themselves subject to a
10s x mutiplier timeout (-t meson test option).
And then I would rely on this overall timeout.
--
David Marchand
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2] test/service: fix spurious failures by extending timeout
2022-10-06 8:39 ` David Marchand
@ 2022-10-06 8:54 ` Mattias Rönnblom
0 siblings, 0 replies; 19+ messages in thread
From: Mattias Rönnblom @ 2022-10-06 8:54 UTC (permalink / raw)
To: David Marchand, Harry van Haaren
Cc: dev, dpdklab, ci, Honnappa.Nagarahalli, mb, thomas
On 2022-10-06 10:39, David Marchand wrote:
> On Thu, Oct 6, 2022 at 10:28 AM Harry van Haaren
> <harry.van.haaren@intel.com> wrote:
>>
>> This commit extends the timeout for service_may_be_active()
>> from 100ms to 1000ms. Local testing on a idle and loaded system
>> (compiling DPDK with all cores) always completes after 1 ms.
>>
>> The wait time for a service-lcore to finish is also extended
>> from 100ms to 1000ms.
>>
>> The same timeout waiting code was duplicated in two tests, and
>> is now refactored to a standalone function avoiding duplication.
>>
>> Reported-by: David Marchand <david.marchand@redhat.com>
>> Suggested-by: Mattias Ronnblom <mattias.ronnblom@ericsson.com>
>> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
>
> Just to be sure, do we want such a timeout in the test logic itself?
I think it depends on how quickly you want to produce a failure, and
also if there are some follow-up tests in the same autotest that you
want to proceed with, regardless of the outcome.
> Is it that you want to make sure that the synchronisation happens in a
> "reasonable" (subject to discussion ;-)) amount of time?
>
> Otherwise, the unit tests run in the CI are themselves subject to a
> 10s x mutiplier timeout (-t meson test option).
> And then I would rely on this overall timeout.
>
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] test/service: fix spurious failures by extending timeout
2022-10-06 8:17 [PATCH] test/service: fix spurious failures by extending timeout Harry van Haaren
2022-10-06 8:28 ` [PATCH v2] " Harry van Haaren
@ 2022-10-06 8:37 ` Mattias Rönnblom
2022-10-06 12:52 ` [PATCH v3] " Harry van Haaren
2 siblings, 0 replies; 19+ messages in thread
From: Mattias Rönnblom @ 2022-10-06 8:37 UTC (permalink / raw)
To: Harry van Haaren, dev
Cc: david.marchand, dpdklab, ci, Honnappa.Nagarahalli, mb, thomas
On 2022-10-06 10:17, Harry van Haaren wrote:
> This commit extends the timeout for service_may_be_active()
> from 100ms to 1000ms. Local testing on a idle and loaded system
> (compiling DPDK with all cores) always completes after 1 ms.
>
> The same timeout waiting code was duplicated in two tests, and
> is now refactored to a standalone function avoiding duplication.
>
> Reported-by: David Marchand <david.marchand@redhat.com>
> Suggested-by: Mattias Ronnblom <mattias.ronnblom@ericsson.com>
> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> ---
> app/test/test_service_cores.c | 43 ++++++++++++++++-------------------
> 1 file changed, 20 insertions(+), 23 deletions(-)
>
> diff --git a/app/test/test_service_cores.c b/app/test/test_service_cores.c
> index 359b6dcd8b..5260e078c4 100644
> --- a/app/test/test_service_cores.c
> +++ b/app/test/test_service_cores.c
> @@ -921,12 +921,26 @@ service_lcore_start_stop(void)
> return unregister_all();
> }
>
> +static int
> +service_ensure_stopped_with_timeout(uint32_t sid)
> +{
> + /* give the service time to stop running */
> + int32_t timeout_ms = 1000; > + int i;
i and timeout_ms should have the same type. Or better, have timeout_ms
as a #define (in ms).
1 s timeout seems reasonable to me.
> + for (i = 0; i < timeout_ms; i++) {
> + if (!rte_service_may_be_active(sid))
> + break;
> + rte_delay_ms(SERVICE_DELAY);
rte_delay_ms(1);
> + }
> +
> + return rte_service_may_be_active(sid);
> +}
> +
> /* stop a service and wait for it to become inactive */
> static int
> service_may_be_active(void)
> {
> const uint32_t sid = 0;
> - int i;
>
> /* expected failure cases */
> TEST_ASSERT_EQUAL(-EINVAL, rte_service_may_be_active(10000),
> @@ -946,19 +960,11 @@ service_may_be_active(void)
> TEST_ASSERT_EQUAL(1, service_lcore_running_check(),
> "Service core expected to poll service but it didn't");
>
> - /* stop the service */
> + /* stop the service, and wait for not-active with timeout */
> TEST_ASSERT_EQUAL(0, rte_service_runstate_set(sid, 0),
> "Error: Service stop returned non-zero");
> -
> - /* give the service 100ms to stop running */
> - for (i = 0; i < 100; i++) {
> - if (!rte_service_may_be_active(sid))
> - break;
> - rte_delay_ms(SERVICE_DELAY);
> - }
> -
> - TEST_ASSERT_EQUAL(0, rte_service_may_be_active(sid),
> - "Error: Service not stopped after 100ms");
> + TEST_ASSERT_EQUAL(0, service_ensure_stopped_with_timeout(sid),
> + "Error: Service not stopped after timeout period.");
>
> return unregister_all();
> }
> @@ -972,7 +978,6 @@ service_active_two_cores(void)
> return TEST_SKIPPED;
>
> const uint32_t sid = 0;
> - int i;
>
> uint32_t lcore = rte_get_next_lcore(/* start core */ -1,
> /* skip main */ 1,
> @@ -1002,16 +1007,8 @@ service_active_two_cores(void)
> /* stop the service */
> TEST_ASSERT_EQUAL(0, rte_service_runstate_set(sid, 0),
> "Error: Service stop returned non-zero");
> -
> - /* give the service 100ms to stop running */
> - for (i = 0; i < 100; i++) {
> - if (!rte_service_may_be_active(sid))
> - break;
> - rte_delay_ms(SERVICE_DELAY);
> - }
> -
> - TEST_ASSERT_EQUAL(0, rte_service_may_be_active(sid),
> - "Error: Service not stopped after 100ms");
> + TEST_ASSERT_EQUAL(0, service_ensure_stopped_with_timeout(sid),
> + "Error: Service not stopped after timeout period.");
>
> return unregister_all();
> }
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH v3] test/service: fix spurious failures by extending timeout
2022-10-06 8:17 [PATCH] test/service: fix spurious failures by extending timeout Harry van Haaren
2022-10-06 8:28 ` [PATCH v2] " Harry van Haaren
2022-10-06 8:37 ` [PATCH] " Mattias Rönnblom
@ 2022-10-06 12:52 ` Harry van Haaren
2022-10-06 13:27 ` Morten Brørup
2022-10-06 14:00 ` Mattias Rönnblom
2 siblings, 2 replies; 19+ messages in thread
From: Harry van Haaren @ 2022-10-06 12:52 UTC (permalink / raw)
To: dev
Cc: david.marchand, dpdklab, ci, Honnappa.Nagarahalli, mb,
mattias.ronnblom, thomas, Harry van Haaren
This commit extends the timeout for service_may_be_active()
from 100ms to 1000ms. Local testing on a idle and loaded system
(compiling DPDK with all cores) always completes after 1 ms.
The wait time for a service-lcore to finish is also extended
from 100ms to 1000ms.
The same timeout waiting code was duplicated in two tests, and
is now refactored to a standalone function avoiding duplication.
Reported-by: David Marchand <david.marchand@redhat.com>
Suggested-by: Mattias Ronnblom <mattias.ronnblom@ericsson.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
---
v3:
- Use #define for timeout, and delay(1) (Mattias)
- Rework slcore-wait to use TIMEOUT_MS as well.
v2:
- v1 addressed only testcase 15 issue, v2 also addresses test
case 5, which has an service-lcore wait code-path.
---
app/test/test_service_cores.c | 49 ++++++++++++++++-------------------
1 file changed, 23 insertions(+), 26 deletions(-)
diff --git a/app/test/test_service_cores.c b/app/test/test_service_cores.c
index 359b6dcd8b..637fcd7cf9 100644
--- a/app/test/test_service_cores.c
+++ b/app/test/test_service_cores.c
@@ -22,6 +22,7 @@ static uint64_t service_tick;
static uint32_t service_remote_launch_flag;
#define SERVICE_DELAY 1
+#define TIMEOUT_MS 1000
#define DUMMY_SERVICE_NAME "dummy_service"
#define MT_SAFE_SERVICE_NAME "mt_safe_service"
@@ -123,15 +124,15 @@ unregister_all(void)
return TEST_SUCCESS;
}
-/* Wait until service lcore not active, or for 100x SERVICE_DELAY */
+/* Wait until service lcore not active, or for TIMEOUT_MS */
static void
wait_slcore_inactive(uint32_t slcore_id)
{
int i;
for (i = 0; rte_service_lcore_may_be_active(slcore_id) == 1 &&
- i < 100; i++)
- rte_delay_ms(SERVICE_DELAY);
+ i < TIMEOUT_MS; i++)
+ rte_delay_ms(1);
}
/* register a single dummy service */
@@ -921,12 +922,25 @@ service_lcore_start_stop(void)
return unregister_all();
}
+static int
+service_ensure_stopped_with_timeout(uint32_t sid)
+{
+ /* give the service time to stop running */
+ int i;
+ for (i = 0; i < TIMEOUT_MS; i++) {
+ if (!rte_service_may_be_active(sid))
+ break;
+ rte_delay_ms(1);
+ }
+
+ return rte_service_may_be_active(sid);
+}
+
/* stop a service and wait for it to become inactive */
static int
service_may_be_active(void)
{
const uint32_t sid = 0;
- int i;
/* expected failure cases */
TEST_ASSERT_EQUAL(-EINVAL, rte_service_may_be_active(10000),
@@ -946,19 +960,11 @@ service_may_be_active(void)
TEST_ASSERT_EQUAL(1, service_lcore_running_check(),
"Service core expected to poll service but it didn't");
- /* stop the service */
+ /* stop the service, and wait for not-active with timeout */
TEST_ASSERT_EQUAL(0, rte_service_runstate_set(sid, 0),
"Error: Service stop returned non-zero");
-
- /* give the service 100ms to stop running */
- for (i = 0; i < 100; i++) {
- if (!rte_service_may_be_active(sid))
- break;
- rte_delay_ms(SERVICE_DELAY);
- }
-
- TEST_ASSERT_EQUAL(0, rte_service_may_be_active(sid),
- "Error: Service not stopped after 100ms");
+ TEST_ASSERT_EQUAL(0, service_ensure_stopped_with_timeout(sid),
+ "Error: Service not stopped after timeout period.");
return unregister_all();
}
@@ -972,7 +978,6 @@ service_active_two_cores(void)
return TEST_SKIPPED;
const uint32_t sid = 0;
- int i;
uint32_t lcore = rte_get_next_lcore(/* start core */ -1,
/* skip main */ 1,
@@ -1002,16 +1007,8 @@ service_active_two_cores(void)
/* stop the service */
TEST_ASSERT_EQUAL(0, rte_service_runstate_set(sid, 0),
"Error: Service stop returned non-zero");
-
- /* give the service 100ms to stop running */
- for (i = 0; i < 100; i++) {
- if (!rte_service_may_be_active(sid))
- break;
- rte_delay_ms(SERVICE_DELAY);
- }
-
- TEST_ASSERT_EQUAL(0, rte_service_may_be_active(sid),
- "Error: Service not stopped after 100ms");
+ TEST_ASSERT_EQUAL(0, service_ensure_stopped_with_timeout(sid),
+ "Error: Service not stopped after timeout period.");
return unregister_all();
}
--
2.34.1
^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: [PATCH v3] test/service: fix spurious failures by extending timeout
2022-10-06 12:52 ` [PATCH v3] " Harry van Haaren
@ 2022-10-06 13:27 ` Morten Brørup
2022-10-06 19:33 ` David Marchand
2022-10-06 14:00 ` Mattias Rönnblom
1 sibling, 1 reply; 19+ messages in thread
From: Morten Brørup @ 2022-10-06 13:27 UTC (permalink / raw)
To: Harry van Haaren, dev
Cc: david.marchand, dpdklab, ci, Honnappa.Nagarahalli,
mattias.ronnblom, thomas
> From: Harry van Haaren [mailto:harry.van.haaren@intel.com]
> Sent: Thursday, 6 October 2022 14.53
>
> This commit extends the timeout for service_may_be_active()
> from 100ms to 1000ms. Local testing on a idle and loaded system
> (compiling DPDK with all cores) always completes after 1 ms.
>
> The wait time for a service-lcore to finish is also extended
> from 100ms to 1000ms.
>
> The same timeout waiting code was duplicated in two tests, and
> is now refactored to a standalone function avoiding duplication.
>
> Reported-by: David Marchand <david.marchand@redhat.com>
> Suggested-by: Mattias Ronnblom <mattias.ronnblom@ericsson.com>
> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
>
> ---
Acked-by: Morten Brørup <mb@smartsharesystems.com>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3] test/service: fix spurious failures by extending timeout
2022-10-06 13:27 ` Morten Brørup
@ 2022-10-06 19:33 ` David Marchand
2023-01-26 9:29 ` David Marchand
0 siblings, 1 reply; 19+ messages in thread
From: David Marchand @ 2022-10-06 19:33 UTC (permalink / raw)
To: Harry van Haaren
Cc: dev, dpdklab, ci, Honnappa.Nagarahalli, mattias.ronnblom, thomas,
Morten Brørup
On Thu, Oct 6, 2022 at 3:27 PM Morten Brørup <mb@smartsharesystems.com> wrote:
> > This commit extends the timeout for service_may_be_active()
> > from 100ms to 1000ms. Local testing on a idle and loaded system
> > (compiling DPDK with all cores) always completes after 1 ms.
> >
> > The wait time for a service-lcore to finish is also extended
> > from 100ms to 1000ms.
> >
> > The same timeout waiting code was duplicated in two tests, and
> > is now refactored to a standalone function avoiding duplication.
> >
> > Reported-by: David Marchand <david.marchand@redhat.com>
> > Suggested-by: Mattias Ronnblom <mattias.ronnblom@ericsson.com>
> > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
Reviewed-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Ok, let's see if the situation gets better with this.
Applied, thanks.
--
David Marchand
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3] test/service: fix spurious failures by extending timeout
2022-10-06 19:33 ` David Marchand
@ 2023-01-26 9:29 ` David Marchand
2023-01-31 17:24 ` Van Haaren, Harry
0 siblings, 1 reply; 19+ messages in thread
From: David Marchand @ 2023-01-26 9:29 UTC (permalink / raw)
To: Harry van Haaren
Cc: dev, dpdklab, ci, Honnappa.Nagarahalli, mattias.ronnblom, thomas,
Morten Brørup, Tyler Retzlaff, Aaron Conole
Hello Harry,
On Thu, Oct 6, 2022 at 9:33 PM David Marchand <david.marchand@redhat.com> wrote:
>
> On Thu, Oct 6, 2022 at 3:27 PM Morten Brørup <mb@smartsharesystems.com> wrote:
> > > This commit extends the timeout for service_may_be_active()
> > > from 100ms to 1000ms. Local testing on a idle and loaded system
> > > (compiling DPDK with all cores) always completes after 1 ms.
> > >
> > > The wait time for a service-lcore to finish is also extended
> > > from 100ms to 1000ms.
> > >
> > > The same timeout waiting code was duplicated in two tests, and
> > > is now refactored to a standalone function avoiding duplication.
> > >
> > > Reported-by: David Marchand <david.marchand@redhat.com>
> > > Suggested-by: Mattias Ronnblom <mattias.ronnblom@ericsson.com>
> > > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> > Acked-by: Morten Brørup <mb@smartsharesystems.com>
> Reviewed-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>
> Ok, let's see if the situation gets better with this.
> Applied, thanks.
I took a look at the january month failures at UNH.
Downloads/dpdk_31608e4db568_2023-01-03_06-58-00_NA/out/testlog.txt:EAL:
Test assert service_lcore_attr_get line 422 failed: Service lcore not
stopped after waiting.
Extending the timeout just made it less likely.
On a similar note, other parts are failing every once in a while:
Downloads/dpdk_2a211079a92e_25064_2023-01-24_15-08-50_NA/out/testlog.txt:EAL:
Test assert service_attr_get line 319 failed: attr_get() failed to get
cycles (expected > zero)
Downloads/dpdk_2a211079a92e_25074_2023-01-25_05-40-46_NA/out/testlog.txt:EAL:
Test assert service_lcore_start_stop line 900 failed: Service core
expected to poll service but it didn't
Downloads/dpdk_2a211079a92e_25075_2023-01-25_09-15-58_NA/out/testlog.txt:EAL:
Test assert service_lcore_start_stop line 900 failed: Service core
expected to poll service but it didn't
Downloads/dpdk_373f4c7de8ff_24866_2023-01-03_22-56-01_NA/out/testlog.txt:EAL:
Test assert service_lcore_start_stop line 900 failed: Service core
expected to poll service but it didn't
Downloads/dpdk_83397b9f0739_25030_2023-01-18_18-30-19_NA/out/testlog.txt:EAL:
Test assert service_lcore_start_stop line 901 failed: Service core
expected to poll service but it didn't
The timeout approach just does not have its place in a functional test.
Either this test is rewritten, or it must go to the performance tests
list so that we stop getting false positives.
Can you work on this?
Thanks.
--
David Marchand
^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: [PATCH v3] test/service: fix spurious failures by extending timeout
2023-01-26 9:29 ` David Marchand
@ 2023-01-31 17:24 ` Van Haaren, Harry
2023-02-03 15:03 ` Van Haaren, Harry
0 siblings, 1 reply; 19+ messages in thread
From: Van Haaren, Harry @ 2023-01-31 17:24 UTC (permalink / raw)
To: David Marchand
Cc: dev, dpdklab, ci, Honnappa.Nagarahalli, mattias.ronnblom, thomas,
Morten Brørup, Tyler Retzlaff, Aaron Conole
> -----Original Message-----
> From: David Marchand <david.marchand@redhat.com>
> Sent: Thursday, January 26, 2023 9:30 AM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>
> Cc: dev@dpdk.org; dpdklab@iol.unh.edu; ci@dpdk.org;
> Honnappa.Nagarahalli@arm.com; mattias.ronnblom
> <mattias.ronnblom@ericsson.com>; thomas@monjalon.net; Morten Brørup
> <mb@smartsharesystems.com>; Tyler Retzlaff <roretzla@linux.microsoft.com>;
> Aaron Conole <aconole@redhat.com>
> Subject: Re: [PATCH v3] test/service: fix spurious failures by extending timeout
>
> Hello Harry,
Hi David,
> On Thu, Oct 6, 2022 at 9:33 PM David Marchand <david.marchand@redhat.com>
> wrote:
> >
> > On Thu, Oct 6, 2022 at 3:27 PM Morten Brørup <mb@smartsharesystems.com>
> wrote:
> > > > This commit extends the timeout for service_may_be_active()
> > > > from 100ms to 1000ms. Local testing on a idle and loaded system
> > > > (compiling DPDK with all cores) always completes after 1 ms.
> > > >
> > > > The wait time for a service-lcore to finish is also extended
> > > > from 100ms to 1000ms.
> > > >
> > > > The same timeout waiting code was duplicated in two tests, and
> > > > is now refactored to a standalone function avoiding duplication.
> > > >
> > > > Reported-by: David Marchand <david.marchand@redhat.com>
> > > > Suggested-by: Mattias Ronnblom <mattias.ronnblom@ericsson.com>
> > > > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> > > Acked-by: Morten Brørup <mb@smartsharesystems.com>
> > Reviewed-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >
> > Ok, let's see if the situation gets better with this.
> > Applied, thanks.
>
> I took a look at the january month failures at UNH.
>
> Downloads/dpdk_31608e4db568_2023-01-03_06-58-00_NA/out/testlog.txt:EAL:
> Test assert service_lcore_attr_get line 422 failed: Service lcore not
> stopped after waiting.
> Extending the timeout just made it less likely.
Aha, okay.
<snip>
> The timeout approach just does not have its place in a functional test.
> Either this test is rewritten, or it must go to the performance tests
> list so that we stop getting false positives.
> Can you work on this?
I'll investigate various approaches on Thursday and reply here with suggested next steps.
Regards, -Harry
^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: [PATCH v3] test/service: fix spurious failures by extending timeout
2023-01-31 17:24 ` Van Haaren, Harry
@ 2023-02-03 15:03 ` Van Haaren, Harry
2023-02-03 15:12 ` Bruce Richardson
2023-02-03 15:16 ` Thomas Monjalon
0 siblings, 2 replies; 19+ messages in thread
From: Van Haaren, Harry @ 2023-02-03 15:03 UTC (permalink / raw)
To: David Marchand
Cc: dev, dpdklab, ci, Honnappa.Nagarahalli, mattias.ronnblom, thomas,
Morten Brørup, Tyler Retzlaff, Aaron Conole
> -----Original Message-----
> From: Van Haaren, Harry
> Sent: Tuesday, January 31, 2023 5:25 PM
> To: David Marchand <david.marchand@redhat.com>
> Cc: dev@dpdk.org; dpdklab@iol.unh.edu; ci@dpdk.org;
> Honnappa.Nagarahalli@arm.com; mattias.ronnblom
> <mattias.ronnblom@ericsson.com>; thomas@monjalon.net; Morten Brørup
> <mb@smartsharesystems.com>; Tyler Retzlaff <roretzla@linux.microsoft.com>;
> Aaron Conole <aconole@redhat.com>
> Subject: RE: [PATCH v3] test/service: fix spurious failures by extending timeout
<snip>
> <snip>
> > The timeout approach just does not have its place in a functional test.
> > Either this test is rewritten, or it must go to the performance tests
> > list so that we stop getting false positives.
> > Can you work on this?
>
> I'll investigate various approaches on Thursday and reply here with suggested
> next steps.
I've identified 3 checks that fail in CI (from the above log outputs), all 3 cases
Have different dlays: 100 ms delay, 200 ms delay and 1000ms.
In the CI, the service-core just hasn't been scheduled (yet) and causes the "failure".
Option 1)
One option is to while(1) loop, waiting for the service-thread to be scheduled. This can be
seen as "increasing the timeout", however in this case the test-case would be errored
not in the test-code, but in the meson-test runner as a timeout (with a 10sec default?)
The benefit here is that massively increasing (~1sec or less to 10 sec) will cover all/many
of the CI timeouts.
Option 2)
Move to perf-tests, and not run these in a noisy-CI environment where the results are not
consistent enough to have value. This would mean that the tests are not run in CI for the
3 checks in question are below, they all *require* the service core to be scheduled:
service_attr_get() -> requires service core to run for service stats to increment
service_lcore_attr_get() -> requires service core to run for lcore stats to increment
service_lcore_start_stop() -> requires service to run to to ensure service-func itself executes.
I don't see how we can "improve" option 2 to not require the service-thread to be scheduled by the OS..
And the only way to make the OS schedule it in the CI more consistently is to give it more time?
Thoughts and input welcomed, I'm happy to make the code changes themselves, its small effort
For both option 1 & 2.
Regards, -Harry
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3] test/service: fix spurious failures by extending timeout
2023-02-03 15:03 ` Van Haaren, Harry
@ 2023-02-03 15:12 ` Bruce Richardson
2023-02-23 20:10 ` Thomas Monjalon
2023-02-03 15:16 ` Thomas Monjalon
1 sibling, 1 reply; 19+ messages in thread
From: Bruce Richardson @ 2023-02-03 15:12 UTC (permalink / raw)
To: Van Haaren, Harry
Cc: David Marchand, dev, dpdklab, ci, Honnappa.Nagarahalli,
mattias.ronnblom, thomas, Morten Brørup, Tyler Retzlaff,
Aaron Conole
On Fri, Feb 03, 2023 at 03:03:38PM +0000, Van Haaren, Harry wrote:
> > -----Original Message-----
> > From: Van Haaren, Harry
> > Sent: Tuesday, January 31, 2023 5:25 PM
> > To: David Marchand <david.marchand@redhat.com>
> > Cc: dev@dpdk.org; dpdklab@iol.unh.edu; ci@dpdk.org;
> > Honnappa.Nagarahalli@arm.com; mattias.ronnblom
> > <mattias.ronnblom@ericsson.com>; thomas@monjalon.net; Morten Brørup
> > <mb@smartsharesystems.com>; Tyler Retzlaff <roretzla@linux.microsoft.com>;
> > Aaron Conole <aconole@redhat.com>
> > Subject: RE: [PATCH v3] test/service: fix spurious failures by extending timeout
>
> <snip>
>
> > <snip>
> > > The timeout approach just does not have its place in a functional test.
> > > Either this test is rewritten, or it must go to the performance tests
> > > list so that we stop getting false positives.
> > > Can you work on this?
> >
> > I'll investigate various approaches on Thursday and reply here with suggested
> > next steps.
>
> I've identified 3 checks that fail in CI (from the above log outputs), all 3 cases
> Have different dlays: 100 ms delay, 200 ms delay and 1000ms.
> In the CI, the service-core just hasn't been scheduled (yet) and causes the "failure".
>
For me, the question is - why hasn't the service-core been scheduled? Can
we use sched-yield or some other mechanism to force a wakeup of it?
/Bruce
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3] test/service: fix spurious failures by extending timeout
2023-02-03 15:12 ` Bruce Richardson
@ 2023-02-23 20:10 ` Thomas Monjalon
2023-02-27 8:41 ` Van Haaren, Harry
0 siblings, 1 reply; 19+ messages in thread
From: Thomas Monjalon @ 2023-02-23 20:10 UTC (permalink / raw)
To: Van Haaren, Harry, dev
Cc: David Marchand, dev, dpdklab, ci, Honnappa.Nagarahalli,
mattias. ronnblom, Morten Brørup, Tyler Retzlaff,
Aaron Conole, Bruce Richardson
03/02/2023 16:12, Bruce Richardson:
> On Fri, Feb 03, 2023 at 03:03:38PM +0000, Van Haaren, Harry wrote:
> > From: Van Haaren, Harry
> >
> > <snip>
> >
> > > > The timeout approach just does not have its place in a functional test.
> > > > Either this test is rewritten, or it must go to the performance tests
> > > > list so that we stop getting false positives.
> > > > Can you work on this?
> > >
> > > I'll investigate various approaches on Thursday and reply here with suggested
> > > next steps.
> >
> > I've identified 3 checks that fail in CI (from the above log outputs), all 3 cases
> > Have different dlays: 100 ms delay, 200 ms delay and 1000ms.
> > In the CI, the service-core just hasn't been scheduled (yet) and causes the "failure".
>
> For me, the question is - why hasn't the service-core been scheduled? Can
> we use sched-yield or some other mechanism to force a wakeup of it?
Harry, you didn't reply to this question please.
^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: [PATCH v3] test/service: fix spurious failures by extending timeout
2023-02-23 20:10 ` Thomas Monjalon
@ 2023-02-27 8:41 ` Van Haaren, Harry
0 siblings, 0 replies; 19+ messages in thread
From: Van Haaren, Harry @ 2023-02-27 8:41 UTC (permalink / raw)
To: Thomas Monjalon, dev
Cc: David Marchand, dev, dpdklab, ci, Honnappa.Nagarahalli,
mattias.ronnblom, Morten Brørup, Tyler Retzlaff,
Aaron Conole, Richardson, Bruce
> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Thursday, February 23, 2023 8:11 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org
> Cc: David Marchand <david.marchand@redhat.com>; dev@dpdk.org;
> dpdklab@iol.unh.edu; ci@dpdk.org; Honnappa.Nagarahalli@arm.com;
> mattias.ronnblom <mattias.ronnblom@ericsson.com>; Morten Brørup
> <mb@smartsharesystems.com>; Tyler Retzlaff <roretzla@linux.microsoft.com>;
> Aaron Conole <aconole@redhat.com>; Richardson, Bruce
> <bruce.richardson@intel.com>
> Subject: Re: [PATCH v3] test/service: fix spurious failures by extending timeout
>
> 03/02/2023 16:12, Bruce Richardson:
> > On Fri, Feb 03, 2023 at 03:03:38PM +0000, Van Haaren, Harry wrote:
> > > From: Van Haaren, Harry
> > >
> > > <snip>
> > >
> > > > > The timeout approach just does not have its place in a functional test.
> > > > > Either this test is rewritten, or it must go to the performance tests
> > > > > list so that we stop getting false positives.
> > > > > Can you work on this?
> > > >
> > > > I'll investigate various approaches on Thursday and reply here with suggested
> > > > next steps.
> > >
> > > I've identified 3 checks that fail in CI (from the above log outputs), all 3 cases
> > > Have different dlays: 100 ms delay, 200 ms delay and 1000ms.
> > > In the CI, the service-core just hasn't been scheduled (yet) and causes the
> "failure".
> >
> > For me, the question is - why hasn't the service-core been scheduled? Can
> > we use sched-yield or some other mechanism to force a wakeup of it?
>
> Harry, you didn't reply to this question please.
Same answer as similar topic in the other thread:
> > I'm not aware of a way to make *a specific other pthread* wakeup. We could sacrifice
> > the current lcore that's waiting for the service-lcore, with a sched_yield() as you suggest.
> > It would potentially "churn" the scheduler enough to give the service core some CPU?
> > It's a guess/gamble in the end, kind of like the timeouts we have today..
Patch sent as requested in other email.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3] test/service: fix spurious failures by extending timeout
2023-02-03 15:03 ` Van Haaren, Harry
2023-02-03 15:12 ` Bruce Richardson
@ 2023-02-03 15:16 ` Thomas Monjalon
2023-02-03 16:09 ` Van Haaren, Harry
1 sibling, 1 reply; 19+ messages in thread
From: Thomas Monjalon @ 2023-02-03 15:16 UTC (permalink / raw)
To: David Marchand, Van Haaren, Harry
Cc: dev, dpdklab, ci, Honnappa.Nagarahalli, mattias. ronnblom,
Morten Brørup, Tyler Retzlaff, Aaron Conole
03/02/2023 16:03, Van Haaren, Harry:
> From: Van Haaren, Harry
> > > The timeout approach just does not have its place in a functional test.
> > > Either this test is rewritten, or it must go to the performance tests
> > > list so that we stop getting false positives.
> > > Can you work on this?
> >
> > I'll investigate various approaches on Thursday and reply here with suggested
> > next steps.
>
> I've identified 3 checks that fail in CI (from the above log outputs), all 3 cases
> Have different dlays: 100 ms delay, 200 ms delay and 1000ms.
> In the CI, the service-core just hasn't been scheduled (yet) and causes the "failure".
>
> Option 1)
> One option is to while(1) loop, waiting for the service-thread to be scheduled. This can be
> seen as "increasing the timeout", however in this case the test-case would be errored
> not in the test-code, but in the meson-test runner as a timeout (with a 10sec default?)
> The benefit here is that massively increasing (~1sec or less to 10 sec) will cover all/many
> of the CI timeouts.
>
> Option 2)
> Move to perf-tests, and not run these in a noisy-CI environment where the results are not
> consistent enough to have value. This would mean that the tests are not run in CI for the
> 3 checks in question are below, they all *require* the service core to be scheduled:
> service_attr_get() -> requires service core to run for service stats to increment
> service_lcore_attr_get() -> requires service core to run for lcore stats to increment
> service_lcore_start_stop() -> requires service to run to to ensure service-func itself executes.
>
> I don't see how we can "improve" option 2 to not require the service-thread to be scheduled by the OS..
> And the only way to make the OS schedule it in the CI more consistently is to give it more time?
We are talking about seconds.
There are setups where scheduling a thread is taking seconds?
> Thoughts and input welcomed, I'm happy to make the code changes themselves, its small effort
> For both option 1 & 2.
For time-sensitive tests, yes they should be in perf tests category.
As David said earlier, no timeout approach in functional tests.
^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: [PATCH v3] test/service: fix spurious failures by extending timeout
2023-02-03 15:16 ` Thomas Monjalon
@ 2023-02-03 16:09 ` Van Haaren, Harry
2023-02-23 20:15 ` Thomas Monjalon
0 siblings, 1 reply; 19+ messages in thread
From: Van Haaren, Harry @ 2023-02-03 16:09 UTC (permalink / raw)
To: Thomas Monjalon, David Marchand
Cc: dev, dpdklab, ci, Honnappa.Nagarahalli, mattias.ronnblom,
Morten Brørup, Tyler Retzlaff, Aaron Conole
> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Friday, February 3, 2023 3:16 PM
> To: David Marchand <david.marchand@redhat.com>; Van Haaren, Harry
> <harry.van.haaren@intel.com>
> Cc: dev@dpdk.org; dpdklab@iol.unh.edu; ci@dpdk.org;
> Honnappa.Nagarahalli@arm.com; mattias.ronnblom
> <mattias.ronnblom@ericsson.com>; Morten Brørup
> <mb@smartsharesystems.com>; Tyler Retzlaff <roretzla@linux.microsoft.com>;
> Aaron Conole <aconole@redhat.com>
> Subject: Re: [PATCH v3] test/service: fix spurious failures by extending timeout
>
> 03/02/2023 16:03, Van Haaren, Harry:
> > From: Van Haaren, Harry
> > > > The timeout approach just does not have its place in a functional test.
> > > > Either this test is rewritten, or it must go to the performance tests
> > > > list so that we stop getting false positives.
> > > > Can you work on this?
> > >
> > > I'll investigate various approaches on Thursday and reply here with suggested
> > > next steps.
> >
> > I've identified 3 checks that fail in CI (from the above log outputs), all 3 cases
> > Have different dlays: 100 ms delay, 200 ms delay and 1000ms.
> > In the CI, the service-core just hasn't been scheduled (yet) and causes the
> "failure".
> >
> > Option 1)
> > One option is to while(1) loop, waiting for the service-thread to be scheduled.
> This can be
> > seen as "increasing the timeout", however in this case the test-case would be
> errored
> > not in the test-code, but in the meson-test runner as a timeout (with a 10sec
> default?)
> > The benefit here is that massively increasing (~1sec or less to 10 sec) will cover
> all/many
> > of the CI timeouts.
> >
> > Option 2)
> > Move to perf-tests, and not run these in a noisy-CI environment where the
> results are not
> > consistent enough to have value. This would mean that the tests are not run in
> CI for the
> > 3 checks in question are below, they all *require* the service core to be
> scheduled:
> > service_attr_get() -> requires service core to run for service stats to increment
> > service_lcore_attr_get() -> requires service core to run for lcore stats to
> increment
> > service_lcore_start_stop() -> requires service to run to to ensure service-func
> itself executes.
> >
> > I don't see how we can "improve" option 2 to not require the service-thread to
> be scheduled by the OS..
> > And the only way to make the OS schedule it in the CI more consistently is to
> give it more time?
>
> We are talking about seconds.
> There are setups where scheduling a thread is taking seconds?
Apparently so - otherwise these tests would always pass.
They *only* fail at random runs in CI, and reliably pass everywhere else.. I've not had
them fail locally, and that includes running in a loop for hours with a busy system..
but not a low-priority CI VM in a busy datacenter.
[Bruce wrote in separate mail]
>>> For me, the question is - why hasn't the service-core been scheduled? Can
>>> we use sched-yield or some other mechanism to force a wakeup of it?
I'm not aware of a way to make *a specific other pthread* wakeup. We could sacrifice
the current lcore that's waiting for the service-lcore, with a sched_yield() as you suggest.
It would potentially "churn" the scheduler enough to give the service core some CPU?
It's a guess/gamble in the end, kind of like the timeouts we have today..
> > Thoughts and input welcomed, I'm happy to make the code changes
> themselves, its small effort
> > For both option 1 & 2.
>
> For time-sensitive tests, yes they should be in perf tests category.
> As David said earlier, no timeout approach in functional tests.
Ok, as before, option 1) is to while(1) and wait for "success". Then there's
no timeout in the test code, but our meson test runner will time-out/fail after ~10sec IIRC.
Or we move the tests perf-tests, as per Option 2), and these simply won't run in CI.
I'm OK with all 3 (including testing with sched_yield() for a month or two and if that helps?)
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3] test/service: fix spurious failures by extending timeout
2023-02-03 16:09 ` Van Haaren, Harry
@ 2023-02-23 20:15 ` Thomas Monjalon
2023-02-27 8:41 ` Van Haaren, Harry
0 siblings, 1 reply; 19+ messages in thread
From: Thomas Monjalon @ 2023-02-23 20:15 UTC (permalink / raw)
To: Van Haaren, Harry
Cc: David Marchand, dev, dpdklab, ci, Honnappa.Nagarahalli,
mattias. ronnblom, Morten Brørup, Tyler Retzlaff,
Aaron Conole, bruce.richardson
03/02/2023 17:09, Van Haaren, Harry:
> From: Thomas Monjalon <thomas@monjalon.net>
> > 03/02/2023 16:03, Van Haaren, Harry:
> > > From: Van Haaren, Harry
> > > > > The timeout approach just does not have its place in a functional test.
> > > > > Either this test is rewritten, or it must go to the performance tests
> > > > > list so that we stop getting false positives.
> > > > > Can you work on this?
> > > >
> > > > I'll investigate various approaches on Thursday and reply here with suggested
> > > > next steps.
> > >
> > > I've identified 3 checks that fail in CI (from the above log outputs), all 3 cases
> > > Have different dlays: 100 ms delay, 200 ms delay and 1000ms.
> > > In the CI, the service-core just hasn't been scheduled (yet) and causes the
> > "failure".
> > >
> > > Option 1)
> > > One option is to while(1) loop, waiting for the service-thread to be scheduled.
> > This can be
> > > seen as "increasing the timeout", however in this case the test-case would be
> > errored
> > > not in the test-code, but in the meson-test runner as a timeout (with a 10sec
> > default?)
> > > The benefit here is that massively increasing (~1sec or less to 10 sec) will cover
> > all/many
> > > of the CI timeouts.
> > >
> > > Option 2)
> > > Move to perf-tests, and not run these in a noisy-CI environment where the
> > results are not
> > > consistent enough to have value. This would mean that the tests are not run in
> > CI for the
> > > 3 checks in question are below, they all *require* the service core to be
> > scheduled:
> > > service_attr_get() -> requires service core to run for service stats to increment
> > > service_lcore_attr_get() -> requires service core to run for lcore stats to
> > increment
> > > service_lcore_start_stop() -> requires service to run to to ensure service-func
> > itself executes.
> > >
> > > I don't see how we can "improve" option 2 to not require the service-thread to
> > be scheduled by the OS..
> > > And the only way to make the OS schedule it in the CI more consistently is to
> > give it more time?
> >
> > We are talking about seconds.
> > There are setups where scheduling a thread is taking seconds?
>
> Apparently so - otherwise these tests would always pass.
>
> They *only* fail at random runs in CI, and reliably pass everywhere else.. I've not had
> them fail locally, and that includes running in a loop for hours with a busy system..
> but not a low-priority CI VM in a busy datacenter.
>
>
> [Bruce wrote in separate mail]
Bruce was not Cc'ed in this reply.
> >>> For me, the question is - why hasn't the service-core been scheduled? Can
> >>> we use sched-yield or some other mechanism to force a wakeup of it?
>
> I'm not aware of a way to make *a specific other pthread* wakeup. We could sacrifice
> the current lcore that's waiting for the service-lcore, with a sched_yield() as you suggest.
> It would potentially "churn" the scheduler enough to give the service core some CPU?
> It's a guess/gamble in the end, kind of like the timeouts we have today..
>
> > > Thoughts and input welcomed, I'm happy to make the code changes
> > themselves, its small effort
> > > For both option 1 & 2.
> >
> > For time-sensitive tests, yes they should be in perf tests category.
> > As David said earlier, no timeout approach in functional tests.
>
> Ok, as before, option 1) is to while(1) and wait for "success". Then there's
> no timeout in the test code, but our meson test runner will time-out/fail after ~10sec IIRC.
>
> Or we move the tests perf-tests, as per Option 2), and these simply won't run in CI.
>
> I'm OK with all 3 (including testing with sched_yield() for a month or two and if that helps?)
Did you send a patch to go in a direction or another?
If not, please move the test to perf-test as suggested before.
We are still hitting the issues in the CI and it is *very* annoying.
It is consuming time of a lot of people for a lot of patches,
just to check it is again an issue with this test.
Please let's remove this test from the CI now.
^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: [PATCH v3] test/service: fix spurious failures by extending timeout
2023-02-23 20:15 ` Thomas Monjalon
@ 2023-02-27 8:41 ` Van Haaren, Harry
0 siblings, 0 replies; 19+ messages in thread
From: Van Haaren, Harry @ 2023-02-27 8:41 UTC (permalink / raw)
To: Thomas Monjalon
Cc: David Marchand, dev, dpdklab, ci, Honnappa.Nagarahalli,
mattias.ronnblom, Morten Brørup, Tyler Retzlaff,
Aaron Conole, Richardson, Bruce
> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Thursday, February 23, 2023 8:15 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>
> Cc: David Marchand <david.marchand@redhat.com>; dev@dpdk.org;
> dpdklab@iol.unh.edu; ci@dpdk.org; Honnappa.Nagarahalli@arm.com;
> mattias.ronnblom <mattias.ronnblom@ericsson.com>; Morten Brørup
> <mb@smartsharesystems.com>; Tyler Retzlaff <roretzla@linux.microsoft.com>;
> Aaron Conole <aconole@redhat.com>; Richardson, Bruce
> <bruce.richardson@intel.com>
> Subject: Re: [PATCH v3] test/service: fix spurious failures by extending timeout
<snip>
> > > We are talking about seconds.
> > > There are setups where scheduling a thread is taking seconds?
> >
> > Apparently so - otherwise these tests would always pass.
> >
> > They *only* fail at random runs in CI, and reliably pass everywhere else.. I've not
> had
> > them fail locally, and that includes running in a loop for hours with a busy system..
> > but not a low-priority CI VM in a busy datacenter.
> >
> >
> > [Bruce wrote in separate mail]
>
> Bruce was not Cc'ed in this reply.
Correct, I missed that he wasn't on the thread already, thanks for adding him on CC.
> > >>> For me, the question is - why hasn't the service-core been scheduled? Can
> > >>> we use sched-yield or some other mechanism to force a wakeup of it?
> >
> > I'm not aware of a way to make *a specific other pthread* wakeup. We could
> sacrifice
> > the current lcore that's waiting for the service-lcore, with a sched_yield() as you
> suggest.
> > It would potentially "churn" the scheduler enough to give the service core some
> CPU?
> > It's a guess/gamble in the end, kind of like the timeouts we have today..
> >
> > > > Thoughts and input welcomed, I'm happy to make the code changes
> > > themselves, its small effort
> > > > For both option 1 & 2.
> > >
> > > For time-sensitive tests, yes they should be in perf tests category.
> > > As David said earlier, no timeout approach in functional tests.
> >
> > Ok, as before, option 1) is to while(1) and wait for "success". Then there's
> > no timeout in the test code, but our meson test runner will time-out/fail after
> ~10sec IIRC.
> >
> > Or we move the tests perf-tests, as per Option 2), and these simply won't run in
> CI.
> >
> > I'm OK with all 3 (including testing with sched_yield() for a month or two and if
> that helps?)
>
> Did you send a patch to go in a direction or another?
> If not, please move the test to perf-test as suggested before.
> We are still hitting the issues in the CI and it is *very* annoying.
> It is consuming time of a lot of people for a lot of patches,
> just to check it is again an issue with this test.
>
> Please let's remove this test from the CI now.
Patch sent: http://patches.dpdk.org/project/dpdk/patch/20230224173637.243266-1-harry.van.haaren@intel.com/
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3] test/service: fix spurious failures by extending timeout
2022-10-06 12:52 ` [PATCH v3] " Harry van Haaren
2022-10-06 13:27 ` Morten Brørup
@ 2022-10-06 14:00 ` Mattias Rönnblom
1 sibling, 0 replies; 19+ messages in thread
From: Mattias Rönnblom @ 2022-10-06 14:00 UTC (permalink / raw)
To: Harry van Haaren, dev
Cc: david.marchand, dpdklab, ci, Honnappa.Nagarahalli, mb, thomas
On 2022-10-06 14:52, Harry van Haaren wrote:
> This commit extends the timeout for service_may_be_active()
> from 100ms to 1000ms. Local testing on a idle and loaded system
> (compiling DPDK with all cores) always completes after 1 ms.
>
> The wait time for a service-lcore to finish is also extended
> from 100ms to 1000ms.
>
> The same timeout waiting code was duplicated in two tests, and
> is now refactored to a standalone function avoiding duplication.
>
> Reported-by: David Marchand <david.marchand@redhat.com>
> Suggested-by: Mattias Ronnblom <mattias.ronnblom@ericsson.com>
> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
>
> ---
>
> v3:
> - Use #define for timeout, and delay(1) (Mattias)
> - Rework slcore-wait to use TIMEOUT_MS as well.
>
> v2:
> - v1 addressed only testcase 15 issue, v2 also addresses test
> case 5, which has an service-lcore wait code-path.
> ---
> app/test/test_service_cores.c | 49 ++++++++++++++++-------------------
> 1 file changed, 23 insertions(+), 26 deletions(-)
>
> diff --git a/app/test/test_service_cores.c b/app/test/test_service_cores.c
> index 359b6dcd8b..637fcd7cf9 100644
> --- a/app/test/test_service_cores.c
> +++ b/app/test/test_service_cores.c
> @@ -22,6 +22,7 @@ static uint64_t service_tick;
> static uint32_t service_remote_launch_flag;
>
> #define SERVICE_DELAY 1
> +#define TIMEOUT_MS 1000
>
> #define DUMMY_SERVICE_NAME "dummy_service"
> #define MT_SAFE_SERVICE_NAME "mt_safe_service"
> @@ -123,15 +124,15 @@ unregister_all(void)
> return TEST_SUCCESS;
> }
>
> -/* Wait until service lcore not active, or for 100x SERVICE_DELAY */
> +/* Wait until service lcore not active, or for TIMEOUT_MS */
> static void
> wait_slcore_inactive(uint32_t slcore_id)
> {
> int i;
>
> for (i = 0; rte_service_lcore_may_be_active(slcore_id) == 1 &&
> - i < 100; i++)
> - rte_delay_ms(SERVICE_DELAY);
> + i < TIMEOUT_MS; i++)
> + rte_delay_ms(1);
> }
>
> /* register a single dummy service */
> @@ -921,12 +922,25 @@ service_lcore_start_stop(void)
> return unregister_all();
> }
>
> +static int
> +service_ensure_stopped_with_timeout(uint32_t sid)
> +{
> + /* give the service time to stop running */
> + int i;
> + for (i = 0; i < TIMEOUT_MS; i++) {
> + if (!rte_service_may_be_active(sid))
> + break;
> + rte_delay_ms(1);
> + }
> +
> + return rte_service_may_be_active(sid);
> +}
> +
> /* stop a service and wait for it to become inactive */
> static int
> service_may_be_active(void)
> {
> const uint32_t sid = 0;
> - int i;
>
> /* expected failure cases */
> TEST_ASSERT_EQUAL(-EINVAL, rte_service_may_be_active(10000),
> @@ -946,19 +960,11 @@ service_may_be_active(void)
> TEST_ASSERT_EQUAL(1, service_lcore_running_check(),
> "Service core expected to poll service but it didn't");
>
> - /* stop the service */
> + /* stop the service, and wait for not-active with timeout */
> TEST_ASSERT_EQUAL(0, rte_service_runstate_set(sid, 0),
> "Error: Service stop returned non-zero");
> -
> - /* give the service 100ms to stop running */
> - for (i = 0; i < 100; i++) {
> - if (!rte_service_may_be_active(sid))
> - break;
> - rte_delay_ms(SERVICE_DELAY);
> - }
> -
> - TEST_ASSERT_EQUAL(0, rte_service_may_be_active(sid),
> - "Error: Service not stopped after 100ms");
> + TEST_ASSERT_EQUAL(0, service_ensure_stopped_with_timeout(sid),
> + "Error: Service not stopped after timeout period.");
>
> return unregister_all();
> }
> @@ -972,7 +978,6 @@ service_active_two_cores(void)
> return TEST_SKIPPED;
>
> const uint32_t sid = 0;
> - int i;
>
> uint32_t lcore = rte_get_next_lcore(/* start core */ -1,
> /* skip main */ 1,
> @@ -1002,16 +1007,8 @@ service_active_two_cores(void)
> /* stop the service */
> TEST_ASSERT_EQUAL(0, rte_service_runstate_set(sid, 0),
> "Error: Service stop returned non-zero");
> -
> - /* give the service 100ms to stop running */
> - for (i = 0; i < 100; i++) {
> - if (!rte_service_may_be_active(sid))
> - break;
> - rte_delay_ms(SERVICE_DELAY);
> - }
> -
> - TEST_ASSERT_EQUAL(0, rte_service_may_be_active(sid),
> - "Error: Service not stopped after 100ms");
> + TEST_ASSERT_EQUAL(0, service_ensure_stopped_with_timeout(sid),
> + "Error: Service not stopped after timeout period.");
>
> return unregister_all();
> }
Reviewed-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
^ permalink raw reply [flat|nested] 19+ messages in thread