DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH] service: fix stop API to wait for service thread
@ 2020-07-20 12:09 ` Harry van Haaren
  2020-07-20 12:51   ` Lukasz Wojciechowski
  2020-07-20 14:38   ` [dpdk-dev] [PATCH v2 1/2] service: add API to retrieve service core active Harry van Haaren
  0 siblings, 2 replies; 31+ messages in thread
From: Harry van Haaren @ 2020-07-20 12:09 UTC (permalink / raw)
  To: dev
  Cc: david.marchand, igor.romanov, honnappa.nagarahalli, ferruh.yigit,
	nd, aconole, l.wojciechow, Harry van Haaren

This commit improves the service_lcore_stop() implementation,
waiting for the service core in question to return. The service
thread itself now has a variable to indicate if its thread is
active. When zero the service thread has completed its service,
and has returned from the service_runner_func() function.

This fixes a race condition observed in the DPDK CI, where the
statistics of the service were not consistent with the expectation
due to the service thread still running, and incrementing a stat
after stop was called.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>

---

This is one possible solution, that avoids a class of race-conditions
based on stop() api and following behaviours. Without a change in
implementation of the service core thread, we could not detect when
the thread was actually finished. This is now possible, and the stop
api makes use of it to wait for 1000x one millisecond, or log a warning
that a service core didn't return quickly.

Thanks for the discussion/debug on list - I'm not sure how to add
reported-by/suggested-by etc tags: but I'll resend a V2 (or can add
on apply).

---
 lib/librte_eal/common/rte_service.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 6a0e0ff65..d2255587d 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -65,6 +65,7 @@ struct core_state {
 	/* map of services IDs are run on this core */
 	uint64_t service_mask;
 	uint8_t runstate; /* running or stopped */
+	uint8_t thread_active; /* indicates when the thread is in service_run() */
 	uint8_t is_service_core; /* set if core is currently a service core */
 	uint8_t service_active_on_lcore[RTE_SERVICE_NUM_MAX];
 	uint64_t loops;
@@ -457,6 +458,8 @@ service_runner_func(void *arg)
 	const int lcore = rte_lcore_id();
 	struct core_state *cs = &lcore_states[lcore];
 
+	__atomic_store_n(&cs->thread_active, 1, __ATOMIC_RELAXED);
+
 	/* runstate act as the guard variable. Use load-acquire
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
@@ -475,6 +478,7 @@ service_runner_func(void *arg)
 		cs->loops++;
 	}
 
+	__atomic_store_n(&cs->thread_active, 0, __ATOMIC_RELAXED);
 	return 0;
 }
 
@@ -765,6 +769,26 @@ rte_service_lcore_stop(uint32_t lcore)
 	__atomic_store_n(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
 		__ATOMIC_RELEASE);
 
+	/* wait for service lcore to return */
+	i = 0;
+	uint8_t active;
+	uint64_t start = rte_rdtsc();
+	do {
+		active = __atomic_load_n(&lcore_states[lcore].thread_active,
+					 __ATOMIC_RELAXED);
+		if (active == 0)
+			break;
+		rte_delay_ms(1);
+		i++;
+	} while (i < 1000);
+
+	if (active != 0) {
+		uint64_t end = rte_rdtsc();
+		RTE_LOG(WARNING, EAL,
+			"service lcore stop() failed, waited for %ld cycles\n",
+			end - start);
+	}
+
 	return 0;
 }
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [PATCH] service: fix stop API to wait for service thread
  2020-07-20 12:09 ` [dpdk-dev] [PATCH] service: fix stop API to wait for service thread Harry van Haaren
@ 2020-07-20 12:51   ` Lukasz Wojciechowski
  2020-07-20 14:20     ` Van Haaren, Harry
  2020-07-20 14:38   ` [dpdk-dev] [PATCH v2 1/2] service: add API to retrieve service core active Harry van Haaren
  1 sibling, 1 reply; 31+ messages in thread
From: Lukasz Wojciechowski @ 2020-07-20 12:51 UTC (permalink / raw)
  To: Harry van Haaren, dev
  Cc: david.marchand, igor.romanov, honnappa.nagarahalli, ferruh.yigit,
	nd, aconole


W dniu 20.07.2020 o 14:09, Harry van Haaren pisze:
> This commit improves the service_lcore_stop() implementation,
> waiting for the service core in question to return. The service
> thread itself now has a variable to indicate if its thread is
> active. When zero the service thread has completed its service,
> and has returned from the service_runner_func() function.
>
> This fixes a race condition observed in the DPDK CI, where the
> statistics of the service were not consistent with the expectation
> due to the service thread still running, and incrementing a stat
> after stop was called.
>
> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
>
> ---
>
> This is one possible solution, that avoids a class of race-conditions
> based on stop() api and following behaviours. Without a change in
> implementation of the service core thread, we could not detect when
> the thread was actually finished. This is now possible, and the stop
> api makes use of it to wait for 1000x one millisecond, or log a warning
> that a service core didn't return quickly.
>
> Thanks for the discussion/debug on list - I'm not sure how to add
> reported-by/suggested-by etc tags: but I'll resend a V2 (or can add
> on apply).
>
> ---
>   lib/librte_eal/common/rte_service.c | 24 ++++++++++++++++++++++++
>   1 file changed, 24 insertions(+)
>
> diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
> index 6a0e0ff65..d2255587d 100644
> --- a/lib/librte_eal/common/rte_service.c
> +++ b/lib/librte_eal/common/rte_service.c
> @@ -65,6 +65,7 @@ struct core_state {
>   	/* map of services IDs are run on this core */
>   	uint64_t service_mask;
>   	uint8_t runstate; /* running or stopped */
> +	uint8_t thread_active; /* indicates when the thread is in service_run() */
>   	uint8_t is_service_core; /* set if core is currently a service core */
>   	uint8_t service_active_on_lcore[RTE_SERVICE_NUM_MAX];
>   	uint64_t loops;
> @@ -457,6 +458,8 @@ service_runner_func(void *arg)
>   	const int lcore = rte_lcore_id();
>   	struct core_state *cs = &lcore_states[lcore];
>   
> +	__atomic_store_n(&cs->thread_active, 1, __ATOMIC_RELAXED);
> +
>   	/* runstate act as the guard variable. Use load-acquire
>   	 * memory order here to synchronize with store-release
>   	 * in runstate update functions.
> @@ -475,6 +478,7 @@ service_runner_func(void *arg)
>   		cs->loops++;
>   	}
>   
> +	__atomic_store_n(&cs->thread_active, 0, __ATOMIC_RELAXED);
>   	return 0;
>   }
>   
> @@ -765,6 +769,26 @@ rte_service_lcore_stop(uint32_t lcore)
>   	__atomic_store_n(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
>   		__ATOMIC_RELEASE);
>   
> +	/* wait for service lcore to return */
> +	i = 0;
> +	uint8_t active;
> +	uint64_t start = rte_rdtsc();
> +	do {
> +		active = __atomic_load_n(&lcore_states[lcore].thread_active,
> +					 __ATOMIC_RELAXED);
> +		if (active == 0)
> +			break;
> +		rte_delay_ms(1);
> +		i++;
> +	} while (i < 1000);
> +
> +	if (active != 0) {
> +		uint64_t end = rte_rdtsc();
> +		RTE_LOG(WARNING, EAL,
> +			"service lcore stop() failed, waited for %ld cycles\n",
> +			end - start);
> +	}
> +
>   	return 0;
>   }
>   
I don't like the idea of inserting this polling loop inside API call. 
And I don't like setting up a 1000 iterations constraint.
How about keeping the thread_active flag, but moving checking state of 
this flag to separate function. This way the user of the API would be 
able to write own loop.
Maybe he/she would like a custom loop, because:
* waiting for more cores
* would like to wait longer
* would like to check if service is finished less often...

-- 
Lukasz Wojciechowski
Principal Software Engineer

Samsung R&D Institute Poland
Samsung Electronics
Office +48 22 377 88 25
l.wojciechow@partner.samsung.com


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [PATCH] service: fix stop API to wait for service thread
  2020-07-20 12:51   ` Lukasz Wojciechowski
@ 2020-07-20 14:20     ` Van Haaren, Harry
  0 siblings, 0 replies; 31+ messages in thread
From: Van Haaren, Harry @ 2020-07-20 14:20 UTC (permalink / raw)
  To: Lukasz Wojciechowski, dev
  Cc: david.marchand, igor.romanov, honnappa.nagarahalli, Yigit,
	Ferruh, nd, aconole

> -----Original Message-----
> From: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com>
> Sent: Monday, July 20, 2020 1:52 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org
> Cc: david.marchand@redhat.com; igor.romanov@oktetlabs.ru;
> honnappa.nagarahalli@arm.com; Yigit, Ferruh <ferruh.yigit@intel.com>;
> nd@arm.com; aconole@redhat.com
> Subject: Re: [PATCH] service: fix stop API to wait for service thread
> 
> 
> W dniu 20.07.2020 o 14:09, Harry van Haaren pisze:
> > This commit improves the service_lcore_stop() implementation,
> > waiting for the service core in question to return. The service
> > thread itself now has a variable to indicate if its thread is
> > active. When zero the service thread has completed its service,
> > and has returned from the service_runner_func() function.
> >
> > This fixes a race condition observed in the DPDK CI, where the
> > statistics of the service were not consistent with the expectation
> > due to the service thread still running, and incrementing a stat
> > after stop was called.
> >
> > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> >
> > ---
> >
> > This is one possible solution, that avoids a class of race-conditions
> > based on stop() api and following behaviours. Without a change in
> > implementation of the service core thread, we could not detect when
> > the thread was actually finished. This is now possible, and the stop
> > api makes use of it to wait for 1000x one millisecond, or log a warning
> > that a service core didn't return quickly.
> >
> > Thanks for the discussion/debug on list - I'm not sure how to add
> > reported-by/suggested-by etc tags: but I'll resend a V2 (or can add
> > on apply).
> >
> > ---
> >   lib/librte_eal/common/rte_service.c | 24 ++++++++++++++++++++++++
> >   1 file changed, 24 insertions(+)
> >
> > diff --git a/lib/librte_eal/common/rte_service.c
> b/lib/librte_eal/common/rte_service.c
> > index 6a0e0ff65..d2255587d 100644
> > --- a/lib/librte_eal/common/rte_service.c
> > +++ b/lib/librte_eal/common/rte_service.c
> > @@ -65,6 +65,7 @@ struct core_state {
> >   	/* map of services IDs are run on this core */
> >   	uint64_t service_mask;
> >   	uint8_t runstate; /* running or stopped */
> > +	uint8_t thread_active; /* indicates when the thread is in service_run() */
> >   	uint8_t is_service_core; /* set if core is currently a service core */
> >   	uint8_t service_active_on_lcore[RTE_SERVICE_NUM_MAX];
> >   	uint64_t loops;
> > @@ -457,6 +458,8 @@ service_runner_func(void *arg)
> >   	const int lcore = rte_lcore_id();
> >   	struct core_state *cs = &lcore_states[lcore];
> >
> > +	__atomic_store_n(&cs->thread_active, 1, __ATOMIC_RELAXED);
> > +
> >   	/* runstate act as the guard variable. Use load-acquire
> >   	 * memory order here to synchronize with store-release
> >   	 * in runstate update functions.
> > @@ -475,6 +478,7 @@ service_runner_func(void *arg)
> >   		cs->loops++;
> >   	}
> >
> > +	__atomic_store_n(&cs->thread_active, 0, __ATOMIC_RELAXED);
> >   	return 0;
> >   }
> >
> > @@ -765,6 +769,26 @@ rte_service_lcore_stop(uint32_t lcore)
> >   	__atomic_store_n(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
> >   		__ATOMIC_RELEASE);
> >
> > +	/* wait for service lcore to return */
> > +	i = 0;
> > +	uint8_t active;
> > +	uint64_t start = rte_rdtsc();
> > +	do {
> > +		active = __atomic_load_n(&lcore_states[lcore].thread_active,
> > +					 __ATOMIC_RELAXED);
> > +		if (active == 0)
> > +			break;
> > +		rte_delay_ms(1);
> > +		i++;
> > +	} while (i < 1000);
> > +
> > +	if (active != 0) {
> > +		uint64_t end = rte_rdtsc();
> > +		RTE_LOG(WARNING, EAL,
> > +			"service lcore stop() failed, waited for %ld cycles\n",
> > +			end - start);
> > +	}
> > +
> >   	return 0;
> >   }
> >
> I don't like the idea of inserting this polling loop inside API call.
> And I don't like setting up a 1000 iterations constraint.
> How about keeping the thread_active flag, but moving checking state of
> this flag to separate function. This way the user of the API would be
> able to write own loop.
> Maybe he/she would like a custom loop, because:
> * waiting for more cores
> * would like to wait longer
> * would like to check if service is finished less often...

Agree - good feedback, thanks. v2 on the way, with this approach.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [dpdk-dev] [PATCH v2 1/2] service: add API to retrieve service core active
  2020-07-20 12:09 ` [dpdk-dev] [PATCH] service: fix stop API to wait for service thread Harry van Haaren
  2020-07-20 12:51   ` Lukasz Wojciechowski
@ 2020-07-20 14:38   ` Harry van Haaren
  2020-07-20 14:38     ` [dpdk-dev] [PATCH v2 2/2] test/service: fix race condition on stopping lcore Harry van Haaren
                       ` (4 more replies)
  1 sibling, 5 replies; 31+ messages in thread
From: Harry van Haaren @ 2020-07-20 14:38 UTC (permalink / raw)
  To: dev
  Cc: david.marchand, igor.romanov, honnappa.nagarahalli, ferruh.yigit,
	nd, aconole, l.wojciechow, Harry van Haaren

This commit adds a new experimental API which allows the user
to retrieve the active state of an lcore. Knowing when the service
lcore is completed its polling loop can be useful to applications
to avoid race conditions when e.g. finalizing statistics.

The service thread itself now has a variable to indicate if its
thread is active. When zero the service thread has completed its
service, and has returned from the service_runner_func() function.

Suggested-by: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>

---

Thanks for feedback Lukasz, please have a look and see if this was
what you were expecting?

Honnappa/Phil, are the __atomic_load/store's correct?

---
 lib/librte_eal/common/rte_service.c  | 14 ++++++++++++++
 lib/librte_eal/include/rte_service.h |  9 +++++++++
 lib/librte_eal/rte_eal_version.map   |  1 +
 3 files changed, 24 insertions(+)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 6a0e0ff65..4c276b006 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -65,6 +65,7 @@ struct core_state {
 	/* map of services IDs are run on this core */
 	uint64_t service_mask;
 	uint8_t runstate; /* running or stopped */
+	uint8_t thread_active; /* indicates when thread is in service_run() */
 	uint8_t is_service_core; /* set if core is currently a service core */
 	uint8_t service_active_on_lcore[RTE_SERVICE_NUM_MAX];
 	uint64_t loops;
@@ -457,6 +458,8 @@ service_runner_func(void *arg)
 	const int lcore = rte_lcore_id();
 	struct core_state *cs = &lcore_states[lcore];
 
+	__atomic_store_n(&cs->thread_active, 1, __ATOMIC_RELAXED);
+
 	/* runstate act as the guard variable. Use load-acquire
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
@@ -475,9 +478,20 @@ service_runner_func(void *arg)
 		cs->loops++;
 	}
 
+	__atomic_store_n(&cs->thread_active, 0, __ATOMIC_RELAXED);
 	return 0;
 }
 
+int32_t
+rte_service_lcore_active(uint32_t lcore)
+{
+	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+		return -EINVAL;
+
+	return __atomic_load_n(&lcore_states[lcore].thread_active,
+			       __ATOMIC_RELAXED);
+}
+
 int32_t
 rte_service_lcore_count(void)
 {
diff --git a/lib/librte_eal/include/rte_service.h b/lib/librte_eal/include/rte_service.h
index e2d0a6dd3..f7134b5c0 100644
--- a/lib/librte_eal/include/rte_service.h
+++ b/lib/librte_eal/include/rte_service.h
@@ -261,6 +261,15 @@ int32_t rte_service_lcore_start(uint32_t lcore_id);
  */
 int32_t rte_service_lcore_stop(uint32_t lcore_id);
 
+/**
+ * Reports if a service lcore is currently running.
+ * @retval 0 Service thread is not active, and has been returned to EAL.
+ * @retval 1 Service thread is in the service core polling loop.
+ * @retval -EINVAL Invalid *lcore_id* provided.
+ */
+__rte_experimental
+int32_t rte_service_lcore_active(uint32_t lcore_id);
+
 /**
  * Adds lcore to the list of service cores.
  *
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index bf0c17c23..d53d5d5b9 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -401,6 +401,7 @@ EXPERIMENTAL {
 	rte_lcore_dump;
 	rte_lcore_iterate;
 	rte_mp_disable;
+	rte_service_lcore_active;
 	rte_thread_register;
 	rte_thread_unregister;
 };
-- 
2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [dpdk-dev] [PATCH v2 2/2] test/service: fix race condition on stopping lcore
  2020-07-20 14:38   ` [dpdk-dev] [PATCH v2 1/2] service: add API to retrieve service core active Harry van Haaren
@ 2020-07-20 14:38     ` Harry van Haaren
  2020-07-20 17:45       ` Lukasz Wojciechowski
  2020-07-21  8:38       ` Phil Yang
  2020-07-20 17:45     ` [dpdk-dev] [PATCH v2 1/2] service: add API to retrieve service core active Lukasz Wojciechowski
                       ` (3 subsequent siblings)
  4 siblings, 2 replies; 31+ messages in thread
From: Harry van Haaren @ 2020-07-20 14:38 UTC (permalink / raw)
  To: dev
  Cc: david.marchand, igor.romanov, honnappa.nagarahalli, ferruh.yigit,
	nd, aconole, l.wojciechow, Harry van Haaren

This commit fixes a potential race condition in the tests
where the lcore running a service would increment a counter
that was already reset by the test-suite thread. The resulting
race-condition incremented value could cause CI failures, as
indicated by DPDK's CI.

This patch fixes the race-condition by making use of the
added rte_service_lcore_active() API, which indicates when
a service-core is no longer in the service-core polling loop.

The unit test makes use of the above function to detect when
all statistics increments are done in the service-core thread,
and then the unit test continues finalizing and checking state.

Fixes: f28f3594ded2 ("service: add attribute API")

Reported-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>

---

Thanks for discussion on v1, this v2 fixup for the CI
including previous feedback on ML.
---
 app/test/test_service_cores.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/app/test/test_service_cores.c b/app/test/test_service_cores.c
index ef1d8fcb9..a45762915 100644
--- a/app/test/test_service_cores.c
+++ b/app/test/test_service_cores.c
@@ -362,6 +362,9 @@ service_lcore_attr_get(void)
 			"Service core add did not return zero");
 	TEST_ASSERT_EQUAL(0, rte_service_map_lcore_set(id, slcore_id, 1),
 			"Enabling valid service and core failed");
+	/* Ensure service is not active before starting */
+	TEST_ASSERT_EQUAL(0, rte_service_lcore_active(slcore_id),
+			"Not-active service core reported as active");
 	TEST_ASSERT_EQUAL(0, rte_service_lcore_start(slcore_id),
 			"Starting service core failed");
 
@@ -382,7 +385,24 @@ service_lcore_attr_get(void)
 			lcore_attr_id, &lcore_attr_value),
 			"Invalid lcore attr didn't return -EINVAL");
 
-	rte_service_lcore_stop(slcore_id);
+	/* Ensure service is active */
+	TEST_ASSERT_EQUAL(1, rte_service_lcore_active(slcore_id),
+			"Active service core reported as not-active");
+
+	TEST_ASSERT_EQUAL(0, rte_service_map_lcore_set(id, slcore_id, 0),
+			"Disabling valid service and core failed");
+	TEST_ASSERT_EQUAL(0, rte_service_lcore_stop(slcore_id),
+			"Failed to stop service lcore");
+
+	int i = 0;
+	while (rte_service_lcore_active(slcore_id) == 1) {
+		rte_delay_ms(1);
+		i++;
+		if (i > 100)
+			break;
+	}
+	TEST_ASSERT_EQUAL(0, rte_service_lcore_active(slcore_id),
+			  "Service lcore not stopped after waiting.");
 
 	TEST_ASSERT_EQUAL(0, rte_service_lcore_attr_reset_all(slcore_id),
 			  "Valid lcore_attr_reset_all() didn't return success");
-- 
2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/2] service: add API to retrieve service core active
  2020-07-20 14:38   ` [dpdk-dev] [PATCH v2 1/2] service: add API to retrieve service core active Harry van Haaren
  2020-07-20 14:38     ` [dpdk-dev] [PATCH v2 2/2] test/service: fix race condition on stopping lcore Harry van Haaren
@ 2020-07-20 17:45     ` Lukasz Wojciechowski
  2020-07-21  7:47     ` Phil Yang
                       ` (2 subsequent siblings)
  4 siblings, 0 replies; 31+ messages in thread
From: Lukasz Wojciechowski @ 2020-07-20 17:45 UTC (permalink / raw)
  To: Harry van Haaren, dev
  Cc: david.marchand, igor.romanov, honnappa.nagarahalli, ferruh.yigit,
	nd, aconole


W dniu 20.07.2020 o 16:38, Harry van Haaren pisze:
> This commit adds a new experimental API which allows the user
> to retrieve the active state of an lcore. Knowing when the service
> lcore is completed its polling loop can be useful to applications
> to avoid race conditions when e.g. finalizing statistics.
>
> The service thread itself now has a variable to indicate if its
> thread is active. When zero the service thread has completed its
> service, and has returned from the service_runner_func() function.
>
> Suggested-by: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com>
> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
>
> ---
>
> Thanks for feedback Lukasz, please have a look and see if this was
> what you were expecting?
>
> Honnappa/Phil, are the __atomic_load/store's correct?
>
> ---
>   lib/librte_eal/common/rte_service.c  | 14 ++++++++++++++
>   lib/librte_eal/include/rte_service.h |  9 +++++++++
>   lib/librte_eal/rte_eal_version.map   |  1 +
>   3 files changed, 24 insertions(+)
>
> diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
> index 6a0e0ff65..4c276b006 100644
> --- a/lib/librte_eal/common/rte_service.c
> +++ b/lib/librte_eal/common/rte_service.c
> @@ -65,6 +65,7 @@ struct core_state {
>   	/* map of services IDs are run on this core */
>   	uint64_t service_mask;
>   	uint8_t runstate; /* running or stopped */
> +	uint8_t thread_active; /* indicates when thread is in service_run() */
>   	uint8_t is_service_core; /* set if core is currently a service core */
>   	uint8_t service_active_on_lcore[RTE_SERVICE_NUM_MAX];
>   	uint64_t loops;
> @@ -457,6 +458,8 @@ service_runner_func(void *arg)
>   	const int lcore = rte_lcore_id();
>   	struct core_state *cs = &lcore_states[lcore];
>   
> +	__atomic_store_n(&cs->thread_active, 1, __ATOMIC_RELAXED);
> +
>   	/* runstate act as the guard variable. Use load-acquire
>   	 * memory order here to synchronize with store-release
>   	 * in runstate update functions.
> @@ -475,9 +478,20 @@ service_runner_func(void *arg)
>   		cs->loops++;
>   	}
>   
> +	__atomic_store_n(&cs->thread_active, 0, __ATOMIC_RELAXED);
>   	return 0;
>   }
>   
> +int32_t
> +rte_service_lcore_active(uint32_t lcore)
> +{
> +	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
> +		return -EINVAL;
> +
> +	return __atomic_load_n(&lcore_states[lcore].thread_active,
> +			       __ATOMIC_RELAXED);
> +}
> +
>   int32_t
>   rte_service_lcore_count(void)
>   {
> diff --git a/lib/librte_eal/include/rte_service.h b/lib/librte_eal/include/rte_service.h
> index e2d0a6dd3..f7134b5c0 100644
> --- a/lib/librte_eal/include/rte_service.h
> +++ b/lib/librte_eal/include/rte_service.h
> @@ -261,6 +261,15 @@ int32_t rte_service_lcore_start(uint32_t lcore_id);
>    */
>   int32_t rte_service_lcore_stop(uint32_t lcore_id);
>   
> +/**
> + * Reports if a service lcore is currently running.
> + * @retval 0 Service thread is not active, and has been returned to EAL.
> + * @retval 1 Service thread is in the service core polling loop.
> + * @retval -EINVAL Invalid *lcore_id* provided.
> + */
> +__rte_experimental
> +int32_t rte_service_lcore_active(uint32_t lcore_id);
> +
>   /**
>    * Adds lcore to the list of service cores.
>    *
> diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
> index bf0c17c23..d53d5d5b9 100644
> --- a/lib/librte_eal/rte_eal_version.map
> +++ b/lib/librte_eal/rte_eal_version.map
> @@ -401,6 +401,7 @@ EXPERIMENTAL {
>   	rte_lcore_dump;
>   	rte_lcore_iterate;
>   	rte_mp_disable;
> +	rte_service_lcore_active;
>   	rte_thread_register;
>   	rte_thread_unregister;
>   };
Acked-by: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com>

-- 
Lukasz Wojciechowski
Principal Software Engineer

Samsung R&D Institute Poland
Samsung Electronics
Office +48 22 377 88 25
l.wojciechow@partner.samsung.com


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/2] test/service: fix race condition on stopping lcore
  2020-07-20 14:38     ` [dpdk-dev] [PATCH v2 2/2] test/service: fix race condition on stopping lcore Harry van Haaren
@ 2020-07-20 17:45       ` Lukasz Wojciechowski
  2020-07-21  8:38       ` Phil Yang
  1 sibling, 0 replies; 31+ messages in thread
From: Lukasz Wojciechowski @ 2020-07-20 17:45 UTC (permalink / raw)
  To: Harry van Haaren, dev
  Cc: david.marchand, igor.romanov, honnappa.nagarahalli, ferruh.yigit,
	nd, aconole


W dniu 20.07.2020 o 16:38, Harry van Haaren pisze:
> This commit fixes a potential race condition in the tests
> where the lcore running a service would increment a counter
> that was already reset by the test-suite thread. The resulting
> race-condition incremented value could cause CI failures, as
> indicated by DPDK's CI.
>
> This patch fixes the race-condition by making use of the
> added rte_service_lcore_active() API, which indicates when
> a service-core is no longer in the service-core polling loop.
>
> The unit test makes use of the above function to detect when
> all statistics increments are done in the service-core thread,
> and then the unit test continues finalizing and checking state.
>
> Fixes: f28f3594ded2 ("service: add attribute API")
>
> Reported-by: David Marchand <david.marchand@redhat.com>
> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
>
> ---
>
> Thanks for discussion on v1, this v2 fixup for the CI
> including previous feedback on ML.
> ---
>   app/test/test_service_cores.c | 22 +++++++++++++++++++++-
>   1 file changed, 21 insertions(+), 1 deletion(-)
>
> diff --git a/app/test/test_service_cores.c b/app/test/test_service_cores.c
> index ef1d8fcb9..a45762915 100644
> --- a/app/test/test_service_cores.c
> +++ b/app/test/test_service_cores.c
> @@ -362,6 +362,9 @@ service_lcore_attr_get(void)
>   			"Service core add did not return zero");
>   	TEST_ASSERT_EQUAL(0, rte_service_map_lcore_set(id, slcore_id, 1),
>   			"Enabling valid service and core failed");
> +	/* Ensure service is not active before starting */
> +	TEST_ASSERT_EQUAL(0, rte_service_lcore_active(slcore_id),
> +			"Not-active service core reported as active");
>   	TEST_ASSERT_EQUAL(0, rte_service_lcore_start(slcore_id),
>   			"Starting service core failed");
>   
> @@ -382,7 +385,24 @@ service_lcore_attr_get(void)
>   			lcore_attr_id, &lcore_attr_value),
>   			"Invalid lcore attr didn't return -EINVAL");
>   
> -	rte_service_lcore_stop(slcore_id);
> +	/* Ensure service is active */
> +	TEST_ASSERT_EQUAL(1, rte_service_lcore_active(slcore_id),
> +			"Active service core reported as not-active");
> +
> +	TEST_ASSERT_EQUAL(0, rte_service_map_lcore_set(id, slcore_id, 0),
> +			"Disabling valid service and core failed");
> +	TEST_ASSERT_EQUAL(0, rte_service_lcore_stop(slcore_id),
> +			"Failed to stop service lcore");
> +
> +	int i = 0;
> +	while (rte_service_lcore_active(slcore_id) == 1) {
> +		rte_delay_ms(1);
> +		i++;
> +		if (i > 100)
> +			break;
> +	}
> +	TEST_ASSERT_EQUAL(0, rte_service_lcore_active(slcore_id),
> +			  "Service lcore not stopped after waiting.");
>   
>   	TEST_ASSERT_EQUAL(0, rte_service_lcore_attr_reset_all(slcore_id),
>   			  "Valid lcore_attr_reset_all() didn't return success");
Acked-by: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com>

-- 
Lukasz Wojciechowski
Principal Software Engineer

Samsung R&D Institute Poland
Samsung Electronics
Office +48 22 377 88 25
l.wojciechow@partner.samsung.com


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/2] service: add API to retrieve service core active
  2020-07-20 14:38   ` [dpdk-dev] [PATCH v2 1/2] service: add API to retrieve service core active Harry van Haaren
  2020-07-20 14:38     ` [dpdk-dev] [PATCH v2 2/2] test/service: fix race condition on stopping lcore Harry van Haaren
  2020-07-20 17:45     ` [dpdk-dev] [PATCH v2 1/2] service: add API to retrieve service core active Lukasz Wojciechowski
@ 2020-07-21  7:47     ` Phil Yang
  2020-07-21 19:43     ` Honnappa Nagarahalli
  2020-07-22 10:37     ` [dpdk-dev] [PATCH v3 " Harry van Haaren
  4 siblings, 0 replies; 31+ messages in thread
From: Phil Yang @ 2020-07-21  7:47 UTC (permalink / raw)
  To: Harry van Haaren, dev
  Cc: david.marchand, igor.romanov, Honnappa Nagarahalli, ferruh.yigit,
	nd, aconole, l.wojciechow, nd

> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Harry van Haaren
> Sent: Monday, July 20, 2020 10:38 PM
> To: dev@dpdk.org
> Cc: david.marchand@redhat.com; igor.romanov@oktetlabs.ru; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; ferruh.yigit@intel.com; nd
> <nd@arm.com>; aconole@redhat.com;
> l.wojciechow@partner.samsung.com; Harry van Haaren
> <harry.van.haaren@intel.com>
> Subject: [dpdk-dev] [PATCH v2 1/2] service: add API to retrieve service core
> active
> 
> This commit adds a new experimental API which allows the user
> to retrieve the active state of an lcore. Knowing when the service
> lcore is completed its polling loop can be useful to applications
> to avoid race conditions when e.g. finalizing statistics.
> 
> The service thread itself now has a variable to indicate if its
> thread is active. When zero the service thread has completed its
> service, and has returned from the service_runner_func() function.
> 
> Suggested-by: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com>
> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> 

It looks good to me.

Reviewed-by: Phil Yang <phil.yang@arm.com>

Thanks,
Phil
<...>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/2] test/service: fix race condition on stopping lcore
  2020-07-20 14:38     ` [dpdk-dev] [PATCH v2 2/2] test/service: fix race condition on stopping lcore Harry van Haaren
  2020-07-20 17:45       ` Lukasz Wojciechowski
@ 2020-07-21  8:38       ` Phil Yang
  2020-07-22 10:26         ` Van Haaren, Harry
  1 sibling, 1 reply; 31+ messages in thread
From: Phil Yang @ 2020-07-21  8:38 UTC (permalink / raw)
  To: Harry van Haaren, dev
  Cc: david.marchand, igor.romanov, Honnappa Nagarahalli, ferruh.yigit,
	nd, aconole, l.wojciechow, nd

<...>

> Subject: [dpdk-dev] [PATCH v2 2/2] test/service: fix race condition on
> stopping lcore
> 
> This commit fixes a potential race condition in the tests
> where the lcore running a service would increment a counter
> that was already reset by the test-suite thread. The resulting
> race-condition incremented value could cause CI failures, as
> indicated by DPDK's CI.
> 
> This patch fixes the race-condition by making use of the
> added rte_service_lcore_active() API, which indicates when
> a service-core is no longer in the service-core polling loop.
> 
> The unit test makes use of the above function to detect when
> all statistics increments are done in the service-core thread,
> and then the unit test continues finalizing and checking state.
> 
> Fixes: f28f3594ded2 ("service: add attribute API")
> 
> Reported-by: David Marchand <david.marchand@redhat.com>
> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>

Minor nit, otherwise it looks good to me.

Reviewed-by: Phil Yang <phil.yang@arm.com>

> 
> ---
> 
> Thanks for discussion on v1, this v2 fixup for the CI
> including previous feedback on ML.
> ---
>  app/test/test_service_cores.c | 22 +++++++++++++++++++++-
>  1 file changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/app/test/test_service_cores.c b/app/test/test_service_cores.c
> index ef1d8fcb9..a45762915 100644
> --- a/app/test/test_service_cores.c
> +++ b/app/test/test_service_cores.c
> @@ -362,6 +362,9 @@ service_lcore_attr_get(void)
>  			"Service core add did not return zero");
>  	TEST_ASSERT_EQUAL(0, rte_service_map_lcore_set(id, slcore_id, 1),
>  			"Enabling valid service and core failed");
> +	/* Ensure service is not active before starting */
> +	TEST_ASSERT_EQUAL(0, rte_service_lcore_active(slcore_id),
> +			"Not-active service core reported as active");
>  	TEST_ASSERT_EQUAL(0, rte_service_lcore_start(slcore_id),
>  			"Starting service core failed");
> 
> @@ -382,7 +385,24 @@ service_lcore_attr_get(void)
>  			lcore_attr_id, &lcore_attr_value),
>  			"Invalid lcore attr didn't return -EINVAL");
> 
> -	rte_service_lcore_stop(slcore_id);
> +	/* Ensure service is active */
> +	TEST_ASSERT_EQUAL(1, rte_service_lcore_active(slcore_id),
> +			"Active service core reported as not-active");
> +
> +	TEST_ASSERT_EQUAL(0, rte_service_map_lcore_set(id, slcore_id, 0),
> +			"Disabling valid service and core failed");
> +	TEST_ASSERT_EQUAL(0, rte_service_lcore_stop(slcore_id),
> +			"Failed to stop service lcore");
> +
> +	int i = 0;
> +	while (rte_service_lcore_active(slcore_id) == 1) {
> +		rte_delay_ms(1);

Just as it does in other functions, use the macro instead of the magic number would be better.
rte_delay_ms(SERVICE_DELAY); 

> +		i++;
> +		if (i > 100)
> +			break;
> +	}
> +	TEST_ASSERT_EQUAL(0, rte_service_lcore_active(slcore_id),
> +			  "Service lcore not stopped after waiting.");
> 
>  	TEST_ASSERT_EQUAL(0, rte_service_lcore_attr_reset_all(slcore_id),
>  			  "Valid lcore_attr_reset_all() didn't return success");
> --
> 2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/2] service: add API to retrieve service core active
  2020-07-20 14:38   ` [dpdk-dev] [PATCH v2 1/2] service: add API to retrieve service core active Harry van Haaren
                       ` (2 preceding siblings ...)
  2020-07-21  7:47     ` Phil Yang
@ 2020-07-21 19:43     ` Honnappa Nagarahalli
  2020-07-21 19:50       ` David Marchand
  2020-07-22 10:37     ` [dpdk-dev] [PATCH v3 " Harry van Haaren
  4 siblings, 1 reply; 31+ messages in thread
From: Honnappa Nagarahalli @ 2020-07-21 19:43 UTC (permalink / raw)
  To: Harry van Haaren, dev
  Cc: david.marchand, igor.romanov, ferruh.yigit, nd, aconole,
	l.wojciechow, Honnappa Nagarahalli, Phil Yang, nd

<snip>

> Subject: [PATCH v2 1/2] service: add API to retrieve service core active
> 
> This commit adds a new experimental API which allows the user to retrieve
> the active state of an lcore. Knowing when the service lcore is completed its
> polling loop can be useful to applications to avoid race conditions when e.g.
> finalizing statistics.
> 
> The service thread itself now has a variable to indicate if its thread is active.
> When zero the service thread has completed its service, and has returned
> from the service_runner_func() function.
> 
> Suggested-by: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com>
> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> 
> ---
> 
> Thanks for feedback Lukasz, please have a look and see if this was what you
> were expecting?
> 
> Honnappa/Phil, are the __atomic_load/store's correct?
> 
> ---
>  lib/librte_eal/common/rte_service.c  | 14 ++++++++++++++
> lib/librte_eal/include/rte_service.h |  9 +++++++++
>  lib/librte_eal/rte_eal_version.map   |  1 +
>  3 files changed, 24 insertions(+)
> 
> diff --git a/lib/librte_eal/common/rte_service.c
> b/lib/librte_eal/common/rte_service.c
> index 6a0e0ff65..4c276b006 100644
> --- a/lib/librte_eal/common/rte_service.c
> +++ b/lib/librte_eal/common/rte_service.c
> @@ -65,6 +65,7 @@ struct core_state {
>  	/* map of services IDs are run on this core */
>  	uint64_t service_mask;
>  	uint8_t runstate; /* running or stopped */
> +	uint8_t thread_active; /* indicates when thread is in service_run() */
Nit, if possible would be good to describe the difference between 'runstate' and 'thread_active'.

>  	uint8_t is_service_core; /* set if core is currently a service core */
>  	uint8_t service_active_on_lcore[RTE_SERVICE_NUM_MAX];
>  	uint64_t loops;
> @@ -457,6 +458,8 @@ service_runner_func(void *arg)
>  	const int lcore = rte_lcore_id();
>  	struct core_state *cs = &lcore_states[lcore];
> 
> +	__atomic_store_n(&cs->thread_active, 1, __ATOMIC_RELAXED);
Essentially, we have to ensure that all the memory operations are contained within both stores (this one and the one below) to 'thread_active'.
We should use __ATOMIC_SEQ_CST, which will avoid any memory operations from getting hoisted above this line.
Performance should not be an issue as it will get executed only when the service core is started.
It would be good to add comment reasoning the memory order used.

Also, what happens if the user calls 'rte_service_lcore_active' just before the above statement is executed? It might not be a valid use case, but it is good to document the race conditions and correct sequence of API calls.

> +
>  	/* runstate act as the guard variable. Use load-acquire
>  	 * memory order here to synchronize with store-release
>  	 * in runstate update functions.
> @@ -475,9 +478,20 @@ service_runner_func(void *arg)
>  		cs->loops++;
>  	}
> 
> +	__atomic_store_n(&cs->thread_active, 0, __ATOMIC_RELAXED);
__ATOMIC_RELEASE would be enough. But, __ATOMIC_SEQ_CST should not cause any performance issues.

>  	return 0;
>  }
> 
> +int32_t
> +rte_service_lcore_active(uint32_t lcore) {
> +	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
> +		return -EINVAL;
> +
> +	return __atomic_load_n(&lcore_states[lcore].thread_active,
> +			       __ATOMIC_RELAXED);
> +}
> +
>  int32_t
>  rte_service_lcore_count(void)
>  {
> diff --git a/lib/librte_eal/include/rte_service.h
> b/lib/librte_eal/include/rte_service.h
> index e2d0a6dd3..f7134b5c0 100644
> --- a/lib/librte_eal/include/rte_service.h
> +++ b/lib/librte_eal/include/rte_service.h
> @@ -261,6 +261,15 @@ int32_t rte_service_lcore_start(uint32_t lcore_id);
>   */
>  int32_t rte_service_lcore_stop(uint32_t lcore_id);
> 
> +/**
> + * Reports if a service lcore is currently running.
> + * @retval 0 Service thread is not active, and has been returned to EAL.
> + * @retval 1 Service thread is in the service core polling loop.
> + * @retval -EINVAL Invalid *lcore_id* provided.
> + */
> +__rte_experimental
> +int32_t rte_service_lcore_active(uint32_t lcore_id);
Would 'rte_service_lcore_may_be_active' better? It would be inline with 'rte_service_may_be_active'?

I think we need additional documentation for 'rte_service_lcore_stop' to indicate that the caller should not assume that the service thread on the lcore has stopped running and should call the above API to check.

> +
>  /**
>   * Adds lcore to the list of service cores.
>   *
> diff --git a/lib/librte_eal/rte_eal_version.map
> b/lib/librte_eal/rte_eal_version.map
> index bf0c17c23..d53d5d5b9 100644
> --- a/lib/librte_eal/rte_eal_version.map
> +++ b/lib/librte_eal/rte_eal_version.map
> @@ -401,6 +401,7 @@ EXPERIMENTAL {
>  	rte_lcore_dump;
>  	rte_lcore_iterate;
>  	rte_mp_disable;
> +	rte_service_lcore_active;
>  	rte_thread_register;
>  	rte_thread_unregister;
>  };
> --
> 2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/2] service: add API to retrieve service core active
  2020-07-21 19:43     ` Honnappa Nagarahalli
@ 2020-07-21 19:50       ` David Marchand
  2020-07-21 20:23         ` Honnappa Nagarahalli
  0 siblings, 1 reply; 31+ messages in thread
From: David Marchand @ 2020-07-21 19:50 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: Harry van Haaren, dev, igor.romanov, ferruh.yigit, nd, aconole,
	l.wojciechow, Phil Yang

On Tue, Jul 21, 2020 at 9:43 PM Honnappa Nagarahalli
<Honnappa.Nagarahalli@arm.com> wrote:
> > @@ -457,6 +458,8 @@ service_runner_func(void *arg)
> >       const int lcore = rte_lcore_id();
> >       struct core_state *cs = &lcore_states[lcore];
> >
> > +     __atomic_store_n(&cs->thread_active, 1, __ATOMIC_RELAXED);
> Essentially, we have to ensure that all the memory operations are contained within both stores (this one and the one below) to 'thread_active'.
> We should use __ATOMIC_SEQ_CST, which will avoid any memory operations from getting hoisted above this line.
> Performance should not be an issue as it will get executed only when the service core is started.
> It would be good to add comment reasoning the memory order used.
>
> Also, what happens if the user calls 'rte_service_lcore_active' just before the above statement is executed? It might not be a valid use case, but it is good to document the race conditions and correct sequence of API calls.
>
> > +
> >       /* runstate act as the guard variable. Use load-acquire
> >        * memory order here to synchronize with store-release
> >        * in runstate update functions.
> > @@ -475,9 +478,20 @@ service_runner_func(void *arg)
> >               cs->loops++;
> >       }
> >
> > +     __atomic_store_n(&cs->thread_active, 0, __ATOMIC_RELAXED);
> __ATOMIC_RELEASE would be enough. But, __ATOMIC_SEQ_CST should not cause any performance issues.

But then are we missing a ACQUIRE barrier in rte_service_lcore_active?


>
> >       return 0;
> >  }
> >
> > +int32_t
> > +rte_service_lcore_active(uint32_t lcore) {
> > +     if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
> > +             return -EINVAL;
> > +
> > +     return __atomic_load_n(&lcore_states[lcore].thread_active,
> > +                            __ATOMIC_RELAXED);
> > +}
> > +
> >  int32_t
> >  rte_service_lcore_count(void)
> >  {
> > diff --git a/lib/librte_eal/include/rte_service.h
> > b/lib/librte_eal/include/rte_service.h
> > index e2d0a6dd3..f7134b5c0 100644
> > --- a/lib/librte_eal/include/rte_service.h
> > +++ b/lib/librte_eal/include/rte_service.h
> > @@ -261,6 +261,15 @@ int32_t rte_service_lcore_start(uint32_t lcore_id);
> >   */
> >  int32_t rte_service_lcore_stop(uint32_t lcore_id);
> >
> > +/**
> > + * Reports if a service lcore is currently running.
> > + * @retval 0 Service thread is not active, and has been returned to EAL.
> > + * @retval 1 Service thread is in the service core polling loop.
> > + * @retval -EINVAL Invalid *lcore_id* provided.
> > + */
> > +__rte_experimental
> > +int32_t rte_service_lcore_active(uint32_t lcore_id);
> Would 'rte_service_lcore_may_be_active' better? It would be inline with 'rte_service_may_be_active'?
>
> I think we need additional documentation for 'rte_service_lcore_stop' to indicate that the caller should not assume that the service thread on the lcore has stopped running and should call the above API to check.

+1
Additional documentation can't hurt.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/2] service: add API to retrieve service core active
  2020-07-21 19:50       ` David Marchand
@ 2020-07-21 20:23         ` Honnappa Nagarahalli
  2020-07-22 10:14           ` Van Haaren, Harry
  0 siblings, 1 reply; 31+ messages in thread
From: Honnappa Nagarahalli @ 2020-07-21 20:23 UTC (permalink / raw)
  To: David Marchand
  Cc: Harry van Haaren, dev, igor.romanov, ferruh.yigit, nd, aconole,
	l.wojciechow, Phil Yang, Honnappa Nagarahalli, nd

> Subject: Re: [PATCH v2 1/2] service: add API to retrieve service core active
> 
> On Tue, Jul 21, 2020 at 9:43 PM Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com> wrote:
> > > @@ -457,6 +458,8 @@ service_runner_func(void *arg)
> > >       const int lcore = rte_lcore_id();
> > >       struct core_state *cs = &lcore_states[lcore];
> > >
> > > +     __atomic_store_n(&cs->thread_active, 1, __ATOMIC_RELAXED);
> > Essentially, we have to ensure that all the memory operations are contained
> within both stores (this one and the one below) to 'thread_active'.
> > We should use __ATOMIC_SEQ_CST, which will avoid any memory
> operations from getting hoisted above this line.
> > Performance should not be an issue as it will get executed only when the
> service core is started.
> > It would be good to add comment reasoning the memory order used.
> >
> > Also, what happens if the user calls 'rte_service_lcore_active' just before the
> above statement is executed? It might not be a valid use case, but it is good
> to document the race conditions and correct sequence of API calls.
> >
> > > +
> > >       /* runstate act as the guard variable. Use load-acquire
> > >        * memory order here to synchronize with store-release
> > >        * in runstate update functions.
> > > @@ -475,9 +478,20 @@ service_runner_func(void *arg)
> > >               cs->loops++;
> > >       }
> > >
> > > +     __atomic_store_n(&cs->thread_active, 0, __ATOMIC_RELAXED);
> > __ATOMIC_RELEASE would be enough. But, __ATOMIC_SEQ_CST should not
> cause any performance issues.
> 
> But then are we missing a ACQUIRE barrier in rte_service_lcore_active?
+1 (see below)

> 
> 
> >
> > >       return 0;
> > >  }
> > >
> > > +int32_t
> > > +rte_service_lcore_active(uint32_t lcore) {
> > > +     if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
> > > +             return -EINVAL;
> > > +
> > > +     return __atomic_load_n(&lcore_states[lcore].thread_active,
> > > +                            __ATOMIC_RELAXED); }
Agree with David. ACQUIRE is the safer order to ensure memory operations are not hoisted in cases like:

if (rte_service_lcore_active(lcore) == 0) {
	<do something>; /* Do not allow the memory operations to hoist above 'if' statement */
}

> > > +
> > >  int32_t
> > >  rte_service_lcore_count(void)
> > >  {
> > > diff --git a/lib/librte_eal/include/rte_service.h
> > > b/lib/librte_eal/include/rte_service.h
> > > index e2d0a6dd3..f7134b5c0 100644
> > > --- a/lib/librte_eal/include/rte_service.h
> > > +++ b/lib/librte_eal/include/rte_service.h
> > > @@ -261,6 +261,15 @@ int32_t rte_service_lcore_start(uint32_t lcore_id);
> > >   */
> > >  int32_t rte_service_lcore_stop(uint32_t lcore_id);
> > >
> > > +/**
> > > + * Reports if a service lcore is currently running.
> > > + * @retval 0 Service thread is not active, and has been returned to EAL.
> > > + * @retval 1 Service thread is in the service core polling loop.
> > > + * @retval -EINVAL Invalid *lcore_id* provided.
> > > + */
> > > +__rte_experimental
> > > +int32_t rte_service_lcore_active(uint32_t lcore_id);
> > Would 'rte_service_lcore_may_be_active' better? It would be inline with
> 'rte_service_may_be_active'?
> >
> > I think we need additional documentation for 'rte_service_lcore_stop' to
> indicate that the caller should not assume that the service thread on the lcore
> has stopped running and should call the above API to check.
> 
> +1
> Additional documentation can't hurt.
> 
> 
> --
> David Marchand


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/2] service: add API to retrieve service core active
  2020-07-21 20:23         ` Honnappa Nagarahalli
@ 2020-07-22 10:14           ` Van Haaren, Harry
  2020-07-22 18:50             ` Honnappa Nagarahalli
  0 siblings, 1 reply; 31+ messages in thread
From: Van Haaren, Harry @ 2020-07-22 10:14 UTC (permalink / raw)
  To: Honnappa Nagarahalli, David Marchand
  Cc: dev, igor.romanov, Yigit, Ferruh, nd, aconole, l.wojciechow,
	Phil Yang, nd

> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Tuesday, July 21, 2020 9:24 PM
> To: David Marchand <david.marchand@redhat.com>
> Cc: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org;
> igor.romanov@oktetlabs.ru; Yigit, Ferruh <ferruh.yigit@intel.com>; nd
> <nd@arm.com>; aconole@redhat.com; l.wojciechow@partner.samsung.com; Phil
> Yang <Phil.Yang@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v2 1/2] service: add API to retrieve service core active
> 
> > Subject: Re: [PATCH v2 1/2] service: add API to retrieve service core active
> >
> > On Tue, Jul 21, 2020 at 9:43 PM Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com> wrote:
> > > > @@ -457,6 +458,8 @@ service_runner_func(void *arg)
> > > >       const int lcore = rte_lcore_id();
> > > >       struct core_state *cs = &lcore_states[lcore];
> > > >
> > > > +     __atomic_store_n(&cs->thread_active, 1, __ATOMIC_RELAXED);
> > > Essentially, we have to ensure that all the memory operations are contained
> > within both stores (this one and the one below) to 'thread_active'.
> > > We should use __ATOMIC_SEQ_CST, which will avoid any memory
> > operations from getting hoisted above this line.
> > > Performance should not be an issue as it will get executed only when the
> > service core is started.
> > > It would be good to add comment reasoning the memory order used.

OK, will update to SEQ_CST in v3, and add comment.

> > > Also, what happens if the user calls 'rte_service_lcore_active' just before the
> > above statement is executed? It might not be a valid use case, but it is good
> > to document the race conditions and correct sequence of API calls.
> > >
> > > > +
> > > >       /* runstate act as the guard variable. Use load-acquire
> > > >        * memory order here to synchronize with store-release
> > > >        * in runstate update functions.
> > > > @@ -475,9 +478,20 @@ service_runner_func(void *arg)
> > > >               cs->loops++;
> > > >       }
> > > >
> > > > +     __atomic_store_n(&cs->thread_active, 0, __ATOMIC_RELAXED);
> > > __ATOMIC_RELEASE would be enough. But, __ATOMIC_SEQ_CST should not
> > cause any performance issues.
> >
> > But then are we missing a ACQUIRE barrier in rte_service_lcore_active?
> +1 (see below)

OK, will update to SEQ_CST in v3, with comment.


> > > > +int32_t
> > > > +rte_service_lcore_active(uint32_t lcore) {
> > > > +     if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
> > > > +             return -EINVAL;
> > > > +
> > > > +     return __atomic_load_n(&lcore_states[lcore].thread_active,
> > > > +                            __ATOMIC_RELAXED); }
> Agree with David. ACQUIRE is the safer order to ensure memory operations are not
> hoisted in cases like:
> 
> if (rte_service_lcore_active(lcore) == 0) {
> 	<do something>; /* Do not allow the memory operations to hoist above 'if'
> statement */
> }

OK, will change to ACQUIRE in v3.

<snip>

> > > > +/**
> > > > + * Reports if a service lcore is currently running.
> > > > + * @retval 0 Service thread is not active, and has been returned to EAL.
> > > > + * @retval 1 Service thread is in the service core polling loop.
> > > > + * @retval -EINVAL Invalid *lcore_id* provided.
> > > > + */
> > > > +__rte_experimental
> > > > +int32_t rte_service_lcore_active(uint32_t lcore_id);
> > > Would 'rte_service_lcore_may_be_active' better? It would be inline with
> > 'rte_service_may_be_active'?

I think the implementation behind the API is different, so I think _may_be_ is not appropriate for service_lcore_active, keeping same function name for v3.

rte_service_lcore_active() checks at a particular point in the calling thread if another thread is active *at that time*. It is either active or not. This is defined, it is deterministic in that the result is either yes or no, and there is no ambiguity at any given check. You're right the value can change *just* after the check - but at the time of the check the answer was deterministic.

rte_service_may_be_active() checks if a service *could* be run by a service core. It is not deterministic. A service lcore only sets a service as "active on lcore" (or not active) when it polls it - this opens a window of nondeterministic result. When a runstate is set to off, there is a window of "unknown" before we know certainly that the service is not run on a service core anymore. That is why I believe the _may_be_ is appropriate for this API, it shows this non determinism.

> > > I think we need additional documentation for 'rte_service_lcore_stop' to
> > indicate that the caller should not assume that the service thread on the lcore
> > has stopped running and should call the above API to check.
> >
> > +1
> > Additional documentation can't hurt.

Will add a section to the _stop() and _lcore_active() in doxygen for v3.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/2] test/service: fix race condition on stopping lcore
  2020-07-21  8:38       ` Phil Yang
@ 2020-07-22 10:26         ` Van Haaren, Harry
  0 siblings, 0 replies; 31+ messages in thread
From: Van Haaren, Harry @ 2020-07-22 10:26 UTC (permalink / raw)
  To: Phil Yang, dev
  Cc: david.marchand, igor.romanov, Honnappa Nagarahalli, Yigit,
	Ferruh, nd, aconole, l.wojciechow, nd

> -----Original Message-----
> From: Phil Yang <Phil.Yang@arm.com>
> Sent: Tuesday, July 21, 2020 9:39 AM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org
> Cc: david.marchand@redhat.com; igor.romanov@oktetlabs.ru; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; Yigit, Ferruh
> <ferruh.yigit@intel.com>; nd <nd@arm.com>; aconole@redhat.com;
> l.wojciechow@partner.samsung.com; nd <nd@arm.com>
> Subject: RE: [dpdk-dev] [PATCH v2 2/2] test/service: fix race condition on
> stopping lcore
> 
> <...>
> 
> > Subject: [dpdk-dev] [PATCH v2 2/2] test/service: fix race condition on
> > stopping lcore
> >
> > This commit fixes a potential race condition in the tests
> > where the lcore running a service would increment a counter
> > that was already reset by the test-suite thread. The resulting
> > race-condition incremented value could cause CI failures, as
> > indicated by DPDK's CI.
> >
> > This patch fixes the race-condition by making use of the
> > added rte_service_lcore_active() API, which indicates when
> > a service-core is no longer in the service-core polling loop.
> >
> > The unit test makes use of the above function to detect when
> > all statistics increments are done in the service-core thread,
> > and then the unit test continues finalizing and checking state.
> >
> > Fixes: f28f3594ded2 ("service: add attribute API")
> >
> > Reported-by: David Marchand <david.marchand@redhat.com>
> > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> 
> Minor nit, otherwise it looks good to me.
> 
> Reviewed-by: Phil Yang <phil.yang@arm.com>

Thanks, will add in v3.

<snip>

> > +	int i = 0;
> > +	while (rte_service_lcore_active(slcore_id) == 1) {
> > +		rte_delay_ms(1);
> 
> Just as it does in other functions, use the macro instead of the magic number
> would be better.
> rte_delay_ms(SERVICE_DELAY);

Sure, will change. I've refactored the while() to a for() too, think it cleans up a little.

<snip>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [dpdk-dev] [PATCH v3 1/2] service: add API to retrieve service core active
  2020-07-20 14:38   ` [dpdk-dev] [PATCH v2 1/2] service: add API to retrieve service core active Harry van Haaren
                       ` (3 preceding siblings ...)
  2020-07-21 19:43     ` Honnappa Nagarahalli
@ 2020-07-22 10:37     ` Harry van Haaren
  2020-07-22 10:37       ` [dpdk-dev] [PATCH v3 2/2] test/service: fix race condition on stopping lcore Harry van Haaren
                         ` (2 more replies)
  4 siblings, 3 replies; 31+ messages in thread
From: Harry van Haaren @ 2020-07-22 10:37 UTC (permalink / raw)
  To: dev
  Cc: david.marchand, igor.romanov, honnappa.nagarahalli, ferruh.yigit,
	nd, aconole, l.wojciechow, phil.yang, Harry van Haaren

This commit adds a new experimental API which allows the user
to retrieve the active state of an lcore. Knowing when the service
lcore is completed its polling loop can be useful to applications
to avoid race conditions when e.g. finalizing statistics.

The service thread itself now has a variable to indicate if its
thread is active. When zero the service thread has completed its
service, and has returned from the service_runner_func() function.

Suggested-by: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>

---

v3:
- Change service lcore stores to SEQ_CST (Honnappa, David)
- Change control thread load to ACQ (Honnappa, David)
- Comment reasons for SEQ_CST/ACQ (Honnappa, David)
- Add comments to Doxygen for _stop() and _lcore_active() (Honnappa, David)
- Add Phil's review tag from ML
---
 lib/librte_eal/common/rte_service.c  | 23 ++++++++++++++++++++++-
 lib/librte_eal/include/rte_service.h | 20 +++++++++++++++++++-
 lib/librte_eal/rte_eal_version.map   |  1 +
 3 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 6a0e0ff65..b0cf7c4ab 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -65,6 +65,7 @@ struct core_state {
 	/* map of services IDs are run on this core */
 	uint64_t service_mask;
 	uint8_t runstate; /* running or stopped */
+	uint8_t thread_active; /* indicates when thread is in service_run() */
 	uint8_t is_service_core; /* set if core is currently a service core */
 	uint8_t service_active_on_lcore[RTE_SERVICE_NUM_MAX];
 	uint64_t loops;
@@ -422,7 +423,7 @@ rte_service_may_be_active(uint32_t id)
 		return -EINVAL;
 
 	for (i = 0; i < lcore_count; i++) {
-		if (lcore_states[ids[i]].service_active_on_lcore[id])
+	if (lcore_states[ids[i]].service_active_on_lcore[id])
 			return 1;
 	}
 
@@ -457,6 +458,8 @@ service_runner_func(void *arg)
 	const int lcore = rte_lcore_id();
 	struct core_state *cs = &lcore_states[lcore];
 
+	__atomic_store_n(&cs->thread_active, 1, __ATOMIC_SEQ_CST);
+
 	/* runstate act as the guard variable. Use load-acquire
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
@@ -475,9 +478,27 @@ service_runner_func(void *arg)
 		cs->loops++;
 	}
 
+	/* Use SEQ CST memory ordering to avoid any re-ordering around
+	 * this store, ensuring that once this store is visible, the service
+	 * lcore thread really is done in service cores code.
+	 */
+	__atomic_store_n(&cs->thread_active, 0, __ATOMIC_SEQ_CST);
 	return 0;
 }
 
+int32_t
+rte_service_lcore_active(uint32_t lcore)
+{
+	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+		return -EINVAL;
+
+	/* Load thread_active using ACQUIRE to avoid instructions dependent on
+	 * the result being re-ordered before this load completes.
+	 */
+	return __atomic_load_n(&lcore_states[lcore].thread_active,
+			       __ATOMIC_ACQUIRE);
+}
+
 int32_t
 rte_service_lcore_count(void)
 {
diff --git a/lib/librte_eal/include/rte_service.h b/lib/librte_eal/include/rte_service.h
index e2d0a6dd3..84245020d 100644
--- a/lib/librte_eal/include/rte_service.h
+++ b/lib/librte_eal/include/rte_service.h
@@ -249,7 +249,9 @@ int32_t rte_service_lcore_start(uint32_t lcore_id);
  * Stop a service core.
  *
  * Stopping a core makes the core become idle, but remains  assigned as a
- * service core.
+ * service core. Note that the service lcore thread may not have returned from
+ * the service it is running when this API returns. To check if the service
+ * lcore is still active, the *rte_service_lcore_active* API is added.
  *
  * @retval 0 Success
  * @retval -EINVAL Invalid *lcore_id* provided
@@ -261,6 +263,22 @@ int32_t rte_service_lcore_start(uint32_t lcore_id);
  */
 int32_t rte_service_lcore_stop(uint32_t lcore_id);
 
+/**
+ * Reports if a service lcore is currently running.
+ *
+ * This function returns if the core has finished service cores code, and has
+ * returned to EAL control. If *rte_service_lcore_stop* has been called but
+ * the lcore has not returned to EAL yet, it might be required to wait and call
+ * this function again. The amount of time to wait before the core returns
+ * depends on the duration of the services being run.
+ *
+ * @retval 0 Service thread is not active, and has been returned to EAL.
+ * @retval 1 Service thread is in the service core polling loop.
+ * @retval -EINVAL Invalid *lcore_id* provided.
+ */
+__rte_experimental
+int32_t rte_service_lcore_active(uint32_t lcore_id);
+
 /**
  * Adds lcore to the list of service cores.
  *
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index bf0c17c23..d53d5d5b9 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -401,6 +401,7 @@ EXPERIMENTAL {
 	rte_lcore_dump;
 	rte_lcore_iterate;
 	rte_mp_disable;
+	rte_service_lcore_active;
 	rte_thread_register;
 	rte_thread_unregister;
 };
-- 
2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [dpdk-dev] [PATCH v3 2/2] test/service: fix race condition on stopping lcore
  2020-07-22 10:37     ` [dpdk-dev] [PATCH v3 " Harry van Haaren
@ 2020-07-22 10:37       ` Harry van Haaren
  2020-07-22 21:40         ` Honnappa Nagarahalli
  2020-07-22 21:39       ` [dpdk-dev] [PATCH v3 1/2] service: add API to retrieve service core active Honnappa Nagarahalli
  2020-07-24 12:45       ` [dpdk-dev] [PATCH v4 " Harry van Haaren
  2 siblings, 1 reply; 31+ messages in thread
From: Harry van Haaren @ 2020-07-22 10:37 UTC (permalink / raw)
  To: dev
  Cc: david.marchand, igor.romanov, honnappa.nagarahalli, ferruh.yigit,
	nd, aconole, l.wojciechow, phil.yang, Harry van Haaren

This commit fixes a potential race condition in the tests
where the lcore running a service would increment a counter
that was already reset by the test-suite thread. The resulting
race-condition incremented value could cause CI failures, as
indicated by DPDK's CI.

This patch fixes the race-condition by making use of the
added rte_service_lcore_active() API, which indicates when
a service-core is no longer in the service-core polling loop.

The unit test makes use of the above function to detect when
all statistics increments are done in the service-core thread,
and then the unit test continues finalizing and checking state.

Fixes: f28f3594ded2 ("service: add attribute API")

Reported-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>

---

v3:
- Refactor while() to for() to simplify (Harry)
- Use SERVICE_DELAY instead of magic const 1 (Phil)
- Add Phil's reviewed by tag from ML

v2:
Thanks for discussion on v1, this v2 fixup for the CI
including previous feedback on ML.
---
 app/test/test_service_cores.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/app/test/test_service_cores.c b/app/test/test_service_cores.c
index ef1d8fcb9..a6bc4487e 100644
--- a/app/test/test_service_cores.c
+++ b/app/test/test_service_cores.c
@@ -362,6 +362,9 @@ service_lcore_attr_get(void)
 			"Service core add did not return zero");
 	TEST_ASSERT_EQUAL(0, rte_service_map_lcore_set(id, slcore_id, 1),
 			"Enabling valid service and core failed");
+	/* Ensure service is not active before starting */
+	TEST_ASSERT_EQUAL(0, rte_service_lcore_active(slcore_id),
+			"Not-active service core reported as active");
 	TEST_ASSERT_EQUAL(0, rte_service_lcore_start(slcore_id),
 			"Starting service core failed");
 
@@ -382,7 +385,22 @@ service_lcore_attr_get(void)
 			lcore_attr_id, &lcore_attr_value),
 			"Invalid lcore attr didn't return -EINVAL");
 
-	rte_service_lcore_stop(slcore_id);
+	/* Ensure service is active */
+	TEST_ASSERT_EQUAL(1, rte_service_lcore_active(slcore_id),
+			"Active service core reported as not-active");
+
+	TEST_ASSERT_EQUAL(0, rte_service_map_lcore_set(id, slcore_id, 0),
+			"Disabling valid service and core failed");
+	TEST_ASSERT_EQUAL(0, rte_service_lcore_stop(slcore_id),
+			"Failed to stop service lcore");
+
+	/* Wait until service lcore not active, or for 100x SERVICE_DELAY */
+	for (int i = 0; i < 100 && rte_service_lcore_active(slcore_id) == 1;
+			i++)
+		rte_delay_ms(SERVICE_DELAY);
+
+	TEST_ASSERT_EQUAL(0, rte_service_lcore_active(slcore_id),
+			  "Service lcore not stopped after waiting.");
 
 	TEST_ASSERT_EQUAL(0, rte_service_lcore_attr_reset_all(slcore_id),
 			  "Valid lcore_attr_reset_all() didn't return success");
-- 
2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/2] service: add API to retrieve service core active
  2020-07-22 10:14           ` Van Haaren, Harry
@ 2020-07-22 18:50             ` Honnappa Nagarahalli
  2020-07-23 16:59               ` Van Haaren, Harry
  0 siblings, 1 reply; 31+ messages in thread
From: Honnappa Nagarahalli @ 2020-07-22 18:50 UTC (permalink / raw)
  To: Van Haaren, Harry, David Marchand
  Cc: dev, igor.romanov, Yigit, Ferruh, nd, aconole, l.wojciechow,
	Phil Yang, Honnappa Nagarahalli, Gage Eads, nd

+ Gage (as I referring to his commit below)

<snip>

> 
> > > > > +/**
> > > > > + * Reports if a service lcore is currently running.
> > > > > + * @retval 0 Service thread is not active, and has been returned to
> EAL.
> > > > > + * @retval 1 Service thread is in the service core polling loop.
> > > > > + * @retval -EINVAL Invalid *lcore_id* provided.
> > > > > + */
> > > > > +__rte_experimental
> > > > > +int32_t rte_service_lcore_active(uint32_t lcore_id);
> > > > Would 'rte_service_lcore_may_be_active' better? It would be inline
> > > > with
> > > 'rte_service_may_be_active'?
> 
> I think the implementation behind the API is different, so I think _may_be_ is
> not appropriate for service_lcore_active, keeping same function name for v3.
> 
> rte_service_lcore_active() checks at a particular point in the calling thread if
> another thread is active *at that time*. It is either active or not. This is
> defined, it is deterministic in that the result is either yes or no, and there is
> no ambiguity at any given check. You're right the value can change *just* after
> the check - but at the time of the check the answer was deterministic.
> 
> rte_service_may_be_active() checks if a service *could* be run by a service
> core. It is not deterministic. A service lcore only sets a service as "active on
> lcore" (or not active) when it polls it - this opens a window of
> nondeterministic result. When a runstate is set to off, there is a window of
> "unknown" before we know certainly that the service is not run on a service
> core anymore. That is why I believe the _may_be_ is appropriate for this API,
> it shows this non determinism.
> 

I am looking at this from the application usage perspective (not the implementation). I am pointing to the similarity that exists with the new API. i.e. when 'rte_service_lcore_stop' is called, it is not known if the service lcore has stopped. If 'rte_service_lcore_active' returns 1, it indicates the lcore 'may be' active (the reasoning could be different in this case), it is not guaranteed that it is active by the time caller checks it. But when the API returns 0, it guarantees that the service lcore (assuming it was started and verified that it was started), has stopped.

Looking at the commit e30dd31847d212cd1b766612cbd980c7d8240baa that added the 'rte_service_may_be_active', the use case is a mechanism to identify a quiescent state after a service was stopped.
The use case for the new API is also the same. We want to identify a quiescent state after a service lcore is stopped.

<snip>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/2] service: add API to retrieve service core active
  2020-07-22 10:37     ` [dpdk-dev] [PATCH v3 " Harry van Haaren
  2020-07-22 10:37       ` [dpdk-dev] [PATCH v3 2/2] test/service: fix race condition on stopping lcore Harry van Haaren
@ 2020-07-22 21:39       ` Honnappa Nagarahalli
  2020-07-24 12:45       ` [dpdk-dev] [PATCH v4 " Harry van Haaren
  2 siblings, 0 replies; 31+ messages in thread
From: Honnappa Nagarahalli @ 2020-07-22 21:39 UTC (permalink / raw)
  To: Harry van Haaren, dev
  Cc: david.marchand, igor.romanov, ferruh.yigit, nd, aconole,
	l.wojciechow, Phil Yang, Honnappa Nagarahalli, nd

<snip>

> Subject: [PATCH v3 1/2] service: add API to retrieve service core active
> 
> This commit adds a new experimental API which allows the user to retrieve
> the active state of an lcore. Knowing when the service lcore is completed its
> polling loop can be useful to applications to avoid race conditions when e.g.
> finalizing statistics.
> 
> The service thread itself now has a variable to indicate if its thread is active.
> When zero the service thread has completed its service, and has returned
> from the service_runner_func() function.
> 
> Suggested-by: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com>
> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>

Over all looks good, few nits inline.
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

> 
> ---
> 
> v3:
> - Change service lcore stores to SEQ_CST (Honnappa, David)
> - Change control thread load to ACQ (Honnappa, David)
> - Comment reasons for SEQ_CST/ACQ (Honnappa, David)
> - Add comments to Doxygen for _stop() and _lcore_active() (Honnappa, David)
> - Add Phil's review tag from ML
> ---
>  lib/librte_eal/common/rte_service.c  | 23 ++++++++++++++++++++++-
> lib/librte_eal/include/rte_service.h | 20 +++++++++++++++++++-
>  lib/librte_eal/rte_eal_version.map   |  1 +
>  3 files changed, 42 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/librte_eal/common/rte_service.c
> b/lib/librte_eal/common/rte_service.c
> index 6a0e0ff65..b0cf7c4ab 100644
> --- a/lib/librte_eal/common/rte_service.c
> +++ b/lib/librte_eal/common/rte_service.c
> @@ -65,6 +65,7 @@ struct core_state {
>  	/* map of services IDs are run on this core */
>  	uint64_t service_mask;
>  	uint8_t runstate; /* running or stopped */
> +	uint8_t thread_active; /* indicates when thread is in service_run() */
>  	uint8_t is_service_core; /* set if core is currently a service core */
>  	uint8_t service_active_on_lcore[RTE_SERVICE_NUM_MAX];
>  	uint64_t loops;
> @@ -422,7 +423,7 @@ rte_service_may_be_active(uint32_t id)
>  		return -EINVAL;
> 
>  	for (i = 0; i < lcore_count; i++) {
> -		if (lcore_states[ids[i]].service_active_on_lcore[id])
> +	if (lcore_states[ids[i]].service_active_on_lcore[id])
nit, I think the <tab> is required here.

>  			return 1;
>  	}
> 
> @@ -457,6 +458,8 @@ service_runner_func(void *arg)
>  	const int lcore = rte_lcore_id();
>  	struct core_state *cs = &lcore_states[lcore];
> 
> +	__atomic_store_n(&cs->thread_active, 1, __ATOMIC_SEQ_CST);
> +
>  	/* runstate act as the guard variable. Use load-acquire
>  	 * memory order here to synchronize with store-release
>  	 * in runstate update functions.
> @@ -475,9 +478,27 @@ service_runner_func(void *arg)
>  		cs->loops++;
>  	}
> 
> +	/* Use SEQ CST memory ordering to avoid any re-ordering around
> +	 * this store, ensuring that once this store is visible, the service
> +	 * lcore thread really is done in service cores code.
> +	 */
> +	__atomic_store_n(&cs->thread_active, 0, __ATOMIC_SEQ_CST);
>  	return 0;
>  }
> 
> +int32_t
> +rte_service_lcore_active(uint32_t lcore) {
> +	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
> +		return -EINVAL;
> +
> +	/* Load thread_active using ACQUIRE to avoid instructions dependent
> on
> +	 * the result being re-ordered before this load completes.
> +	 */
> +	return __atomic_load_n(&lcore_states[lcore].thread_active,
> +			       __ATOMIC_ACQUIRE);
> +}
> +
>  int32_t
>  rte_service_lcore_count(void)
>  {
> diff --git a/lib/librte_eal/include/rte_service.h
> b/lib/librte_eal/include/rte_service.h
> index e2d0a6dd3..84245020d 100644
> --- a/lib/librte_eal/include/rte_service.h
> +++ b/lib/librte_eal/include/rte_service.h
> @@ -249,7 +249,9 @@ int32_t rte_service_lcore_start(uint32_t lcore_id);
>   * Stop a service core.
>   *
>   * Stopping a core makes the core become idle, but remains  assigned as a
> - * service core.
> + * service core. Note that the service lcore thread may not have
> + returned from
> + * the service it is running when this API returns. To check if the
> + service
> + * lcore is still active, the *rte_service_lcore_active* API is added.
nit
Instead of "To check if the service lcore is still active, the *rte_service_lcore_active* API is added.", may be "*rte_service_lcore_active* API can be used to check if the service lcore is still active"?

>   *
>   * @retval 0 Success
>   * @retval -EINVAL Invalid *lcore_id* provided @@ -261,6 +263,22 @@
> int32_t rte_service_lcore_start(uint32_t lcore_id);
>   */
>  int32_t rte_service_lcore_stop(uint32_t lcore_id);
> 
> +/**
> + * Reports if a service lcore is currently running.
nit                                                     ^^^^^^^^^^^^^^^ may be "currently running mapped services"? (matching with the documentation for existing APIs)

> + *
> + * This function returns if the core has finished service cores code,
nit                                                                     ^^^^^^^^^^^^^^^^^^^^^^ may be "completed running assigned services"?

> +and has
> + * returned to EAL control. If *rte_service_lcore_stop* has been called
> +but
> + * the lcore has not returned to EAL yet, it might be required to wait
> +and call
> + * this function again. The amount of time to wait before the core
> +returns
> + * depends on the duration of the services being run.
> + *
> + * @retval 0 Service thread is not active, and has been returned to EAL.
nit,                                    ^^^^^ lcore?
> + * @retval 1 Service thread is in the service core polling loop.
nit, may be "Service lcore is running assigned services."? (matching the terms used in existing APIs)

> + * @retval -EINVAL Invalid *lcore_id* provided.
> + */
> +__rte_experimental
> +int32_t rte_service_lcore_active(uint32_t lcore_id);
> +
>  /**
>   * Adds lcore to the list of service cores.
>   *
> diff --git a/lib/librte_eal/rte_eal_version.map
> b/lib/librte_eal/rte_eal_version.map
> index bf0c17c23..d53d5d5b9 100644
> --- a/lib/librte_eal/rte_eal_version.map
> +++ b/lib/librte_eal/rte_eal_version.map
> @@ -401,6 +401,7 @@ EXPERIMENTAL {
>  	rte_lcore_dump;
>  	rte_lcore_iterate;
>  	rte_mp_disable;
> +	rte_service_lcore_active;
>  	rte_thread_register;
>  	rte_thread_unregister;
>  };
> --
> 2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/2] test/service: fix race condition on stopping lcore
  2020-07-22 10:37       ` [dpdk-dev] [PATCH v3 2/2] test/service: fix race condition on stopping lcore Harry van Haaren
@ 2020-07-22 21:40         ` Honnappa Nagarahalli
  0 siblings, 0 replies; 31+ messages in thread
From: Honnappa Nagarahalli @ 2020-07-22 21:40 UTC (permalink / raw)
  To: Harry van Haaren, dev
  Cc: david.marchand, igor.romanov, ferruh.yigit, nd, aconole,
	l.wojciechow, Phil Yang, Honnappa Nagarahalli, nd


<snip>

> Subject: [PATCH v3 2/2] test/service: fix race condition on stopping lcore
> 
> This commit fixes a potential race condition in the tests where the lcore
> running a service would increment a counter that was already reset by the
> test-suite thread. The resulting race-condition incremented value could cause
> CI failures, as indicated by DPDK's CI.
> 
> This patch fixes the race-condition by making use of the added
> rte_service_lcore_active() API, which indicates when a service-core is no
> longer in the service-core polling loop.
> 
> The unit test makes use of the above function to detect when all statistics
> increments are done in the service-core thread, and then the unit test
> continues finalizing and checking state.
> 
> Fixes: f28f3594ded2 ("service: add attribute API")
> 
> Reported-by: David Marchand <david.marchand@redhat.com>
> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

> 
> ---
> 
> v3:
> - Refactor while() to for() to simplify (Harry)
> - Use SERVICE_DELAY instead of magic const 1 (Phil)
> - Add Phil's reviewed by tag from ML
> 
> v2:
> Thanks for discussion on v1, this v2 fixup for the CI including previous
> feedback on ML.
> ---
>  app/test/test_service_cores.c | 20 +++++++++++++++++++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/app/test/test_service_cores.c b/app/test/test_service_cores.c index
> ef1d8fcb9..a6bc4487e 100644
> --- a/app/test/test_service_cores.c
> +++ b/app/test/test_service_cores.c
> @@ -362,6 +362,9 @@ service_lcore_attr_get(void)
>  			"Service core add did not return zero");
>  	TEST_ASSERT_EQUAL(0, rte_service_map_lcore_set(id, slcore_id, 1),
>  			"Enabling valid service and core failed");
> +	/* Ensure service is not active before starting */
> +	TEST_ASSERT_EQUAL(0, rte_service_lcore_active(slcore_id),
> +			"Not-active service core reported as active");
>  	TEST_ASSERT_EQUAL(0, rte_service_lcore_start(slcore_id),
>  			"Starting service core failed");
> 
> @@ -382,7 +385,22 @@ service_lcore_attr_get(void)
>  			lcore_attr_id, &lcore_attr_value),
>  			"Invalid lcore attr didn't return -EINVAL");
> 
> -	rte_service_lcore_stop(slcore_id);
> +	/* Ensure service is active */
> +	TEST_ASSERT_EQUAL(1, rte_service_lcore_active(slcore_id),
> +			"Active service core reported as not-active");
> +
> +	TEST_ASSERT_EQUAL(0, rte_service_map_lcore_set(id, slcore_id, 0),
> +			"Disabling valid service and core failed");
> +	TEST_ASSERT_EQUAL(0, rte_service_lcore_stop(slcore_id),
> +			"Failed to stop service lcore");
> +
> +	/* Wait until service lcore not active, or for 100x SERVICE_DELAY */
> +	for (int i = 0; i < 100 && rte_service_lcore_active(slcore_id) == 1;
> +			i++)
> +		rte_delay_ms(SERVICE_DELAY);
> +
> +	TEST_ASSERT_EQUAL(0, rte_service_lcore_active(slcore_id),
> +			  "Service lcore not stopped after waiting.");
> 
>  	TEST_ASSERT_EQUAL(0, rte_service_lcore_attr_reset_all(slcore_id),
>  			  "Valid lcore_attr_reset_all() didn't return success");
> --
> 2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/2] service: add API to retrieve service core active
  2020-07-22 18:50             ` Honnappa Nagarahalli
@ 2020-07-23 16:59               ` Van Haaren, Harry
  0 siblings, 0 replies; 31+ messages in thread
From: Van Haaren, Harry @ 2020-07-23 16:59 UTC (permalink / raw)
  To: Honnappa Nagarahalli, David Marchand
  Cc: dev, igor.romanov, Yigit, Ferruh, nd, aconole, l.wojciechow,
	Phil Yang, Eads, Gage, nd

> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Wednesday, July 22, 2020 7:50 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; David Marchand
> <david.marchand@redhat.com>
> Cc: dev@dpdk.org; igor.romanov@oktetlabs.ru; Yigit, Ferruh
> <ferruh.yigit@intel.com>; nd <nd@arm.com>; aconole@redhat.com;
> l.wojciechow@partner.samsung.com; Phil Yang <Phil.Yang@arm.com>;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Eads, Gage
> <gage.eads@intel.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v2 1/2] service: add API to retrieve service core active
> 
> + Gage (as I referring to his commit below)
> 
> <snip>
> 
> >
> > > > > > +/**
> > > > > > + * Reports if a service lcore is currently running.
> > > > > > + * @retval 0 Service thread is not active, and has been returned to
> > EAL.
> > > > > > + * @retval 1 Service thread is in the service core polling loop.
> > > > > > + * @retval -EINVAL Invalid *lcore_id* provided.
> > > > > > + */
> > > > > > +__rte_experimental
> > > > > > +int32_t rte_service_lcore_active(uint32_t lcore_id);
> > > > > Would 'rte_service_lcore_may_be_active' better? It would be inline
> > > > > with
> > > > 'rte_service_may_be_active'?
> >
> > I think the implementation behind the API is different, so I think _may_be_ is
> > not appropriate for service_lcore_active, keeping same function name for v3.
> >
> > rte_service_lcore_active() checks at a particular point in the calling thread if
> > another thread is active *at that time*. It is either active or not. This is
> > defined, it is deterministic in that the result is either yes or no, and there is
> > no ambiguity at any given check. You're right the value can change *just* after
> > the check - but at the time of the check the answer was deterministic.
> >
> > rte_service_may_be_active() checks if a service *could* be run by a service
> > core. It is not deterministic. A service lcore only sets a service as "active on
> > lcore" (or not active) when it polls it - this opens a window of
> > nondeterministic result. When a runstate is set to off, there is a window of
> > "unknown" before we know certainly that the service is not run on a service
> > core anymore. That is why I believe the _may_be_ is appropriate for this API,
> > it shows this non determinism.
> >
> 
> I am looking at this from the application usage perspective (not the
> implementation). I am pointing to the similarity that exists with the new API. i.e.
> when 'rte_service_lcore_stop' is called, it is not known if the service lcore has
> stopped. If 'rte_service_lcore_active' returns 1, it indicates the lcore 'may be'
> active (the reasoning could be different in this case), it is not guaranteed that it
> is active by the time caller checks it. But when the API returns 0, it guarantees
> that the service lcore (assuming it was started and verified that it was started),
> has stopped.
> 
> Looking at the commit e30dd31847d212cd1b766612cbd980c7d8240baa that
> added the 'rte_service_may_be_active', the use case is a mechanism to identify a
> quiescent state after a service was stopped.
> The use case for the new API is also the same. We want to identify a quiescent
> state after a service lcore is stopped.
> 
> <snip>

Sure - if you feel strongly that we need the _may_be_ to focus the end-users
attention that this is a cross-thread check and has some quiescent property, lets
add it to err on the side of obvious. Will change API name for v3.

Saw the other feedback on patches too - will incorporate and send ASAP, might need
till tomorrow to get it done.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [dpdk-dev] [PATCH v4 1/2] service: add API to retrieve service core active
  2020-07-22 10:37     ` [dpdk-dev] [PATCH v3 " Harry van Haaren
  2020-07-22 10:37       ` [dpdk-dev] [PATCH v3 2/2] test/service: fix race condition on stopping lcore Harry van Haaren
  2020-07-22 21:39       ` [dpdk-dev] [PATCH v3 1/2] service: add API to retrieve service core active Honnappa Nagarahalli
@ 2020-07-24 12:45       ` Harry van Haaren
  2020-07-24 12:45         ` [dpdk-dev] [PATCH v4 2/2] test/service: fix race condition on stopping lcore Harry van Haaren
  2020-07-24 13:45         ` [dpdk-dev] [PATCH v5 1/2] service: add API to retrieve service core active Harry van Haaren
  2 siblings, 2 replies; 31+ messages in thread
From: Harry van Haaren @ 2020-07-24 12:45 UTC (permalink / raw)
  To: dev
  Cc: david.marchand, igor.romanov, honnappa.nagarahalli, ferruh.yigit,
	nd, aconole, l.wojciechow, phil.yang, Harry van Haaren

This commit adds a new experimental API which allows the user
to retrieve the active state of an lcore. Knowing when the service
lcore is completed its polling loop can be useful to applications
to avoid race conditions when e.g. finalizing statistics.

The service thread itself now has a variable to indicate if its
thread is active. When zero the service thread has completed its
service, and has returned from the service_runner_func() function.

Suggested-by: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

---

v4:
- Use _may_be_ style API for lcore_active (Honnappa)
- Fix missing tab indent (Honnappa)
- Add 'lcore' to doxygen retval description (Honnappa)

@Honnappa: Please note i did not update the doxygen title of the
lcore_may_be_active() function, as the current description is more
accurate than making it more consistent with other functions. The
function returns core active state - and I don't want to confuse
mappings of services to lcores with what the function returns.

v3:
- Change service lcore stores to SEQ_CST (Honnappa, David)
- Change control thread load to ACQ (Honnappa, David)
- Comment reasons for SEQ_CST/ACQ (Honnappa, David)
- Add comments to Doxygen for _stop() and _lcore_active() (Honnappa, David)
- Add Phil's review tag from ML
---
 lib/librte_eal/common/rte_service.c  | 21 +++++++++++++++++++++
 lib/librte_eal/include/rte_service.h | 22 +++++++++++++++++++++-
 lib/librte_eal/rte_eal_version.map   |  1 +
 3 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 6a0e0ff65..35f1887fd 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -65,6 +65,7 @@ struct core_state {
 	/* map of services IDs are run on this core */
 	uint64_t service_mask;
 	uint8_t runstate; /* running or stopped */
+	uint8_t thread_active; /* indicates when thread is in service_run() */
 	uint8_t is_service_core; /* set if core is currently a service core */
 	uint8_t service_active_on_lcore[RTE_SERVICE_NUM_MAX];
 	uint64_t loops;
@@ -457,6 +458,8 @@ service_runner_func(void *arg)
 	const int lcore = rte_lcore_id();
 	struct core_state *cs = &lcore_states[lcore];
 
+	__atomic_store_n(&cs->thread_active, 1, __ATOMIC_SEQ_CST);
+
 	/* runstate act as the guard variable. Use load-acquire
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
@@ -475,9 +478,27 @@ service_runner_func(void *arg)
 		cs->loops++;
 	}
 
+	/* Use SEQ CST memory ordering to avoid any re-ordering around
+	 * this store, ensuring that once this store is visible, the service
+	 * lcore thread really is done in service cores code.
+	 */
+	__atomic_store_n(&cs->thread_active, 0, __ATOMIC_SEQ_CST);
 	return 0;
 }
 
+int32_t
+rte_service_lcore_may_be_active(uint32_t lcore)
+{
+	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+		return -EINVAL;
+
+	/* Load thread_active using ACQUIRE to avoid instructions dependant on
+	 * the result being re-ordered before this load completes.
+	 */
+	return __atomic_load_n(&lcore_states[lcore].thread_active,
+			       __ATOMIC_ACQUIRE);
+}
+
 int32_t
 rte_service_lcore_count(void)
 {
diff --git a/lib/librte_eal/include/rte_service.h b/lib/librte_eal/include/rte_service.h
index e2d0a6dd3..09f0d08b1 100644
--- a/lib/librte_eal/include/rte_service.h
+++ b/lib/librte_eal/include/rte_service.h
@@ -249,7 +249,11 @@ int32_t rte_service_lcore_start(uint32_t lcore_id);
  * Stop a service core.
  *
  * Stopping a core makes the core become idle, but remains  assigned as a
- * service core.
+ * service core. Note that the serivce lcore thread may not have returned from
+ * the service it is running when this API returns.
+ *
+ * The *rte_service_lcore_may_be_active* API can be used to check if the
+ * service lcore is * still active.
  *
  * @retval 0 Success
  * @retval -EINVAL Invalid *lcore_id* provided
@@ -261,6 +265,22 @@ int32_t rte_service_lcore_start(uint32_t lcore_id);
  */
 int32_t rte_service_lcore_stop(uint32_t lcore_id);
 
+/**
+ * Reports if a service lcore is currently running.
+ *
+ * This function returns if the core has finished service cores code, and has
+ * returned to EAL control. If *rte_service_lcore_stop* has been called but
+ * the lcore has not returned to EAL yet, it might be required to wait and call
+ * this function again. The amount of time to wait before the core returns
+ * depends on the duration of the services being run.
+ *
+ * @retval 0 Service thread is not active, and lcore has been returned to EAL.
+ * @retval 1 Service thread is in the service core polling loop.
+ * @retval -EINVAL Invalid *lcore_id* provided.
+ */
+__rte_experimental
+int32_t rte_service_lcore_may_be_active(uint32_t lcore_id);
+
 /**
  * Adds lcore to the list of service cores.
  *
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index bf0c17c23..39826ef91 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -401,6 +401,7 @@ EXPERIMENTAL {
 	rte_lcore_dump;
 	rte_lcore_iterate;
 	rte_mp_disable;
+	rte_service_lcore_may_be_active;
 	rte_thread_register;
 	rte_thread_unregister;
 };
-- 
2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [dpdk-dev] [PATCH v4 2/2] test/service: fix race condition on stopping lcore
  2020-07-24 12:45       ` [dpdk-dev] [PATCH v4 " Harry van Haaren
@ 2020-07-24 12:45         ` Harry van Haaren
  2020-07-24 13:45         ` [dpdk-dev] [PATCH v5 1/2] service: add API to retrieve service core active Harry van Haaren
  1 sibling, 0 replies; 31+ messages in thread
From: Harry van Haaren @ 2020-07-24 12:45 UTC (permalink / raw)
  To: dev
  Cc: david.marchand, igor.romanov, honnappa.nagarahalli, ferruh.yigit,
	nd, aconole, l.wojciechow, phil.yang, Harry van Haaren

This commit fixes a potential race condition in the tests
where the lcore running a service would increment a counter
that was already reset by the test-suite thread. The resulting
race-condition incremented value could cause CI failures, as
indicated by DPDK's CI.

This patch fixes the race-condition by making use of the
added rte_service_lcore_active() API, which indicates when
a service-core is no longer in the service-core polling loop.

The unit test makes use of the above function to detect when
all statistics increments are done in the service-core thread,
and then the unit test continues finalizing and checking state.

Fixes: f28f3594ded2 ("service: add attribute API")

Reported-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

---

v4:
- Update test to new _may_be_ style API (Honnappa)
- Add reviewed by from ML

v3:
- Refactor while() to for() to simplify (Harry)
- Use SERVICE_DELAY instead of magic const 1 (Phil)
- Add Phil's reviewed by tag from ML

v2:
Thanks for discussion on v1, this v2 fixup for the CI
including previous feedback on ML.
---
 app/test/test_service_cores.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/app/test/test_service_cores.c b/app/test/test_service_cores.c
index ef1d8fcb9..8bc1d9913 100644
--- a/app/test/test_service_cores.c
+++ b/app/test/test_service_cores.c
@@ -362,6 +362,9 @@ service_lcore_attr_get(void)
 			"Service core add did not return zero");
 	TEST_ASSERT_EQUAL(0, rte_service_map_lcore_set(id, slcore_id, 1),
 			"Enabling valid service and core failed");
+	/* Ensure service is not active before starting */
+	TEST_ASSERT_EQUAL(0, rte_service_lcore_may_be_active(slcore_id),
+			"Not-active service core reported as active");
 	TEST_ASSERT_EQUAL(0, rte_service_lcore_start(slcore_id),
 			"Starting service core failed");
 
@@ -382,7 +385,22 @@ service_lcore_attr_get(void)
 			lcore_attr_id, &lcore_attr_value),
 			"Invalid lcore attr didn't return -EINVAL");
 
-	rte_service_lcore_stop(slcore_id);
+	/* Ensure service is active */
+	TEST_ASSERT_EQUAL(1, rte_service_lcore_may_be_active(slcore_id),
+			"Active service core reported as not-active");
+
+	TEST_ASSERT_EQUAL(0, rte_service_map_lcore_set(id, slcore_id, 0),
+			"Disabling valid service and core failed");
+	TEST_ASSERT_EQUAL(0, rte_service_lcore_stop(slcore_id),
+			"Failed to stop service lcore");
+
+	/* Wait until service lcore not active, or for 100x SERVICE_DELAY */
+	for (int i = 0; rte_service_lcore_may_be_active(slcore_id) == 1 &&
+			i < 100; i++)
+		rte_delay_ms(SERVICE_DELAY);
+
+	TEST_ASSERT_EQUAL(0, rte_service_lcore_may_be_active(slcore_id),
+			  "Service lcore not stopped after waiting.");
 
 	TEST_ASSERT_EQUAL(0, rte_service_lcore_attr_reset_all(slcore_id),
 			  "Valid lcore_attr_reset_all() didn't return success");
-- 
2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [dpdk-dev] [PATCH v5 1/2] service: add API to retrieve service core active
  2020-07-24 12:45       ` [dpdk-dev] [PATCH v4 " Harry van Haaren
  2020-07-24 12:45         ` [dpdk-dev] [PATCH v4 2/2] test/service: fix race condition on stopping lcore Harry van Haaren
@ 2020-07-24 13:45         ` Harry van Haaren
  2020-07-24 13:45           ` [dpdk-dev] [PATCH v5 2/2] test/service: fix race condition on stopping lcore Harry van Haaren
  2020-09-14 14:31           ` [dpdk-dev] [PATCH v6 1/2] service: add API to retrieve service core active Harry van Haaren
  1 sibling, 2 replies; 31+ messages in thread
From: Harry van Haaren @ 2020-07-24 13:45 UTC (permalink / raw)
  To: dev
  Cc: david.marchand, igor.romanov, honnappa.nagarahalli, ferruh.yigit,
	nd, aconole, l.wojciechow, phil.yang, Harry van Haaren

This commit adds a new experimental API which allows the user
to retrieve the active state of an lcore. Knowing when the service
lcore is completed its polling loop can be useful to applications
to avoid race conditions when e.g. finalizing statistics.

The service thread itself now has a variable to indicate if its
thread is active. When zero the service thread has completed its
service, and has returned from the service_runner_func() function.

Suggested-by: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

---

v5:
- Fix typos (robot)

v4:
- Use _may_be_ style API for lcore_active (Honnappa)
- Fix missing tab indent (Honnappa)
- Add 'lcore' to doxygen retval description (Honnappa)

@Honnappa: Please note i did not update the doxygen title of the
lcore_may_be_active() function, as the current description is more
accurate than making it more consistent with other functions.

v3:
- Change service lcore stores to SEQ_CST (Honnappa, David)
- Change control thread load to ACQ (Honnappa, David)
- Comment reasons for SEQ_CST/ACQ (Honnappa, David)
- Add comments to Doxygen for _stop() and _lcore_active() (Honnappa, David)
- Add Phil's review tag from ML
---
 lib/librte_eal/common/rte_service.c  | 21 +++++++++++++++++++++
 lib/librte_eal/include/rte_service.h | 22 +++++++++++++++++++++-
 lib/librte_eal/rte_eal_version.map   |  1 +
 3 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 6a0e0ff65..98565bbef 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -65,6 +65,7 @@ struct core_state {
 	/* map of services IDs are run on this core */
 	uint64_t service_mask;
 	uint8_t runstate; /* running or stopped */
+	uint8_t thread_active; /* indicates when thread is in service_run() */
 	uint8_t is_service_core; /* set if core is currently a service core */
 	uint8_t service_active_on_lcore[RTE_SERVICE_NUM_MAX];
 	uint64_t loops;
@@ -457,6 +458,8 @@ service_runner_func(void *arg)
 	const int lcore = rte_lcore_id();
 	struct core_state *cs = &lcore_states[lcore];
 
+	__atomic_store_n(&cs->thread_active, 1, __ATOMIC_SEQ_CST);
+
 	/* runstate act as the guard variable. Use load-acquire
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
@@ -475,9 +478,27 @@ service_runner_func(void *arg)
 		cs->loops++;
 	}
 
+	/* Use SEQ CST memory ordering to avoid any re-ordering around
+	 * this store, ensuring that once this store is visible, the service
+	 * lcore thread really is done in service cores code.
+	 */
+	__atomic_store_n(&cs->thread_active, 0, __ATOMIC_SEQ_CST);
 	return 0;
 }
 
+int32_t
+rte_service_lcore_may_be_active(uint32_t lcore)
+{
+	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+		return -EINVAL;
+
+	/* Load thread_active using ACQUIRE to avoid instructions dependent on
+	 * the result being re-ordered before this load completes.
+	 */
+	return __atomic_load_n(&lcore_states[lcore].thread_active,
+			       __ATOMIC_ACQUIRE);
+}
+
 int32_t
 rte_service_lcore_count(void)
 {
diff --git a/lib/librte_eal/include/rte_service.h b/lib/librte_eal/include/rte_service.h
index e2d0a6dd3..ca9950d09 100644
--- a/lib/librte_eal/include/rte_service.h
+++ b/lib/librte_eal/include/rte_service.h
@@ -249,7 +249,11 @@ int32_t rte_service_lcore_start(uint32_t lcore_id);
  * Stop a service core.
  *
  * Stopping a core makes the core become idle, but remains  assigned as a
- * service core.
+ * service core. Note that the service lcore thread may not have returned from
+ * the service it is running when this API returns.
+ *
+ * The *rte_service_lcore_may_be_active* API can be used to check if the
+ * service lcore is * still active.
  *
  * @retval 0 Success
  * @retval -EINVAL Invalid *lcore_id* provided
@@ -261,6 +265,22 @@ int32_t rte_service_lcore_start(uint32_t lcore_id);
  */
 int32_t rte_service_lcore_stop(uint32_t lcore_id);
 
+/**
+ * Reports if a service lcore is currently running.
+ *
+ * This function returns if the core has finished service cores code, and has
+ * returned to EAL control. If *rte_service_lcore_stop* has been called but
+ * the lcore has not returned to EAL yet, it might be required to wait and call
+ * this function again. The amount of time to wait before the core returns
+ * depends on the duration of the services being run.
+ *
+ * @retval 0 Service thread is not active, and lcore has been returned to EAL.
+ * @retval 1 Service thread is in the service core polling loop.
+ * @retval -EINVAL Invalid *lcore_id* provided.
+ */
+__rte_experimental
+int32_t rte_service_lcore_may_be_active(uint32_t lcore_id);
+
 /**
  * Adds lcore to the list of service cores.
  *
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index bf0c17c23..39826ef91 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -401,6 +401,7 @@ EXPERIMENTAL {
 	rte_lcore_dump;
 	rte_lcore_iterate;
 	rte_mp_disable;
+	rte_service_lcore_may_be_active;
 	rte_thread_register;
 	rte_thread_unregister;
 };
-- 
2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [dpdk-dev] [PATCH v5 2/2] test/service: fix race condition on stopping lcore
  2020-07-24 13:45         ` [dpdk-dev] [PATCH v5 1/2] service: add API to retrieve service core active Harry van Haaren
@ 2020-07-24 13:45           ` Harry van Haaren
  2020-09-14  8:36             ` David Marchand
  2020-09-14 14:31           ` [dpdk-dev] [PATCH v6 1/2] service: add API to retrieve service core active Harry van Haaren
  1 sibling, 1 reply; 31+ messages in thread
From: Harry van Haaren @ 2020-07-24 13:45 UTC (permalink / raw)
  To: dev
  Cc: david.marchand, igor.romanov, honnappa.nagarahalli, ferruh.yigit,
	nd, aconole, l.wojciechow, phil.yang, Harry van Haaren

This commit fixes a potential race condition in the tests
where the lcore running a service would increment a counter
that was already reset by the test-suite thread. The resulting
race-condition incremented value could cause CI failures, as
indicated by DPDK's CI.

This patch fixes the race-condition by making use of the
added rte_service_lcore_active() API, which indicates when
a service-core is no longer in the service-core polling loop.

The unit test makes use of the above function to detect when
all statistics increments are done in the service-core thread,
and then the unit test continues finalizing and checking state.

Fixes: f28f3594ded2 ("service: add attribute API")

Reported-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

---

v4:
- Update test to new _may_be_ style API (Honnappa)
- Add reviewed by from ML

v3:
- Refactor while() to for() to simplify (Harry)
- Use SERVICE_DELAY instead of magic const 1 (Phil)
- Add Phil's reviewed by tag from ML

v2:
Thanks for discussion on v1, this v2 fixup for the CI
including previous feedback on ML.
---
 app/test/test_service_cores.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/app/test/test_service_cores.c b/app/test/test_service_cores.c
index ef1d8fcb9..8bc1d9913 100644
--- a/app/test/test_service_cores.c
+++ b/app/test/test_service_cores.c
@@ -362,6 +362,9 @@ service_lcore_attr_get(void)
 			"Service core add did not return zero");
 	TEST_ASSERT_EQUAL(0, rte_service_map_lcore_set(id, slcore_id, 1),
 			"Enabling valid service and core failed");
+	/* Ensure service is not active before starting */
+	TEST_ASSERT_EQUAL(0, rte_service_lcore_may_be_active(slcore_id),
+			"Not-active service core reported as active");
 	TEST_ASSERT_EQUAL(0, rte_service_lcore_start(slcore_id),
 			"Starting service core failed");
 
@@ -382,7 +385,22 @@ service_lcore_attr_get(void)
 			lcore_attr_id, &lcore_attr_value),
 			"Invalid lcore attr didn't return -EINVAL");
 
-	rte_service_lcore_stop(slcore_id);
+	/* Ensure service is active */
+	TEST_ASSERT_EQUAL(1, rte_service_lcore_may_be_active(slcore_id),
+			"Active service core reported as not-active");
+
+	TEST_ASSERT_EQUAL(0, rte_service_map_lcore_set(id, slcore_id, 0),
+			"Disabling valid service and core failed");
+	TEST_ASSERT_EQUAL(0, rte_service_lcore_stop(slcore_id),
+			"Failed to stop service lcore");
+
+	/* Wait until service lcore not active, or for 100x SERVICE_DELAY */
+	for (int i = 0; rte_service_lcore_may_be_active(slcore_id) == 1 &&
+			i < 100; i++)
+		rte_delay_ms(SERVICE_DELAY);
+
+	TEST_ASSERT_EQUAL(0, rte_service_lcore_may_be_active(slcore_id),
+			  "Service lcore not stopped after waiting.");
 
 	TEST_ASSERT_EQUAL(0, rte_service_lcore_attr_reset_all(slcore_id),
 			  "Valid lcore_attr_reset_all() didn't return success");
-- 
2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [PATCH v5 2/2] test/service: fix race condition on stopping lcore
  2020-07-24 13:45           ` [dpdk-dev] [PATCH v5 2/2] test/service: fix race condition on stopping lcore Harry van Haaren
@ 2020-09-14  8:36             ` David Marchand
  2020-09-14 14:33               ` Van Haaren, Harry
  0 siblings, 1 reply; 31+ messages in thread
From: David Marchand @ 2020-09-14  8:36 UTC (permalink / raw)
  To: Harry van Haaren
  Cc: dev, Igor Romanov, Honnappa Nagarahalli, Yigit, Ferruh, nd,
	Aaron Conole, Lukasz Wojciechowski, Phil Yang

On Fri, Jul 24, 2020 at 3:44 PM Harry van Haaren
<harry.van.haaren@intel.com> wrote:
>
> This commit fixes a potential race condition in the tests
> where the lcore running a service would increment a counter
> that was already reset by the test-suite thread. The resulting
> race-condition incremented value could cause CI failures, as
> indicated by DPDK's CI.
>
> This patch fixes the race-condition by making use of the
> added rte_service_lcore_active() API, which indicates when
> a service-core is no longer in the service-core polling loop.
>
> The unit test makes use of the above function to detect when
> all statistics increments are done in the service-core thread,
> and then the unit test continues finalizing and checking state.
>
> Fixes: f28f3594ded2 ("service: add attribute API")
>
> Reported-by: David Marchand <david.marchand@redhat.com>
> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

There is still a compilation issue reported by the Intel CI.
http://mails.dpdk.org/archives/test-report/2020-July/146535.html

Please fix.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [dpdk-dev] [PATCH v6 1/2] service: add API to retrieve service core active
  2020-07-24 13:45         ` [dpdk-dev] [PATCH v5 1/2] service: add API to retrieve service core active Harry van Haaren
  2020-07-24 13:45           ` [dpdk-dev] [PATCH v5 2/2] test/service: fix race condition on stopping lcore Harry van Haaren
@ 2020-09-14 14:31           ` Harry van Haaren
  2020-09-14 14:31             ` [dpdk-dev] [PATCH v6 2/2] test/service: fix race condition on stopping lcore Harry van Haaren
  1 sibling, 1 reply; 31+ messages in thread
From: Harry van Haaren @ 2020-09-14 14:31 UTC (permalink / raw)
  To: dev; +Cc: david.marchand, Harry van Haaren

This commit adds a new experimental API which allows the user
to retrieve the active state of an lcore. Knowing when the service
lcore is completed its polling loop can be useful to applications
to avoid race conditions when e.g. finalizing statistics.

The service thread itself now has a variable to indicate if its
thread is active. When zero the service thread has completed its
service, and has returned from the service_runner_func() function.

Suggested-by: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

---

v5:
- Fix typos (robot)

v4:
- Use _may_be_ style API for lcore_active (Honnappa)
- Fix missing tab indent (Honnappa)
- Add 'lcore' to doxygen retval description (Honnappa)

@Honnappa: Please note i did not update the doxygen title of the
lcore_may_be_active() function, as the current description is more
accurate than making it more consistent with other functions.

v3:
- Change service lcore stores to SEQ_CST (Honnappa, David)
- Change control thread load to ACQ (Honnappa, David)
- Comment reasons for SEQ_CST/ACQ (Honnappa, David)
- Add comments to Doxygen for _stop() and _lcore_active() (Honnappa, David)
- Add Phil's review tag from ML
---
 lib/librte_eal/common/rte_service.c  | 21 +++++++++++++++++++++
 lib/librte_eal/include/rte_service.h | 22 +++++++++++++++++++++-
 lib/librte_eal/rte_eal_version.map   |  1 +
 3 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 6a0e0ff65d..98565bbef3 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -65,6 +65,7 @@ struct core_state {
 	/* map of services IDs are run on this core */
 	uint64_t service_mask;
 	uint8_t runstate; /* running or stopped */
+	uint8_t thread_active; /* indicates when thread is in service_run() */
 	uint8_t is_service_core; /* set if core is currently a service core */
 	uint8_t service_active_on_lcore[RTE_SERVICE_NUM_MAX];
 	uint64_t loops;
@@ -457,6 +458,8 @@ service_runner_func(void *arg)
 	const int lcore = rte_lcore_id();
 	struct core_state *cs = &lcore_states[lcore];
 
+	__atomic_store_n(&cs->thread_active, 1, __ATOMIC_SEQ_CST);
+
 	/* runstate act as the guard variable. Use load-acquire
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
@@ -475,9 +478,27 @@ service_runner_func(void *arg)
 		cs->loops++;
 	}
 
+	/* Use SEQ CST memory ordering to avoid any re-ordering around
+	 * this store, ensuring that once this store is visible, the service
+	 * lcore thread really is done in service cores code.
+	 */
+	__atomic_store_n(&cs->thread_active, 0, __ATOMIC_SEQ_CST);
 	return 0;
 }
 
+int32_t
+rte_service_lcore_may_be_active(uint32_t lcore)
+{
+	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+		return -EINVAL;
+
+	/* Load thread_active using ACQUIRE to avoid instructions dependent on
+	 * the result being re-ordered before this load completes.
+	 */
+	return __atomic_load_n(&lcore_states[lcore].thread_active,
+			       __ATOMIC_ACQUIRE);
+}
+
 int32_t
 rte_service_lcore_count(void)
 {
diff --git a/lib/librte_eal/include/rte_service.h b/lib/librte_eal/include/rte_service.h
index e2d0a6dd32..ca9950d091 100644
--- a/lib/librte_eal/include/rte_service.h
+++ b/lib/librte_eal/include/rte_service.h
@@ -249,7 +249,11 @@ int32_t rte_service_lcore_start(uint32_t lcore_id);
  * Stop a service core.
  *
  * Stopping a core makes the core become idle, but remains  assigned as a
- * service core.
+ * service core. Note that the service lcore thread may not have returned from
+ * the service it is running when this API returns.
+ *
+ * The *rte_service_lcore_may_be_active* API can be used to check if the
+ * service lcore is * still active.
  *
  * @retval 0 Success
  * @retval -EINVAL Invalid *lcore_id* provided
@@ -261,6 +265,22 @@ int32_t rte_service_lcore_start(uint32_t lcore_id);
  */
 int32_t rte_service_lcore_stop(uint32_t lcore_id);
 
+/**
+ * Reports if a service lcore is currently running.
+ *
+ * This function returns if the core has finished service cores code, and has
+ * returned to EAL control. If *rte_service_lcore_stop* has been called but
+ * the lcore has not returned to EAL yet, it might be required to wait and call
+ * this function again. The amount of time to wait before the core returns
+ * depends on the duration of the services being run.
+ *
+ * @retval 0 Service thread is not active, and lcore has been returned to EAL.
+ * @retval 1 Service thread is in the service core polling loop.
+ * @retval -EINVAL Invalid *lcore_id* provided.
+ */
+__rte_experimental
+int32_t rte_service_lcore_may_be_active(uint32_t lcore_id);
+
 /**
  * Adds lcore to the list of service cores.
  *
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index 0b18e2ef85..55f611d071 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -395,6 +395,7 @@ EXPERIMENTAL {
 	rte_lcore_dump;
 	rte_lcore_iterate;
 	rte_mp_disable;
+	rte_service_lcore_may_be_active;
 	rte_thread_register;
 	rte_thread_unregister;
 };
-- 
2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [dpdk-dev] [PATCH v6 2/2] test/service: fix race condition on stopping lcore
  2020-09-14 14:31           ` [dpdk-dev] [PATCH v6 1/2] service: add API to retrieve service core active Harry van Haaren
@ 2020-09-14 14:31             ` Harry van Haaren
  2020-09-21 14:51               ` David Marchand
  0 siblings, 1 reply; 31+ messages in thread
From: Harry van Haaren @ 2020-09-14 14:31 UTC (permalink / raw)
  To: dev; +Cc: david.marchand, Harry van Haaren

This commit fixes a potential race condition in the tests
where the lcore running a service would increment a counter
that was already reset by the test-suite thread. The resulting
race-condition incremented value could cause CI failures, as
indicated by DPDK's CI.

This patch fixes the race-condition by making use of the
added rte_service_lcore_active() API, which indicates when
a service-core is no longer in the service-core polling loop.

The unit test makes use of the above function to detect when
all statistics increments are done in the service-core thread,
and then the unit test continues finalizing and checking state.

Fixes: f28f3594ded2 ("service: add attribute API")

Reported-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

---

v6:
- Fix CI issue on C99 style loop initializer (David)

v4:
- Update test to new _may_be_ style API (Honnappa)
- Add reviewed by from ML

v3:
- Refactor while() to for() to simplify (Harry)
- Use SERVICE_DELAY instead of magic const 1 (Phil)
- Add Phil's reviewed by tag from ML

v2:
Thanks for discussion on v1, this v2 fixup for the CI
including previous feedback on ML.
---
 app/test/test_service_cores.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/app/test/test_service_cores.c b/app/test/test_service_cores.c
index ef1d8fcb9b..5d92bea8af 100644
--- a/app/test/test_service_cores.c
+++ b/app/test/test_service_cores.c
@@ -362,6 +362,9 @@ service_lcore_attr_get(void)
 			"Service core add did not return zero");
 	TEST_ASSERT_EQUAL(0, rte_service_map_lcore_set(id, slcore_id, 1),
 			"Enabling valid service and core failed");
+	/* Ensure service is not active before starting */
+	TEST_ASSERT_EQUAL(0, rte_service_lcore_may_be_active(slcore_id),
+			"Not-active service core reported as active");
 	TEST_ASSERT_EQUAL(0, rte_service_lcore_start(slcore_id),
 			"Starting service core failed");
 
@@ -382,7 +385,23 @@ service_lcore_attr_get(void)
 			lcore_attr_id, &lcore_attr_value),
 			"Invalid lcore attr didn't return -EINVAL");
 
-	rte_service_lcore_stop(slcore_id);
+	/* Ensure service is active */
+	TEST_ASSERT_EQUAL(1, rte_service_lcore_may_be_active(slcore_id),
+			"Active service core reported as not-active");
+
+	TEST_ASSERT_EQUAL(0, rte_service_map_lcore_set(id, slcore_id, 0),
+			"Disabling valid service and core failed");
+	TEST_ASSERT_EQUAL(0, rte_service_lcore_stop(slcore_id),
+			"Failed to stop service lcore");
+
+	/* Wait until service lcore not active, or for 100x SERVICE_DELAY */
+	int i;
+	for (i = 0; rte_service_lcore_may_be_active(slcore_id) == 1 &&
+			i < 100; i++)
+		rte_delay_ms(SERVICE_DELAY);
+
+	TEST_ASSERT_EQUAL(0, rte_service_lcore_may_be_active(slcore_id),
+			  "Service lcore not stopped after waiting.");
 
 	TEST_ASSERT_EQUAL(0, rte_service_lcore_attr_reset_all(slcore_id),
 			  "Valid lcore_attr_reset_all() didn't return success");
-- 
2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [PATCH v5 2/2] test/service: fix race condition on stopping lcore
  2020-09-14  8:36             ` David Marchand
@ 2020-09-14 14:33               ` Van Haaren, Harry
  0 siblings, 0 replies; 31+ messages in thread
From: Van Haaren, Harry @ 2020-09-14 14:33 UTC (permalink / raw)
  To: David Marchand
  Cc: dev, Igor Romanov, Honnappa Nagarahalli, Yigit, Ferruh, nd,
	Aaron Conole, Lukasz Wojciechowski, Phil Yang

> -----Original Message-----
> From: David Marchand <david.marchand@redhat.com>
> Sent: Monday, September 14, 2020 9:37 AM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>
> Cc: dev <dev@dpdk.org>; Igor Romanov <igor.romanov@oktetlabs.ru>; Honnappa
> Nagarahalli <honnappa.nagarahalli@arm.com>; Yigit, Ferruh
> <ferruh.yigit@intel.com>; nd <nd@arm.com>; Aaron Conole
> <aconole@redhat.com>; Lukasz Wojciechowski
> <l.wojciechow@partner.samsung.com>; Phil Yang <phil.yang@arm.com>
> Subject: Re: [PATCH v5 2/2] test/service: fix race condition on stopping lcore
> 
> On Fri, Jul 24, 2020 at 3:44 PM Harry van Haaren
> <harry.van.haaren@intel.com> wrote:
> >
> > This commit fixes a potential race condition in the tests
> > where the lcore running a service would increment a counter
> > that was already reset by the test-suite thread. The resulting
> > race-condition incremented value could cause CI failures, as
> > indicated by DPDK's CI.
> >
> > This patch fixes the race-condition by making use of the
> > added rte_service_lcore_active() API, which indicates when
> > a service-core is no longer in the service-core polling loop.
> >
> > The unit test makes use of the above function to detect when
> > all statistics increments are done in the service-core thread,
> > and then the unit test continues finalizing and checking state.
> >
> > Fixes: f28f3594ded2 ("service: add attribute API")
> >
> > Reported-by: David Marchand <david.marchand@redhat.com>
> > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> 
> There is still a compilation issue reported by the Intel CI.
> http://mails.dpdk.org/archives/test-report/2020-July/146535.html
> 
> Please fix.

Done, v6 sent; http://patches.dpdk.org/project/dpdk/list/?series=12198


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [PATCH v6 2/2] test/service: fix race condition on stopping lcore
  2020-09-14 14:31             ` [dpdk-dev] [PATCH v6 2/2] test/service: fix race condition on stopping lcore Harry van Haaren
@ 2020-09-21 14:51               ` David Marchand
  2020-10-13 19:45                 ` David Marchand
  0 siblings, 1 reply; 31+ messages in thread
From: David Marchand @ 2020-09-21 14:51 UTC (permalink / raw)
  To: Harry van Haaren
  Cc: dev, Lukasz Wojciechowski, Honnappa Nagarahalli, Phil Yang, Aaron Conole

On Mon, Sep 14, 2020 at 4:30 PM Harry van Haaren
<harry.van.haaren@intel.com> wrote:
>
> This commit fixes a potential race condition in the tests
> where the lcore running a service would increment a counter
> that was already reset by the test-suite thread. The resulting
> race-condition incremented value could cause CI failures, as
> indicated by DPDK's CI.
>
> This patch fixes the race-condition by making use of the
> added rte_service_lcore_active() API, which indicates when
> a service-core is no longer in the service-core polling loop.
>
> The unit test makes use of the above function to detect when
> all statistics increments are done in the service-core thread,
> and then the unit test continues finalizing and checking state.
>
> Fixes: f28f3594ded2 ("service: add attribute API")
>
> Reported-by: David Marchand <david.marchand@redhat.com>
> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

Series applied, thanks.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [PATCH v6 2/2] test/service: fix race condition on stopping lcore
  2020-09-21 14:51               ` David Marchand
@ 2020-10-13 19:45                 ` David Marchand
  2020-10-15  8:11                   ` Van Haaren, Harry
  0 siblings, 1 reply; 31+ messages in thread
From: David Marchand @ 2020-10-13 19:45 UTC (permalink / raw)
  To: Harry van Haaren
  Cc: dev, Lukasz Wojciechowski, Honnappa Nagarahalli, Phil Yang, Aaron Conole

Hello Harry,

Long time no see :-)

On Mon, Sep 21, 2020 at 4:51 PM David Marchand
<david.marchand@redhat.com> wrote:
>
> On Mon, Sep 14, 2020 at 4:30 PM Harry van Haaren
> <harry.van.haaren@intel.com> wrote:
> >
> > This commit fixes a potential race condition in the tests
> > where the lcore running a service would increment a counter
> > that was already reset by the test-suite thread. The resulting
> > race-condition incremented value could cause CI failures, as
> > indicated by DPDK's CI.
> >
> > This patch fixes the race-condition by making use of the
> > added rte_service_lcore_active() API, which indicates when
> > a service-core is no longer in the service-core polling loop.
> >
> > The unit test makes use of the above function to detect when
> > all statistics increments are done in the service-core thread,
> > and then the unit test continues finalizing and checking state.
> >
> > Fixes: f28f3594ded2 ("service: add attribute API")
> >
> > Reported-by: David Marchand <david.marchand@redhat.com>
> > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>

We probably need a followup fix for:
https://travis-ci.com/github/DPDK/dpdk/jobs/398954463#L10088


The race is in service_attr_get where we look at/reset spent cycles
while a service lcore is still running.
Quoting this test code:

rte_service_lcore_stop(slcore_id);

TEST_ASSERT_EQUAL(0, rte_service_attr_get(id, attr_calls, &attr_value),
"Valid attr_get() call didn't return success");
TEST_ASSERT_EQUAL(1, (attr_value > 0),
"attr_get() call didn't get call count (zero)");

TEST_ASSERT_EQUAL(0, rte_service_attr_reset_all(id),
"Valid attr_reset_all() return success");

TEST_ASSERT_EQUAL(0, rte_service_attr_get(id, attr_id, &attr_value),
"Valid attr_get() call didn't return success");


-- 
David Marchand


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dpdk-dev] [PATCH v6 2/2] test/service: fix race condition on stopping lcore
  2020-10-13 19:45                 ` David Marchand
@ 2020-10-15  8:11                   ` Van Haaren, Harry
  0 siblings, 0 replies; 31+ messages in thread
From: Van Haaren, Harry @ 2020-10-15  8:11 UTC (permalink / raw)
  To: David Marchand
  Cc: dev, Lukasz Wojciechowski, Honnappa Nagarahalli, Phil Yang, Aaron Conole

> -----Original Message-----
> From: David Marchand <david.marchand@redhat.com>
> Sent: Tuesday, October 13, 2020 8:45 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>
> Cc: dev <dev@dpdk.org>; Lukasz Wojciechowski
> <l.wojciechow@partner.samsung.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Phil Yang <phil.yang@arm.com>; Aaron
> Conole <aconole@redhat.com>
> Subject: Re: [PATCH v6 2/2] test/service: fix race condition on stopping lcore
> 
> Hello Harry,
> 
> Long time no see :-)
> 
> On Mon, Sep 21, 2020 at 4:51 PM David Marchand
> <david.marchand@redhat.com> wrote:
> >
> > On Mon, Sep 14, 2020 at 4:30 PM Harry van Haaren
> > <harry.van.haaren@intel.com> wrote:
> > >
> > > This commit fixes a potential race condition in the tests
> > > where the lcore running a service would increment a counter
> > > that was already reset by the test-suite thread. The resulting
> > > race-condition incremented value could cause CI failures, as
> > > indicated by DPDK's CI.
> > >
> > > This patch fixes the race-condition by making use of the
> > > added rte_service_lcore_active() API, which indicates when
> > > a service-core is no longer in the service-core polling loop.
> > >
> > > The unit test makes use of the above function to detect when
> > > all statistics increments are done in the service-core thread,
> > > and then the unit test continues finalizing and checking state.
> > >
> > > Fixes: f28f3594ded2 ("service: add attribute API")
> > >
> > > Reported-by: David Marchand <david.marchand@redhat.com>
> > > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> > > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> >
> 
> We probably need a followup fix for:
> https://travis-ci.com/github/DPDK/dpdk/jobs/398954463#L10088
> 
> 
> The race is in service_attr_get where we look at/reset spent cycles
> while a service lcore is still running.
> Quoting this test code:
> 
> rte_service_lcore_stop(slcore_id);

/* TODO: implement wait for slcore_id to stop polling here */
 
> TEST_ASSERT_EQUAL(0, rte_service_attr_get(id, attr_calls, &attr_value),
> "Valid attr_get() call didn't return success");
> TEST_ASSERT_EQUAL(1, (attr_value > 0),
> "attr_get() call didn't get call count (zero)");
> 
> TEST_ASSERT_EQUAL(0, rte_service_attr_reset_all(id),
> "Valid attr_reset_all() return success");
> 
> TEST_ASSERT_EQUAL(0, rte_service_attr_get(id, attr_id, &attr_value),
> "Valid attr_get() call didn't return success");

Based on the output you provided ( https://travis-ci.com/github/DPDK/dpdk/jobs/398954463#L10088 )
and the above, indeed it seems that the core may not have stopped yet (race-cond).

Will send a patch - thanks for reporting with details, -Harry 

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2020-10-15  8:12 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20200720120850eucas1p10debcafa273d244a7e63d225c50cc9df@eucas1p1.samsung.com>
2020-07-20 12:09 ` [dpdk-dev] [PATCH] service: fix stop API to wait for service thread Harry van Haaren
2020-07-20 12:51   ` Lukasz Wojciechowski
2020-07-20 14:20     ` Van Haaren, Harry
2020-07-20 14:38   ` [dpdk-dev] [PATCH v2 1/2] service: add API to retrieve service core active Harry van Haaren
2020-07-20 14:38     ` [dpdk-dev] [PATCH v2 2/2] test/service: fix race condition on stopping lcore Harry van Haaren
2020-07-20 17:45       ` Lukasz Wojciechowski
2020-07-21  8:38       ` Phil Yang
2020-07-22 10:26         ` Van Haaren, Harry
2020-07-20 17:45     ` [dpdk-dev] [PATCH v2 1/2] service: add API to retrieve service core active Lukasz Wojciechowski
2020-07-21  7:47     ` Phil Yang
2020-07-21 19:43     ` Honnappa Nagarahalli
2020-07-21 19:50       ` David Marchand
2020-07-21 20:23         ` Honnappa Nagarahalli
2020-07-22 10:14           ` Van Haaren, Harry
2020-07-22 18:50             ` Honnappa Nagarahalli
2020-07-23 16:59               ` Van Haaren, Harry
2020-07-22 10:37     ` [dpdk-dev] [PATCH v3 " Harry van Haaren
2020-07-22 10:37       ` [dpdk-dev] [PATCH v3 2/2] test/service: fix race condition on stopping lcore Harry van Haaren
2020-07-22 21:40         ` Honnappa Nagarahalli
2020-07-22 21:39       ` [dpdk-dev] [PATCH v3 1/2] service: add API to retrieve service core active Honnappa Nagarahalli
2020-07-24 12:45       ` [dpdk-dev] [PATCH v4 " Harry van Haaren
2020-07-24 12:45         ` [dpdk-dev] [PATCH v4 2/2] test/service: fix race condition on stopping lcore Harry van Haaren
2020-07-24 13:45         ` [dpdk-dev] [PATCH v5 1/2] service: add API to retrieve service core active Harry van Haaren
2020-07-24 13:45           ` [dpdk-dev] [PATCH v5 2/2] test/service: fix race condition on stopping lcore Harry van Haaren
2020-09-14  8:36             ` David Marchand
2020-09-14 14:33               ` Van Haaren, Harry
2020-09-14 14:31           ` [dpdk-dev] [PATCH v6 1/2] service: add API to retrieve service core active Harry van Haaren
2020-09-14 14:31             ` [dpdk-dev] [PATCH v6 2/2] test/service: fix race condition on stopping lcore Harry van Haaren
2020-09-21 14:51               ` David Marchand
2020-10-13 19:45                 ` David Marchand
2020-10-15  8:11                   ` Van Haaren, Harry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).