DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH v1 0/2] remove smp barriers in app/test
@ 2020-12-22  6:30 Feifei Wang
  2020-12-22  6:30 ` [dpdk-dev] [PATCH v1 1/2] app/test: remove unnecessary barriers for ring stress test Feifei Wang
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Feifei Wang @ 2020-12-22  6:30 UTC (permalink / raw)
  Cc: dev, nd, Honnappa.Nagarahalli, ruifeng.wang, Feifei Wang

For smp barriers in app/test, remove the unnecessary barriers and fix
some bugs.

Feifei Wang (2):
  app/test: remove unnecessary barriers for ring stress test
  app/test: collect perf data after worker threads exit

 app/test/test_ring_stress_impl.h | 5 +----
 app/test/test_trace_perf.c       | 5 ++---
 2 files changed, 3 insertions(+), 7 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [dpdk-dev] [PATCH v1 1/2] app/test: remove unnecessary barriers for ring stress test
  2020-12-22  6:30 [dpdk-dev] [PATCH v1 0/2] remove smp barriers in app/test Feifei Wang
@ 2020-12-22  6:30 ` Feifei Wang
  2020-12-22 12:42   ` Ananyev, Konstantin
  2020-12-22  6:30 ` [dpdk-dev] [PATCH v1 2/2] app/test: collect perf data after worker threads exit Feifei Wang
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 19+ messages in thread
From: Feifei Wang @ 2020-12-22  6:30 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Konstantin Ananyev
  Cc: dev, nd, Honnappa.Nagarahalli, ruifeng.wang, Feifei Wang

The variable "wrk_cmd" is a signal to control threads from running and
stopping. When worker lcores load "wrk_cmd == WRK_CMD_RUN", they start
running and when worker lcores load "wrk_cmd == WRK_CMD_STOP", they
stop.

For the wmb in test_mt1, no storing operations must keep the order
after storing "wrk_cmd". Thus the wmb is unnecessary.

For the rmb in test_worker, the parameters have been prepared when
worker lcores call "test_worker". It is unnessary to wait wrk_cmd to be
loaded, then the parameters can be loaded, So the rmb can be removed.

In the meanwhile, fix a typo. The note above storing "stop" into
"wrk_cmd" should be "stop test" rather than "start test".

Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 app/test/test_ring_stress_impl.h | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/app/test/test_ring_stress_impl.h b/app/test/test_ring_stress_impl.h
index f9ca63b90..384555ef9 100644
--- a/app/test/test_ring_stress_impl.h
+++ b/app/test/test_ring_stress_impl.h
@@ -198,7 +198,6 @@ test_worker(void *arg, const char *fname, int32_t prcs)
 	fill_ring_elm(&loc_elm, lc);
 
 	while (wrk_cmd != WRK_CMD_RUN) {
-		rte_smp_rmb();
 		rte_pause();
 	}
 
@@ -357,13 +356,11 @@ test_mt1(int (*test)(void *))
 
 	/* signal worker to start test */
 	wrk_cmd = WRK_CMD_RUN;
-	rte_smp_wmb();
 
 	usleep(run_time * US_PER_S);
 
-	/* signal worker to start test */
+	/* signal worker to stop test */
 	wrk_cmd = WRK_CMD_STOP;
-	rte_smp_wmb();
 
 	/* wait for workers and collect stats. */
 	mc = rte_lcore_id();
-- 
2.17.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [dpdk-dev] [PATCH v1 2/2] app/test: collect perf data after worker threads exit
  2020-12-22  6:30 [dpdk-dev] [PATCH v1 0/2] remove smp barriers in app/test Feifei Wang
  2020-12-22  6:30 ` [dpdk-dev] [PATCH v1 1/2] app/test: remove unnecessary barriers for ring stress test Feifei Wang
@ 2020-12-22  6:30 ` Feifei Wang
  2021-03-10  2:12 ` [dpdk-dev] [PATCH v3 0/1] remove smp barriers in app/test Feifei Wang
  2021-03-10  2:15 ` [dpdk-dev] [PATCH v3 0/1] remove smp barriers in app/test Feifei Wang
  3 siblings, 0 replies; 19+ messages in thread
From: Feifei Wang @ 2020-12-22  6:30 UTC (permalink / raw)
  To: Jerin Jacob, Sunil Kumar Kori, David Marchand
  Cc: dev, nd, Honnappa.Nagarahalli, ruifeng.wang, Feifei Wang, stable

The measure_perf function should be excuted after worker threads exit
to collect correct perf data. Otherwise, while workers are running, the
main thread may get incomplete data from workers.

In the meanwhile, remove unnecessary barrier in the test.
For signal variables "ldata.done" and "ldata.start", no operations
should keep the order that being executed after them. So the wmb after
them can be moved.

Fixes: 16a277a24c9f ("test/trace: add performance test cases")
Cc: jerinj@marvell.com
Cc: stable@dpdk.org

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 app/test/test_trace_perf.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/app/test/test_trace_perf.c b/app/test/test_trace_perf.c
index e1ad8e6f5..46ae7d807 100644
--- a/app/test/test_trace_perf.c
+++ b/app/test/test_trace_perf.c
@@ -79,7 +79,6 @@ signal_workers_to_finish(struct test_data *data)
 
 	for (workers = 0; workers < data->nb_workers; workers++) {
 		data->ldata[workers].done = 1;
-		rte_smp_wmb();
 	}
 }
 
@@ -102,7 +101,6 @@ worker_fn_##func(void *arg) \
 { \
 	struct lcore_data *ldata = arg; \
 	ldata->started = 1; \
-	rte_smp_wmb(); \
 	__worker_##func(ldata); \
 	return 0; \
 }
@@ -137,11 +135,12 @@ run_test(const char *str, lcore_function_t f, struct test_data *data, size_t sz)
 
 	wait_till_workers_are_ready(data);
 	rte_delay_ms(100); /* Wait for some time to accumulate the stats */
-	measure_perf(str, data);
 	signal_workers_to_finish(data);
 
 	RTE_LCORE_FOREACH_WORKER(id)
 		rte_eal_wait_lcore(id);
+
+	measure_perf(str, data);
 }
 
 static int
-- 
2.17.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/2] app/test: remove unnecessary barriers for ring stress test
  2020-12-22  6:30 ` [dpdk-dev] [PATCH v1 1/2] app/test: remove unnecessary barriers for ring stress test Feifei Wang
@ 2020-12-22 12:42   ` Ananyev, Konstantin
  2021-01-27 23:00     ` Honnappa Nagarahalli
  0 siblings, 1 reply; 19+ messages in thread
From: Ananyev, Konstantin @ 2020-12-22 12:42 UTC (permalink / raw)
  To: Feifei Wang, Honnappa Nagarahalli
  Cc: dev, nd, Honnappa.Nagarahalli, ruifeng.wang

Hi Feifei,

> 
> The variable "wrk_cmd" is a signal to control threads from running and
> stopping. When worker lcores load "wrk_cmd == WRK_CMD_RUN", they start
> running and when worker lcores load "wrk_cmd == WRK_CMD_STOP", they
> stop.
> 
> For the wmb in test_mt1, no storing operations must keep the order
> after storing "wrk_cmd". Thus the wmb is unnecessary.

I think there is a bug in my original code, we should do smp_wmb() *before*  
setting wrk_cmd, not after:

        /* launch on all workers */
        RTE_LCORE_FOREACH_WORKER(lc) {
                arg[lc].rng = r;
                arg[lc].stats = init_stat;
                rte_eal_remote_launch(test, &arg[lc], lc);
        }

        /* signal worker to start test */
+      rte_smp_wmb();
        wrk_cmd = WRK_CMD_RUN;
-       rte_smp_wmb();

        usleep(run_time * US_PER_S);


I still think we'd better have some synchronisation here.
Otherwise what would prevent compiler and/or cpu to update wrk_cmd out of order
(before _init_ phase is completed)?
We probably can safely assume no reordering from the compiler here,
as we have function calls straight before and after 'wrk_cmd = WRK_CMD_RUN;'
But for consistency and easier maintenance, I still think it is better
to have something here, after all it is not performance critical pass. 

> For the rmb in test_worker, the parameters have been prepared when
> worker lcores call "test_worker". It is unnessary to wait wrk_cmd to be
> loaded, then the parameters can be loaded, So the rmb can be removed.

It is not only about parameters loading,  it is to prevent worker core to start too early.

As I understand, your goal is to get rid of rte_smp_*() calls.
Might be better to replace such places here with _atomic_ semantics.
Then, as I can see, we also can get rid of 'volatile' fo wrk_cmd.
 
> In the meanwhile, fix a typo. The note above storing "stop" into
> "wrk_cmd" should be "stop test" rather than "start test".
> 
> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> ---
>  app/test/test_ring_stress_impl.h | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/app/test/test_ring_stress_impl.h b/app/test/test_ring_stress_impl.h
> index f9ca63b90..384555ef9 100644
> --- a/app/test/test_ring_stress_impl.h
> +++ b/app/test/test_ring_stress_impl.h
> @@ -198,7 +198,6 @@ test_worker(void *arg, const char *fname, int32_t prcs)
>  	fill_ring_elm(&loc_elm, lc);
> 
>  	while (wrk_cmd != WRK_CMD_RUN) {
> -		rte_smp_rmb();
>  		rte_pause();
>  	}
> 
> @@ -357,13 +356,11 @@ test_mt1(int (*test)(void *))
> 
>  	/* signal worker to start test */
>  	wrk_cmd = WRK_CMD_RUN;
> -	rte_smp_wmb();
> 
>  	usleep(run_time * US_PER_S);
> 
> -	/* signal worker to start test */
> +	/* signal worker to stop test */
>  	wrk_cmd = WRK_CMD_STOP;
> -	rte_smp_wmb();
> 
>  	/* wait for workers and collect stats. */
>  	mc = rte_lcore_id();
> --
> 2.17.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/2] app/test: remove unnecessary barriers for ring stress test
  2020-12-22 12:42   ` Ananyev, Konstantin
@ 2021-01-27 23:00     ` Honnappa Nagarahalli
  2021-01-28 14:43       ` Ananyev, Konstantin
  0 siblings, 1 reply; 19+ messages in thread
From: Honnappa Nagarahalli @ 2021-01-27 23:00 UTC (permalink / raw)
  To: Ananyev, Konstantin, Feifei Wang
  Cc: dev, nd, Ruifeng Wang, Honnappa Nagarahalli, nd

<snip>

> 
> Hi Feifei,
> 
> >
> > The variable "wrk_cmd" is a signal to control threads from running and
> > stopping. When worker lcores load "wrk_cmd == WRK_CMD_RUN", they
> start
> > running and when worker lcores load "wrk_cmd == WRK_CMD_STOP",
> they
> > stop.
> >
> > For the wmb in test_mt1, no storing operations must keep the order
> > after storing "wrk_cmd". Thus the wmb is unnecessary.
> 
> I think there is a bug in my original code, we should do smp_wmb() *before*
> setting wrk_cmd, not after:
> 
>         /* launch on all workers */
>         RTE_LCORE_FOREACH_WORKER(lc) {
>                 arg[lc].rng = r;
>                 arg[lc].stats = init_stat;
>                 rte_eal_remote_launch(test, &arg[lc], lc);
>         }
> 
>         /* signal worker to start test */
> +      rte_smp_wmb();
>         wrk_cmd = WRK_CMD_RUN;
> -       rte_smp_wmb();
> 
>         usleep(run_time * US_PER_S);
> 
> 
> I still think we'd better have some synchronisation here.
> Otherwise what would prevent compiler and/or cpu to update wrk_cmd out
> of order (before _init_ phase is completed)?
> We probably can safely assume no reordering from the compiler here, as we
> have function calls straight before and after 'wrk_cmd = WRK_CMD_RUN;'
> But for consistency and easier maintenance, I still think it is better to have
> something here, after all it is not performance critical pass.
Agree that this is not performance critical.

This is more about correctness (as usually people refer to code to understand the concepts). You can refer to video [1]. Essentially, the pthread_create has 'happens-before' behavior. i.e. all the memory operations before the pthread_create are visible to the new thread. The rte_smp_rmb() barrier in the thread function is not required as it reads the data that was set before the thread was launched.

I do not know why rte_smp_wmb is required. The update to 'wrk_cmd' is seen by the thread eventually. rte_smp_wmb does not result in update being seen sooner/immediately. 

[1] https://www.youtube.com/watch?t=4170&v=KeLBd2EJLOU&feature=youtu.be
> 
> > For the rmb in test_worker, the parameters have been prepared when
> > worker lcores call "test_worker". It is unnessary to wait wrk_cmd to
> > be loaded, then the parameters can be loaded, So the rmb can be
> removed.
> 
> It is not only about parameters loading,  it is to prevent worker core to start
> too early.
Because 'pthread_launch' provides the 'happens-before' behavior, the worker core will see the updates that happened before the worker was launched.

I suggest changing the commit log to provide the reasoning around pthread_create.

> 
> As I understand, your goal is to get rid of rte_smp_*() calls.
> Might be better to replace such places here with _atomic_ semantics.
> Then, as I can see, we also can get rid of 'volatile' fo wrk_cmd.
> 
> > In the meanwhile, fix a typo. The note above storing "stop" into
> > "wrk_cmd" should be "stop test" rather than "start test".
> >
> > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > ---
> >  app/test/test_ring_stress_impl.h | 5 +----
> >  1 file changed, 1 insertion(+), 4 deletions(-)
> >
> > diff --git a/app/test/test_ring_stress_impl.h
> > b/app/test/test_ring_stress_impl.h
> > index f9ca63b90..384555ef9 100644
> > --- a/app/test/test_ring_stress_impl.h
> > +++ b/app/test/test_ring_stress_impl.h
> > @@ -198,7 +198,6 @@ test_worker(void *arg, const char *fname, int32_t
> prcs)
> >  	fill_ring_elm(&loc_elm, lc);
> >
> >  	while (wrk_cmd != WRK_CMD_RUN) {
> > -		rte_smp_rmb();
> >  		rte_pause();
> >  	}
> >
> > @@ -357,13 +356,11 @@ test_mt1(int (*test)(void *))
> >
> >  	/* signal worker to start test */
> >  	wrk_cmd = WRK_CMD_RUN;
> > -	rte_smp_wmb();
> >
> >  	usleep(run_time * US_PER_S);
> >
> > -	/* signal worker to start test */
> > +	/* signal worker to stop test */
> >  	wrk_cmd = WRK_CMD_STOP;
> > -	rte_smp_wmb();
> >
> >  	/* wait for workers and collect stats. */
> >  	mc = rte_lcore_id();
> > --
> > 2.17.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/2] app/test: remove unnecessary barriers for ring stress test
  2021-01-27 23:00     ` Honnappa Nagarahalli
@ 2021-01-28 14:43       ` Ananyev, Konstantin
  2021-01-29  3:17         ` Honnappa Nagarahalli
  0 siblings, 1 reply; 19+ messages in thread
From: Ananyev, Konstantin @ 2021-01-28 14:43 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Feifei Wang; +Cc: dev, nd, Ruifeng Wang, nd


> >
> > Hi Feifei,
> >
> > >
> > > The variable "wrk_cmd" is a signal to control threads from running and
> > > stopping. When worker lcores load "wrk_cmd == WRK_CMD_RUN", they
> > start
> > > running and when worker lcores load "wrk_cmd == WRK_CMD_STOP",
> > they
> > > stop.
> > >
> > > For the wmb in test_mt1, no storing operations must keep the order
> > > after storing "wrk_cmd". Thus the wmb is unnecessary.
> >
> > I think there is a bug in my original code, we should do smp_wmb() *before*
> > setting wrk_cmd, not after:
> >
> >         /* launch on all workers */
> >         RTE_LCORE_FOREACH_WORKER(lc) {
> >                 arg[lc].rng = r;
> >                 arg[lc].stats = init_stat;
> >                 rte_eal_remote_launch(test, &arg[lc], lc);
> >         }
> >
> >         /* signal worker to start test */
> > +      rte_smp_wmb();
> >         wrk_cmd = WRK_CMD_RUN;
> > -       rte_smp_wmb();
> >
> >         usleep(run_time * US_PER_S);
> >
> >
> > I still think we'd better have some synchronisation here.
> > Otherwise what would prevent compiler and/or cpu to update wrk_cmd out
> > of order (before _init_ phase is completed)?
> > We probably can safely assume no reordering from the compiler here, as we
> > have function calls straight before and after 'wrk_cmd = WRK_CMD_RUN;'
> > But for consistency and easier maintenance, I still think it is better to have
> > something here, after all it is not performance critical pass.
> Agree that this is not performance critical.
> 
> This is more about correctness (as usually people refer to code to understand the concepts). You can refer to video [1]. Essentially, the
> pthread_create has 'happens-before' behavior. i.e. all the memory operations before the pthread_create are visible to the new thread. The
> rte_smp_rmb() barrier in the thread function is not required as it reads the data that was set before the thread was launched.

rte_eal_remote_launch() doesn't call pthread_create().
All it does -  updates global variable (lcore_config) and writes/reads to/from the pipe.

> 
> I do not know why rte_smp_wmb is required. The update to 'wrk_cmd' is seen by the thread eventually. rte_smp_wmb does not result in
> update being seen sooner/immediately.

We don't need it sooner.
We need to make sure it wouldn't be seen by any worker thread before all workers are launched.
To make sure all workers start the test at approximately same moment.
That's why I think wmb() should be before 'wrk_cmd = WRK_CMD_RUN;' in my original code.

> 
> [1] https://www.youtube.com/watch?t=4170&v=KeLBd2EJLOU&feature=youtu.be
> >
> > > For the rmb in test_worker, the parameters have been prepared when
> > > worker lcores call "test_worker". It is unnessary to wait wrk_cmd to
> > > be loaded, then the parameters can be loaded, So the rmb can be
> > removed.
> >
> > It is not only about parameters loading,  it is to prevent worker core to start
> > too early.
> Because 'pthread_launch' provides the 'happens-before' behavior, the worker core will see the updates that happened before the worker
> was launched.
> 
> I suggest changing the commit log to provide the reasoning around pthread_create.
> 
> >
> > As I understand, your goal is to get rid of rte_smp_*() calls.
> > Might be better to replace such places here with _atomic_ semantics.
> > Then, as I can see, we also can get rid of 'volatile' fo wrk_cmd.
> >
> > > In the meanwhile, fix a typo. The note above storing "stop" into
> > > "wrk_cmd" should be "stop test" rather than "start test".
> > >
> > > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > ---
> > >  app/test/test_ring_stress_impl.h | 5 +----
> > >  1 file changed, 1 insertion(+), 4 deletions(-)
> > >
> > > diff --git a/app/test/test_ring_stress_impl.h
> > > b/app/test/test_ring_stress_impl.h
> > > index f9ca63b90..384555ef9 100644
> > > --- a/app/test/test_ring_stress_impl.h
> > > +++ b/app/test/test_ring_stress_impl.h
> > > @@ -198,7 +198,6 @@ test_worker(void *arg, const char *fname, int32_t
> > prcs)
> > >  	fill_ring_elm(&loc_elm, lc);
> > >
> > >  	while (wrk_cmd != WRK_CMD_RUN) {
> > > -		rte_smp_rmb();
> > >  		rte_pause();
> > >  	}
> > >
> > > @@ -357,13 +356,11 @@ test_mt1(int (*test)(void *))
> > >
> > >  	/* signal worker to start test */
> > >  	wrk_cmd = WRK_CMD_RUN;
> > > -	rte_smp_wmb();
> > >
> > >  	usleep(run_time * US_PER_S);
> > >
> > > -	/* signal worker to start test */
> > > +	/* signal worker to stop test */
> > >  	wrk_cmd = WRK_CMD_STOP;
> > > -	rte_smp_wmb();
> > >
> > >  	/* wait for workers and collect stats. */
> > >  	mc = rte_lcore_id();
> > > --
> > > 2.17.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/2] app/test: remove unnecessary barriers for ring stress test
  2021-01-28 14:43       ` Ananyev, Konstantin
@ 2021-01-29  3:17         ` Honnappa Nagarahalli
  2021-01-29  4:58           ` Stephen Hemminger
  0 siblings, 1 reply; 19+ messages in thread
From: Honnappa Nagarahalli @ 2021-01-29  3:17 UTC (permalink / raw)
  To: Ananyev, Konstantin, Feifei Wang
  Cc: dev, nd, Ruifeng Wang, Honnappa Nagarahalli, nd

<snip>

> 
> > >
> > > Hi Feifei,
> > >
> > > >
> > > > The variable "wrk_cmd" is a signal to control threads from running
> > > > and stopping. When worker lcores load "wrk_cmd ==
> WRK_CMD_RUN",
> > > > they
> > > start
> > > > running and when worker lcores load "wrk_cmd == WRK_CMD_STOP",
> > > they
> > > > stop.
> > > >
> > > > For the wmb in test_mt1, no storing operations must keep the order
> > > > after storing "wrk_cmd". Thus the wmb is unnecessary.
> > >
> > > I think there is a bug in my original code, we should do smp_wmb()
> > > *before* setting wrk_cmd, not after:
> > >
> > >         /* launch on all workers */
> > >         RTE_LCORE_FOREACH_WORKER(lc) {
> > >                 arg[lc].rng = r;
> > >                 arg[lc].stats = init_stat;
> > >                 rte_eal_remote_launch(test, &arg[lc], lc);
> > >         }
> > >
> > >         /* signal worker to start test */
> > > +      rte_smp_wmb();
> > >         wrk_cmd = WRK_CMD_RUN;
> > > -       rte_smp_wmb();
> > >
> > >         usleep(run_time * US_PER_S);
> > >
> > >
> > > I still think we'd better have some synchronisation here.
> > > Otherwise what would prevent compiler and/or cpu to update wrk_cmd
> > > out of order (before _init_ phase is completed)?
> > > We probably can safely assume no reordering from the compiler here,
> > > as we have function calls straight before and after 'wrk_cmd =
> WRK_CMD_RUN;'
> > > But for consistency and easier maintenance, I still think it is
> > > better to have something here, after all it is not performance critical pass.
> > Agree that this is not performance critical.
> >
> > This is more about correctness (as usually people refer to code to
> > understand the concepts). You can refer to video [1]. Essentially, the
> > pthread_create has 'happens-before' behavior. i.e. all the memory
> > operations before the pthread_create are visible to the new thread.
> > The
> > rte_smp_rmb() barrier in the thread function is not required as it reads the
> data that was set before the thread was launched.
> 
> rte_eal_remote_launch() doesn't call pthread_create().
> All it does -  updates global variable (lcore_config) and writes/reads to/from
> the pipe.
> 
Thanks for the reminder ☹
I think rte_eal_remote_launch and rte_eal_wait_lcore need to provide behavior similar to pthread_launch and pthread_join respectively.

There is use of rte_smp_*mb in those functions as well. Those need to be fixed first and then look at these.

> >
> > I do not know why rte_smp_wmb is required. The update to 'wrk_cmd' is
> > seen by the thread eventually. rte_smp_wmb does not result in update
> being seen sooner/immediately.
> 
> We don't need it sooner.
> We need to make sure it wouldn't be seen by any worker thread before all
> workers are launched.
> To make sure all workers start the test at approximately same moment.
> That's why I think wmb() should be before 'wrk_cmd = WRK_CMD_RUN;' in
> my original code.
> 
> >
> > [1]
> >
> https://www.youtube.com/watch?t=4170&v=KeLBd2EJLOU&feature=youtu.
> be
> > >
> > > > For the rmb in test_worker, the parameters have been prepared when
> > > > worker lcores call "test_worker". It is unnessary to wait wrk_cmd
> > > > to be loaded, then the parameters can be loaded, So the rmb can be
> > > removed.
> > >
> > > It is not only about parameters loading,  it is to prevent worker
> > > core to start too early.
> > Because 'pthread_launch' provides the 'happens-before' behavior, the
> > worker core will see the updates that happened before the worker was
> launched.
> >
> > I suggest changing the commit log to provide the reasoning around
> pthread_create.
> >
> > >
> > > As I understand, your goal is to get rid of rte_smp_*() calls.
> > > Might be better to replace such places here with _atomic_ semantics.
> > > Then, as I can see, we also can get rid of 'volatile' fo wrk_cmd.
> > >
> > > > In the meanwhile, fix a typo. The note above storing "stop" into
> > > > "wrk_cmd" should be "stop test" rather than "start test".
> > > >
> > > > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > > > Reviewed-by: Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>
> > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > ---
> > > >  app/test/test_ring_stress_impl.h | 5 +----
> > > >  1 file changed, 1 insertion(+), 4 deletions(-)
> > > >
> > > > diff --git a/app/test/test_ring_stress_impl.h
> > > > b/app/test/test_ring_stress_impl.h
> > > > index f9ca63b90..384555ef9 100644
> > > > --- a/app/test/test_ring_stress_impl.h
> > > > +++ b/app/test/test_ring_stress_impl.h
> > > > @@ -198,7 +198,6 @@ test_worker(void *arg, const char *fname,
> > > > int32_t
> > > prcs)
> > > >  	fill_ring_elm(&loc_elm, lc);
> > > >
> > > >  	while (wrk_cmd != WRK_CMD_RUN) {
> > > > -		rte_smp_rmb();
> > > >  		rte_pause();
> > > >  	}
> > > >
> > > > @@ -357,13 +356,11 @@ test_mt1(int (*test)(void *))
> > > >
> > > >  	/* signal worker to start test */
> > > >  	wrk_cmd = WRK_CMD_RUN;
> > > > -	rte_smp_wmb();
> > > >
> > > >  	usleep(run_time * US_PER_S);
> > > >
> > > > -	/* signal worker to start test */
> > > > +	/* signal worker to stop test */
> > > >  	wrk_cmd = WRK_CMD_STOP;
> > > > -	rte_smp_wmb();
> > > >
> > > >  	/* wait for workers and collect stats. */
> > > >  	mc = rte_lcore_id();
> > > > --
> > > > 2.17.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/2] app/test: remove unnecessary barriers for ring stress test
  2021-01-29  3:17         ` Honnappa Nagarahalli
@ 2021-01-29  4:58           ` Stephen Hemminger
  2021-01-30  1:24             ` Honnappa Nagarahalli
  0 siblings, 1 reply; 19+ messages in thread
From: Stephen Hemminger @ 2021-01-29  4:58 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: Ananyev, Konstantin, Feifei Wang, dev, nd, Ruifeng Wang

On Fri, 29 Jan 2021 03:17:50 +0000
Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> wrote:

> <snip>
> 
> >   
> > > >
> > > > Hi Feifei,
> > > >  
> > > > >
> > > > > The variable "wrk_cmd" is a signal to control threads from running
> > > > > and stopping. When worker lcores load "wrk_cmd ==  
> > WRK_CMD_RUN",  
> > > > > they  
> > > > start  
> > > > > running and when worker lcores load "wrk_cmd == WRK_CMD_STOP",  
> > > > they  
> > > > > stop.
> > > > >
> > > > > For the wmb in test_mt1, no storing operations must keep the order
> > > > > after storing "wrk_cmd". Thus the wmb is unnecessary.  
> > > >
> > > > I think there is a bug in my original code, we should do smp_wmb()
> > > > *before* setting wrk_cmd, not after:
> > > >
> > > >         /* launch on all workers */
> > > >         RTE_LCORE_FOREACH_WORKER(lc) {
> > > >                 arg[lc].rng = r;
> > > >                 arg[lc].stats = init_stat;
> > > >                 rte_eal_remote_launch(test, &arg[lc], lc);
> > > >         }
> > > >
> > > >         /* signal worker to start test */
> > > > +      rte_smp_wmb();
> > > >         wrk_cmd = WRK_CMD_RUN;
> > > > -       rte_smp_wmb();
> > > >
> > > >         usleep(run_time * US_PER_S);
> > > >
> > > >
> > > > I still think we'd better have some synchronisation here.
> > > > Otherwise what would prevent compiler and/or cpu to update wrk_cmd
> > > > out of order (before _init_ phase is completed)?
> > > > We probably can safely assume no reordering from the compiler here,
> > > > as we have function calls straight before and after 'wrk_cmd =  
> > WRK_CMD_RUN;'  
> > > > But for consistency and easier maintenance, I still think it is
> > > > better to have something here, after all it is not performance critical pass.  
> > > Agree that this is not performance critical.
> > >
> > > This is more about correctness (as usually people refer to code to
> > > understand the concepts). You can refer to video [1]. Essentially, the
> > > pthread_create has 'happens-before' behavior. i.e. all the memory
> > > operations before the pthread_create are visible to the new thread.
> > > The
> > > rte_smp_rmb() barrier in the thread function is not required as it reads the  
> > data that was set before the thread was launched.
> > 
> > rte_eal_remote_launch() doesn't call pthread_create().
> > All it does -  updates global variable (lcore_config) and writes/reads to/from
> > the pipe.
> >   
> Thanks for the reminder ☹
> I think rte_eal_remote_launch and rte_eal_wait_lcore need to provide behavior similar to pthread_launch and pthread_join respectively.
> 
> There is use of rte_smp_*mb in those functions as well. Those need to be fixed first and then look at these.

Looks like you want __atomic_thread_fence() here.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/2] app/test: remove unnecessary barriers for ring stress test
  2021-01-29  4:58           ` Stephen Hemminger
@ 2021-01-30  1:24             ` Honnappa Nagarahalli
  2021-02-01  8:37               ` [dpdk-dev] 回复: " Feifei Wang
  2021-02-01  8:48               ` [dpdk-dev] 回复: " Feifei Wang
  0 siblings, 2 replies; 19+ messages in thread
From: Honnappa Nagarahalli @ 2021-01-30  1:24 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Ananyev, Konstantin, Feifei Wang, dev, nd, Ruifeng Wang,
	Honnappa Nagarahalli, nd

<snip>

> >
> > >
> > > > >
> > > > > Hi Feifei,
> > > > >
> > > > > >
> > > > > > The variable "wrk_cmd" is a signal to control threads from
> > > > > > running and stopping. When worker lcores load "wrk_cmd ==
> > > WRK_CMD_RUN",
> > > > > > they
> > > > > start
> > > > > > running and when worker lcores load "wrk_cmd == WRK_CMD_STOP",
> > > > > they
> > > > > > stop.
> > > > > >
> > > > > > For the wmb in test_mt1, no storing operations must keep the
> > > > > > order after storing "wrk_cmd". Thus the wmb is unnecessary.
> > > > >
> > > > > I think there is a bug in my original code, we should do
> > > > > smp_wmb()
> > > > > *before* setting wrk_cmd, not after:
> > > > >
> > > > >         /* launch on all workers */
> > > > >         RTE_LCORE_FOREACH_WORKER(lc) {
> > > > >                 arg[lc].rng = r;
> > > > >                 arg[lc].stats = init_stat;
> > > > >                 rte_eal_remote_launch(test, &arg[lc], lc);
> > > > >         }
> > > > >
> > > > >         /* signal worker to start test */
> > > > > +      rte_smp_wmb();
> > > > >         wrk_cmd = WRK_CMD_RUN;
> > > > > -       rte_smp_wmb();
> > > > >
> > > > >         usleep(run_time * US_PER_S);
> > > > >
> > > > >
> > > > > I still think we'd better have some synchronisation here.
> > > > > Otherwise what would prevent compiler and/or cpu to update
> > > > > wrk_cmd out of order (before _init_ phase is completed)?
> > > > > We probably can safely assume no reordering from the compiler
> > > > > here, as we have function calls straight before and after
> > > > > 'wrk_cmd =
> > > WRK_CMD_RUN;'
> > > > > But for consistency and easier maintenance, I still think it is
> > > > > better to have something here, after all it is not performance critical
> pass.
> > > > Agree that this is not performance critical.
> > > >
> > > > This is more about correctness (as usually people refer to code to
> > > > understand the concepts). You can refer to video [1]. Essentially,
> > > > the pthread_create has 'happens-before' behavior. i.e. all the
> > > > memory operations before the pthread_create are visible to the new
> thread.
> > > > The
> > > > rte_smp_rmb() barrier in the thread function is not required as it
> > > > reads the
> > > data that was set before the thread was launched.
> > >
> > > rte_eal_remote_launch() doesn't call pthread_create().
> > > All it does -  updates global variable (lcore_config) and
> > > writes/reads to/from the pipe.
> > >
> > Thanks for the reminder ☹
> > I think rte_eal_remote_launch and rte_eal_wait_lcore need to provide
> behavior similar to pthread_launch and pthread_join respectively.
> >
> > There is use of rte_smp_*mb in those functions as well. Those need to be fixed
> first and then look at these.
> 
> Looks like you want __atomic_thread_fence() here.
> 
In the rte_eal_remote_launch case, all the memory operations before the API call need to be visible to the worker. If this is the only requirement, we can use the function pointer as the guard variable and use store-release. In the eal_thread_loop function we could do load-acquire on the function pointer.

I do not think that there is a requirement to ensure that the memory operations after the API call do not happen before the worker thread starts running the function (As there is no guarantee on when the worker thread will run. If the main thread needs to know if the worker thread is running explicit hand-shaking needs to happen).

The rte_eal_wait_lcore API needs to ensure that the memory operations in the worker are visible to the main. rte_eal_wait_lcore and eal_thread_loop are synchronizing using lcore_config[worker_id].state. I need to understand what else 'state' is used for. If there are no issues, we can do a store-release on 'state' in eal_thread_loop and a load-acquire in rte_eal_wait_lcore.

So, we do not have to use the __atomic_thread_fence.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [dpdk-dev] 回复:  [PATCH v1 1/2] app/test: remove unnecessary barriers for ring stress test
  2021-01-30  1:24             ` Honnappa Nagarahalli
@ 2021-02-01  8:37               ` Feifei Wang
  2021-02-01 13:50                 ` [dpdk-dev] " Ananyev, Konstantin
  2021-02-01  8:48               ` [dpdk-dev] 回复: " Feifei Wang
  1 sibling, 1 reply; 19+ messages in thread
From: Feifei Wang @ 2021-02-01  8:37 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Stephen Hemminger
  Cc: Ananyev, Konstantin, dev, nd, Ruifeng Wang, nd, nd

Hi, Honnappa, Konstantin and Stephen

Thanks very much for your attention of this patch. Based on your opinion, Ruifeng and I discuss about this and make a summary:
____________________________________________________________________________________________________________________
						main thread						worker thread
rte_eal_remote_launch:															[ Honnappa focus ]
																		To ensure f can load correct arg,
																		arg store should before f
						lcore_config[worker_id].f = f;
						lcore_config[worker_id].arg = arg;
						wmb()? or store-relase on f?
						
										eal_thread_loop:
						pipeline_communication	---------------------->	pipeline_communication
													if (lcore_config[lcore_id].f == NULL)
													rte_panic("NULL function pointer\n");
		
													fct_arg = lcore_config[lcore_id].arg;
													ret = lcore_config[lcore_id].f(fct_arg);
___________________________________________________________________________________________________________________

test_ring_stress:				wmb()?												[ Konstantin focus ]
										test_worker:							Main thread can use wrk_cmd to
						Wrk_cmd =WRK_CMD_RUN;	---------------------->	Wrk_cmd == WRK_CMD_RUN;		control multiple threads to start running
													wmb()?					at the same time as much as possible
													ring_dequeue;
													ring_enqueue;
						Wrk_cmd =WRK_CMD_STOP;	---------------------->	Wrk_cmd == WRK_CMD_STOP;
____________________________________________________________________________________________________________________

rte_eal_wait_lcore:											wmb()					[ Honnappa focus ]
				lcore_config[lcore_id].state == FINISHED	<---------------------	lcore_config[lcore_id].state = FINISHED	Load-acquire and store-release
																		are used on the variable “state”
						rmb();
____________________________________________________________________________________________________________________
						
From the picture above,

1.First, for the underlying function rte_eal_remote_launch, Honnappa focuses on that,
pipeline_communication cannot ensure ‘arg’ parameters is loaded correctly by
the worker thread. 
This is because in weak memory order framework, maybe the main thread and worker
thread firstly finish pipeline communication, and then the worker thread receive signal
and execute the function ‘ f ’. However, it maybe load a wrong value of ‘arg’ due to that
the main thread stores ‘arg’ after pipeline communication. So wmb or store_release is
necessary for ‘arg’.

2.Second, for the upper-layer test_ring_stress, Konstantin foucese on that,
Whether the main thread can use ‘wrk_cmd’ to control multiple threads to run at the
same time as much as possible.
Because rte_eal_remote_launch only can communicates with one worker thread
at the same time. This means some worker thread maybe start working very early but other
worker threads maybe need to wait a long time to start working if  ‘wrk_cmd' is stored 'RUN' flag
before rte_remote_launch.
At last, for unit test, this may cause that the test results are not stable.

3.Third, for rte_eal_wait_lcore, Honnappa focuses on that the ‘state’ as a   synchronous bariable,
we should add load-acquire and store-release on it. However, there have been rmb and wmb
after and before ‘state’, So I’m not sure whether we should replace them.

In summary, I think Honnappa and Konstantin have different concerns.
For Honnappa, we can add wmb or store-release to ensure the ‘arg’ can be loaded correctly
in rte_eal_remote_launch.
For Konstantin, we can add wmb and rmb to ensure the main thread can control the worker
Threads to run at the same time, and then make the test results more accurate in the
ring_stress_test.


Best Regards
Feifei

> -----邮件原件-----
> 发件人: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> 发送时间: 2021年1月30日 9:24
> 收件人: Stephen Hemminger <stephen@networkplumber.org>
> 抄送: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Feifei Wang
> <Feifei.Wang2@arm.com>; dev@dpdk.org; nd <nd@arm.com>; Ruifeng
> Wang <Ruifeng.Wang@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> 主题: RE: [dpdk-dev] [PATCH v1 1/2] app/test: remove unnecessary barriers
> for ring stress test
> 
> <snip>
> 
> > >
> > > >
> > > > > >
> > > > > > Hi Feifei,
> > > > > >
> > > > > > >
> > > > > > > The variable "wrk_cmd" is a signal to control threads from
> > > > > > > running and stopping. When worker lcores load "wrk_cmd ==
> > > > WRK_CMD_RUN",
> > > > > > > they
> > > > > > start
> > > > > > > running and when worker lcores load "wrk_cmd ==
> > > > > > > WRK_CMD_STOP",
> > > > > > they
> > > > > > > stop.
> > > > > > >
> > > > > > > For the wmb in test_mt1, no storing operations must keep the
> > > > > > > order after storing "wrk_cmd". Thus the wmb is unnecessary.
> > > > > >
> > > > > > I think there is a bug in my original code, we should do
> > > > > > smp_wmb()
> > > > > > *before* setting wrk_cmd, not after:
> > > > > >
> > > > > >         /* launch on all workers */
> > > > > >         RTE_LCORE_FOREACH_WORKER(lc) {
> > > > > >                 arg[lc].rng = r;
> > > > > >                 arg[lc].stats = init_stat;
> > > > > >                 rte_eal_remote_launch(test, &arg[lc], lc);
> > > > > >         }
> > > > > >
> > > > > >         /* signal worker to start test */
> > > > > > +      rte_smp_wmb();
> > > > > >         wrk_cmd = WRK_CMD_RUN;
> > > > > > -       rte_smp_wmb();
> > > > > >
> > > > > >         usleep(run_time * US_PER_S);
> > > > > >
> > > > > >
> > > > > > I still think we'd better have some synchronisation here.
> > > > > > Otherwise what would prevent compiler and/or cpu to update
> > > > > > wrk_cmd out of order (before _init_ phase is completed)?
> > > > > > We probably can safely assume no reordering from the compiler
> > > > > > here, as we have function calls straight before and after
> > > > > > 'wrk_cmd =
> > > > WRK_CMD_RUN;'
> > > > > > But for consistency and easier maintenance, I still think it
> > > > > > is better to have something here, after all it is not
> > > > > > performance critical
> > pass.
> > > > > Agree that this is not performance critical.
> > > > >
> > > > > This is more about correctness (as usually people refer to code
> > > > > to understand the concepts). You can refer to video [1].
> > > > > Essentially, the pthread_create has 'happens-before' behavior.
> > > > > i.e. all the memory operations before the pthread_create are
> > > > > visible to the new
> > thread.
> > > > > The
> > > > > rte_smp_rmb() barrier in the thread function is not required as
> > > > > it reads the
> > > > data that was set before the thread was launched.
> > > >
> > > > rte_eal_remote_launch() doesn't call pthread_create().
> > > > All it does -  updates global variable (lcore_config) and
> > > > writes/reads to/from the pipe.
> > > >
> > > Thanks for the reminder ☹
> > > I think rte_eal_remote_launch and rte_eal_wait_lcore need to provide
> > behavior similar to pthread_launch and pthread_join respectively.
> > >
> > > There is use of rte_smp_*mb in those functions as well. Those need
> > > to be fixed
> > first and then look at these.
> >
> > Looks like you want __atomic_thread_fence() here.
> >
> In the rte_eal_remote_launch case, all the memory operations before the
> API call need to be visible to the worker. If this is the only requirement, we
> can use the function pointer as the guard variable and use store-release. In
> the eal_thread_loop function we could do load-acquire on the function
> pointer.
> 
> I do not think that there is a requirement to ensure that the memory
> operations after the API call do not happen before the worker thread starts
> running the function (As there is no guarantee on when the worker thread
> will run. If the main thread needs to know if the worker thread is running
> explicit hand-shaking needs to happen).
> 
> The rte_eal_wait_lcore API needs to ensure that the memory operations in
> the worker are visible to the main. rte_eal_wait_lcore and eal_thread_loop
> are synchronizing using lcore_config[worker_id].state. I need to understand
> what else 'state' is used for. If there are no issues, we can do a store-release
> on 'state' in eal_thread_loop and a load-acquire in rte_eal_wait_lcore.
> 
> So, we do not have to use the __atomic_thread_fence.
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [dpdk-dev] 回复:  [PATCH v1 1/2] app/test: remove unnecessary barriers for ring stress test
  2021-01-30  1:24             ` Honnappa Nagarahalli
  2021-02-01  8:37               ` [dpdk-dev] 回复: " Feifei Wang
@ 2021-02-01  8:48               ` Feifei Wang
  2021-02-01  9:07                 ` Feifei Wang
  1 sibling, 1 reply; 19+ messages in thread
From: Feifei Wang @ 2021-02-01  8:48 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Stephen Hemminger
  Cc: Ananyev, Konstantin, dev, nd, Ruifeng Wang, nd

Sorry, a mistake happens in the picture, after Wrk_cmd == WRK_CMD_RUN, it should be a rmb rather than wmb.

> -----邮件原件-----
> 发件人: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> 发送时间: 2021年1月30日 9:24
> 收件人: Stephen Hemminger <stephen@networkplumber.org>
> 抄送: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Feifei Wang
> <Feifei.Wang2@arm.com>; dev@dpdk.org; nd <nd@arm.com>; Ruifeng
> Wang <Ruifeng.Wang@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> 主题: RE: [dpdk-dev] [PATCH v1 1/2] app/test: remove unnecessary barriers
> for ring stress test
> 
> <snip>
> 
> > >
> > > >
> > > > > >
> > > > > > Hi Feifei,
> > > > > >
> > > > > > >
> > > > > > > The variable "wrk_cmd" is a signal to control threads from
> > > > > > > running and stopping. When worker lcores load "wrk_cmd ==
> > > > WRK_CMD_RUN",
> > > > > > > they
> > > > > > start
> > > > > > > running and when worker lcores load "wrk_cmd ==
> > > > > > > WRK_CMD_STOP",
> > > > > > they
> > > > > > > stop.
> > > > > > >
> > > > > > > For the wmb in test_mt1, no storing operations must keep the
> > > > > > > order after storing "wrk_cmd". Thus the wmb is unnecessary.
> > > > > >
> > > > > > I think there is a bug in my original code, we should do
> > > > > > smp_wmb()
> > > > > > *before* setting wrk_cmd, not after:
> > > > > >
> > > > > >         /* launch on all workers */
> > > > > >         RTE_LCORE_FOREACH_WORKER(lc) {
> > > > > >                 arg[lc].rng = r;
> > > > > >                 arg[lc].stats = init_stat;
> > > > > >                 rte_eal_remote_launch(test, &arg[lc], lc);
> > > > > >         }
> > > > > >
> > > > > >         /* signal worker to start test */
> > > > > > +      rte_smp_wmb();
> > > > > >         wrk_cmd = WRK_CMD_RUN;
> > > > > > -       rte_smp_wmb();
> > > > > >
> > > > > >         usleep(run_time * US_PER_S);
> > > > > >
> > > > > >
> > > > > > I still think we'd better have some synchronisation here.
> > > > > > Otherwise what would prevent compiler and/or cpu to update
> > > > > > wrk_cmd out of order (before _init_ phase is completed)?
> > > > > > We probably can safely assume no reordering from the compiler
> > > > > > here, as we have function calls straight before and after
> > > > > > 'wrk_cmd =
> > > > WRK_CMD_RUN;'
> > > > > > But for consistency and easier maintenance, I still think it
> > > > > > is better to have something here, after all it is not
> > > > > > performance critical
> > pass.
> > > > > Agree that this is not performance critical.
> > > > >
> > > > > This is more about correctness (as usually people refer to code
> > > > > to understand the concepts). You can refer to video [1].
> > > > > Essentially, the pthread_create has 'happens-before' behavior.
> > > > > i.e. all the memory operations before the pthread_create are
> > > > > visible to the new
> > thread.
> > > > > The
> > > > > rte_smp_rmb() barrier in the thread function is not required as
> > > > > it reads the
> > > > data that was set before the thread was launched.
> > > >
> > > > rte_eal_remote_launch() doesn't call pthread_create().
> > > > All it does -  updates global variable (lcore_config) and
> > > > writes/reads to/from the pipe.
> > > >
> > > Thanks for the reminder ☹
> > > I think rte_eal_remote_launch and rte_eal_wait_lcore need to provide
> > behavior similar to pthread_launch and pthread_join respectively.
> > >
> > > There is use of rte_smp_*mb in those functions as well. Those need
> > > to be fixed
> > first and then look at these.
> >
> > Looks like you want __atomic_thread_fence() here.
> >
> In the rte_eal_remote_launch case, all the memory operations before the
> API call need to be visible to the worker. If this is the only requirement, we
> can use the function pointer as the guard variable and use store-release. In
> the eal_thread_loop function we could do load-acquire on the function
> pointer.
> 
> I do not think that there is a requirement to ensure that the memory
> operations after the API call do not happen before the worker thread starts
> running the function (As there is no guarantee on when the worker thread
> will run. If the main thread needs to know if the worker thread is running
> explicit hand-shaking needs to happen).
> 
> The rte_eal_wait_lcore API needs to ensure that the memory operations in
> the worker are visible to the main. rte_eal_wait_lcore and eal_thread_loop
> are synchronizing using lcore_config[worker_id].state. I need to understand
> what else 'state' is used for. If there are no issues, we can do a store-release
> on 'state' in eal_thread_loop and a load-acquire in rte_eal_wait_lcore.
> 
> So, we do not have to use the __atomic_thread_fence.
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [dpdk-dev] 回复:  [PATCH v1 1/2] app/test: remove unnecessary barriers for ring stress test
  2021-02-01  8:48               ` [dpdk-dev] 回复: " Feifei Wang
@ 2021-02-01  9:07                 ` Feifei Wang
  0 siblings, 0 replies; 19+ messages in thread
From: Feifei Wang @ 2021-02-01  9:07 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Stephen Hemminger
  Cc: Ananyev, Konstantin, dev, nd, Ruifeng Wang, nd, nd

Hi, everyone

Sorry for that there may be a problem in the e-mail format. 
Please see the picture according to the following link:
https://ibb.co/SQ7yGfW

Best Regards
Feifei

> -----邮件原件-----
> 发件人: Feifei Wang
> 发送时间: 2021年2月1日 16:49
> 收件人: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Stephen
> Hemminger <stephen@networkplumber.org>
> 抄送: Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> dev@dpdk.org; nd <nd@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> 主题: 回复: [dpdk-dev] [PATCH v1 1/2] app/test: remove unnecessary
> barriers for ring stress test
> 
> Sorry, a mistake happens in the picture, after Wrk_cmd == WRK_CMD_RUN,
> it should be a rmb rather than wmb.
> 
> > -----邮件原件-----
> > 发件人: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > 发送时间: 2021年1月30日 9:24
> > 收件人: Stephen Hemminger <stephen@networkplumber.org>
> > 抄送: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Feifei Wang
> > <Feifei.Wang2@arm.com>; dev@dpdk.org; nd <nd@arm.com>; Ruifeng
> Wang
> > <Ruifeng.Wang@arm.com>; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> > 主题: RE: [dpdk-dev] [PATCH v1 1/2] app/test: remove unnecessary
> > barriers for ring stress test
> >
> > <snip>
> >
> > > >
> > > > >
> > > > > > >
> > > > > > > Hi Feifei,
> > > > > > >
> > > > > > > >
> > > > > > > > The variable "wrk_cmd" is a signal to control threads from
> > > > > > > > running and stopping. When worker lcores load "wrk_cmd ==
> > > > > WRK_CMD_RUN",
> > > > > > > > they
> > > > > > > start
> > > > > > > > running and when worker lcores load "wrk_cmd ==
> > > > > > > > WRK_CMD_STOP",
> > > > > > > they
> > > > > > > > stop.
> > > > > > > >
> > > > > > > > For the wmb in test_mt1, no storing operations must keep
> > > > > > > > the order after storing "wrk_cmd". Thus the wmb is
> unnecessary.
> > > > > > >
> > > > > > > I think there is a bug in my original code, we should do
> > > > > > > smp_wmb()
> > > > > > > *before* setting wrk_cmd, not after:
> > > > > > >
> > > > > > >         /* launch on all workers */
> > > > > > >         RTE_LCORE_FOREACH_WORKER(lc) {
> > > > > > >                 arg[lc].rng = r;
> > > > > > >                 arg[lc].stats = init_stat;
> > > > > > >                 rte_eal_remote_launch(test, &arg[lc], lc);
> > > > > > >         }
> > > > > > >
> > > > > > >         /* signal worker to start test */
> > > > > > > +      rte_smp_wmb();
> > > > > > >         wrk_cmd = WRK_CMD_RUN;
> > > > > > > -       rte_smp_wmb();
> > > > > > >
> > > > > > >         usleep(run_time * US_PER_S);
> > > > > > >
> > > > > > >
> > > > > > > I still think we'd better have some synchronisation here.
> > > > > > > Otherwise what would prevent compiler and/or cpu to update
> > > > > > > wrk_cmd out of order (before _init_ phase is completed)?
> > > > > > > We probably can safely assume no reordering from the
> > > > > > > compiler here, as we have function calls straight before and
> > > > > > > after 'wrk_cmd =
> > > > > WRK_CMD_RUN;'
> > > > > > > But for consistency and easier maintenance, I still think it
> > > > > > > is better to have something here, after all it is not
> > > > > > > performance critical
> > > pass.
> > > > > > Agree that this is not performance critical.
> > > > > >
> > > > > > This is more about correctness (as usually people refer to
> > > > > > code to understand the concepts). You can refer to video [1].
> > > > > > Essentially, the pthread_create has 'happens-before' behavior.
> > > > > > i.e. all the memory operations before the pthread_create are
> > > > > > visible to the new
> > > thread.
> > > > > > The
> > > > > > rte_smp_rmb() barrier in the thread function is not required
> > > > > > as it reads the
> > > > > data that was set before the thread was launched.
> > > > >
> > > > > rte_eal_remote_launch() doesn't call pthread_create().
> > > > > All it does -  updates global variable (lcore_config) and
> > > > > writes/reads to/from the pipe.
> > > > >
> > > > Thanks for the reminder ☹
> > > > I think rte_eal_remote_launch and rte_eal_wait_lcore need to
> > > > provide
> > > behavior similar to pthread_launch and pthread_join respectively.
> > > >
> > > > There is use of rte_smp_*mb in those functions as well. Those need
> > > > to be fixed
> > > first and then look at these.
> > >
> > > Looks like you want __atomic_thread_fence() here.
> > >
> > In the rte_eal_remote_launch case, all the memory operations before
> > the API call need to be visible to the worker. If this is the only
> > requirement, we can use the function pointer as the guard variable and
> > use store-release. In the eal_thread_loop function we could do
> > load-acquire on the function pointer.
> >
> > I do not think that there is a requirement to ensure that the memory
> > operations after the API call do not happen before the worker thread
> > starts running the function (As there is no guarantee on when the
> > worker thread will run. If the main thread needs to know if the worker
> > thread is running explicit hand-shaking needs to happen).
> >
> > The rte_eal_wait_lcore API needs to ensure that the memory operations
> > in the worker are visible to the main. rte_eal_wait_lcore and
> > eal_thread_loop are synchronizing using lcore_config[worker_id].state.
> > I need to understand what else 'state' is used for. If there are no
> > issues, we can do a store-release on 'state' in eal_thread_loop and a load-
> acquire in rte_eal_wait_lcore.
> >
> > So, we do not have to use the __atomic_thread_fence.
> >


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/2] app/test: remove unnecessary barriers for ring stress test
  2021-02-01  8:37               ` [dpdk-dev] 回复: " Feifei Wang
@ 2021-02-01 13:50                 ` Ananyev, Konstantin
  2021-02-03 16:24                   ` Honnappa Nagarahalli
  0 siblings, 1 reply; 19+ messages in thread
From: Ananyev, Konstantin @ 2021-02-01 13:50 UTC (permalink / raw)
  To: Feifei Wang, Honnappa Nagarahalli, Stephen Hemminger
  Cc: dev, nd, Ruifeng Wang, nd, nd

Hi Feifei,

> 
> Hi, Honnappa, Konstantin and Stephen
> 
> Thanks very much for your attention of this patch. Based on your opinion, Ruifeng and I discuss about this and make a summary:
> _________________________________________________________________________________________________________________
> ___
> 						main thread						worker thread
> rte_eal_remote_launch:
> 	[ Honnappa focus ]
> 
> 	To ensure f can load correct arg,
> 
> 	arg store should before f
> 						lcore_config[worker_id].f = f;
> 						lcore_config[worker_id].arg = arg;
> 						wmb()? or store-relase on f?
> 
> 										eal_thread_loop:
> 						pipeline_communication	---------------------->	pipeline_communication
> 													if (lcore_config[lcore_id].f ==
> NULL)
> 													rte_panic("NULL function
> pointer\n");
> 
> 													fct_arg =
> lcore_config[lcore_id].arg;
> 													ret =
> lcore_config[lcore_id].f(fct_arg);
> _________________________________________________________________________________________________________________
> __
> 
> test_ring_stress:				wmb()?
> 	[ Konstantin focus ]
> 										test_worker:
> 	Main thread can use wrk_cmd to
> 						Wrk_cmd =WRK_CMD_RUN;	---------------------->	Wrk_cmd == WRK_CMD_RUN;
> 	control multiple threads to start running
> 													wmb()?
> 	at the same time as much as possible
> 													ring_dequeue;
> 													ring_enqueue;
> 						Wrk_cmd =WRK_CMD_STOP;	---------------------->	Wrk_cmd == WRK_CMD_STOP;
> _________________________________________________________________________________________________________________
> ___
> 
> rte_eal_wait_lcore:											wmb()
> 	[ Honnappa focus ]
> 				lcore_config[lcore_id].state == FINISHED	<---------------------	lcore_config[lcore_id].state =
> FINISHED	Load-acquire and store-release
> 
> 	are used on the variable “state”
> 						rmb();
> _________________________________________________________________________________________________________________
> ___
> 
> From the picture above,
> 
> 1.First, for the underlying function rte_eal_remote_launch, Honnappa focuses on that,
> pipeline_communication cannot ensure ‘arg’ parameters is loaded correctly by
> the worker thread.
> This is because in weak memory order framework, maybe the main thread and worker
> thread firstly finish pipeline communication, and then the worker thread receive signal
> and execute the function ‘ f ’. However, it maybe load a wrong value of ‘arg’ due to that
> the main thread stores ‘arg’ after pipeline communication. So wmb or store_release is
> necessary for ‘arg’.
> 
> 2.Second, for the upper-layer test_ring_stress, Konstantin foucese on that,
> Whether the main thread can use ‘wrk_cmd’ to control multiple threads to run at the
> same time as much as possible.
> Because rte_eal_remote_launch only can communicates with one worker thread
> at the same time. This means some worker thread maybe start working very early but other
> worker threads maybe need to wait a long time to start working if  ‘wrk_cmd' is stored 'RUN' flag
> before rte_remote_launch.
> At last, for unit test, this may cause that the test results are not stable.
> 
> 3.Third, for rte_eal_wait_lcore, Honnappa focuses on that the ‘state’ as a   synchronous bariable,
> we should add load-acquire and store-release on it. However, there have been rmb and wmb
> after and before ‘state’, So I’m not sure whether we should replace them.
> 
> In summary, I think Honnappa and Konstantin have different concerns.
> For Honnappa, we can add wmb or store-release to ensure the ‘arg’ can be loaded correctly
> in rte_eal_remote_launch.
> For Konstantin, we can add wmb and rmb to ensure the main thread can control the worker
> Threads to run at the same time, and then make the test results more accurate in the
> ring_stress_test.

Agree with both.

> 
> 
> Best Regards
> Feifei
> 
> > -----邮件原件-----
> > 发件人: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > 发送时间: 2021年1月30日 9:24
> > 收件人: Stephen Hemminger <stephen@networkplumber.org>
> > 抄送: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Feifei Wang
> > <Feifei.Wang2@arm.com>; dev@dpdk.org; nd <nd@arm.com>; Ruifeng
> > Wang <Ruifeng.Wang@arm.com>; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> > 主题: RE: [dpdk-dev] [PATCH v1 1/2] app/test: remove unnecessary barriers
> > for ring stress test
> >
> > <snip>
> >
> > > >
> > > > >
> > > > > > >
> > > > > > > Hi Feifei,
> > > > > > >
> > > > > > > >
> > > > > > > > The variable "wrk_cmd" is a signal to control threads from
> > > > > > > > running and stopping. When worker lcores load "wrk_cmd ==
> > > > > WRK_CMD_RUN",
> > > > > > > > they
> > > > > > > start
> > > > > > > > running and when worker lcores load "wrk_cmd ==
> > > > > > > > WRK_CMD_STOP",
> > > > > > > they
> > > > > > > > stop.
> > > > > > > >
> > > > > > > > For the wmb in test_mt1, no storing operations must keep the
> > > > > > > > order after storing "wrk_cmd". Thus the wmb is unnecessary.
> > > > > > >
> > > > > > > I think there is a bug in my original code, we should do
> > > > > > > smp_wmb()
> > > > > > > *before* setting wrk_cmd, not after:
> > > > > > >
> > > > > > >         /* launch on all workers */
> > > > > > >         RTE_LCORE_FOREACH_WORKER(lc) {
> > > > > > >                 arg[lc].rng = r;
> > > > > > >                 arg[lc].stats = init_stat;
> > > > > > >                 rte_eal_remote_launch(test, &arg[lc], lc);
> > > > > > >         }
> > > > > > >
> > > > > > >         /* signal worker to start test */
> > > > > > > +      rte_smp_wmb();
> > > > > > >         wrk_cmd = WRK_CMD_RUN;
> > > > > > > -       rte_smp_wmb();
> > > > > > >
> > > > > > >         usleep(run_time * US_PER_S);
> > > > > > >
> > > > > > >
> > > > > > > I still think we'd better have some synchronisation here.
> > > > > > > Otherwise what would prevent compiler and/or cpu to update
> > > > > > > wrk_cmd out of order (before _init_ phase is completed)?
> > > > > > > We probably can safely assume no reordering from the compiler
> > > > > > > here, as we have function calls straight before and after
> > > > > > > 'wrk_cmd =
> > > > > WRK_CMD_RUN;'
> > > > > > > But for consistency and easier maintenance, I still think it
> > > > > > > is better to have something here, after all it is not
> > > > > > > performance critical
> > > pass.
> > > > > > Agree that this is not performance critical.
> > > > > >
> > > > > > This is more about correctness (as usually people refer to code
> > > > > > to understand the concepts). You can refer to video [1].
> > > > > > Essentially, the pthread_create has 'happens-before' behavior.
> > > > > > i.e. all the memory operations before the pthread_create are
> > > > > > visible to the new
> > > thread.
> > > > > > The
> > > > > > rte_smp_rmb() barrier in the thread function is not required as
> > > > > > it reads the
> > > > > data that was set before the thread was launched.
> > > > >
> > > > > rte_eal_remote_launch() doesn't call pthread_create().
> > > > > All it does -  updates global variable (lcore_config) and
> > > > > writes/reads to/from the pipe.
> > > > >
> > > > Thanks for the reminder ☹
> > > > I think rte_eal_remote_launch and rte_eal_wait_lcore need to provide
> > > behavior similar to pthread_launch and pthread_join respectively.
> > > >
> > > > There is use of rte_smp_*mb in those functions as well. Those need
> > > > to be fixed
> > > first and then look at these.
> > >
> > > Looks like you want __atomic_thread_fence() here.
> > >
> > In the rte_eal_remote_launch case, all the memory operations before the
> > API call need to be visible to the worker. If this is the only requirement, we
> > can use the function pointer as the guard variable and use store-release. In
> > the eal_thread_loop function we could do load-acquire on the function
> > pointer.
> >
> > I do not think that there is a requirement to ensure that the memory
> > operations after the API call do not happen before the worker thread starts
> > running the function (As there is no guarantee on when the worker thread
> > will run. If the main thread needs to know if the worker thread is running
> > explicit hand-shaking needs to happen).
> >
> > The rte_eal_wait_lcore API needs to ensure that the memory operations in
> > the worker are visible to the main. rte_eal_wait_lcore and eal_thread_loop
> > are synchronizing using lcore_config[worker_id].state. I need to understand
> > what else 'state' is used for. If there are no issues, we can do a store-release
> > on 'state' in eal_thread_loop and a load-acquire in rte_eal_wait_lcore.
> >
> > So, we do not have to use the __atomic_thread_fence.
> >


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/2] app/test: remove unnecessary barriers for ring stress test
  2021-02-01 13:50                 ` [dpdk-dev] " Ananyev, Konstantin
@ 2021-02-03 16:24                   ` Honnappa Nagarahalli
  0 siblings, 0 replies; 19+ messages in thread
From: Honnappa Nagarahalli @ 2021-02-03 16:24 UTC (permalink / raw)
  To: Ananyev, Konstantin, Feifei Wang, Stephen Hemminger
  Cc: dev, nd, Ruifeng Wang, Honnappa Nagarahalli, nd

<snip>

> 
> Hi Feifei,
> 
> >
> > Hi, Honnappa, Konstantin and Stephen
> >
> > Thanks very much for your attention of this patch. Based on your
> > opinion, Ruifeng and I discuss about this and make a summary:
> >
> __________________________________________________________
> ____________
> > ___________________________________________
> > ___
> > 						main thread
> 				worker thread
> > rte_eal_remote_launch:
> > 	[ Honnappa focus ]
> >
> > 	To ensure f can load correct arg,
> >
> > 	arg store should before f
> > 						lcore_config[worker_id].f = f;
> > 						lcore_config[worker_id].arg =
> arg;
> > 						wmb()? or store-relase on f?
> >
> >
> 	eal_thread_loop:
> > 						pipeline_communication
> 	---------------------->	pipeline_communication
> >
> 				if (lcore_config[lcore_id].f ==
> > NULL)
> >
> 				rte_panic("NULL function
> > pointer\n");
> >
> >
> 				fct_arg =
> > lcore_config[lcore_id].arg;
> >
> 				ret =
> > lcore_config[lcore_id].f(fct_arg);
> >
> __________________________________________________________
> ____________
> > ___________________________________________
> > __
> >
> > test_ring_stress:				wmb()?
> > 	[ Konstantin focus ]
> >
> 	test_worker:
> > 	Main thread can use wrk_cmd to
> > 						Wrk_cmd =WRK_CMD_RUN;
> 	---------------------->	Wrk_cmd == WRK_CMD_RUN;
> > 	control multiple threads to start running
> >
> 				wmb()?
> > 	at the same time as much as possible
> >
> 				ring_dequeue;
> >
> 				ring_enqueue;
> > 						Wrk_cmd =WRK_CMD_STOP;
> 	---------------------->	Wrk_cmd == WRK_CMD_STOP;
> >
> __________________________________________________________
> ____________
> > ___________________________________________
> > ___
> >
> > rte_eal_wait_lcore:
> 				wmb()
> > 	[ Honnappa focus ]
> > 				lcore_config[lcore_id].state == FINISHED
> 	<---------------------	lcore_config[lcore_id].state =
> > FINISHED	Load-acquire and store-release
> >
> > 	are used on the variable “state”
> > 						rmb();
> >
> __________________________________________________________
> ____________
> > ___________________________________________
> > ___
> >
> > From the picture above,
> >
> > 1.First, for the underlying function rte_eal_remote_launch, Honnappa
> > focuses on that, pipeline_communication cannot ensure ‘arg’ parameters
> > is loaded correctly by the worker thread.
> > This is because in weak memory order framework, maybe the main thread
> > and worker thread firstly finish pipeline communication, and then the
> > worker thread receive signal and execute the function ‘ f ’. However,
> > it maybe load a wrong value of ‘arg’ due to that the main thread
> > stores ‘arg’ after pipeline communication. So wmb or store_release is
> necessary for ‘arg’.
> >
> > 2.Second, for the upper-layer test_ring_stress, Konstantin foucese on
> > that, Whether the main thread can use ‘wrk_cmd’ to control multiple
> > threads to run at the same time as much as possible.
> > Because rte_eal_remote_launch only can communicates with one worker
> > thread at the same time. This means some worker thread maybe start
> > working very early but other worker threads maybe need to wait a long
> > time to start working if  ‘wrk_cmd' is stored 'RUN' flag before
> rte_remote_launch.
> > At last, for unit test, this may cause that the test results are not stable.
> >
> > 3.Third, for rte_eal_wait_lcore, Honnappa focuses on that the ‘state’ as a
> synchronous bariable,
> > we should add load-acquire and store-release on it. However, there
> > have been rmb and wmb after and before ‘state’, So I’m not sure whether
> we should replace them.
> >
> > In summary, I think Honnappa and Konstantin have different concerns.
> > For Honnappa, we can add wmb or store-release to ensure the ‘arg’ can
> > be loaded correctly in rte_eal_remote_launch.
> > For Konstantin, we can add wmb and rmb to ensure the main thread can
> > control the worker Threads to run at the same time, and then make the
> > test results more accurate in the ring_stress_test.
> 
> Agree with both.
Thanks Feifei, understood. I am just trying to take a step back and see what kind of ordering guarantees rte_eal_remote_launch should provide so that we do not have to deal with adding additional barriers in the applications. For ex: if we can avoid the barriers around 'wrk_cmd' (kind of use cases) it will benefit all the applications.

> 
> >
> >
> > Best Regards
> > Feifei
> >
> > > -----邮件原件-----
> > > 发件人: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > > 发送时间: 2021年1月30日 9:24
> > > 收件人: Stephen Hemminger <stephen@networkplumber.org>
> > > 抄送: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Feifei
> Wang
> > > <Feifei.Wang2@arm.com>; dev@dpdk.org; nd <nd@arm.com>; Ruifeng
> Wang
> > > <Ruifeng.Wang@arm.com>; Honnappa Nagarahalli
> > > <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> > > 主题: RE: [dpdk-dev] [PATCH v1 1/2] app/test: remove unnecessary
> > > barriers for ring stress test
> > >
> > > <snip>
> > >
> > > > >
> > > > > >
> > > > > > > >
> > > > > > > > Hi Feifei,
> > > > > > > >
> > > > > > > > >
> > > > > > > > > The variable "wrk_cmd" is a signal to control threads
> > > > > > > > > from running and stopping. When worker lcores load
> > > > > > > > > "wrk_cmd ==
> > > > > > WRK_CMD_RUN",
> > > > > > > > > they
> > > > > > > > start
> > > > > > > > > running and when worker lcores load "wrk_cmd ==
> > > > > > > > > WRK_CMD_STOP",
> > > > > > > > they
> > > > > > > > > stop.
> > > > > > > > >
> > > > > > > > > For the wmb in test_mt1, no storing operations must keep
> > > > > > > > > the order after storing "wrk_cmd". Thus the wmb is
> unnecessary.
> > > > > > > >
> > > > > > > > I think there is a bug in my original code, we should do
> > > > > > > > smp_wmb()
> > > > > > > > *before* setting wrk_cmd, not after:
> > > > > > > >
> > > > > > > >         /* launch on all workers */
> > > > > > > >         RTE_LCORE_FOREACH_WORKER(lc) {
> > > > > > > >                 arg[lc].rng = r;
> > > > > > > >                 arg[lc].stats = init_stat;
> > > > > > > >                 rte_eal_remote_launch(test, &arg[lc], lc);
> > > > > > > >         }
> > > > > > > >
> > > > > > > >         /* signal worker to start test */
> > > > > > > > +      rte_smp_wmb();
> > > > > > > >         wrk_cmd = WRK_CMD_RUN;
> > > > > > > > -       rte_smp_wmb();
> > > > > > > >
> > > > > > > >         usleep(run_time * US_PER_S);
> > > > > > > >
> > > > > > > >
> > > > > > > > I still think we'd better have some synchronisation here.
> > > > > > > > Otherwise what would prevent compiler and/or cpu to update
> > > > > > > > wrk_cmd out of order (before _init_ phase is completed)?
> > > > > > > > We probably can safely assume no reordering from the
> > > > > > > > compiler here, as we have function calls straight before
> > > > > > > > and after 'wrk_cmd =
> > > > > > WRK_CMD_RUN;'
> > > > > > > > But for consistency and easier maintenance, I still think
> > > > > > > > it is better to have something here, after all it is not
> > > > > > > > performance critical
> > > > pass.
> > > > > > > Agree that this is not performance critical.
> > > > > > >
> > > > > > > This is more about correctness (as usually people refer to
> > > > > > > code to understand the concepts). You can refer to video [1].
> > > > > > > Essentially, the pthread_create has 'happens-before' behavior.
> > > > > > > i.e. all the memory operations before the pthread_create are
> > > > > > > visible to the new
> > > > thread.
> > > > > > > The
> > > > > > > rte_smp_rmb() barrier in the thread function is not required
> > > > > > > as it reads the
> > > > > > data that was set before the thread was launched.
> > > > > >
> > > > > > rte_eal_remote_launch() doesn't call pthread_create().
> > > > > > All it does -  updates global variable (lcore_config) and
> > > > > > writes/reads to/from the pipe.
> > > > > >
> > > > > Thanks for the reminder ☹
> > > > > I think rte_eal_remote_launch and rte_eal_wait_lcore need to
> > > > > provide
> > > > behavior similar to pthread_launch and pthread_join respectively.
> > > > >
> > > > > There is use of rte_smp_*mb in those functions as well. Those
> > > > > need to be fixed
> > > > first and then look at these.
> > > >
> > > > Looks like you want __atomic_thread_fence() here.
> > > >
> > > In the rte_eal_remote_launch case, all the memory operations before
> > > the API call need to be visible to the worker. If this is the only
> > > requirement, we can use the function pointer as the guard variable
> > > and use store-release. In the eal_thread_loop function we could do
> > > load-acquire on the function pointer.
> > >
> > > I do not think that there is a requirement to ensure that the memory
> > > operations after the API call do not happen before the worker thread
> > > starts running the function (As there is no guarantee on when the
> > > worker thread will run. If the main thread needs to know if the
> > > worker thread is running explicit hand-shaking needs to happen).
> > >
> > > The rte_eal_wait_lcore API needs to ensure that the memory
> > > operations in the worker are visible to the main. rte_eal_wait_lcore
> > > and eal_thread_loop are synchronizing using
> > > lcore_config[worker_id].state. I need to understand what else
> > > 'state' is used for. If there are no issues, we can do a store-release on
> 'state' in eal_thread_loop and a load-acquire in rte_eal_wait_lcore.
> > >
> > > So, we do not have to use the __atomic_thread_fence.
> > >


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [dpdk-dev] [PATCH v3 0/1] remove smp barriers in app/test
  2020-12-22  6:30 [dpdk-dev] [PATCH v1 0/2] remove smp barriers in app/test Feifei Wang
  2020-12-22  6:30 ` [dpdk-dev] [PATCH v1 1/2] app/test: remove unnecessary barriers for ring stress test Feifei Wang
  2020-12-22  6:30 ` [dpdk-dev] [PATCH v1 2/2] app/test: collect perf data after worker threads exit Feifei Wang
@ 2021-03-10  2:12 ` Feifei Wang
  2021-03-10  2:12   ` [dpdk-dev] [PATCH v3 1/1] test/trace: collect perf data after worker threads exit Feifei Wang
  2021-03-10  2:15 ` [dpdk-dev] [PATCH v3 0/1] remove smp barriers in app/test Feifei Wang
  3 siblings, 1 reply; 19+ messages in thread
From: Feifei Wang @ 2021-03-10  2:12 UTC (permalink / raw)
  Cc: dev, nd, Feifei Wang

For smp barriers in app/test, remove the unnecessary barriers
and fix some bugs.

v2:
1. remove ring stress test patch for that Honnappa will
do some changes for rte_remote_launch APIs

v3:
1. change the patch tag (Jerin)

Feifei Wang (1):
  test/trace: collect perf data after worker threads exit

 app/test/test_trace_perf.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [dpdk-dev] [PATCH v3 1/1] test/trace: collect perf data after worker threads exit
  2021-03-10  2:12 ` [dpdk-dev] [PATCH v3 0/1] remove smp barriers in app/test Feifei Wang
@ 2021-03-10  2:12   ` Feifei Wang
  0 siblings, 0 replies; 19+ messages in thread
From: Feifei Wang @ 2021-03-10  2:12 UTC (permalink / raw)
  To: Jerin Jacob, Sunil Kumar Kori, David Marchand
  Cc: dev, nd, Feifei Wang, stable, Honnappa Nagarahalli, Pavan Nikhilesh

The measure_perf function should be executed after worker threads exit
to collect correct perf data. Otherwise, while workers are running, the
main thread may get incomplete data from workers.

In the meanwhile, remove unnecessary barrier in the test.
For signal variables "ldata.done" and "ldata.start", no operations
should keep the order that being executed after them. So the wmb after
them can be moved.

Fixes: 16a277a24c9f ("test/trace: add performance test cases")
Cc: jerinj@marvell.com
Cc: stable@dpdk.org

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
---
 app/test/test_trace_perf.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/app/test/test_trace_perf.c b/app/test/test_trace_perf.c
index e1ad8e6f5..46ae7d807 100644
--- a/app/test/test_trace_perf.c
+++ b/app/test/test_trace_perf.c
@@ -79,7 +79,6 @@ signal_workers_to_finish(struct test_data *data)
 
 	for (workers = 0; workers < data->nb_workers; workers++) {
 		data->ldata[workers].done = 1;
-		rte_smp_wmb();
 	}
 }
 
@@ -102,7 +101,6 @@ worker_fn_##func(void *arg) \
 { \
 	struct lcore_data *ldata = arg; \
 	ldata->started = 1; \
-	rte_smp_wmb(); \
 	__worker_##func(ldata); \
 	return 0; \
 }
@@ -137,11 +135,12 @@ run_test(const char *str, lcore_function_t f, struct test_data *data, size_t sz)
 
 	wait_till_workers_are_ready(data);
 	rte_delay_ms(100); /* Wait for some time to accumulate the stats */
-	measure_perf(str, data);
 	signal_workers_to_finish(data);
 
 	RTE_LCORE_FOREACH_WORKER(id)
 		rte_eal_wait_lcore(id);
+
+	measure_perf(str, data);
 }
 
 static int
-- 
2.25.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [dpdk-dev] [PATCH v3 0/1] remove smp barriers in app/test
  2020-12-22  6:30 [dpdk-dev] [PATCH v1 0/2] remove smp barriers in app/test Feifei Wang
                   ` (2 preceding siblings ...)
  2021-03-10  2:12 ` [dpdk-dev] [PATCH v3 0/1] remove smp barriers in app/test Feifei Wang
@ 2021-03-10  2:15 ` Feifei Wang
  2021-03-10  2:15   ` [dpdk-dev] [PATCH v3 1/1] test/trace: collect perf data after worker threads exit Feifei Wang
  3 siblings, 1 reply; 19+ messages in thread
From: Feifei Wang @ 2021-03-10  2:15 UTC (permalink / raw)
  Cc: dev, nd, Feifei Wang

For smp barriers in app/test, remove the unnecessary barriers
and fix some bugs.

v2:
1. remove ring stress test patch for that Honnappa will
do some changes for rte_remote_launch APIs

v3:
1. change the patch tag (Jerin)

Feifei Wang (1):
  test/trace: collect perf data after worker threads exit

 app/test/test_trace_perf.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [dpdk-dev] [PATCH v3 1/1] test/trace: collect perf data after worker threads exit
  2021-03-10  2:15 ` [dpdk-dev] [PATCH v3 0/1] remove smp barriers in app/test Feifei Wang
@ 2021-03-10  2:15   ` Feifei Wang
  2021-04-14 14:14     ` David Marchand
  0 siblings, 1 reply; 19+ messages in thread
From: Feifei Wang @ 2021-03-10  2:15 UTC (permalink / raw)
  To: Jerin Jacob, Sunil Kumar Kori, David Marchand
  Cc: dev, nd, Feifei Wang, stable, Honnappa Nagarahalli,
	Honnappa Nagarahalli, Ruifeng Wang, Pavan Nikhilesh

The measure_perf function should be executed after worker threads exit
to collect correct perf data. Otherwise, while workers are running, the
main thread may get incomplete data from workers.

In the meanwhile, remove unnecessary barrier in the test.
For signal variables "ldata.done" and "ldata.start", no operations
should keep the order that being executed after them. So the wmb after
them can be moved.

Fixes: 16a277a24c9f ("test/trace: add performance test cases")
Cc: jerinj@marvell.com
Cc: stable@dpdk.org

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
---
 app/test/test_trace_perf.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/app/test/test_trace_perf.c b/app/test/test_trace_perf.c
index e1ad8e6f5..46ae7d807 100644
--- a/app/test/test_trace_perf.c
+++ b/app/test/test_trace_perf.c
@@ -79,7 +79,6 @@ signal_workers_to_finish(struct test_data *data)
 
 	for (workers = 0; workers < data->nb_workers; workers++) {
 		data->ldata[workers].done = 1;
-		rte_smp_wmb();
 	}
 }
 
@@ -102,7 +101,6 @@ worker_fn_##func(void *arg) \
 { \
 	struct lcore_data *ldata = arg; \
 	ldata->started = 1; \
-	rte_smp_wmb(); \
 	__worker_##func(ldata); \
 	return 0; \
 }
@@ -137,11 +135,12 @@ run_test(const char *str, lcore_function_t f, struct test_data *data, size_t sz)
 
 	wait_till_workers_are_ready(data);
 	rte_delay_ms(100); /* Wait for some time to accumulate the stats */
-	measure_perf(str, data);
 	signal_workers_to_finish(data);
 
 	RTE_LCORE_FOREACH_WORKER(id)
 		rte_eal_wait_lcore(id);
+
+	measure_perf(str, data);
 }
 
 static int
-- 
2.25.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/1] test/trace: collect perf data after worker threads exit
  2021-03-10  2:15   ` [dpdk-dev] [PATCH v3 1/1] test/trace: collect perf data after worker threads exit Feifei Wang
@ 2021-04-14 14:14     ` David Marchand
  0 siblings, 0 replies; 19+ messages in thread
From: David Marchand @ 2021-04-14 14:14 UTC (permalink / raw)
  To: Feifei Wang
  Cc: Jerin Jacob, Sunil Kumar Kori, dev, nd, dpdk stable,
	Honnappa Nagarahalli, Ruifeng Wang, Pavan Nikhilesh

On Wed, Mar 10, 2021 at 3:15 AM Feifei Wang <feifei.wang2@arm.com> wrote:
>
> The measure_perf function should be executed after worker threads exit
> to collect correct perf data. Otherwise, while workers are running, the
> main thread may get incomplete data from workers.
>
> In the meanwhile, remove unnecessary barrier in the test.
> For signal variables "ldata.done" and "ldata.start", no operations
> should keep the order that being executed after them. So the wmb after
> them can be moved.
>
> Fixes: 16a277a24c9f ("test/trace: add performance test cases")
> Cc: stable@dpdk.org
>
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> Acked-by: Jerin Jacob <jerinj@marvell.com>

Applied, thanks.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-04-14 14:15 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-22  6:30 [dpdk-dev] [PATCH v1 0/2] remove smp barriers in app/test Feifei Wang
2020-12-22  6:30 ` [dpdk-dev] [PATCH v1 1/2] app/test: remove unnecessary barriers for ring stress test Feifei Wang
2020-12-22 12:42   ` Ananyev, Konstantin
2021-01-27 23:00     ` Honnappa Nagarahalli
2021-01-28 14:43       ` Ananyev, Konstantin
2021-01-29  3:17         ` Honnappa Nagarahalli
2021-01-29  4:58           ` Stephen Hemminger
2021-01-30  1:24             ` Honnappa Nagarahalli
2021-02-01  8:37               ` [dpdk-dev] 回复: " Feifei Wang
2021-02-01 13:50                 ` [dpdk-dev] " Ananyev, Konstantin
2021-02-03 16:24                   ` Honnappa Nagarahalli
2021-02-01  8:48               ` [dpdk-dev] 回复: " Feifei Wang
2021-02-01  9:07                 ` Feifei Wang
2020-12-22  6:30 ` [dpdk-dev] [PATCH v1 2/2] app/test: collect perf data after worker threads exit Feifei Wang
2021-03-10  2:12 ` [dpdk-dev] [PATCH v3 0/1] remove smp barriers in app/test Feifei Wang
2021-03-10  2:12   ` [dpdk-dev] [PATCH v3 1/1] test/trace: collect perf data after worker threads exit Feifei Wang
2021-03-10  2:15 ` [dpdk-dev] [PATCH v3 0/1] remove smp barriers in app/test Feifei Wang
2021-03-10  2:15   ` [dpdk-dev] [PATCH v3 1/1] test/trace: collect perf data after worker threads exit Feifei Wang
2021-04-14 14:14     ` David Marchand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).