DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [RFC PATCH 0/3] ethdev: few changes in rte_ethdev layer
@ 2017-12-01 14:55 Konstantin Ananyev
  2017-12-14  4:54 ` Jerin Jacob
  0 siblings, 1 reply; 9+ messages in thread
From: Konstantin Ananyev @ 2017-12-01 14:55 UTC (permalink / raw)
  To: dev, dev; +Cc: Konstantin Ananyev

The series introduces 2 main changes:

1.Introduce a separate data structure (rte_eth_queue_local)
to store local to given process (i.e. no-shareable) information
for each configured rx/tx queue.
Memory for that structure is allocated/freed dynamically during
rte_eth_dev_configure().
Reserve a space for queue specific (rx|tx)_pkt_burst(),
tx_pkt_prepare() function pointers inside that structure.
Move rx/tx callback related information inside that structure.
That introduces a change in current behavior: all callbacks for
un-configured queues will be automatically removed.
Also as size of struct rte_eth_dev changes that patch is an ABI breakage,
so deprecation notice for 18.05 is filled.
Further suggestions how to introduce the same functionality
without ABI breakage are welcome.

2. Make it safe to remove rx/tx callback at runtime.
Right now it is not possible for the application to figure out
when it is safe to free removed callback handle and
associated with it resources(unless the queue is stopped).
That's probably not a big problem if all callbacks are static
hange through whole application lifetime)
and/or application doesn't allocate any resources for the callback handler.
Though if callbacks have to be added/removed dynamically and
callback handler would require more resources to operate properly -
then it might become an issue.
So patch #2 fixes that problem - now as soon as
rte_eth_remove_(rx|tx)_callback() completes successfully, application
can safely free all associated with the removed callback resources.

Performance impact:
If application doesn't use RX/TX callbacks, then the tests I run didn't
reveal any performance degradation.
Though if application do use RX/TX callbacks - patch #2 does introduce
some slowdown.
 
To be more specific here, on BDW (E5-2699 v4) 2.2GHz, 4x10Gb (X520-4)
with http://dpdk.org/dev/patchwork/patch/31864/ patch installed I got:
1) testpmd ... --latencystats=1 - slowdown < 1%
2) examples//l3fwd ... --parse-ptype - - slowdown < 1%
3) examples/rxtx_callbacks - slowdown ~8%
All that in terms of packet throughput (Mpps).

Ability to safely remove callbacks at runtime implies
some sort of synchronization.
Even I tried to make it as light as possible,
probably some slowdown is unavoidable.
Of course instead of introducing these changes at rte_ethdev layer
similar technique could be applied on individual callback basis.
In that case it would be up to callback writer/installer to decide
does he/she need a removable at runtime callback or not.
Though in that case, each installed callback might introduce extra
synchronization overhead and slowdown.

Konstantin Ananyev (3):
  ethdev: introduce eth_queue_local
  ethdev: make it safe to remove rx/tx callback at runtime
  doc: ethdev ABI change deprecation notice

 doc/guides/rel_notes/deprecation.rst |   5 +
 lib/librte_ether/rte_ethdev.c        | 390 ++++++++++++++++++++++-------------
 lib/librte_ether/rte_ethdev.h        | 174 ++++++++++++----
 3 files changed, 387 insertions(+), 182 deletions(-)

-- 
2.13.5

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [dpdk-dev] [RFC PATCH 0/3] ethdev: few changes in rte_ethdev layer
  2017-12-01 14:55 [dpdk-dev] [RFC PATCH 0/3] ethdev: few changes in rte_ethdev layer Konstantin Ananyev
@ 2017-12-14  4:54 ` Jerin Jacob
  2017-12-14 13:57   ` Ananyev, Konstantin
  0 siblings, 1 reply; 9+ messages in thread
From: Jerin Jacob @ 2017-12-14  4:54 UTC (permalink / raw)
  To: Konstantin Ananyev; +Cc: dev, shahafs

-----Original Message-----
> Date: Fri, 1 Dec 2017 14:55:12 +0000
> From: Konstantin Ananyev <konstantin.ananyev@intel.com>
> To: dev@dpdk.org, dev@dpdk.org
> CC: Konstantin Ananyev <konstantin.ananyev@intel.com>
> Subject: [dpdk-dev] [RFC PATCH 0/3] ethdev: few changes in rte_ethdev layer
> X-Mailer: git-send-email 1.7.0.7
> 

Hi Konstantin,

> The series introduces 2 main changes:
> 
> 1.Introduce a separate data structure (rte_eth_queue_local)
> to store local to given process (i.e. no-shareable) information
> for each configured rx/tx queue.
> Memory for that structure is allocated/freed dynamically during
> rte_eth_dev_configure().
> Reserve a space for queue specific (rx|tx)_pkt_burst(),
> tx_pkt_prepare() function pointers inside that structure.
> Move rx/tx callback related information inside that structure.
> That introduces a change in current behavior: all callbacks for
> un-configured queues will be automatically removed.
> Also as size of struct rte_eth_dev changes that patch is an ABI breakage,
> so deprecation notice for 18.05 is filled.
> Further suggestions how to introduce the same functionality
> without ABI breakage are welcome.

Are we doing this to enable, Tx/Rx queue specific burst functions to
invoke based on new TX/RX queue specific offloads framework.
Meaning, if a offload configured for the given queue,
driver can select a different function pointer for Rx/Tx burst functions.

> 
> 2. Make it safe to remove rx/tx callback at runtime.
> Right now it is not possible for the application to figure out
> when it is safe to free removed callback handle and
> associated with it resources(unless the queue is stopped).
> That's probably not a big problem if all callbacks are static
> hange through whole application lifetime)
> and/or application doesn't allocate any resources for the callback handler.
> Though if callbacks have to be added/removed dynamically and
> callback handler would require more resources to operate properly -
> then it might become an issue.
> So patch #2 fixes that problem - now as soon as
> rte_eth_remove_(rx|tx)_callback() completes successfully, application
> can safely free all associated with the removed callback resources.
> 
> Performance impact:
> If application doesn't use RX/TX callbacks, then the tests I run didn't
> reveal any performance degradation.
> Though if application do use RX/TX callbacks - patch #2 does introduce
> some slowdown.
>  
> To be more specific here, on BDW (E5-2699 v4) 2.2GHz, 4x10Gb (X520-4)
> with http://dpdk.org/dev/patchwork/patch/31864/ patch installed I got:
> 1) testpmd ... --latencystats=1 - slowdown < 1%
> 2) examples//l3fwd ... --parse-ptype - - slowdown < 1%
> 3) examples/rxtx_callbacks - slowdown ~8%
> All that in terms of packet throughput (Mpps).

We tried on an arm64 machine; We got following result on host and guest
using l3fwd.


Note:
+ve number means "Performance regression with patches"
-ve number means "Performance improvement with patches"

Relative performance difference in percentage:

% difference on host (2 x 40G)	
pkt_sz  1c      2c      4c      8c
64	1.41	0.18	0.51	0.29
72	0.50	0.53	0.09	0.19
128	0.31	0.31	0.42	0.00
256	-0.44	-0.44	0.00	0.00
512	1.94	1.94	0.01	0.01
1024	0.00	0.01	0.00	0.01
1518	0.02	0.01	0.02	0.02

% difference on guest (2 x 40G)	
pkt_sz  1c      2c      4c      8c
64      5.78	2.30	-2.45	-0.01
72	1.13	1.13	1.31	-4.29
128	1.36	1.36	1.54	-0.82
256	2.02	2.03	-0.01	-0.35
512	4.05	4.04	0.00	-0.46
1024	0.39	0.38	0.00	-0.74
1518	0.00	0.00	-0.05	-4.20

I think, it is because we added more code under 
the RTE_ETHDEV_RXTX_CALLBACKS and it is enabled by
default.

I think, the impact will vary based on 
architecture and underneath i-cache, d-cache and IPC.

I am just thinking, How we can avoid such impact?
How about,
# Remove RTE_ETHDEV_RXTX_CALLBACKS option
# Hook Rx/TX callbacks under TX/RX offload flags
# ethdev layer gives helper function to invoke
callbacks to driver. But, don't invoke from common code
# When application needs the callback support, it
configured the given RX/TX queue offloads
# If the Rx/TX callback configure by the application
then driver calls the helper functions exposed by
the common code to invoke RX callbacks.

Jerin

> 
> Ability to safely remove callbacks at runtime implies
> some sort of synchronization.
> Even I tried to make it as light as possible,
> probably some slowdown is unavoidable.
> Of course instead of introducing these changes at rte_ethdev layer
> similar technique could be applied on individual callback basis.
> In that case it would be up to callback writer/installer to decide
> does he/she need a removable at runtime callback or not.
> Though in that case, each installed callback might introduce extra
> synchronization overhead and slowdown.
> 
> Konstantin Ananyev (3):
>   ethdev: introduce eth_queue_local
>   ethdev: make it safe to remove rx/tx callback at runtime
>   doc: ethdev ABI change deprecation notice
> 
>  doc/guides/rel_notes/deprecation.rst |   5 +
>  lib/librte_ether/rte_ethdev.c        | 390 ++++++++++++++++++++++-------------
>  lib/librte_ether/rte_ethdev.h        | 174 ++++++++++++----
>  3 files changed, 387 insertions(+), 182 deletions(-)
> 
> -- 
> 2.13.5
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [dpdk-dev] [RFC PATCH 0/3] ethdev: few changes in rte_ethdev layer
  2017-12-14  4:54 ` Jerin Jacob
@ 2017-12-14 13:57   ` Ananyev, Konstantin
  2017-12-18 16:38     ` Jerin Jacob
  0 siblings, 1 reply; 9+ messages in thread
From: Ananyev, Konstantin @ 2017-12-14 13:57 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: dev, shahafs

Hi Jerin,

> 
> Hi Konstantin,
> 
> > The series introduces 2 main changes:
> >
> > 1.Introduce a separate data structure (rte_eth_queue_local)
> > to store local to given process (i.e. no-shareable) information
> > for each configured rx/tx queue.
> > Memory for that structure is allocated/freed dynamically during
> > rte_eth_dev_configure().
> > Reserve a space for queue specific (rx|tx)_pkt_burst(),
> > tx_pkt_prepare() function pointers inside that structure.
> > Move rx/tx callback related information inside that structure.	
> > That introduces a change in current behavior: all callbacks for
> > un-configured queues will be automatically removed.
> > Also as size of struct rte_eth_dev changes that patch is an ABI breakage,
> > so deprecation notice for 18.05 is filled.
> > Further suggestions how to introduce the same functionality
> > without ABI breakage are welcome.
> 
> Are we doing this to enable, Tx/Rx queue specific burst functions to
> invoke based on new TX/RX queue specific offloads framework.
> Meaning, if a offload configured for the given queue,
> driver can select a different function pointer for Rx/Tx burst functions.

Yes that's the stretch goal.
We do need some extra space for rx/tx callbacks (use field) that needs to
be cache aligned and we plan to make rx/tx function per queue.
So I thought it would be a good opportunity to do these 2 things in one go -
to minimize number of ABI breakages. 

> 
> >
> > 2. Make it safe to remove rx/tx callback at runtime.
> > Right now it is not possible for the application to figure out
> > when it is safe to free removed callback handle and
> > associated with it resources(unless the queue is stopped).
> > That's probably not a big problem if all callbacks are static
> > hange through whole application lifetime)
> > and/or application doesn't allocate any resources for the callback handler.
> > Though if callbacks have to be added/removed dynamically and
> > callback handler would require more resources to operate properly -
> > then it might become an issue.
> > So patch #2 fixes that problem - now as soon as
> > rte_eth_remove_(rx|tx)_callback() completes successfully, application
> > can safely free all associated with the removed callback resources.
> >
> > Performance impact:
> > If application doesn't use RX/TX callbacks, then the tests I run didn't
> > reveal any performance degradation.
> > Though if application do use RX/TX callbacks - patch #2 does introduce
> > some slowdown.
> >
> > To be more specific here, on BDW (E5-2699 v4) 2.2GHz, 4x10Gb (X520-4)
> > with http://dpdk.org/dev/patchwork/patch/31864/ patch installed I got:
> > 1) testpmd ... --latencystats=1 - slowdown < 1%
> > 2) examples//l3fwd ... --parse-ptype - - slowdown < 1%
> > 3) examples/rxtx_callbacks - slowdown ~8%
> > All that in terms of packet throughput (Mpps).
> 
> We tried on an arm64 machine; We got following result on host and guest
> using l3fwd.

Thanks for testing it.
So these numbers below - are for l3fwd running with or without rx callback installed?
i.e. with or without '--parse-ptype' option (and on a NIC with/without ptype HW support)?

> 
> 
> Note:
> +ve number means "Performance regression with patches"
> -ve number means "Performance improvement with patches"
> 
> Relative performance difference in percentage:
> 
> % difference on host (2 x 40G)
> pkt_sz  1c      2c      4c      8c
> 64	1.41	0.18	0.51	0.29
> 72	0.50	0.53	0.09	0.19
> 128	0.31	0.31	0.42	0.00
> 256	-0.44	-0.44	0.00	0.00
> 512	1.94	1.94	0.01	0.01
> 1024	0.00	0.01	0.00	0.01
> 1518	0.02	0.01	0.02	0.02
> 
> % difference on guest (2 x 40G)
> pkt_sz  1c      2c      4c      8c
> 64      5.78	2.30	-2.45	-0.01
> 72	1.13	1.13	1.31	-4.29
> 128	1.36	1.36	1.54	-0.82
> 256	2.02	2.03	-0.01	-0.35
> 512	4.05	4.04	0.00	-0.46
> 1024	0.39	0.38	0.00	-0.74
> 1518	0.00	0.00	-0.05	-4.20
> 
> I think, it is because we added more code under
> the RTE_ETHDEV_RXTX_CALLBACKS and it is enabled by
> default.
> 
> I think, the impact will vary based on
> architecture and underneath i-cache, d-cache and IPC.
> 
> I am just thinking, How we can avoid such impact?
> How about,
> # Remove RTE_ETHDEV_RXTX_CALLBACKS option
> # Hook Rx/TX callbacks under TX/RX offload flags
> # ethdev layer gives helper function to invoke
> callbacks to driver. 

I suppose you suggest that user would have to decide at queue_setup()
stage does he wants a callback(s) for RX/TX or no?
As I can see, the main drawback with that approach - it would introduce new limitation.
Right now it is possible to add/remove RX/TX callback at runtime
(without stopping and reconfiguring the queue), and I suppose we do want to keep that ability.
Again in some cases user can't decide does he wants a callback installed or not,
before he actually calls dev_start().
l3fwd ptype-parse callback is an example of such situation:
right now user has to start device first to figure out which ptypes are supported by this PMD
with desired configuration, then based on this information he will (or will not) setup the callback.
What I thought in same direction - introduce a notion of 'static' and 'dynamic' callbacks.
Static ones can't be removed while queue is not stopped while dynamic can.
But that also requires user to select at queue_setup() stage does he wants dynamic
callbacks for that queue or not (new offload flag or extra field in rxconf/txconf structure). 
So again - same limitation for the user.
Another possibility here - have a flag 'dynamic' or 'static' not per queue, but per individual callback.
Yes, that would mean API change for add/remove callback functions.

Konstantin

> But, don't invoke from common code
> # When application needs the callback support, it
> configured the given RX/TX queue offloads
> # If the Rx/TX callback configure by the application
> then driver calls the helper functions exposed by
> the common code to invoke RX callbacks.


> 
> Jerin
> 
> >
> > Ability to safely remove callbacks at runtime implies
> > some sort of synchronization.
> > Even I tried to make it as light as possible,
> > probably some slowdown is unavoidable.
> > Of course instead of introducing these changes at rte_ethdev layer
> > similar technique could be applied on individual callback basis.
> > In that case it would be up to callback writer/installer to decide
> > does he/she need a removable at runtime callback or not.
> > Though in that case, each installed callback might introduce extra
> > synchronization overhead and slowdown.
> >
> > Konstantin Ananyev (3):
> >   ethdev: introduce eth_queue_local
> >   ethdev: make it safe to remove rx/tx callback at runtime
> >   doc: ethdev ABI change deprecation notice
> >
> >  doc/guides/rel_notes/deprecation.rst |   5 +
> >  lib/librte_ether/rte_ethdev.c        | 390 ++++++++++++++++++++++-------------
> >  lib/librte_ether/rte_ethdev.h        | 174 ++++++++++++----
> >  3 files changed, 387 insertions(+), 182 deletions(-)
> >
> > --
> > 2.13.5
> >

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [dpdk-dev] [RFC PATCH 0/3] ethdev: few changes in rte_ethdev layer
  2017-12-14 13:57   ` Ananyev, Konstantin
@ 2017-12-18 16:38     ` Jerin Jacob
  2017-12-19 15:27       ` Ananyev, Konstantin
  0 siblings, 1 reply; 9+ messages in thread
From: Jerin Jacob @ 2017-12-18 16:38 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev, shahafs

-----Original Message-----
> Date: Thu, 14 Dec 2017 13:57:02 +0000
> From: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> CC: "dev@dpdk.org" <dev@dpdk.org>, "shahafs@mellanox.com"
>  <shahafs@mellanox.com>
> Subject: RE: [dpdk-dev] [RFC PATCH 0/3] ethdev: few changes in rte_ethdev
>  layer
> 
> Hi Jerin,

Hi Konstantin,

> 
> > 
> > Hi Konstantin,
> > 
> > > The series introduces 2 main changes:
> > >
> > > 1.Introduce a separate data structure (rte_eth_queue_local)
> > > to store local to given process (i.e. no-shareable) information
> > > for each configured rx/tx queue.
> > > Memory for that structure is allocated/freed dynamically during
> > > rte_eth_dev_configure().
> > > Reserve a space for queue specific (rx|tx)_pkt_burst(),
> > > tx_pkt_prepare() function pointers inside that structure.
> > > Move rx/tx callback related information inside that structure.	
> > > That introduces a change in current behavior: all callbacks for
> > > un-configured queues will be automatically removed.
> > > Also as size of struct rte_eth_dev changes that patch is an ABI breakage,
> > > so deprecation notice for 18.05 is filled.
> > > Further suggestions how to introduce the same functionality
> > > without ABI breakage are welcome.
> > 
> > Are we doing this to enable, Tx/Rx queue specific burst functions to
> > invoke based on new TX/RX queue specific offloads framework.
> > Meaning, if a offload configured for the given queue,
> > driver can select a different function pointer for Rx/Tx burst functions.
> 
> Yes that's the stretch goal.
> We do need some extra space for rx/tx callbacks (use field) that needs to
> be cache aligned and we plan to make rx/tx function per queue.
> So I thought it would be a good opportunity to do these 2 things in one go -
> to minimize number of ABI breakages. 

Makes sense.

> 
> > 
> > >
> > > 2. Make it safe to remove rx/tx callback at runtime.
> > > Right now it is not possible for the application to figure out
> > > when it is safe to free removed callback handle and
> > > associated with it resources(unless the queue is stopped).
> > > That's probably not a big problem if all callbacks are static
> > > hange through whole application lifetime)
> > > and/or application doesn't allocate any resources for the callback handler.
> > > Though if callbacks have to be added/removed dynamically and
> > > callback handler would require more resources to operate properly -
> > > then it might become an issue.
> > > So patch #2 fixes that problem - now as soon as
> > > rte_eth_remove_(rx|tx)_callback() completes successfully, application
> > > can safely free all associated with the removed callback resources.
> > >
> > > Performance impact:
> > > If application doesn't use RX/TX callbacks, then the tests I run didn't
> > > reveal any performance degradation.
> > > Though if application do use RX/TX callbacks - patch #2 does introduce
> > > some slowdown.
> > >
> > > To be more specific here, on BDW (E5-2699 v4) 2.2GHz, 4x10Gb (X520-4)
> > > with http://dpdk.org/dev/patchwork/patch/31864/ patch installed I got:
> > > 1) testpmd ... --latencystats=1 - slowdown < 1%
> > > 2) examples//l3fwd ... --parse-ptype - - slowdown < 1%
> > > 3) examples/rxtx_callbacks - slowdown ~8%
> > > All that in terms of packet throughput (Mpps).
> > 
> > We tried on an arm64 machine; We got following result on host and guest
> > using l3fwd.
> 
> Thanks for testing it.
> So these numbers below - are for l3fwd running with or without rx callback installed?
> i.e. with or without '--parse-ptype' option (and on a NIC with/without ptype HW support)?

without '--parse-ptype' option(i.e without rx callback installed).
That makes it sad, Even though application is not using the rx callback,
We need to pay on a average 3% drop in performance(based on the
use case/arch/SoC/topology etc)

> 
> > 
> > 
> > Note:
> > +ve number means "Performance regression with patches"
> > -ve number means "Performance improvement with patches"
> > 
> > Relative performance difference in percentage:
> > 
> > % difference on host (2 x 40G)
> > pkt_sz  1c      2c      4c      8c
> > 64	1.41	0.18	0.51	0.29
> > 72	0.50	0.53	0.09	0.19
> > 128	0.31	0.31	0.42	0.00
> > 256	-0.44	-0.44	0.00	0.00
> > 512	1.94	1.94	0.01	0.01
> > 1024	0.00	0.01	0.00	0.01
> > 1518	0.02	0.01	0.02	0.02
> > 
> > % difference on guest (2 x 40G)
> > pkt_sz  1c      2c      4c      8c
> > 64      5.78	2.30	-2.45	-0.01
> > 72	1.13	1.13	1.31	-4.29
> > 128	1.36	1.36	1.54	-0.82
> > 256	2.02	2.03	-0.01	-0.35
> > 512	4.05	4.04	0.00	-0.46
> > 1024	0.39	0.38	0.00	-0.74
> > 1518	0.00	0.00	-0.05	-4.20
> > 
> > I think, it is because we added more code under
> > the RTE_ETHDEV_RXTX_CALLBACKS and it is enabled by
> > default.
> > 
> > I think, the impact will vary based on
> > architecture and underneath i-cache, d-cache and IPC.
> > 
> > I am just thinking, How we can avoid such impact?
> > How about,
> > # Remove RTE_ETHDEV_RXTX_CALLBACKS option
> > # Hook Rx/TX callbacks under TX/RX offload flags
> > # ethdev layer gives helper function to invoke
> > callbacks to driver. 
> 
> I suppose you suggest that user would have to decide at queue_setup()
> stage does he wants a callback(s) for RX/TX or no?

Yes. queue_setup or configure() time.

> As I can see, the main drawback with that approach - it would introduce new limitation.
> Right now it is possible to add/remove RX/TX callback at runtime
> (without stopping and reconfiguring the queue), and I suppose we do want to keep that ability.

For those _application_ where it can not decide or it need to add without
stopping and re configuring the queue still can configure/enable the ethdev
device with RX/TX callback option(and pay price for it)

> Again in some cases user can't decide does he wants a callback installed or not,
> before he actually calls dev_start().
> l3fwd ptype-parse callback is an example of such situation:
> right now user has to start device first to figure out which ptypes are supported by this PMD
> with desired configuration, then based on this information he will (or will not) setup the callback.

Good point. Can we remove that limitation from PMD perspective? IMO, ptype-parse is more of HW feature,
All we need to do a lookup(single load) to convert to DPDK packet type.
http://dpdk.org/browse/dpdk/tree/drivers/net/thunderx/nicvf_rxtx.c#n258

I see almost all the drivers has dummy check like this, I think, That makes it
dependency with start().

        if (dev->rx_pkt_burst == i40e_recv_pkts ||
            dev->rx_pkt_burst == i40e_recv_pkts_bulk_alloc ||
            dev->rx_pkt_burst == i40e_recv_scattered_pkts ||
            dev->rx_pkt_burst == i40e_recv_scattered_pkts_vec ||
            dev->rx_pkt_burst == i40e_recv_pkts_vec)
                return ptypes;


> What I thought in same direction - introduce a notion of 'static' and 'dynamic' callbacks.
> Static ones can't be removed while queue is not stopped while dynamic can.
> But that also requires user to select at queue_setup() stage does he wants dynamic
> callbacks for that queue or not (new offload flag or extra field in rxconf/txconf structure). 
> So again - same limitation for the user.
> Another possibility here - have a flag 'dynamic' or 'static' not per queue, but per individual callback.
> Yes, that would mean API change for add/remove callback functions.

At least on my test, the regression was caused because adding the code in
tx_burst and rx_burst under the RTE_ETHDEV_RXTX_CALLBACKS(which enabled
by default). By disabling the RTE_ETHDEV_RXTX_CALLBACKS option, I don't
see any regression. If so, Will such regression fixed by introducing
'static' or 'dynamic' callbacks? I am happy to test and provide feedback.

For me, I think, The better option is request from application for the
need for Rx/TX callbacks and set the handler appropriate in the driver to
have zero impact for the application which does not need callback at all.

at minimum, on those PMDs that support detecting the supported ptypes without
device start()




> 
> Konstantin
> 
> > But, don't invoke from common code
> > # When application needs the callback support, it
> > configured the given RX/TX queue offloads
> > # If the Rx/TX callback configure by the application
> > then driver calls the helper functions exposed by
> > the common code to invoke RX callbacks.
> 
> 
> > 
> > Jerin
> > 
> > >
> > > Ability to safely remove callbacks at runtime implies
> > > some sort of synchronization.
> > > Even I tried to make it as light as possible,
> > > probably some slowdown is unavoidable.
> > > Of course instead of introducing these changes at rte_ethdev layer
> > > similar technique could be applied on individual callback basis.
> > > In that case it would be up to callback writer/installer to decide
> > > does he/she need a removable at runtime callback or not.
> > > Though in that case, each installed callback might introduce extra
> > > synchronization overhead and slowdown.
> > >
> > > Konstantin Ananyev (3):
> > >   ethdev: introduce eth_queue_local
> > >   ethdev: make it safe to remove rx/tx callback at runtime
> > >   doc: ethdev ABI change deprecation notice
> > >
> > >  doc/guides/rel_notes/deprecation.rst |   5 +
> > >  lib/librte_ether/rte_ethdev.c        | 390 ++++++++++++++++++++++-------------
> > >  lib/librte_ether/rte_ethdev.h        | 174 ++++++++++++----
> > >  3 files changed, 387 insertions(+), 182 deletions(-)
> > >
> > > --
> > > 2.13.5
> > >

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [dpdk-dev] [RFC PATCH 0/3] ethdev: few changes in rte_ethdev layer
  2017-12-18 16:38     ` Jerin Jacob
@ 2017-12-19 15:27       ` Ananyev, Konstantin
  2017-12-20  7:28         ` Jerin Jacob
  0 siblings, 1 reply; 9+ messages in thread
From: Ananyev, Konstantin @ 2017-12-19 15:27 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: dev, shahafs


Hi Jerin,

> > > >
> > > > 2. Make it safe to remove rx/tx callback at runtime.
> > > > Right now it is not possible for the application to figure out
> > > > when it is safe to free removed callback handle and
> > > > associated with it resources(unless the queue is stopped).
> > > > That's probably not a big problem if all callbacks are static
> > > > hange through whole application lifetime)
> > > > and/or application doesn't allocate any resources for the callback handler.
> > > > Though if callbacks have to be added/removed dynamically and
> > > > callback handler would require more resources to operate properly -
> > > > then it might become an issue.
> > > > So patch #2 fixes that problem - now as soon as
> > > > rte_eth_remove_(rx|tx)_callback() completes successfully, application
> > > > can safely free all associated with the removed callback resources.
> > > >
> > > > Performance impact:
> > > > If application doesn't use RX/TX callbacks, then the tests I run didn't
> > > > reveal any performance degradation.
> > > > Though if application do use RX/TX callbacks - patch #2 does introduce
> > > > some slowdown.
> > > >
> > > > To be more specific here, on BDW (E5-2699 v4) 2.2GHz, 4x10Gb (X520-4)
> > > > with http://dpdk.org/dev/patchwork/patch/31864/ patch installed I got:
> > > > 1) testpmd ... --latencystats=1 - slowdown < 1%
> > > > 2) examples//l3fwd ... --parse-ptype - - slowdown < 1%
> > > > 3) examples/rxtx_callbacks - slowdown ~8%
> > > > All that in terms of packet throughput (Mpps).
> > >
> > > We tried on an arm64 machine; We got following result on host and guest
> > > using l3fwd.
> >
> > Thanks for testing it.
> > So these numbers below - are for l3fwd running with or without rx callback installed?
> > i.e. with or without '--parse-ptype' option (and on a NIC with/without ptype HW support)?
> 
> without '--parse-ptype' option(i.e without rx callback installed).
> That makes it sad, Even though application is not using the rx callback,
> We need to pay on a average 3% drop in performance(based on the
> use case/arch/SoC/topology etc)

Yes it is sad...
Do you have any idea why that happens?
Is that the same situation you described before: for arm compiler
often replaces small branches with conditional execution instructions
and miss-prediction for such instructions is not free?
Or is it something else?

Another thing - the series contains 2 main patches:
1) http://dpdk.org/dev/patchwork/patch/31866/
Introduces eth_queue_local and moves callback data into it.
That adds one extra memory reference for rx/tx function to read cb func and data.
2) http://dpdk.org/dev/patchwork/patch/31867/
Introduces some synchronization for cb data.

Is it possible to rerun your tests with just patch #1 installed?
Hopefully it would help us to figure out which of the patches (#1 or #2)  
is the main source of slowdown on arm machines.

> 
> >
> > >
> > >
> > > Note:
> > > +ve number means "Performance regression with patches"
> > > -ve number means "Performance improvement with patches"
> > >
> > > Relative performance difference in percentage:
> > >
> > > % difference on host (2 x 40G)
> > > pkt_sz  1c      2c      4c      8c
> > > 64	1.41	0.18	0.51	0.29
> > > 72	0.50	0.53	0.09	0.19
> > > 128	0.31	0.31	0.42	0.00
> > > 256	-0.44	-0.44	0.00	0.00
> > > 512	1.94	1.94	0.01	0.01
> > > 1024	0.00	0.01	0.00	0.01
> > > 1518	0.02	0.01	0.02	0.02
> > >
> > > % difference on guest (2 x 40G)
> > > pkt_sz  1c      2c      4c      8c
> > > 64      5.78	2.30	-2.45	-0.01
> > > 72	1.13	1.13	1.31	-4.29
> > > 128	1.36	1.36	1.54	-0.82
> > > 256	2.02	2.03	-0.01	-0.35
> > > 512	4.05	4.04	0.00	-0.46
> > > 1024	0.39	0.38	0.00	-0.74
> > > 1518	0.00	0.00	-0.05	-4.20
> > >
> > > I think, it is because we added more code under
> > > the RTE_ETHDEV_RXTX_CALLBACKS and it is enabled by
> > > default.
> > >
> > > I think, the impact will vary based on
> > > architecture and underneath i-cache, d-cache and IPC.
> > >
> > > I am just thinking, How we can avoid such impact?
> > > How about,
> > > # Remove RTE_ETHDEV_RXTX_CALLBACKS option
> > > # Hook Rx/TX callbacks under TX/RX offload flags
> > > # ethdev layer gives helper function to invoke
> > > callbacks to driver.
> >
> > I suppose you suggest that user would have to decide at queue_setup()
> > stage does he wants a callback(s) for RX/TX or no?
> 
> Yes. queue_setup or configure() time.
> 
> > As I can see, the main drawback with that approach - it would introduce new limitation.
> > Right now it is possible to add/remove RX/TX callback at runtime
> > (without stopping and reconfiguring the queue), and I suppose we do want to keep that ability.
> 
> For those _application_ where it can not decide or it need to add without
> stopping and re configuring the queue still can configure/enable the ethdev
> device with RX/TX callback option(and pay price for it)
> 
> > Again in some cases user can't decide does he wants a callback installed or not,
> > before he actually calls dev_start().
> > l3fwd ptype-parse callback is an example of such situation:
> > right now user has to start device first to figure out which ptypes are supported by this PMD
> > with desired configuration, then based on this information he will (or will not) setup the callback.
> 
> Good point. Can we remove that limitation from PMD perspective? IMO, ptype-parse is more of HW feature,
> All we need to do a lookup(single load) to convert to DPDK packet type.
> http://dpdk.org/browse/dpdk/tree/drivers/net/thunderx/nicvf_rxtx.c#n258
> 
> I see almost all the drivers has dummy check like this, I think, That makes it
> dependency with start().
> 
>         if (dev->rx_pkt_burst == i40e_recv_pkts ||
>             dev->rx_pkt_burst == i40e_recv_pkts_bulk_alloc ||
>             dev->rx_pkt_burst == i40e_recv_scattered_pkts ||
>             dev->rx_pkt_burst == i40e_recv_scattered_pkts_vec ||
>             dev->rx_pkt_burst == i40e_recv_pkts_vec)
>                 return ptypes;
> 

We probably can, but it would require extra work and changes inside the PMDs.
i40e is a good one - all rx functions support the same ptypes.
But let say for ixgbe ARM version of vRX doesn't support all ptypes that x86 vRX
and generic scalar RX do.
For fm10k - vector RX and scalar RX also support different ptype sets.  

> 
> > What I thought in same direction - introduce a notion of 'static' and 'dynamic' callbacks.
> > Static ones can't be removed while queue is not stopped while dynamic can.
> > But that also requires user to select at queue_setup() stage does he wants dynamic
> > callbacks for that queue or not (new offload flag or extra field in rxconf/txconf structure).
> > So again - same limitation for the user.
> > Another possibility here - have a flag 'dynamic' or 'static' not per queue, but per individual callback.
> > Yes, that would mean API change for add/remove callback functions.
> 
> At least on my test, the regression was caused because adding the code in
> tx_burst and rx_burst under the RTE_ETHDEV_RXTX_CALLBACKS(which enabled
> by default). By disabling the RTE_ETHDEV_RXTX_CALLBACKS option, I don't
> see any regression. If so, Will such regression fixed by introducing
> 'static' or 'dynamic' callbacks? I am happy to test and provide feedback.

Probably no then, but let's try the experiment I mentioned above first.

> 
> For me, I think, The better option is request from application for the
> need for Rx/TX callbacks and set the handler appropriate in the driver to
> have zero impact for the application which does not need callback at all.

It is possible, but:
- that would be quite significant change and would affect each RX and TX function   in each PMD.
- from other side - even now there are several libs inside DPDK that rely on ability to add/remove
  callbacks on the fly - pdump, latencystats.
  ethdev rx profiling  also relies on ability to install RX callback.

With all these things I presume most apps will configure their devices with rx/tx callback support enabled anyway.
Which makes me think - is it worth to introduce such changes at all.
Anyway, my first suggestion would be let's try to figure out what exactly causes such big slowdown on ARM.
Konstantin

> 
> at minimum, on those PMDs that support detecting the supported ptypes without
> device start()
> 
> 
> 
> 
> >
> > Konstantin
> >
> > > But, don't invoke from common code
> > > # When application needs the callback support, it
> > > configured the given RX/TX queue offloads
> > > # If the Rx/TX callback configure by the application
> > > then driver calls the helper functions exposed by
> > > the common code to invoke RX callbacks.
> >
> >
> > >
> > > Jerin
> > >
> > > >
> > > > Ability to safely remove callbacks at runtime implies
> > > > some sort of synchronization.
> > > > Even I tried to make it as light as possible,
> > > > probably some slowdown is unavoidable.
> > > > Of course instead of introducing these changes at rte_ethdev layer
> > > > similar technique could be applied on individual callback basis.
> > > > In that case it would be up to callback writer/installer to decide
> > > > does he/she need a removable at runtime callback or not.
> > > > Though in that case, each installed callback might introduce extra
> > > > synchronization overhead and slowdown.
> > > >
> > > > Konstantin Ananyev (3):
> > > >   ethdev: introduce eth_queue_local
> > > >   ethdev: make it safe to remove rx/tx callback at runtime
> > > >   doc: ethdev ABI change deprecation notice
> > > >
> > > >  doc/guides/rel_notes/deprecation.rst |   5 +
> > > >  lib/librte_ether/rte_ethdev.c        | 390 ++++++++++++++++++++++-------------
> > > >  lib/librte_ether/rte_ethdev.h        | 174 ++++++++++++----
> > > >  3 files changed, 387 insertions(+), 182 deletions(-)
> > > >
> > > > --
> > > > 2.13.5
> > > >

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [dpdk-dev] [RFC PATCH 0/3] ethdev: few changes in rte_ethdev layer
  2017-12-19 15:27       ` Ananyev, Konstantin
@ 2017-12-20  7:28         ` Jerin Jacob
  2017-12-20 18:23           ` Ananyev, Konstantin
  0 siblings, 1 reply; 9+ messages in thread
From: Jerin Jacob @ 2017-12-20  7:28 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev, shahafs

-----Original Message-----
> Date: Tue, 19 Dec 2017 15:27:26 +0000
> From: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> CC: "dev@dpdk.org" <dev@dpdk.org>, "shahafs@mellanox.com"
>  <shahafs@mellanox.com>
> Subject: RE: [dpdk-dev] [RFC PATCH 0/3] ethdev: few changes in rte_ethdev
>  layer
> 
> 

Hi Konstantin,

> > > > >
> > > > > 2. Make it safe to remove rx/tx callback at runtime.
> > > > > Right now it is not possible for the application to figure out
> > > > > when it is safe to free removed callback handle and
> > > > > associated with it resources(unless the queue is stopped).
> > > > > That's probably not a big problem if all callbacks are static
> > > > > hange through whole application lifetime)
> > > > > and/or application doesn't allocate any resources for the callback handler.
> > > > > Though if callbacks have to be added/removed dynamically and
> > > > > callback handler would require more resources to operate properly -
> > > > > then it might become an issue.
> > > > > So patch #2 fixes that problem - now as soon as
> > > > > rte_eth_remove_(rx|tx)_callback() completes successfully, application
> > > > > can safely free all associated with the removed callback resources.
> > > > >
> > > > > Performance impact:
> > > > > If application doesn't use RX/TX callbacks, then the tests I run didn't
> > > > > reveal any performance degradation.
> > > > > Though if application do use RX/TX callbacks - patch #2 does introduce
> > > > > some slowdown.
> > > > >
> > > > > To be more specific here, on BDW (E5-2699 v4) 2.2GHz, 4x10Gb (X520-4)
> > > > > with http://dpdk.org/dev/patchwork/patch/31864/ patch installed I got:
> > > > > 1) testpmd ... --latencystats=1 - slowdown < 1%
> > > > > 2) examples//l3fwd ... --parse-ptype - - slowdown < 1%
> > > > > 3) examples/rxtx_callbacks - slowdown ~8%
> > > > > All that in terms of packet throughput (Mpps).
> > > >
> > > > We tried on an arm64 machine; We got following result on host and guest
> > > > using l3fwd.
> > >
> > > Thanks for testing it.
> > > So these numbers below - are for l3fwd running with or without rx callback installed?
> > > i.e. with or without '--parse-ptype' option (and on a NIC with/without ptype HW support)?
> > 
> > without '--parse-ptype' option(i.e without rx callback installed).
> > That makes it sad, Even though application is not using the rx callback,
> > We need to pay on a average 3% drop in performance(based on the
> > use case/arch/SoC/topology etc)
> 
> Yes it is sad...
> Do you have any idea why that happens?
> Is that the same situation you described before: for arm compiler
> often replaces small branches with conditional execution instructions
> and miss-prediction for such instructions is not free?

Yes.

Another thing I noticed on these inline functions(rx/tx burst)even if we add 
a lot of code which may note be executed, based on the parent
function layout, the compete register access layout gets change(like
push/pop from stack etc, as more code gets added).
I.e This change does not have much impact with testpmd(iofwd) where the
parent function is not doing much thing.It is showing up only on l3fwd.

> Or is it something else?
> 
> Another thing - the series contains 2 main patches:
> 1) http://dpdk.org/dev/patchwork/patch/31866/
> Introduces eth_queue_local and moves callback data into it.
> That adds one extra memory reference for rx/tx function to read cb func and data.
> 2) http://dpdk.org/dev/patchwork/patch/31867/
> Introduces some synchronization for cb data.
> 
> Is it possible to rerun your tests with just patch #1 installed?
> Hopefully it would help us to figure out which of the patches (#1 or #2)  
> is the main source of slowdown on arm machines.

I re ran the test as requested:
With first patch - 1.17% regression
With first patch + Second patch - 2.84% regression

I am fine with slowdown, but it should be effected only with those
application which request the callback feature.

> 
> > 
> > >
> > > >
> > > >
> > > > Note:
> > > > +ve number means "Performance regression with patches"
> > > > -ve number means "Performance improvement with patches"
> > > >
> > > > Relative performance difference in percentage:
> > > >
> > > > % difference on host (2 x 40G)
> > > > pkt_sz  1c      2c      4c      8c
> > > > 64	1.41	0.18	0.51	0.29
> > > > 72	0.50	0.53	0.09	0.19
> > > > 128	0.31	0.31	0.42	0.00
> > > > 256	-0.44	-0.44	0.00	0.00
> > > > 512	1.94	1.94	0.01	0.01
> > > > 1024	0.00	0.01	0.00	0.01
> > > > 1518	0.02	0.01	0.02	0.02
> > > >
> > > > % difference on guest (2 x 40G)
> > > > pkt_sz  1c      2c      4c      8c
> > > > 64      5.78	2.30	-2.45	-0.01
> > > > 72	1.13	1.13	1.31	-4.29
> > > > 128	1.36	1.36	1.54	-0.82
> > > > 256	2.02	2.03	-0.01	-0.35
> > > > 512	4.05	4.04	0.00	-0.46
> > > > 1024	0.39	0.38	0.00	-0.74
> > > > 1518	0.00	0.00	-0.05	-4.20
> > > >
> > > > I think, it is because we added more code under
> > > > the RTE_ETHDEV_RXTX_CALLBACKS and it is enabled by
> > > > default.
> > > >
> > > > I think, the impact will vary based on
> > > > architecture and underneath i-cache, d-cache and IPC.
> > > >
> > > > I am just thinking, How we can avoid such impact?
> > > > How about,
> > > > # Remove RTE_ETHDEV_RXTX_CALLBACKS option
> > > > # Hook Rx/TX callbacks under TX/RX offload flags
> > > > # ethdev layer gives helper function to invoke
> > > > callbacks to driver.
> > >
> > > I suppose you suggest that user would have to decide at queue_setup()
> > > stage does he wants a callback(s) for RX/TX or no?
> > 
> > Yes. queue_setup or configure() time.
> > 
> > > As I can see, the main drawback with that approach - it would introduce new limitation.
> > > Right now it is possible to add/remove RX/TX callback at runtime
> > > (without stopping and reconfiguring the queue), and I suppose we do want to keep that ability.
> > 
> > For those _application_ where it can not decide or it need to add without
> > stopping and re configuring the queue still can configure/enable the ethdev
> > device with RX/TX callback option(and pay price for it)
> > 
> > > Again in some cases user can't decide does he wants a callback installed or not,
> > > before he actually calls dev_start().
> > > l3fwd ptype-parse callback is an example of such situation:
> > > right now user has to start device first to figure out which ptypes are supported by this PMD
> > > with desired configuration, then based on this information he will (or will not) setup the callback.
> > 
> > Good point. Can we remove that limitation from PMD perspective? IMO, ptype-parse is more of HW feature,
> > All we need to do a lookup(single load) to convert to DPDK packet type.
> > http://dpdk.org/browse/dpdk/tree/drivers/net/thunderx/nicvf_rxtx.c#n258
> > 
> > I see almost all the drivers has dummy check like this, I think, That makes it
> > dependency with start().
> > 
> >         if (dev->rx_pkt_burst == i40e_recv_pkts ||
> >             dev->rx_pkt_burst == i40e_recv_pkts_bulk_alloc ||
> >             dev->rx_pkt_burst == i40e_recv_scattered_pkts ||
> >             dev->rx_pkt_burst == i40e_recv_scattered_pkts_vec ||
> >             dev->rx_pkt_burst == i40e_recv_pkts_vec)
> >                 return ptypes;
> > 
> 
> We probably can, but it would require extra work and changes inside the PMDs.

Yes, If we want do it in clean way. Another intermediate solution could
be return some valid error code on those PMD. On such error code, application can
start the device and get the supported ptype and then stop/reconfigure it again

> i40e is a good one - all rx functions support the same ptypes.
> But let say for ixgbe ARM version of vRX doesn't support all ptypes that x86 vRX
> and generic scalar RX do.
> For fm10k - vector RX and scalar RX also support different ptype sets.  

> 
> > 
> > > What I thought in same direction - introduce a notion of 'static' and 'dynamic' callbacks.
> > > Static ones can't be removed while queue is not stopped while dynamic can.
> > > But that also requires user to select at queue_setup() stage does he wants dynamic
> > > callbacks for that queue or not (new offload flag or extra field in rxconf/txconf structure).
> > > So again - same limitation for the user.
> > > Another possibility here - have a flag 'dynamic' or 'static' not per queue, but per individual callback.
> > > Yes, that would mean API change for add/remove callback functions.
> > 
> > At least on my test, the regression was caused because adding the code in
> > tx_burst and rx_burst under the RTE_ETHDEV_RXTX_CALLBACKS(which enabled
> > by default). By disabling the RTE_ETHDEV_RXTX_CALLBACKS option, I don't
> > see any regression. If so, Will such regression fixed by introducing
> > 'static' or 'dynamic' callbacks? I am happy to test and provide feedback.
> 
> Probably no then, but let's try the experiment I mentioned above first.
> 
> > 
> > For me, I think, The better option is request from application for the
> > need for Rx/TX callbacks and set the handler appropriate in the driver to
> > have zero impact for the application which does not need callback at all.
> 
> It is possible, but:
> - that would be quite significant change and would affect each RX and TX function   in each PMD.

May be not, assume if we  have helper function which invoke the callbacks.
PMD can just call that before returning it.

> - from other side - even now there are several libs inside DPDK that rely on ability to add/remove
>   callbacks on the fly - pdump, latencystats.
I think, it is application who decides to use this API,then I think,
application can request callback feature support. I think, we have
followed the same method in latest RX/TX offload flag scheme as well.

>   ethdev rx profiling  also relies on ability to install RX callback.
> 
> With all these things I presume most apps will configure their devices with rx/tx callback support enabled anyway.
> Which makes me think - is it worth to introduce such changes at all.
> Anyway, my first suggestion would be let's try to figure out what exactly causes such big slowdown on ARM.

It may not be ARM in general, Under our roof, We have different class of SoCs based
on the use case like 4 cores, 16 core, 96 cores part etc with different
micro architecture etc based on the use case(embedded/server etc) like
Intel. So it may not very generic with all ARM machines, it is really
based on micro architecture. ie. The same patch may behave
differently(less than < 1%) on IA-Atom vs IA-XEON. So IMO, Common code 
change should use helper functions, it up to the driver to use it.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [dpdk-dev] [RFC PATCH 0/3] ethdev: few changes in rte_ethdev layer
  2017-12-20  7:28         ` Jerin Jacob
@ 2017-12-20 18:23           ` Ananyev, Konstantin
  2018-01-04  9:22             ` Jerin Jacob
  0 siblings, 1 reply; 9+ messages in thread
From: Ananyev, Konstantin @ 2017-12-20 18:23 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: dev, shahafs

Hi Jerin,

> 
> Hi Konstantin,
> 
> > > > > >
> > > > > > 2. Make it safe to remove rx/tx callback at runtime.
> > > > > > Right now it is not possible for the application to figure out
> > > > > > when it is safe to free removed callback handle and
> > > > > > associated with it resources(unless the queue is stopped).
> > > > > > That's probably not a big problem if all callbacks are static
> > > > > > hange through whole application lifetime)
> > > > > > and/or application doesn't allocate any resources for the callback handler.
> > > > > > Though if callbacks have to be added/removed dynamically and
> > > > > > callback handler would require more resources to operate properly -
> > > > > > then it might become an issue.
> > > > > > So patch #2 fixes that problem - now as soon as
> > > > > > rte_eth_remove_(rx|tx)_callback() completes successfully, application
> > > > > > can safely free all associated with the removed callback resources.
> > > > > >
> > > > > > Performance impact:
> > > > > > If application doesn't use RX/TX callbacks, then the tests I run didn't
> > > > > > reveal any performance degradation.
> > > > > > Though if application do use RX/TX callbacks - patch #2 does introduce
> > > > > > some slowdown.
> > > > > >
> > > > > > To be more specific here, on BDW (E5-2699 v4) 2.2GHz, 4x10Gb (X520-4)
> > > > > > with http://dpdk.org/dev/patchwork/patch/31864/ patch installed I got:
> > > > > > 1) testpmd ... --latencystats=1 - slowdown < 1%
> > > > > > 2) examples//l3fwd ... --parse-ptype - - slowdown < 1%
> > > > > > 3) examples/rxtx_callbacks - slowdown ~8%
> > > > > > All that in terms of packet throughput (Mpps).
> > > > >
> > > > > We tried on an arm64 machine; We got following result on host and guest
> > > > > using l3fwd.
> > > >
> > > > Thanks for testing it.
> > > > So these numbers below - are for l3fwd running with or without rx callback installed?
> > > > i.e. with or without '--parse-ptype' option (and on a NIC with/without ptype HW support)?
> > >
> > > without '--parse-ptype' option(i.e without rx callback installed).
> > > That makes it sad, Even though application is not using the rx callback,
> > > We need to pay on a average 3% drop in performance(based on the
> > > use case/arch/SoC/topology etc)
> >
> > Yes it is sad...
> > Do you have any idea why that happens?
> > Is that the same situation you described before: for arm compiler
> > often replaces small branches with conditional execution instructions
> > and miss-prediction for such instructions is not free?
> 
> Yes.

Probably a dumb question - is it possible to disable such optimization
for these particular functions (rx_burst/tx_burst)?
Or would it make things even worse?

> 
> Another thing I noticed on these inline functions(rx/tx burst)even if we add
> a lot of code which may note be executed, based on the parent
> function layout, the compete register access layout gets change(like
> push/pop from stack etc, as more code gets added).
> I.e This change does not have much impact with testpmd(iofwd) where the
> parent function is not doing much thing.It is showing up only on l3fwd.

Ok sort of register spillage...
But if compiler does it unconditionally (for the code path when it never needed)
seems like a problem with the compile, no?
Wonder if we'll move at least RX/TX CBs invocation into a separate function -
would it help somehow?
I.E. if (q->cb != NULL) execut_cbs(...);
?
Another thought - as I remember a while ago Bruce and other lads suggested to move
rx_burst/tx)burst (and whole struct rte_eth_dev) definitions into rte_ethdev.c...
That would help to avoid such problems, again we'll be able to modify rte_eth_dev
without breaking ABI...
Though it would mean extra call overhead, so don't know of-hand how expensive
it could be.

> 
> > Or is it something else?
> >
> > Another thing - the series contains 2 main patches:
> > 1) http://dpdk.org/dev/patchwork/patch/31866/
> > Introduces eth_queue_local and moves callback data into it.
> > That adds one extra memory reference for rx/tx function to read cb func and data.
> > 2) http://dpdk.org/dev/patchwork/patch/31867/
> > Introduces some synchronization for cb data.
> >
> > Is it possible to rerun your tests with just patch #1 installed?
> > Hopefully it would help us to figure out which of the patches (#1 or #2)
> > is the main source of slowdown on arm machines.
> 
> I re ran the test as requested:
> With first patch - 1.17% regression
> With first patch + Second patch - 2.84% regression

Thanks

> 
> I am fine with slowdown, but it should be effected only with those
> application which request the callback feature.

Ok, but you do realize that if we'll go with rx/tx function per queue,
patch #1 slowdown (1.17%)  will probably reappear anyway? 

> 
> >
> > >
> > > >
> > > > >
> > > > >
> > > > > Note:
> > > > > +ve number means "Performance regression with patches"
> > > > > -ve number means "Performance improvement with patches"
> > > > >
> > > > > Relative performance difference in percentage:
> > > > >
> > > > > % difference on host (2 x 40G)
> > > > > pkt_sz  1c      2c      4c      8c
> > > > > 64	1.41	0.18	0.51	0.29
> > > > > 72	0.50	0.53	0.09	0.19
> > > > > 128	0.31	0.31	0.42	0.00
> > > > > 256	-0.44	-0.44	0.00	0.00
> > > > > 512	1.94	1.94	0.01	0.01
> > > > > 1024	0.00	0.01	0.00	0.01
> > > > > 1518	0.02	0.01	0.02	0.02
> > > > >
> > > > > % difference on guest (2 x 40G)
> > > > > pkt_sz  1c      2c      4c      8c
> > > > > 64      5.78	2.30	-2.45	-0.01
> > > > > 72	1.13	1.13	1.31	-4.29
> > > > > 128	1.36	1.36	1.54	-0.82
> > > > > 256	2.02	2.03	-0.01	-0.35
> > > > > 512	4.05	4.04	0.00	-0.46
> > > > > 1024	0.39	0.38	0.00	-0.74
> > > > > 1518	0.00	0.00	-0.05	-4.20
> > > > >
> > > > > I think, it is because we added more code under
> > > > > the RTE_ETHDEV_RXTX_CALLBACKS and it is enabled by
> > > > > default.
> > > > >
> > > > > I think, the impact will vary based on
> > > > > architecture and underneath i-cache, d-cache and IPC.
> > > > >
> > > > > I am just thinking, How we can avoid such impact?
> > > > > How about,
> > > > > # Remove RTE_ETHDEV_RXTX_CALLBACKS option
> > > > > # Hook Rx/TX callbacks under TX/RX offload flags
> > > > > # ethdev layer gives helper function to invoke
> > > > > callbacks to driver.
> > > >
> > > > I suppose you suggest that user would have to decide at queue_setup()
> > > > stage does he wants a callback(s) for RX/TX or no?
> > >
> > > Yes. queue_setup or configure() time.
> > >
> > > > As I can see, the main drawback with that approach - it would introduce new limitation.
> > > > Right now it is possible to add/remove RX/TX callback at runtime
> > > > (without stopping and reconfiguring the queue), and I suppose we do want to keep that ability.
> > >
> > > For those _application_ where it can not decide or it need to add without
> > > stopping and re configuring the queue still can configure/enable the ethdev
> > > device with RX/TX callback option(and pay price for it)
> > >
> > > > Again in some cases user can't decide does he wants a callback installed or not,
> > > > before he actually calls dev_start().
> > > > l3fwd ptype-parse callback is an example of such situation:
> > > > right now user has to start device first to figure out which ptypes are supported by this PMD
> > > > with desired configuration, then based on this information he will (or will not) setup the callback.
> > >
> > > Good point. Can we remove that limitation from PMD perspective? IMO, ptype-parse is more of HW feature,
> > > All we need to do a lookup(single load) to convert to DPDK packet type.
> > > http://dpdk.org/browse/dpdk/tree/drivers/net/thunderx/nicvf_rxtx.c#n258
> > >
> > > I see almost all the drivers has dummy check like this, I think, That makes it
> > > dependency with start().
> > >
> > >         if (dev->rx_pkt_burst == i40e_recv_pkts ||
> > >             dev->rx_pkt_burst == i40e_recv_pkts_bulk_alloc ||
> > >             dev->rx_pkt_burst == i40e_recv_scattered_pkts ||
> > >             dev->rx_pkt_burst == i40e_recv_scattered_pkts_vec ||
> > >             dev->rx_pkt_burst == i40e_recv_pkts_vec)
> > >                 return ptypes;
> > >
> >
> > We probably can, but it would require extra work and changes inside the PMDs.
> 
> Yes, If we want do it in clean way. Another intermediate solution could
> be return some valid error code on those PMD. On such error code, application can
> start the device and get the supported ptype and then stop/reconfigure it again
> 
> > i40e is a good one - all rx functions support the same ptypes.
> > But let say for ixgbe ARM version of vRX doesn't support all ptypes that x86 vRX
> > and generic scalar RX do.
> > For fm10k - vector RX and scalar RX also support different ptype sets.
> 
> >
> > >
> > > > What I thought in same direction - introduce a notion of 'static' and 'dynamic' callbacks.
> > > > Static ones can't be removed while queue is not stopped while dynamic can.
> > > > But that also requires user to select at queue_setup() stage does he wants dynamic
> > > > callbacks for that queue or not (new offload flag or extra field in rxconf/txconf structure).
> > > > So again - same limitation for the user.
> > > > Another possibility here - have a flag 'dynamic' or 'static' not per queue, but per individual callback.
> > > > Yes, that would mean API change for add/remove callback functions.
> > >
> > > At least on my test, the regression was caused because adding the code in
> > > tx_burst and rx_burst under the RTE_ETHDEV_RXTX_CALLBACKS(which enabled
> > > by default). By disabling the RTE_ETHDEV_RXTX_CALLBACKS option, I don't
> > > see any regression. If so, Will such regression fixed by introducing
> > > 'static' or 'dynamic' callbacks? I am happy to test and provide feedback.
> >
> > Probably no then, but let's try the experiment I mentioned above first.
> >
> > >
> > > For me, I think, The better option is request from application for the
> > > need for Rx/TX callbacks and set the handler appropriate in the driver to
> > > have zero impact for the application which does not need callback at all.
> >
> > It is possible, but:
> > - that would be quite significant change and would affect each RX and TX function   in each PMD.
> 
> May be not, assume if we  have helper function which invoke the callbacks.
> PMD can just call that before returning it.

Even with helpers it means we'll have to double number of RX/TX functions -
one with a CB support, another without.  

> 
> > - from other side - even now there are several libs inside DPDK that rely on ability to add/remove
> >   callbacks on the fly - pdump, latencystats.
> I think, it is application who decides to use this API,then I think,
> application can request callback feature support. I think, we have
> followed the same method in latest RX/TX offload flag scheme as well.
> 
> >   ethdev rx profiling  also relies on ability to install RX callback.
> >
> > With all these things I presume most apps will configure their devices with rx/tx callback support enabled anyway.
> > Which makes me think - is it worth to introduce such changes at all.
> > Anyway, my first suggestion would be let's try to figure out what exactly causes such big slowdown on ARM.
> 
> It may not be ARM in general, Under our roof, We have different class of SoCs based
> on the use case like 4 cores, 16 core, 96 cores part etc with different
> micro architecture etc based on the use case(embedded/server etc) like
> Intel. So it may not very generic with all ARM machines, it is really
> based on micro architecture. ie. The same patch may behave
> differently(less than < 1%) on IA-Atom vs IA-XEON. So IMO, Common code
> change should use helper functions, it up to the driver to use it.

Konstantin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [dpdk-dev] [RFC PATCH 0/3] ethdev: few changes in rte_ethdev layer
  2017-12-20 18:23           ` Ananyev, Konstantin
@ 2018-01-04  9:22             ` Jerin Jacob
  2018-01-15 16:29               ` Ananyev, Konstantin
  0 siblings, 1 reply; 9+ messages in thread
From: Jerin Jacob @ 2018-01-04  9:22 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev, shahafs

-----Original Message-----
> Date: Wed, 20 Dec 2017 18:23:46 +0000
> From: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> CC: "dev@dpdk.org" <dev@dpdk.org>, "shahafs@mellanox.com"
>  <shahafs@mellanox.com>
> Subject: RE: [dpdk-dev] [RFC PATCH 0/3] ethdev: few changes in rte_ethdev
>  layer
> 

Hi Konstantin, 

> > > > > > > 2. Make it safe to remove rx/tx callback at runtime.
> > > > > > > Right now it is not possible for the application to figure out
> > > > > > > when it is safe to free removed callback handle and
> > > > > > > associated with it resources(unless the queue is stopped).
> > > > > > > That's probably not a big problem if all callbacks are static
> > > > > > > hange through whole application lifetime)
> > > > > > > and/or application doesn't allocate any resources for the callback handler.
> > > > > > > Though if callbacks have to be added/removed dynamically and
> > > > > > > callback handler would require more resources to operate properly -
> > > > > > > then it might become an issue.
> > > > > > > So patch #2 fixes that problem - now as soon as
> > > > > > > rte_eth_remove_(rx|tx)_callback() completes successfully, application
> > > > > > > can safely free all associated with the removed callback resources.
> > > > > > >
> > > > > > > Performance impact:
> > > > > > > If application doesn't use RX/TX callbacks, then the tests I run didn't
> > > > > > > reveal any performance degradation.
> > > > > > > Though if application do use RX/TX callbacks - patch #2 does introduce
> > > > > > > some slowdown.
> > > > > > >
> > > > > > > To be more specific here, on BDW (E5-2699 v4) 2.2GHz, 4x10Gb (X520-4)
> > > > > > > with http://dpdk.org/dev/patchwork/patch/31864/ patch installed I got:
> > > > > > > 1) testpmd ... --latencystats=1 - slowdown < 1%
> > > > > > > 2) examples//l3fwd ... --parse-ptype - - slowdown < 1%
> > > > > > > 3) examples/rxtx_callbacks - slowdown ~8%
> > > > > > > All that in terms of packet throughput (Mpps).
> > > > > >
> > > > > > We tried on an arm64 machine; We got following result on host and guest
> > > > > > using l3fwd.
> > > > >
> > > > > Thanks for testing it.
> > > > > So these numbers below - are for l3fwd running with or without rx callback installed?
> > > > > i.e. with or without '--parse-ptype' option (and on a NIC with/without ptype HW support)?
> > > >
> > > > without '--parse-ptype' option(i.e without rx callback installed).
> > > > That makes it sad, Even though application is not using the rx callback,
> > > > We need to pay on a average 3% drop in performance(based on the
> > > > use case/arch/SoC/topology etc)
> > >
> > > Yes it is sad...
> > > Do you have any idea why that happens?
> > > Is that the same situation you described before: for arm compiler
> > > often replaces small branches with conditional execution instructions
> > > and miss-prediction for such instructions is not free?
> > 
> > Yes.
> 
> Probably a dumb question - is it possible to disable such optimization
> for these particular functions (rx_burst/tx_burst)?
> Or would it make things even worse?
> 
> > 
> > Another thing I noticed on these inline functions(rx/tx burst)even if we add
> > a lot of code which may note be executed, based on the parent
> > function layout, the compete register access layout gets change(like
> > push/pop from stack etc, as more code gets added).
> > I.e This change does not have much impact with testpmd(iofwd) where the
> > parent function is not doing much thing.It is showing up only on l3fwd.
> 
> Ok sort of register spillage...
> But if compiler does it unconditionally (for the code path when it never needed)
> seems like a problem with the compile, no?

I think, it will be depended on parent function usage.

> Wonder if we'll move at least RX/TX CBs invocation into a separate function -
> would it help somehow?
> I.E. if (q->cb != NULL) execut_cbs(...);
> ?

Yes.I tried to move both (Rx, Tx) execut_cbs as separate in .c func ton.
That is helping the regression in some extended(Reduced from 2.84% to 1.27%)

example:
+uint16_t
+__rte_eth_execut_tx_cbs(uint16_t port_id, uint16_t queue_id,
+                       struct rte_mbuf **pkts, uint16_t nb_pkts,
+                       struct rte_eth_queue_local *ql);
+
 static inline uint16_t
 rte_eth_tx_burst(uint16_t port_id, uint16_t queue_id,
                 struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
@@ -3256,19 +3259,11 @@ rte_eth_tx_burst(uint16_t port_id, uint16_t
queue_id,
 #ifdef RTE_ETHDEV_RXTX_CALLBACKS
        {
                struct rte_eth_queue_local *ql;
-               struct rte_eth_rxtx_callback *cb;
 
                ql = dev->tx_ql + queue_id;
-               if (unlikely(ql->cbs.head != NULL)) {
-
-                       __rte_eth_rxtx_cbs_inuse(&ql->cbs);
-
-                       for (cb = ql->cbs.head; cb != NULL; cb = cb->next)
-                               nb_pkts = cb->fn.tx(port_id, queue_id, tx_pkts,
-                                       nb_pkts, cb->param);
-
-                       __rte_eth_rxtx_cbs_unuse(&ql->cbs);
-               }
+               if (unlikely(ql->cbs.head != NULL))
+                       nb_pkts = __rte_eth_execut_tx_cbs(port_id,queue_id,
+                       tx_pkts, nb_pkts, ql);
        }
 #endif


> Another thought - as I remember a while ago Bruce and other lads suggested to move
> rx_burst/tx)burst (and whole struct rte_eth_dev) definitions into rte_ethdev.c...
> That would help to avoid such problems, again we'll be able to modify rte_eth_dev
> without breaking ABI...
> Though it would mean extra call overhead, so don't know of-hand how expensive
> it could be.

>From my experiments it shows more expensive than above scheme.

> 
> > 
> > > Or is it something else?
> > >
> > > Another thing - the series contains 2 main patches:
> > > 1) http://dpdk.org/dev/patchwork/patch/31866/
> > > Introduces eth_queue_local and moves callback data into it.
> > > That adds one extra memory reference for rx/tx function to read cb func and data.
> > > 2) http://dpdk.org/dev/patchwork/patch/31867/
> > > Introduces some synchronization for cb data.
> > >
> > > Is it possible to rerun your tests with just patch #1 installed?
> > > Hopefully it would help us to figure out which of the patches (#1 or #2)
> > > is the main source of slowdown on arm machines.
> > 
> > I re ran the test as requested:
> > With first patch - 1.17% regression
> > With first patch + Second patch - 2.84% regression
> 
> Thanks
> 
> > 
> > I am fine with slowdown, but it should be effected only with those
> > application which request the callback feature.
> 
> Ok, but you do realize that if we'll go with rx/tx function per queue,
> patch #1 slowdown (1.17%)  will probably reappear anyway? 

May be not, it will be one more redirection to get the rx/tx function
pointer(may be a additionally load instruction)

> 
> > 
> > >
> > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > Note:
> > > > > > +ve number means "Performance regression with patches"
> > > > > > -ve number means "Performance improvement with patches"
> > > > > >
> > > > > > Relative performance difference in percentage:
> > > > > >
> > > > > > % difference on host (2 x 40G)
> > > > > > pkt_sz  1c      2c      4c      8c
> > > > > > 64	1.41	0.18	0.51	0.29
> > > > > > 72	0.50	0.53	0.09	0.19
> > > > > > 128	0.31	0.31	0.42	0.00
> > > > > > 256	-0.44	-0.44	0.00	0.00
> > > > > > 512	1.94	1.94	0.01	0.01
> > > > > > 1024	0.00	0.01	0.00	0.01
> > > > > > 1518	0.02	0.01	0.02	0.02
> > > > > >
> > > > > > % difference on guest (2 x 40G)
> > > > > > pkt_sz  1c      2c      4c      8c
> > > > > > 64      5.78	2.30	-2.45	-0.01
> > > > > > 72	1.13	1.13	1.31	-4.29
> > > > > > 128	1.36	1.36	1.54	-0.82
> > > > > > 256	2.02	2.03	-0.01	-0.35
> > > > > > 512	4.05	4.04	0.00	-0.46
> > > > > > 1024	0.39	0.38	0.00	-0.74
> > > > > > 1518	0.00	0.00	-0.05	-4.20
> > > > > >
> > > > > > I think, it is because we added more code under
> > > > > > the RTE_ETHDEV_RXTX_CALLBACKS and it is enabled by
> > > > > > default.
> > > > > >
> > > > > > I think, the impact will vary based on
> > > > > > architecture and underneath i-cache, d-cache and IPC.
> > > > > >
> > > > > > I am just thinking, How we can avoid such impact?
> > > > > > How about,
> > > > > > # Remove RTE_ETHDEV_RXTX_CALLBACKS option
> > > > > > # Hook Rx/TX callbacks under TX/RX offload flags
> > > > > > # ethdev layer gives helper function to invoke
> > > > > > callbacks to driver.
> > > > >
> > > > > I suppose you suggest that user would have to decide at queue_setup()
> > > > > stage does he wants a callback(s) for RX/TX or no?
> > > >
> > > > Yes. queue_setup or configure() time.
> > > >
> > > > > As I can see, the main drawback with that approach - it would introduce new limitation.
> > > > > Right now it is possible to add/remove RX/TX callback at runtime
> > > > > (without stopping and reconfiguring the queue), and I suppose we do want to keep that ability.
> > > >
> > > > For those _application_ where it can not decide or it need to add without
> > > > stopping and re configuring the queue still can configure/enable the ethdev
> > > > device with RX/TX callback option(and pay price for it)
> > > >
> > > > > Again in some cases user can't decide does he wants a callback installed or not,
> > > > > before he actually calls dev_start().
> > > > > l3fwd ptype-parse callback is an example of such situation:
> > > > > right now user has to start device first to figure out which ptypes are supported by this PMD
> > > > > with desired configuration, then based on this information he will (or will not) setup the callback.
> > > >
> > > > Good point. Can we remove that limitation from PMD perspective? IMO, ptype-parse is more of HW feature,
> > > > All we need to do a lookup(single load) to convert to DPDK packet type.
> > > > http://dpdk.org/browse/dpdk/tree/drivers/net/thunderx/nicvf_rxtx.c#n258
> > > >
> > > > I see almost all the drivers has dummy check like this, I think, That makes it
> > > > dependency with start().
> > > >
> > > >         if (dev->rx_pkt_burst == i40e_recv_pkts ||
> > > >             dev->rx_pkt_burst == i40e_recv_pkts_bulk_alloc ||
> > > >             dev->rx_pkt_burst == i40e_recv_scattered_pkts ||
> > > >             dev->rx_pkt_burst == i40e_recv_scattered_pkts_vec ||
> > > >             dev->rx_pkt_burst == i40e_recv_pkts_vec)
> > > >                 return ptypes;
> > > >
> > >
> > > We probably can, but it would require extra work and changes inside the PMDs.
> > 
> > Yes, If we want do it in clean way. Another intermediate solution could
> > be return some valid error code on those PMD. On such error code, application can
> > start the device and get the supported ptype and then stop/reconfigure it again
> > 
> > > i40e is a good one - all rx functions support the same ptypes.
> > > But let say for ixgbe ARM version of vRX doesn't support all ptypes that x86 vRX
> > > and generic scalar RX do.
> > > For fm10k - vector RX and scalar RX also support different ptype sets.
> > 
> > >
> > > >
> > > > > What I thought in same direction - introduce a notion of 'static' and 'dynamic' callbacks.
> > > > > Static ones can't be removed while queue is not stopped while dynamic can.
> > > > > But that also requires user to select at queue_setup() stage does he wants dynamic
> > > > > callbacks for that queue or not (new offload flag or extra field in rxconf/txconf structure).
> > > > > So again - same limitation for the user.
> > > > > Another possibility here - have a flag 'dynamic' or 'static' not per queue, but per individual callback.
> > > > > Yes, that would mean API change for add/remove callback functions.
> > > >
> > > > At least on my test, the regression was caused because adding the code in
> > > > tx_burst and rx_burst under the RTE_ETHDEV_RXTX_CALLBACKS(which enabled
> > > > by default). By disabling the RTE_ETHDEV_RXTX_CALLBACKS option, I don't
> > > > see any regression. If so, Will such regression fixed by introducing
> > > > 'static' or 'dynamic' callbacks? I am happy to test and provide feedback.
> > >
> > > Probably no then, but let's try the experiment I mentioned above first.
> > >
> > > >
> > > > For me, I think, The better option is request from application for the
> > > > need for Rx/TX callbacks and set the handler appropriate in the driver to
> > > > have zero impact for the application which does not need callback at all.
> > >
> > > It is possible, but:
> > > - that would be quite significant change and would affect each RX and TX function   in each PMD.
> > 
> > May be not, assume if we  have helper function which invoke the callbacks.
> > PMD can just call that before returning it.
> 
> Even with helpers it means we'll have to double number of RX/TX functions -
> one with a CB support, another without.  

IMO, It is up to the PMD driver. It can still choose to enable callbacks
all the time irrespective of input from application(i.e no need for double
number of RX/TX functions).

But it will enable a PMD to choose do optimization if the PMD wants
it. As a NW PMD maintainer, I will be OK to make minimal change(if the
helper function is available) in my PMD, if I get 1-2%
improvement. On the upside, We can remove RTE_ETHDEV_RXTX_CALLBACKS
conditional complication flag.


> 
> > 
> > > - from other side - even now there are several libs inside DPDK that rely on ability to add/remove
> > >   callbacks on the fly - pdump, latencystats.
> > I think, it is application who decides to use this API,then I think,
> > application can request callback feature support. I think, we have
> > followed the same method in latest RX/TX offload flag scheme as well.
> > 
> > >   ethdev rx profiling  also relies on ability to install RX callback.
> > >
> > > With all these things I presume most apps will configure their devices with rx/tx callback support enabled anyway.
> > > Which makes me think - is it worth to introduce such changes at all.
> > > Anyway, my first suggestion would be let's try to figure out what exactly causes such big slowdown on ARM.
> > 
> > It may not be ARM in general, Under our roof, We have different class of SoCs based
> > on the use case like 4 cores, 16 core, 96 cores part etc with different
> > micro architecture etc based on the use case(embedded/server etc) like
> > Intel. So it may not very generic with all ARM machines, it is really
> > based on micro architecture. ie. The same patch may behave
> > differently(less than < 1%) on IA-Atom vs IA-XEON. So IMO, Common code
> > change should use helper functions, it up to the driver to use it.
> 
> Konstantin
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [dpdk-dev] [RFC PATCH 0/3] ethdev: few changes in rte_ethdev layer
  2018-01-04  9:22             ` Jerin Jacob
@ 2018-01-15 16:29               ` Ananyev, Konstantin
  0 siblings, 0 replies; 9+ messages in thread
From: Ananyev, Konstantin @ 2018-01-15 16:29 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: dev, shahafs



Hi Jerin,

> 
> Hi Konstantin,
> 
> > > > > > > > 2. Make it safe to remove rx/tx callback at runtime.
> > > > > > > > Right now it is not possible for the application to figure out
> > > > > > > > when it is safe to free removed callback handle and
> > > > > > > > associated with it resources(unless the queue is stopped).
> > > > > > > > That's probably not a big problem if all callbacks are static
> > > > > > > > hange through whole application lifetime)
> > > > > > > > and/or application doesn't allocate any resources for the callback handler.
> > > > > > > > Though if callbacks have to be added/removed dynamically and
> > > > > > > > callback handler would require more resources to operate properly -
> > > > > > > > then it might become an issue.
> > > > > > > > So patch #2 fixes that problem - now as soon as
> > > > > > > > rte_eth_remove_(rx|tx)_callback() completes successfully, application
> > > > > > > > can safely free all associated with the removed callback resources.
> > > > > > > >
> > > > > > > > Performance impact:
> > > > > > > > If application doesn't use RX/TX callbacks, then the tests I run didn't
> > > > > > > > reveal any performance degradation.
> > > > > > > > Though if application do use RX/TX callbacks - patch #2 does introduce
> > > > > > > > some slowdown.
> > > > > > > >
> > > > > > > > To be more specific here, on BDW (E5-2699 v4) 2.2GHz, 4x10Gb (X520-4)
> > > > > > > > with http://dpdk.org/dev/patchwork/patch/31864/ patch installed I got:
> > > > > > > > 1) testpmd ... --latencystats=1 - slowdown < 1%
> > > > > > > > 2) examples//l3fwd ... --parse-ptype - - slowdown < 1%
> > > > > > > > 3) examples/rxtx_callbacks - slowdown ~8%
> > > > > > > > All that in terms of packet throughput (Mpps).
> > > > > > >
> > > > > > > We tried on an arm64 machine; We got following result on host and guest
> > > > > > > using l3fwd.
> > > > > >
> > > > > > Thanks for testing it.
> > > > > > So these numbers below - are for l3fwd running with or without rx callback installed?
> > > > > > i.e. with or without '--parse-ptype' option (and on a NIC with/without ptype HW support)?
> > > > >
> > > > > without '--parse-ptype' option(i.e without rx callback installed).
> > > > > That makes it sad, Even though application is not using the rx callback,
> > > > > We need to pay on a average 3% drop in performance(based on the
> > > > > use case/arch/SoC/topology etc)
> > > >
> > > > Yes it is sad...
> > > > Do you have any idea why that happens?
> > > > Is that the same situation you described before: for arm compiler
> > > > often replaces small branches with conditional execution instructions
> > > > and miss-prediction for such instructions is not free?
> > >
> > > Yes.
> >
> > Probably a dumb question - is it possible to disable such optimization
> > for these particular functions (rx_burst/tx_burst)?
> > Or would it make things even worse?
> >
> > >
> > > Another thing I noticed on these inline functions(rx/tx burst)even if we add
> > > a lot of code which may note be executed, based on the parent
> > > function layout, the compete register access layout gets change(like
> > > push/pop from stack etc, as more code gets added).
> > > I.e This change does not have much impact with testpmd(iofwd) where the
> > > parent function is not doing much thing.It is showing up only on l3fwd.
> >
> > Ok sort of register spillage...
> > But if compiler does it unconditionally (for the code path when it never needed)
> > seems like a problem with the compile, no?
> 
> I think, it will be depended on parent function usage.
> 
> > Wonder if we'll move at least RX/TX CBs invocation into a separate function -
> > would it help somehow?
> > I.E. if (q->cb != NULL) execut_cbs(...);
> > ?
> 
> Yes.I tried to move both (Rx, Tx) execut_cbs as separate in .c func ton.
> That is helping the regression in some extended(Reduced from 2.84% to 1.27%)
> 
> example:
> +uint16_t
> +__rte_eth_execut_tx_cbs(uint16_t port_id, uint16_t queue_id,
> +                       struct rte_mbuf **pkts, uint16_t nb_pkts,
> +                       struct rte_eth_queue_local *ql);
> +
>  static inline uint16_t
>  rte_eth_tx_burst(uint16_t port_id, uint16_t queue_id,
>                  struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
> @@ -3256,19 +3259,11 @@ rte_eth_tx_burst(uint16_t port_id, uint16_t
> queue_id,
>  #ifdef RTE_ETHDEV_RXTX_CALLBACKS
>         {
>                 struct rte_eth_queue_local *ql;
> -               struct rte_eth_rxtx_callback *cb;
> 
>                 ql = dev->tx_ql + queue_id;
> -               if (unlikely(ql->cbs.head != NULL)) {
> -
> -                       __rte_eth_rxtx_cbs_inuse(&ql->cbs);
> -
> -                       for (cb = ql->cbs.head; cb != NULL; cb = cb->next)
> -                               nb_pkts = cb->fn.tx(port_id, queue_id, tx_pkts,
> -                                       nb_pkts, cb->param);
> -
> -                       __rte_eth_rxtx_cbs_unuse(&ql->cbs);
> -               }
> +               if (unlikely(ql->cbs.head != NULL))
> +                       nb_pkts = __rte_eth_execut_tx_cbs(port_id,queue_id,
> +                       tx_pkts, nb_pkts, ql);
>         }
>  #endif
> 
> 
> > Another thought - as I remember a while ago Bruce and other lads suggested to move
> > rx_burst/tx)burst (and whole struct rte_eth_dev) definitions into rte_ethdev.c...
> > That would help to avoid such problems, again we'll be able to modify rte_eth_dev
> > without breaking ABI...
> > Though it would mean extra call overhead, so don't know of-hand how expensive
> > it could be.
> 
> From my experiments it shows more expensive than above scheme.
> 
> >
> > >
> > > > Or is it something else?
> > > >
> > > > Another thing - the series contains 2 main patches:
> > > > 1) http://dpdk.org/dev/patchwork/patch/31866/
> > > > Introduces eth_queue_local and moves callback data into it.
> > > > That adds one extra memory reference for rx/tx function to read cb func and data.
> > > > 2) http://dpdk.org/dev/patchwork/patch/31867/
> > > > Introduces some synchronization for cb data.
> > > >
> > > > Is it possible to rerun your tests with just patch #1 installed?
> > > > Hopefully it would help us to figure out which of the patches (#1 or #2)
> > > > is the main source of slowdown on arm machines.
> > >
> > > I re ran the test as requested:
> > > With first patch - 1.17% regression
> > > With first patch + Second patch - 2.84% regression
> >
> > Thanks
> >
> > >
> > > I am fine with slowdown, but it should be effected only with those
> > > application which request the callback feature.
> >
> > Ok, but you do realize that if we'll go with rx/tx function per queue,
> > patch #1 slowdown (1.17%)  will probably reappear anyway?
> 
> May be not, it will be one more redirection to get the rx/tx function
> pointer(may be a additionally load instruction)
> 
> >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Note:
> > > > > > > +ve number means "Performance regression with patches"
> > > > > > > -ve number means "Performance improvement with patches"
> > > > > > >
> > > > > > > Relative performance difference in percentage:
> > > > > > >
> > > > > > > % difference on host (2 x 40G)
> > > > > > > pkt_sz  1c      2c      4c      8c
> > > > > > > 64	1.41	0.18	0.51	0.29
> > > > > > > 72	0.50	0.53	0.09	0.19
> > > > > > > 128	0.31	0.31	0.42	0.00
> > > > > > > 256	-0.44	-0.44	0.00	0.00
> > > > > > > 512	1.94	1.94	0.01	0.01
> > > > > > > 1024	0.00	0.01	0.00	0.01
> > > > > > > 1518	0.02	0.01	0.02	0.02
> > > > > > >
> > > > > > > % difference on guest (2 x 40G)
> > > > > > > pkt_sz  1c      2c      4c      8c
> > > > > > > 64      5.78	2.30	-2.45	-0.01
> > > > > > > 72	1.13	1.13	1.31	-4.29
> > > > > > > 128	1.36	1.36	1.54	-0.82
> > > > > > > 256	2.02	2.03	-0.01	-0.35
> > > > > > > 512	4.05	4.04	0.00	-0.46
> > > > > > > 1024	0.39	0.38	0.00	-0.74
> > > > > > > 1518	0.00	0.00	-0.05	-4.20
> > > > > > >
> > > > > > > I think, it is because we added more code under
> > > > > > > the RTE_ETHDEV_RXTX_CALLBACKS and it is enabled by
> > > > > > > default.
> > > > > > >
> > > > > > > I think, the impact will vary based on
> > > > > > > architecture and underneath i-cache, d-cache and IPC.
> > > > > > >
> > > > > > > I am just thinking, How we can avoid such impact?
> > > > > > > How about,
> > > > > > > # Remove RTE_ETHDEV_RXTX_CALLBACKS option
> > > > > > > # Hook Rx/TX callbacks under TX/RX offload flags
> > > > > > > # ethdev layer gives helper function to invoke
> > > > > > > callbacks to driver.
> > > > > >
> > > > > > I suppose you suggest that user would have to decide at queue_setup()
> > > > > > stage does he wants a callback(s) for RX/TX or no?
> > > > >
> > > > > Yes. queue_setup or configure() time.
> > > > >
> > > > > > As I can see, the main drawback with that approach - it would introduce new limitation.
> > > > > > Right now it is possible to add/remove RX/TX callback at runtime
> > > > > > (without stopping and reconfiguring the queue), and I suppose we do want to keep that ability.
> > > > >
> > > > > For those _application_ where it can not decide or it need to add without
> > > > > stopping and re configuring the queue still can configure/enable the ethdev
> > > > > device with RX/TX callback option(and pay price for it)
> > > > >
> > > > > > Again in some cases user can't decide does he wants a callback installed or not,
> > > > > > before he actually calls dev_start().
> > > > > > l3fwd ptype-parse callback is an example of such situation:
> > > > > > right now user has to start device first to figure out which ptypes are supported by this PMD
> > > > > > with desired configuration, then based on this information he will (or will not) setup the callback.
> > > > >
> > > > > Good point. Can we remove that limitation from PMD perspective? IMO, ptype-parse is more of HW feature,
> > > > > All we need to do a lookup(single load) to convert to DPDK packet type.
> > > > > http://dpdk.org/browse/dpdk/tree/drivers/net/thunderx/nicvf_rxtx.c#n258
> > > > >
> > > > > I see almost all the drivers has dummy check like this, I think, That makes it
> > > > > dependency with start().
> > > > >
> > > > >         if (dev->rx_pkt_burst == i40e_recv_pkts ||
> > > > >             dev->rx_pkt_burst == i40e_recv_pkts_bulk_alloc ||
> > > > >             dev->rx_pkt_burst == i40e_recv_scattered_pkts ||
> > > > >             dev->rx_pkt_burst == i40e_recv_scattered_pkts_vec ||
> > > > >             dev->rx_pkt_burst == i40e_recv_pkts_vec)
> > > > >                 return ptypes;
> > > > >
> > > >
> > > > We probably can, but it would require extra work and changes inside the PMDs.
> > >
> > > Yes, If we want do it in clean way. Another intermediate solution could
> > > be return some valid error code on those PMD. On such error code, application can
> > > start the device and get the supported ptype and then stop/reconfigure it again
> > >
> > > > i40e is a good one - all rx functions support the same ptypes.
> > > > But let say for ixgbe ARM version of vRX doesn't support all ptypes that x86 vRX
> > > > and generic scalar RX do.
> > > > For fm10k - vector RX and scalar RX also support different ptype sets.
> > >
> > > >
> > > > >
> > > > > > What I thought in same direction - introduce a notion of 'static' and 'dynamic' callbacks.
> > > > > > Static ones can't be removed while queue is not stopped while dynamic can.
> > > > > > But that also requires user to select at queue_setup() stage does he wants dynamic
> > > > > > callbacks for that queue or not (new offload flag or extra field in rxconf/txconf structure).
> > > > > > So again - same limitation for the user.
> > > > > > Another possibility here - have a flag 'dynamic' or 'static' not per queue, but per individual callback.
> > > > > > Yes, that would mean API change for add/remove callback functions.
> > > > >
> > > > > At least on my test, the regression was caused because adding the code in
> > > > > tx_burst and rx_burst under the RTE_ETHDEV_RXTX_CALLBACKS(which enabled
> > > > > by default). By disabling the RTE_ETHDEV_RXTX_CALLBACKS option, I don't
> > > > > see any regression. If so, Will such regression fixed by introducing
> > > > > 'static' or 'dynamic' callbacks? I am happy to test and provide feedback.
> > > >
> > > > Probably no then, but let's try the experiment I mentioned above first.
> > > >
> > > > >
> > > > > For me, I think, The better option is request from application for the
> > > > > need for Rx/TX callbacks and set the handler appropriate in the driver to
> > > > > have zero impact for the application which does not need callback at all.
> > > >
> > > > It is possible, but:
> > > > - that would be quite significant change and would affect each RX and TX function   in each PMD.
> > >
> > > May be not, assume if we  have helper function which invoke the callbacks.
> > > PMD can just call that before returning it.
> >
> > Even with helpers it means we'll have to double number of RX/TX functions -
> > one with a CB support, another without.
> 
> IMO, It is up to the PMD driver. It can still choose to enable callbacks
> all the time irrespective of input from application(i.e no need for double
> number of RX/TX functions).

Good point.

> 
> But it will enable a PMD to choose do optimization if the PMD wants
> it. As a NW PMD maintainer, I will be OK to make minimal change(if the
> helper function is available) in my PMD, if I get 1-2%
> improvement. On the upside, We can remove RTE_ETHDEV_RXTX_CALLBACKS
> conditional complication flag.

Yes, removing RTE_ETHDEV_RXTX_CALLBACKS would definitely be a good thing.
I tried the approach you suggested for ixgbe PMD - on IA there seems no regression.
The only thing I don't really happy with it - it would require changes  for every RX/TX function in each PMD.
Yes, changes are quite mechanical, but still every PMD will need to be retested. 
Before going forward would be really interesting to hear other people opinions here.
Might be there are some other ways to overcome that problem.
Thanks
Konstantin

> 
> 
> >
> > >
> > > > - from other side - even now there are several libs inside DPDK that rely on ability to add/remove
> > > >   callbacks on the fly - pdump, latencystats.
> > > I think, it is application who decides to use this API,then I think,
> > > application can request callback feature support. I think, we have
> > > followed the same method in latest RX/TX offload flag scheme as well.
> > >
> > > >   ethdev rx profiling  also relies on ability to install RX callback.
> > > >
> > > > With all these things I presume most apps will configure their devices with rx/tx callback support enabled anyway.
> > > > Which makes me think - is it worth to introduce such changes at all.
> > > > Anyway, my first suggestion would be let's try to figure out what exactly causes such big slowdown on ARM.
> > >
> > > It may not be ARM in general, Under our roof, We have different class of SoCs based
> > > on the use case like 4 cores, 16 core, 96 cores part etc with different
> > > micro architecture etc based on the use case(embedded/server etc) like
> > > Intel. So it may not very generic with all ARM machines, it is really
> > > based on micro architecture. ie. The same patch may behave
> > > differently(less than < 1%) on IA-Atom vs IA-XEON. So IMO, Common code
> > > change should use helper functions, it up to the driver to use it.
> >
> > Konstantin
> >

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-01-15 16:29 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-01 14:55 [dpdk-dev] [RFC PATCH 0/3] ethdev: few changes in rte_ethdev layer Konstantin Ananyev
2017-12-14  4:54 ` Jerin Jacob
2017-12-14 13:57   ` Ananyev, Konstantin
2017-12-18 16:38     ` Jerin Jacob
2017-12-19 15:27       ` Ananyev, Konstantin
2017-12-20  7:28         ` Jerin Jacob
2017-12-20 18:23           ` Ananyev, Konstantin
2018-01-04  9:22             ` Jerin Jacob
2018-01-15 16:29               ` Ananyev, Konstantin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).