DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [RFC] ethdev: introduce shared Rx queue
@ 2021-07-27  3:42 Xueming Li
  2021-07-28  7:56 ` Andrew Rybchenko
                   ` (16 more replies)
  0 siblings, 17 replies; 266+ messages in thread
From: Xueming Li @ 2021-07-27  3:42 UTC (permalink / raw)
  Cc: dev, Viacheslav Ovsiienko, xuemingl, Thomas Monjalon,
	Ferruh Yigit, Andrew Rybchenko

In eth PMD driver model, each RX queue was pre-loaded with mbufs for
saving incoming packets. When number of SF or VF scale out in a switch
domain, the memory consumption became significant. Most important,
polling all ports leads to high cache miss, high latency and low
throughput.

To save memory and speed up, this patch introduces shared RX queue.
Ports with same configuration in a switch domain could share RX queue
set by specifying offloading flag RTE_ETH_RX_OFFLOAD_SHARED_RXQ. Polling
a member port in shared RX queue receives packets for all member ports.
Source port is identified by mbuf->port.

Queue number of ports in shared group should be identical. Queue index
is 1:1 mapped in shared group.

Shared RX queue is supposed to be polled on same thread.

Multiple groups is supported by group ID.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 lib/ethdev/rte_ethdev.c | 1 +
 lib/ethdev/rte_ethdev.h | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index a1106f5896..632a0e890b 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -127,6 +127,7 @@ static const struct {
 	RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
 	RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
 	RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
+	RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
 };
 
 #undef RTE_RX_OFFLOAD_BIT2STR
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index d2b27c351f..5c63751be0 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
 	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
 	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
 	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
+	uint32_t shared_group; /**< Shared port group index in switch domain. */
 	/**
 	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
 	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
@@ -1373,6 +1374,12 @@ struct rte_eth_conf {
 #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
 #define DEV_RX_OFFLOAD_RSS_HASH		0x00080000
 #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
+/**
+ * RXQ is shared within ports in switch domain to save memory and avoid
+ * polling every port. Any port in group could be used to receive packets.
+ * Real source port number saved in mbuf->port field.
+ */
+#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
 
 #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
 				 DEV_RX_OFFLOAD_UDP_CKSUM | \
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [RFC] ethdev: introduce shared Rx queue
  2021-07-27  3:42 [dpdk-dev] [RFC] ethdev: introduce shared Rx queue Xueming Li
@ 2021-07-28  7:56 ` Andrew Rybchenko
  2021-07-28  8:20   ` Xueming(Steven) Li
  2021-08-09 11:47 ` [dpdk-dev] [PATCH v1] " Xueming Li
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 266+ messages in thread
From: Andrew Rybchenko @ 2021-07-28  7:56 UTC (permalink / raw)
  To: Xueming Li; +Cc: dev, Viacheslav Ovsiienko, Thomas Monjalon, Ferruh Yigit

On 7/27/21 6:42 AM, Xueming Li wrote:
> In eth PMD driver model, each RX queue was pre-loaded with mbufs for
> saving incoming packets. When number of SF or VF scale out in a switch
> domain, the memory consumption became significant. Most important,
> polling all ports leads to high cache miss, high latency and low
> throughput.
> 
> To save memory and speed up, this patch introduces shared RX queue.
> Ports with same configuration in a switch domain could share RX queue
> set by specifying offloading flag RTE_ETH_RX_OFFLOAD_SHARED_RXQ. Polling
> a member port in shared RX queue receives packets for all member ports.
> Source port is identified by mbuf->port.
> 
> Queue number of ports in shared group should be identical. Queue index
> is 1:1 mapped in shared group.
> 
> Shared RX queue is supposed to be polled on same thread.
> 
> Multiple groups is supported by group ID.
> 
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>

It looks like it could be useful to artificial benchmarks, but
absolutely useless for real life. SFs and VFs are used by VMs
(or containers?) to have its own part of HW. If so, SF or VF
Rx and Tx queues live in a VM and cannot be shared.

Sharing makes sense for representors, but it is not mentioned in
the description.

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [RFC] ethdev: introduce shared Rx queue
  2021-07-28  7:56 ` Andrew Rybchenko
@ 2021-07-28  8:20   ` Xueming(Steven) Li
  0 siblings, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-07-28  8:20 UTC (permalink / raw)
  To: Andrew Rybchenko
  Cc: dev, Slava Ovsiienko, NBU-Contact-Thomas Monjalon, Ferruh Yigit

Hi Andrew,

> -----Original Message-----
> From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Sent: Wednesday, July 28, 2021 3:57 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: dev@dpdk.org; Slava Ovsiienko <viacheslavo@nvidia.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Ferruh
> Yigit <ferruh.yigit@intel.com>
> Subject: Re: [RFC] ethdev: introduce shared Rx queue
> 
> On 7/27/21 6:42 AM, Xueming Li wrote:
> > In eth PMD driver model, each RX queue was pre-loaded with mbufs for
> > saving incoming packets. When number of SF or VF scale out in a switch
> > domain, the memory consumption became significant. Most important,
> > polling all ports leads to high cache miss, high latency and low
> > throughput.
> >
> > To save memory and speed up, this patch introduces shared RX queue.
> > Ports with same configuration in a switch domain could share RX queue
> > set by specifying offloading flag RTE_ETH_RX_OFFLOAD_SHARED_RXQ.
> > Polling a member port in shared RX queue receives packets for all member ports.
> > Source port is identified by mbuf->port.
> >
> > Queue number of ports in shared group should be identical. Queue index
> > is 1:1 mapped in shared group.
> >
> > Shared RX queue is supposed to be polled on same thread.
> >
> > Multiple groups is supported by group ID.
> >
> > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> 
> It looks like it could be useful to artificial benchmarks, but absolutely useless for real life. SFs and VFs are used by VMs (or containers?)
> to have its own part of HW. If so, SF or VF Rx and Tx queues live in a VM and cannot be shared.

Thanks for looking at this! Agree, SF and VF can't be shared.

> 
> Sharing makes sense for representors, but it is not mentioned in the description.

Yes, the major target is representors, ports in same switch domain, I'll emphasis this in next version.

^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-07-27  3:42 [dpdk-dev] [RFC] ethdev: introduce shared Rx queue Xueming Li
  2021-07-28  7:56 ` Andrew Rybchenko
@ 2021-08-09 11:47 ` Xueming Li
  2021-08-09 13:50   ` Jerin Jacob
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 266+ messages in thread
From: Xueming Li @ 2021-08-09 11:47 UTC (permalink / raw)
  Cc: dev, xuemingl, Ferruh Yigit, Thomas Monjalon, Andrew Rybchenko

In current DPDK framework, each RX queue is pre-loaded with mbufs for
incoming packets. When number of representors scale out in a switch
domain, the memory consumption became significant. Most important,
polling all ports leads to high cache miss, high latency and low
throughput.

This patch introduces shared RX queue. Ports with same configuration in
a switch domain could share RX queue set by specifying sharing group.
Polling any queue using same shared RX queue receives packets from all
member ports. Source port is identified by mbuf->port.

Port queue number in a shared group should be identical. Queue index is
1:1 mapped in shared group.

Share RX queue is supposed to be polled on same thread.

Multiple groups is supported by group ID.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 doc/guides/nics/features.rst                    | 11 +++++++++++
 doc/guides/nics/features/default.ini            |  1 +
 doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
 lib/ethdev/rte_ethdev.c                         |  1 +
 lib/ethdev/rte_ethdev.h                         |  7 +++++++
 5 files changed, 30 insertions(+)

diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index a96e12d155..2e2a9b1554 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
   ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
 
 
+.. _nic_features_shared_rx_queue:
+
+Shared Rx queue
+---------------
+
+Supports shared Rx queue for ports in same switch domain.
+
+* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
+* **[provides] mbuf**: ``mbuf.port``.
+
+
 .. _nic_features_packet_type_parsing:
 
 Packet type parsing
diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
index 754184ddd4..ebeb4c1851 100644
--- a/doc/guides/nics/features/default.ini
+++ b/doc/guides/nics/features/default.ini
@@ -19,6 +19,7 @@ Free Tx mbuf on demand =
 Queue start/stop     =
 Runtime Rx queue setup =
 Runtime Tx queue setup =
+Shared Rx queue      =
 Burst mode info      =
 Power mgmt address monitor =
 MTU update           =
diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
index ff6aa91c80..45bf5a3a10 100644
--- a/doc/guides/prog_guide/switch_representation.rst
+++ b/doc/guides/prog_guide/switch_representation.rst
@@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
 .. [1] `Ethernet switch device driver model (switchdev)
        <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
 
+- Memory usage of representors is huge when number of representor grows,
+  because PMD always allocate mbuf for each descriptor of Rx queue.
+  Polling the large number of ports brings more CPU load, cache miss and
+  latency. Shared Rx queue can be used to share Rx queue between PF and
+  representors in same switch domain. ``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
+  is present in Rx offloading capability of device info. Setting the
+  offloading flag in device Rx mode or Rx queue configuration to enable
+  shared Rx queue. Polling any member port of shared Rx queue can return
+  packets of all ports in group, port ID is saved in ``mbuf.port``.
+
 Basic SR-IOV
 ------------
 
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 9d95cd11e1..1361ff759a 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -127,6 +127,7 @@ static const struct {
 	RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
 	RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
 	RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
+	RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
 };
 
 #undef RTE_RX_OFFLOAD_BIT2STR
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index d2b27c351f..a578c9db9d 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
 	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
 	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
 	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
+	uint32_t shared_group; /**< Shared port group index in switch domain. */
 	/**
 	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
 	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
@@ -1373,6 +1374,12 @@ struct rte_eth_conf {
 #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
 #define DEV_RX_OFFLOAD_RSS_HASH		0x00080000
 #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
+/**
+ * Rx queue is shared among ports in same switch domain to save memory,
+ * avoid polling each port. Any port in group can be used to receive packets.
+ * Real source port number saved in mbuf->port field.
+ */
+#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
 
 #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
 				 DEV_RX_OFFLOAD_UDP_CKSUM | \
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-08-09 11:47 ` [dpdk-dev] [PATCH v1] " Xueming Li
@ 2021-08-09 13:50   ` Jerin Jacob
  2021-08-09 14:16     ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Jerin Jacob @ 2021-08-09 13:50 UTC (permalink / raw)
  To: Xueming Li; +Cc: dpdk-dev, Ferruh Yigit, Thomas Monjalon, Andrew Rybchenko

On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
>
> In current DPDK framework, each RX queue is pre-loaded with mbufs for
> incoming packets. When number of representors scale out in a switch
> domain, the memory consumption became significant. Most important,
> polling all ports leads to high cache miss, high latency and low
> throughput.
>
> This patch introduces shared RX queue. Ports with same configuration in
> a switch domain could share RX queue set by specifying sharing group.
> Polling any queue using same shared RX queue receives packets from all
> member ports. Source port is identified by mbuf->port.
>
> Port queue number in a shared group should be identical. Queue index is
> 1:1 mapped in shared group.
>
> Share RX queue is supposed to be polled on same thread.
>
> Multiple groups is supported by group ID.

Is this offload specific to the representor? If so can this name be
changed specifically to representor?
If it is for a generic case, how the flow ordering will be maintained?

>
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> ---
>  doc/guides/nics/features.rst                    | 11 +++++++++++
>  doc/guides/nics/features/default.ini            |  1 +
>  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
>  lib/ethdev/rte_ethdev.c                         |  1 +
>  lib/ethdev/rte_ethdev.h                         |  7 +++++++
>  5 files changed, 30 insertions(+)
>
> diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> index a96e12d155..2e2a9b1554 100644
> --- a/doc/guides/nics/features.rst
> +++ b/doc/guides/nics/features.rst
> @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
>    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
>
>
> +.. _nic_features_shared_rx_queue:
> +
> +Shared Rx queue
> +---------------
> +
> +Supports shared Rx queue for ports in same switch domain.
> +
> +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> +* **[provides] mbuf**: ``mbuf.port``.
> +
> +
>  .. _nic_features_packet_type_parsing:
>
>  Packet type parsing
> diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
> index 754184ddd4..ebeb4c1851 100644
> --- a/doc/guides/nics/features/default.ini
> +++ b/doc/guides/nics/features/default.ini
> @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
>  Queue start/stop     =
>  Runtime Rx queue setup =
>  Runtime Tx queue setup =
> +Shared Rx queue      =
>  Burst mode info      =
>  Power mgmt address monitor =
>  MTU update           =
> diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
> index ff6aa91c80..45bf5a3a10 100644
> --- a/doc/guides/prog_guide/switch_representation.rst
> +++ b/doc/guides/prog_guide/switch_representation.rst
> @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
>  .. [1] `Ethernet switch device driver model (switchdev)
>         <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
>
> +- Memory usage of representors is huge when number of representor grows,
> +  because PMD always allocate mbuf for each descriptor of Rx queue.
> +  Polling the large number of ports brings more CPU load, cache miss and
> +  latency. Shared Rx queue can be used to share Rx queue between PF and
> +  representors in same switch domain. ``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> +  is present in Rx offloading capability of device info. Setting the
> +  offloading flag in device Rx mode or Rx queue configuration to enable
> +  shared Rx queue. Polling any member port of shared Rx queue can return
> +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> +
>  Basic SR-IOV
>  ------------
>
> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index 9d95cd11e1..1361ff759a 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -127,6 +127,7 @@ static const struct {
>         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
>         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
>         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
>  };
>
>  #undef RTE_RX_OFFLOAD_BIT2STR
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index d2b27c351f..a578c9db9d 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
>         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
>         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
>         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> +       uint32_t shared_group; /**< Shared port group index in switch domain. */
>         /**
>          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
>          * Only offloads set on rx_queue_offload_capa or rx_offload_capa
> @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
>  #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
>  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
>  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> +/**
> + * Rx queue is shared among ports in same switch domain to save memory,
> + * avoid polling each port. Any port in group can be used to receive packets.
> + * Real source port number saved in mbuf->port field.
> + */
> +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
>
>  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
>                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-08-09 13:50   ` Jerin Jacob
@ 2021-08-09 14:16     ` Xueming(Steven) Li
  2021-08-11  8:02       ` Jerin Jacob
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-08-09 14:16 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dpdk-dev, Ferruh Yigit, NBU-Contact-Thomas Monjalon, Andrew Rybchenko

Hi,

> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Monday, August 9, 2021 9:51 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> 
> On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> >
> > In current DPDK framework, each RX queue is pre-loaded with mbufs for
> > incoming packets. When number of representors scale out in a switch
> > domain, the memory consumption became significant. Most important,
> > polling all ports leads to high cache miss, high latency and low
> > throughput.
> >
> > This patch introduces shared RX queue. Ports with same configuration
> > in a switch domain could share RX queue set by specifying sharing group.
> > Polling any queue using same shared RX queue receives packets from all
> > member ports. Source port is identified by mbuf->port.
> >
> > Port queue number in a shared group should be identical. Queue index
> > is
> > 1:1 mapped in shared group.
> >
> > Share RX queue is supposed to be polled on same thread.
> >
> > Multiple groups is supported by group ID.
> 
> Is this offload specific to the representor? If so can this name be changed specifically to representor?

Yes, PF and representor in switch domain could take advantage.

> If it is for a generic case, how the flow ordering will be maintained?

Not quite sure that I understood your question. The control path of is almost same as before,
PF and representor port still needed, rte flows not impacted.
Queues still needed for each member port, descriptors(mbuf) will be supplied from shared Rx queue
in my PMD implementation.

> 
> >
> > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > ---
> >  doc/guides/nics/features.rst                    | 11 +++++++++++
> >  doc/guides/nics/features/default.ini            |  1 +
> >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> >  lib/ethdev/rte_ethdev.c                         |  1 +
> >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> >  5 files changed, 30 insertions(+)
> >
> > diff --git a/doc/guides/nics/features.rst
> > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > --- a/doc/guides/nics/features.rst
> > +++ b/doc/guides/nics/features.rst
> > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> >
> >
> > +.. _nic_features_shared_rx_queue:
> > +
> > +Shared Rx queue
> > +---------------
> > +
> > +Supports shared Rx queue for ports in same switch domain.
> > +
> > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > +* **[provides] mbuf**: ``mbuf.port``.
> > +
> > +
> >  .. _nic_features_packet_type_parsing:
> >
> >  Packet type parsing
> > diff --git a/doc/guides/nics/features/default.ini
> > b/doc/guides/nics/features/default.ini
> > index 754184ddd4..ebeb4c1851 100644
> > --- a/doc/guides/nics/features/default.ini
> > +++ b/doc/guides/nics/features/default.ini
> > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> >  Queue start/stop     =
> >  Runtime Rx queue setup =
> >  Runtime Tx queue setup =
> > +Shared Rx queue      =
> >  Burst mode info      =
> >  Power mgmt address monitor =
> >  MTU update           =
> > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > b/doc/guides/prog_guide/switch_representation.rst
> > index ff6aa91c80..45bf5a3a10 100644
> > --- a/doc/guides/prog_guide/switch_representation.rst
> > +++ b/doc/guides/prog_guide/switch_representation.rst
> > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> >  .. [1] `Ethernet switch device driver model (switchdev)
> >
> > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
> >
> > +- Memory usage of representors is huge when number of representor
> > +grows,
> > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > +  Polling the large number of ports brings more CPU load, cache miss
> > +and
> > +  latency. Shared Rx queue can be used to share Rx queue between PF
> > +and
> > +  representors in same switch domain.
> > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > +  is present in Rx offloading capability of device info. Setting the
> > +  offloading flag in device Rx mode or Rx queue configuration to
> > +enable
> > +  shared Rx queue. Polling any member port of shared Rx queue can
> > +return
> > +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> > +
> >  Basic SR-IOV
> >  ------------
> >
> > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c index
> > 9d95cd11e1..1361ff759a 100644
> > --- a/lib/ethdev/rte_ethdev.c
> > +++ b/lib/ethdev/rte_ethdev.c
> > @@ -127,6 +127,7 @@ static const struct {
> >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> >  };
> >
> >  #undef RTE_RX_OFFLOAD_BIT2STR
> > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > d2b27c351f..a578c9db9d 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > */
> > +       uint32_t shared_group; /**< Shared port group index in switch
> > + domain. */
> >         /**
> >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> >          * Only offloads set on rx_queue_offload_capa or
> > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {  #define
> > DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > +/**
> > + * Rx queue is shared among ports in same switch domain to save
> > +memory,
> > + * avoid polling each port. Any port in group can be used to receive packets.
> > + * Real source port number saved in mbuf->port field.
> > + */
> > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> >
> >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-08-09 14:16     ` Xueming(Steven) Li
@ 2021-08-11  8:02       ` Jerin Jacob
  2021-08-11  8:28         ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Jerin Jacob @ 2021-08-11  8:02 UTC (permalink / raw)
  To: Xueming(Steven) Li
  Cc: dpdk-dev, Ferruh Yigit, NBU-Contact-Thomas Monjalon, Andrew Rybchenko

On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
> Hi,
>
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Monday, August 9, 2021 9:51 PM
> > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> >
> > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > >
> > > In current DPDK framework, each RX queue is pre-loaded with mbufs for
> > > incoming packets. When number of representors scale out in a switch
> > > domain, the memory consumption became significant. Most important,
> > > polling all ports leads to high cache miss, high latency and low
> > > throughput.
> > >
> > > This patch introduces shared RX queue. Ports with same configuration
> > > in a switch domain could share RX queue set by specifying sharing group.
> > > Polling any queue using same shared RX queue receives packets from all
> > > member ports. Source port is identified by mbuf->port.
> > >
> > > Port queue number in a shared group should be identical. Queue index
> > > is
> > > 1:1 mapped in shared group.
> > >
> > > Share RX queue is supposed to be polled on same thread.
> > >
> > > Multiple groups is supported by group ID.
> >
> > Is this offload specific to the representor? If so can this name be changed specifically to representor?
>
> Yes, PF and representor in switch domain could take advantage.
>
> > If it is for a generic case, how the flow ordering will be maintained?
>
> Not quite sure that I understood your question. The control path of is almost same as before,
> PF and representor port still needed, rte flows not impacted.
> Queues still needed for each member port, descriptors(mbuf) will be supplied from shared Rx queue
> in my PMD implementation.

My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload,
multiple ethdev receive queues land into the same receive queue, In that case,
how the flow order is maintained for respective receive queues.
If this offload is only useful for representor case, Can we make this
offload specific
to representor the case by changing its name and scope.



>
> >
> > >
> > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > ---
> > >  doc/guides/nics/features.rst                    | 11 +++++++++++
> > >  doc/guides/nics/features/default.ini            |  1 +
> > >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > >  lib/ethdev/rte_ethdev.c                         |  1 +
> > >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> > >  5 files changed, 30 insertions(+)
> > >
> > > diff --git a/doc/guides/nics/features.rst
> > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > --- a/doc/guides/nics/features.rst
> > > +++ b/doc/guides/nics/features.rst
> > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > >
> > >
> > > +.. _nic_features_shared_rx_queue:
> > > +
> > > +Shared Rx queue
> > > +---------------
> > > +
> > > +Supports shared Rx queue for ports in same switch domain.
> > > +
> > > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > +* **[provides] mbuf**: ``mbuf.port``.
> > > +
> > > +
> > >  .. _nic_features_packet_type_parsing:
> > >
> > >  Packet type parsing
> > > diff --git a/doc/guides/nics/features/default.ini
> > > b/doc/guides/nics/features/default.ini
> > > index 754184ddd4..ebeb4c1851 100644
> > > --- a/doc/guides/nics/features/default.ini
> > > +++ b/doc/guides/nics/features/default.ini
> > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > >  Queue start/stop     =
> > >  Runtime Rx queue setup =
> > >  Runtime Tx queue setup =
> > > +Shared Rx queue      =
> > >  Burst mode info      =
> > >  Power mgmt address monitor =
> > >  MTU update           =
> > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > b/doc/guides/prog_guide/switch_representation.rst
> > > index ff6aa91c80..45bf5a3a10 100644
> > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > >  .. [1] `Ethernet switch device driver model (switchdev)
> > >
> > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
> > >
> > > +- Memory usage of representors is huge when number of representor
> > > +grows,
> > > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > > +  Polling the large number of ports brings more CPU load, cache miss
> > > +and
> > > +  latency. Shared Rx queue can be used to share Rx queue between PF
> > > +and
> > > +  representors in same switch domain.
> > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > +  is present in Rx offloading capability of device info. Setting the
> > > +  offloading flag in device Rx mode or Rx queue configuration to
> > > +enable
> > > +  shared Rx queue. Polling any member port of shared Rx queue can
> > > +return
> > > +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > +
> > >  Basic SR-IOV
> > >  ------------
> > >
> > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c index
> > > 9d95cd11e1..1361ff759a 100644
> > > --- a/lib/ethdev/rte_ethdev.c
> > > +++ b/lib/ethdev/rte_ethdev.c
> > > @@ -127,6 +127,7 @@ static const struct {
> > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > >  };
> > >
> > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > > d2b27c351f..a578c9db9d 100644
> > > --- a/lib/ethdev/rte_ethdev.h
> > > +++ b/lib/ethdev/rte_ethdev.h
> > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > */
> > > +       uint32_t shared_group; /**< Shared port group index in switch
> > > + domain. */
> > >         /**
> > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > >          * Only offloads set on rx_queue_offload_capa or
> > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {  #define
> > > DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > +/**
> > > + * Rx queue is shared among ports in same switch domain to save
> > > +memory,
> > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > + * Real source port number saved in mbuf->port field.
> > > + */
> > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > >
> > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > --
> > > 2.25.1
> > >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-08-11  8:02       ` Jerin Jacob
@ 2021-08-11  8:28         ` Xueming(Steven) Li
  2021-08-11 12:04           ` Ferruh Yigit
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-08-11  8:28 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dpdk-dev, Ferruh Yigit, NBU-Contact-Thomas Monjalon, Andrew Rybchenko



> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Wednesday, August 11, 2021 4:03 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> 
> On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> > Hi,
> >
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Monday, August 9, 2021 9:51 PM
> > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > <andrew.rybchenko@oktetlabs.ru>
> > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > >
> > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > >
> > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > for incoming packets. When number of representors scale out in a
> > > > switch domain, the memory consumption became significant. Most
> > > > important, polling all ports leads to high cache miss, high
> > > > latency and low throughput.
> > > >
> > > > This patch introduces shared RX queue. Ports with same
> > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > Polling any queue using same shared RX queue receives packets from
> > > > all member ports. Source port is identified by mbuf->port.
> > > >
> > > > Port queue number in a shared group should be identical. Queue
> > > > index is
> > > > 1:1 mapped in shared group.
> > > >
> > > > Share RX queue is supposed to be polled on same thread.
> > > >
> > > > Multiple groups is supported by group ID.
> > >
> > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> >
> > Yes, PF and representor in switch domain could take advantage.
> >
> > > If it is for a generic case, how the flow ordering will be maintained?
> >
> > Not quite sure that I understood your question. The control path of is
> > almost same as before, PF and representor port still needed, rte flows not impacted.
> > Queues still needed for each member port, descriptors(mbuf) will be
> > supplied from shared Rx queue in my PMD implementation.
> 
> My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
> receive queue, In that case, how the flow order is maintained for respective receive queues.

I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
all forwarding engine. Will sent patches soon.

> If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
> scope.

It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.

> 
> 
> >
> > >
> > > >
> > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > ---
> > > >  doc/guides/nics/features.rst                    | 11 +++++++++++
> > > >  doc/guides/nics/features/default.ini            |  1 +
> > > >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > >  lib/ethdev/rte_ethdev.c                         |  1 +
> > > >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> > > >  5 files changed, 30 insertions(+)
> > > >
> > > > diff --git a/doc/guides/nics/features.rst
> > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > --- a/doc/guides/nics/features.rst
> > > > +++ b/doc/guides/nics/features.rst
> > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > >
> > > >
> > > > +.. _nic_features_shared_rx_queue:
> > > > +
> > > > +Shared Rx queue
> > > > +---------------
> > > > +
> > > > +Supports shared Rx queue for ports in same switch domain.
> > > > +
> > > > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > +
> > > > +
> > > >  .. _nic_features_packet_type_parsing:
> > > >
> > > >  Packet type parsing
> > > > diff --git a/doc/guides/nics/features/default.ini
> > > > b/doc/guides/nics/features/default.ini
> > > > index 754184ddd4..ebeb4c1851 100644
> > > > --- a/doc/guides/nics/features/default.ini
> > > > +++ b/doc/guides/nics/features/default.ini
> > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > >  Queue start/stop     =
> > > >  Runtime Rx queue setup =
> > > >  Runtime Tx queue setup =
> > > > +Shared Rx queue      =
> > > >  Burst mode info      =
> > > >  Power mgmt address monitor =
> > > >  MTU update           =
> > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > index ff6aa91c80..45bf5a3a10 100644
> > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > >  .. [1] `Ethernet switch device driver model (switchdev)
> > > >
> > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > >`_
> > > >
> > > > +- Memory usage of representors is huge when number of representor
> > > > +grows,
> > > > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > +  Polling the large number of ports brings more CPU load, cache
> > > > +miss and
> > > > +  latency. Shared Rx queue can be used to share Rx queue between
> > > > +PF and
> > > > +  representors in same switch domain.
> > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > +  is present in Rx offloading capability of device info. Setting
> > > > +the
> > > > +  offloading flag in device Rx mode or Rx queue configuration to
> > > > +enable
> > > > +  shared Rx queue. Polling any member port of shared Rx queue can
> > > > +return
> > > > +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > +
> > > >  Basic SR-IOV
> > > >  ------------
> > > >
> > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > index 9d95cd11e1..1361ff759a 100644
> > > > --- a/lib/ethdev/rte_ethdev.c
> > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > @@ -127,6 +127,7 @@ static const struct {
> > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > >  };
> > > >
> > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > index d2b27c351f..a578c9db9d 100644
> > > > --- a/lib/ethdev/rte_ethdev.h
> > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > */
> > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > + switch domain. */
> > > >         /**
> > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > >          * Only offloads set on rx_queue_offload_capa or
> > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > +/**
> > > > + * Rx queue is shared among ports in same switch domain to save
> > > > +memory,
> > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > + * Real source port number saved in mbuf->port field.
> > > > + */
> > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > >
> > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > --
> > > > 2.25.1
> > > >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-08-11  8:28         ` Xueming(Steven) Li
@ 2021-08-11 12:04           ` Ferruh Yigit
  2021-08-11 12:59             ` Xueming(Steven) Li
  2021-09-26  5:35             ` Xueming(Steven) Li
  0 siblings, 2 replies; 266+ messages in thread
From: Ferruh Yigit @ 2021-08-11 12:04 UTC (permalink / raw)
  To: Xueming(Steven) Li, Jerin Jacob
  Cc: dpdk-dev, NBU-Contact-Thomas Monjalon, Andrew Rybchenko

On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> 
> 
>> -----Original Message-----
>> From: Jerin Jacob <jerinjacobk@gmail.com>
>> Sent: Wednesday, August 11, 2021 4:03 PM
>> To: Xueming(Steven) Li <xuemingl@nvidia.com>
>> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
>> Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
>> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
>>
>> On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>>>
>>> Hi,
>>>
>>>> -----Original Message-----
>>>> From: Jerin Jacob <jerinjacobk@gmail.com>
>>>> Sent: Monday, August 9, 2021 9:51 PM
>>>> To: Xueming(Steven) Li <xuemingl@nvidia.com>
>>>> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
>>>> NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
>>>> <andrew.rybchenko@oktetlabs.ru>
>>>> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
>>>>
>>>> On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
>>>>>
>>>>> In current DPDK framework, each RX queue is pre-loaded with mbufs
>>>>> for incoming packets. When number of representors scale out in a
>>>>> switch domain, the memory consumption became significant. Most
>>>>> important, polling all ports leads to high cache miss, high
>>>>> latency and low throughput.
>>>>>
>>>>> This patch introduces shared RX queue. Ports with same
>>>>> configuration in a switch domain could share RX queue set by specifying sharing group.
>>>>> Polling any queue using same shared RX queue receives packets from
>>>>> all member ports. Source port is identified by mbuf->port.
>>>>>
>>>>> Port queue number in a shared group should be identical. Queue
>>>>> index is
>>>>> 1:1 mapped in shared group.
>>>>>
>>>>> Share RX queue is supposed to be polled on same thread.
>>>>>
>>>>> Multiple groups is supported by group ID.
>>>>
>>>> Is this offload specific to the representor? If so can this name be changed specifically to representor?
>>>
>>> Yes, PF and representor in switch domain could take advantage.
>>>
>>>> If it is for a generic case, how the flow ordering will be maintained?
>>>
>>> Not quite sure that I understood your question. The control path of is
>>> almost same as before, PF and representor port still needed, rte flows not impacted.
>>> Queues still needed for each member port, descriptors(mbuf) will be
>>> supplied from shared Rx queue in my PMD implementation.
>>
>> My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
>> receive queue, In that case, how the flow order is maintained for respective receive queues.
> 
> I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
> limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> all forwarding engine. Will sent patches soon.
> 

All ports will put the packets in to the same queue (share queue), right? Does
this means only single core will poll only, what will happen if there are
multiple cores polling, won't it cause problem?

And if this requires specific changes in the application, I am not sure about
the solution, can't this work in a transparent way to the application?

Overall, is this for optimizing memory for the port represontors? If so can't we
have a port representor specific solution, reducing scope can reduce the
complexity it brings?

>> If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
>> scope.
> 
> It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> 
>>
>>
>>>
>>>>
>>>>>
>>>>> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
>>>>> ---
>>>>>  doc/guides/nics/features.rst                    | 11 +++++++++++
>>>>>  doc/guides/nics/features/default.ini            |  1 +
>>>>>  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
>>>>>  lib/ethdev/rte_ethdev.c                         |  1 +
>>>>>  lib/ethdev/rte_ethdev.h                         |  7 +++++++
>>>>>  5 files changed, 30 insertions(+)
>>>>>
>>>>> diff --git a/doc/guides/nics/features.rst
>>>>> b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
>>>>> --- a/doc/guides/nics/features.rst
>>>>> +++ b/doc/guides/nics/features.rst
>>>>> @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
>>>>>    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
>>>>>
>>>>>
>>>>> +.. _nic_features_shared_rx_queue:
>>>>> +
>>>>> +Shared Rx queue
>>>>> +---------------
>>>>> +
>>>>> +Supports shared Rx queue for ports in same switch domain.
>>>>> +
>>>>> +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
>>>>> +* **[provides] mbuf**: ``mbuf.port``.
>>>>> +
>>>>> +
>>>>>  .. _nic_features_packet_type_parsing:
>>>>>
>>>>>  Packet type parsing
>>>>> diff --git a/doc/guides/nics/features/default.ini
>>>>> b/doc/guides/nics/features/default.ini
>>>>> index 754184ddd4..ebeb4c1851 100644
>>>>> --- a/doc/guides/nics/features/default.ini
>>>>> +++ b/doc/guides/nics/features/default.ini
>>>>> @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
>>>>>  Queue start/stop     =
>>>>>  Runtime Rx queue setup =
>>>>>  Runtime Tx queue setup =
>>>>> +Shared Rx queue      =
>>>>>  Burst mode info      =
>>>>>  Power mgmt address monitor =
>>>>>  MTU update           =
>>>>> diff --git a/doc/guides/prog_guide/switch_representation.rst
>>>>> b/doc/guides/prog_guide/switch_representation.rst
>>>>> index ff6aa91c80..45bf5a3a10 100644
>>>>> --- a/doc/guides/prog_guide/switch_representation.rst
>>>>> +++ b/doc/guides/prog_guide/switch_representation.rst
>>>>> @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
>>>>>  .. [1] `Ethernet switch device driver model (switchdev)
>>>>>
>>>>> <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
>>>>>> `_
>>>>>
>>>>> +- Memory usage of representors is huge when number of representor
>>>>> +grows,
>>>>> +  because PMD always allocate mbuf for each descriptor of Rx queue.
>>>>> +  Polling the large number of ports brings more CPU load, cache
>>>>> +miss and
>>>>> +  latency. Shared Rx queue can be used to share Rx queue between
>>>>> +PF and
>>>>> +  representors in same switch domain.
>>>>> +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
>>>>> +  is present in Rx offloading capability of device info. Setting
>>>>> +the
>>>>> +  offloading flag in device Rx mode or Rx queue configuration to
>>>>> +enable
>>>>> +  shared Rx queue. Polling any member port of shared Rx queue can
>>>>> +return
>>>>> +  packets of all ports in group, port ID is saved in ``mbuf.port``.
>>>>> +
>>>>>  Basic SR-IOV
>>>>>  ------------
>>>>>
>>>>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
>>>>> index 9d95cd11e1..1361ff759a 100644
>>>>> --- a/lib/ethdev/rte_ethdev.c
>>>>> +++ b/lib/ethdev/rte_ethdev.c
>>>>> @@ -127,6 +127,7 @@ static const struct {
>>>>>         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
>>>>>         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
>>>>>         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
>>>>> +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
>>>>>  };
>>>>>
>>>>>  #undef RTE_RX_OFFLOAD_BIT2STR
>>>>> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
>>>>> index d2b27c351f..a578c9db9d 100644
>>>>> --- a/lib/ethdev/rte_ethdev.h
>>>>> +++ b/lib/ethdev/rte_ethdev.h
>>>>> @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
>>>>>         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
>>>>>         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
>>>>>         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
>>>>> */
>>>>> +       uint32_t shared_group; /**< Shared port group index in
>>>>> + switch domain. */
>>>>>         /**
>>>>>          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
>>>>>          * Only offloads set on rx_queue_offload_capa or
>>>>> rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
>>>>> #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
>>>>>  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
>>>>>  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
>>>>> +/**
>>>>> + * Rx queue is shared among ports in same switch domain to save
>>>>> +memory,
>>>>> + * avoid polling each port. Any port in group can be used to receive packets.
>>>>> + * Real source port number saved in mbuf->port field.
>>>>> + */
>>>>> +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
>>>>>
>>>>>  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
>>>>>                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
>>>>> --
>>>>> 2.25.1
>>>>>


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-08-11 12:04           ` Ferruh Yigit
@ 2021-08-11 12:59             ` Xueming(Steven) Li
  2021-08-12 14:35               ` Xueming(Steven) Li
  2021-09-15 15:34               ` Xueming(Steven) Li
  2021-09-26  5:35             ` Xueming(Steven) Li
  1 sibling, 2 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-08-11 12:59 UTC (permalink / raw)
  To: Ferruh Yigit, Jerin Jacob
  Cc: dpdk-dev, NBU-Contact-Thomas Monjalon, Andrew Rybchenko



> -----Original Message-----
> From: Ferruh Yigit <ferruh.yigit@intel.com>
> Sent: Wednesday, August 11, 2021 8:04 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>; Jerin Jacob <jerinjacobk@gmail.com>
> Cc: dpdk-dev <dev@dpdk.org>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>
> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> 
> On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> >
> >
> >> -----Original Message-----
> >> From: Jerin Jacob <jerinjacobk@gmail.com>
> >> Sent: Wednesday, August 11, 2021 4:03 PM
> >> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> >> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> >> NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> >> <andrew.rybchenko@oktetlabs.ru>
> >> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> >>
> >> On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>>> -----Original Message-----
> >>>> From: Jerin Jacob <jerinjacobk@gmail.com>
> >>>> Sent: Monday, August 9, 2021 9:51 PM
> >>>> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> >>>> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> >>>> NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> >>>> <andrew.rybchenko@oktetlabs.ru>
> >>>> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx
> >>>> queue
> >>>>
> >>>> On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> >>>>>
> >>>>> In current DPDK framework, each RX queue is pre-loaded with mbufs
> >>>>> for incoming packets. When number of representors scale out in a
> >>>>> switch domain, the memory consumption became significant. Most
> >>>>> important, polling all ports leads to high cache miss, high
> >>>>> latency and low throughput.
> >>>>>
> >>>>> This patch introduces shared RX queue. Ports with same
> >>>>> configuration in a switch domain could share RX queue set by specifying sharing group.
> >>>>> Polling any queue using same shared RX queue receives packets from
> >>>>> all member ports. Source port is identified by mbuf->port.
> >>>>>
> >>>>> Port queue number in a shared group should be identical. Queue
> >>>>> index is
> >>>>> 1:1 mapped in shared group.
> >>>>>
> >>>>> Share RX queue is supposed to be polled on same thread.
> >>>>>
> >>>>> Multiple groups is supported by group ID.
> >>>>
> >>>> Is this offload specific to the representor? If so can this name be changed specifically to representor?
> >>>
> >>> Yes, PF and representor in switch domain could take advantage.
> >>>
> >>>> If it is for a generic case, how the flow ordering will be maintained?
> >>>
> >>> Not quite sure that I understood your question. The control path of
> >>> is almost same as before, PF and representor port still needed, rte flows not impacted.
> >>> Queues still needed for each member port, descriptors(mbuf) will be
> >>> supplied from shared Rx queue in my PMD implementation.
> >>
> >> My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> >> offload, multiple ethdev receive queues land into the same receive queue, In that case, how the flow order is maintained for
> respective receive queues.
> >
> > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > Packets from same source port could be grouped as a small burst to
> > process, this will accelerates the performance if traffic come from
> > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for all
> forwarding engine. Will sent patches soon.
> >
> 
> All ports will put the packets in to the same queue (share queue), right? Does this means only single core will poll only, what will
> happen if there are multiple cores polling, won't it cause problem?

This has been mentioned in commit log, the shared rxq is supposed to be polling in single thread(core) - I think it should be "MUST".
Result is unexpected if there are multiple cores pooling, that's why I added a polling schedule check in testpmd.
Similar for rx/tx burst function, a queue can't be polled on multiple thread(core), and for performance concern, no such check in eal api.

If users want to utilize multiple cores to distribute workloads, it's possible to define more groups, queues in different group could be
could be polled on multiple cores.

It's possible to poll every member port in group, but not necessary, any port in group could be polled to get packets for all ports in group.

If the member port subject to hot plug/remove,  it's possible to create a vdev with same queue number, copy rxq object and poll vdev
as a dedicate proxy for the group.

> 
> And if this requires specific changes in the application, I am not sure about the solution, can't this work in a transparent way to the
> application?

Yes, we considered different options in design stage. One possible solution is to cache received packets in rings, this can be done on
eth layer, but I'm afraid less benefits, user still has to be a ware of multiple core polling. 
This can be done as a wrapper PMD later, more efforts.

> 
> Overall, is this for optimizing memory for the port represontors? If so can't we have a port representor specific solution, reducing
> scope can reduce the complexity it brings?

This feature supports both PF and representor, and yes, major issue is memory of representors. Poll all representors also 
introduces more core cache miss latency. This feature essentially aggregates all ports in group as one port.
On the other hand, it's useful for rte flow to create offloading flows using representor as a regular port ID.

It's great if any new solution/suggestion, my head buried in PMD code :)

> 
> >> If this offload is only useful for representor case, Can we make this
> >> offload specific to representor the case by changing its name and scope.
> >
> > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> >
> >>
> >>
> >>>
> >>>>
> >>>>>
> >>>>> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> >>>>> ---
> >>>>>  doc/guides/nics/features.rst                    | 11 +++++++++++
> >>>>>  doc/guides/nics/features/default.ini            |  1 +
> >>>>>  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> >>>>>  lib/ethdev/rte_ethdev.c                         |  1 +
> >>>>>  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> >>>>>  5 files changed, 30 insertions(+)
> >>>>>
> >>>>> diff --git a/doc/guides/nics/features.rst
> >>>>> b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> >>>>> --- a/doc/guides/nics/features.rst
> >>>>> +++ b/doc/guides/nics/features.rst
> >>>>> @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> >>>>>    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> >>>>>
> >>>>>
> >>>>> +.. _nic_features_shared_rx_queue:
> >>>>> +
> >>>>> +Shared Rx queue
> >>>>> +---------------
> >>>>> +
> >>>>> +Supports shared Rx queue for ports in same switch domain.
> >>>>> +
> >>>>> +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> >>>>> +* **[provides] mbuf**: ``mbuf.port``.
> >>>>> +
> >>>>> +
> >>>>>  .. _nic_features_packet_type_parsing:
> >>>>>
> >>>>>  Packet type parsing
> >>>>> diff --git a/doc/guides/nics/features/default.ini
> >>>>> b/doc/guides/nics/features/default.ini
> >>>>> index 754184ddd4..ebeb4c1851 100644
> >>>>> --- a/doc/guides/nics/features/default.ini
> >>>>> +++ b/doc/guides/nics/features/default.ini
> >>>>> @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> >>>>>  Queue start/stop     =
> >>>>>  Runtime Rx queue setup =
> >>>>>  Runtime Tx queue setup =
> >>>>> +Shared Rx queue      =
> >>>>>  Burst mode info      =
> >>>>>  Power mgmt address monitor =
> >>>>>  MTU update           =
> >>>>> diff --git a/doc/guides/prog_guide/switch_representation.rst
> >>>>> b/doc/guides/prog_guide/switch_representation.rst
> >>>>> index ff6aa91c80..45bf5a3a10 100644
> >>>>> --- a/doc/guides/prog_guide/switch_representation.rst
> >>>>> +++ b/doc/guides/prog_guide/switch_representation.rst
> >>>>> @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> >>>>>  .. [1] `Ethernet switch device driver model (switchdev)
> >>>>>
> >>>>> <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> >>>>>> `_
> >>>>>
> >>>>> +- Memory usage of representors is huge when number of representor
> >>>>> +grows,
> >>>>> +  because PMD always allocate mbuf for each descriptor of Rx queue.
> >>>>> +  Polling the large number of ports brings more CPU load, cache
> >>>>> +miss and
> >>>>> +  latency. Shared Rx queue can be used to share Rx queue between
> >>>>> +PF and
> >>>>> +  representors in same switch domain.
> >>>>> +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> >>>>> +  is present in Rx offloading capability of device info. Setting
> >>>>> +the
> >>>>> +  offloading flag in device Rx mode or Rx queue configuration to
> >>>>> +enable
> >>>>> +  shared Rx queue. Polling any member port of shared Rx queue can
> >>>>> +return
> >>>>> +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> >>>>> +
> >>>>>  Basic SR-IOV
> >>>>>  ------------
> >>>>>
> >>>>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> >>>>> index 9d95cd11e1..1361ff759a 100644
> >>>>> --- a/lib/ethdev/rte_ethdev.c
> >>>>> +++ b/lib/ethdev/rte_ethdev.c
> >>>>> @@ -127,6 +127,7 @@ static const struct {
> >>>>>         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> >>>>>         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> >>>>>         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> >>>>> +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> >>>>>  };
> >>>>>
> >>>>>  #undef RTE_RX_OFFLOAD_BIT2STR
> >>>>> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> >>>>> index d2b27c351f..a578c9db9d 100644
> >>>>> --- a/lib/ethdev/rte_ethdev.h
> >>>>> +++ b/lib/ethdev/rte_ethdev.h
> >>>>> @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> >>>>>         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> >>>>>         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> >>>>>         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> >>>>> */
> >>>>> +       uint32_t shared_group; /**< Shared port group index in
> >>>>> + switch domain. */
> >>>>>         /**
> >>>>>          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> >>>>>          * Only offloads set on rx_queue_offload_capa or
> >>>>> rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> >>>>> #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> >>>>>  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> >>>>>  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> >>>>> +/**
> >>>>> + * Rx queue is shared among ports in same switch domain to save
> >>>>> +memory,
> >>>>> + * avoid polling each port. Any port in group can be used to receive packets.
> >>>>> + * Real source port number saved in mbuf->port field.
> >>>>> + */
> >>>>> +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> >>>>>
> >>>>>  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> >>>>>                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> >>>>> --
> >>>>> 2.25.1
> >>>>>


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue
  2021-07-27  3:42 [dpdk-dev] [RFC] ethdev: introduce shared Rx queue Xueming Li
  2021-07-28  7:56 ` Andrew Rybchenko
  2021-08-09 11:47 ` [dpdk-dev] [PATCH v1] " Xueming Li
@ 2021-08-11 14:04 ` Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 02/15] app/testpmd: dump port and queue info for each packet Xueming Li
                     ` (14 more replies)
  2021-09-17  8:01 ` [dpdk-dev] [PATCH v3 0/8] " Xueming Li
                   ` (13 subsequent siblings)
  16 siblings, 15 replies; 266+ messages in thread
From: Xueming Li @ 2021-08-11 14:04 UTC (permalink / raw)
  Cc: dev, xuemingl, Jerin Jacob, Ferruh Yigit, Thomas Monjalon,
	Andrew Rybchenko

In current DPDK framework, each RX queue is pre-loaded with mbufs for
incoming packets. When number of representors scale out in a switch
domain, the memory consumption became significant. Most important,
polling all ports leads to high cache miss, high latency and low
throughput.

This patch introduces shared RX queue. Ports with same configuration in
a switch domain could share RX queue set by specifying sharing group.
Polling any queue using same shared RX queue receives packets from all
member ports. Source port is identified by mbuf->port.

Port queue number in a shared group should be identical. Queue index is
1:1 mapped in shared group.

Share RX queue must be polled on single thread or core.

Multiple groups is supported by group ID.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Cc: Jerin Jacob <jerinjacobk@gmail.com>
---
Rx queue object could be used as shared Rx queue object, it's important
to clear all queue control callback api that using queue object:
  https://mails.dpdk.org/archives/dev/2021-July/215574.html
---
 doc/guides/nics/features.rst                    | 11 +++++++++++
 doc/guides/nics/features/default.ini            |  1 +
 doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
 lib/ethdev/rte_ethdev.c                         |  1 +
 lib/ethdev/rte_ethdev.h                         |  7 +++++++
 5 files changed, 30 insertions(+)

diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index a96e12d155..2e2a9b1554 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
   ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
 
 
+.. _nic_features_shared_rx_queue:
+
+Shared Rx queue
+---------------
+
+Supports shared Rx queue for ports in same switch domain.
+
+* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
+* **[provides] mbuf**: ``mbuf.port``.
+
+
 .. _nic_features_packet_type_parsing:
 
 Packet type parsing
diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
index 754184ddd4..ebeb4c1851 100644
--- a/doc/guides/nics/features/default.ini
+++ b/doc/guides/nics/features/default.ini
@@ -19,6 +19,7 @@ Free Tx mbuf on demand =
 Queue start/stop     =
 Runtime Rx queue setup =
 Runtime Tx queue setup =
+Shared Rx queue      =
 Burst mode info      =
 Power mgmt address monitor =
 MTU update           =
diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
index ff6aa91c80..45bf5a3a10 100644
--- a/doc/guides/prog_guide/switch_representation.rst
+++ b/doc/guides/prog_guide/switch_representation.rst
@@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
 .. [1] `Ethernet switch device driver model (switchdev)
        <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
 
+- Memory usage of representors is huge when number of representor grows,
+  because PMD always allocate mbuf for each descriptor of Rx queue.
+  Polling the large number of ports brings more CPU load, cache miss and
+  latency. Shared Rx queue can be used to share Rx queue between PF and
+  representors in same switch domain. ``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
+  is present in Rx offloading capability of device info. Setting the
+  offloading flag in device Rx mode or Rx queue configuration to enable
+  shared Rx queue. Polling any member port of shared Rx queue can return
+  packets of all ports in group, port ID is saved in ``mbuf.port``.
+
 Basic SR-IOV
 ------------
 
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 193f0d8295..058f5c88d9 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -127,6 +127,7 @@ static const struct {
 	RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
 	RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
 	RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
+	RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
 };
 
 #undef RTE_RX_OFFLOAD_BIT2STR
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index d2b27c351f..a578c9db9d 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
 	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
 	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
 	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
+	uint32_t shared_group; /**< Shared port group index in switch domain. */
 	/**
 	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
 	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
@@ -1373,6 +1374,12 @@ struct rte_eth_conf {
 #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
 #define DEV_RX_OFFLOAD_RSS_HASH		0x00080000
 #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
+/**
+ * Rx queue is shared among ports in same switch domain to save memory,
+ * avoid polling each port. Any port in group can be used to receive packets.
+ * Real source port number saved in mbuf->port field.
+ */
+#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
 
 #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
 				 DEV_RX_OFFLOAD_UDP_CKSUM | \
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v2 02/15] app/testpmd: dump port and queue info for each packet
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
@ 2021-08-11 14:04   ` Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 03/15] app/testpmd: new parameter to enable shared Rx queue Xueming Li
                     ` (13 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-08-11 14:04 UTC (permalink / raw)
  Cc: dev, xuemingl, Xiaoyun Li

In case of shared Rx queue, port number of mbufs returned from one rx
burst could be different.

To support shared Rx queue, this patch dumps mbuf->port and queue for
each packet.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/util.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/app/test-pmd/util.c b/app/test-pmd/util.c
index 5dd7157947..1733d5e663 100644
--- a/app/test-pmd/util.c
+++ b/app/test-pmd/util.c
@@ -100,6 +100,7 @@ dump_pkt_burst(uint16_t port_id, uint16_t queue, struct rte_mbuf *pkts[],
 		struct rte_flow_restore_info info = { 0, };
 
 		mb = pkts[i];
+		MKDUMPSTR(print_buf, buf_size, cur_len, "port %u, ", mb->port);
 		eth_hdr = rte_pktmbuf_read(mb, 0, sizeof(_eth_hdr), &_eth_hdr);
 		eth_type = RTE_BE_TO_CPU_16(eth_hdr->ether_type);
 		packet_type = mb->packet_type;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v2 03/15] app/testpmd: new parameter to enable shared Rx queue
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 02/15] app/testpmd: dump port and queue info for each packet Xueming Li
@ 2021-08-11 14:04   ` Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 04/15] app/testpmd: make sure shared Rx queue polled on same core Xueming Li
                     ` (12 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-08-11 14:04 UTC (permalink / raw)
  Cc: dev, xuemingl, Xiaoyun Li

Adds "--rxq-share" parameter to enable shared rxq for each rxq.

Default shared rxq group 0 is used, RX queues in same switch domain
shares same rxq according to queue index.

Shared Rx queue is enabled only if device support offloading flag
RTE_ETH_RX_OFFLOAD_SHARED_RXQ.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c                 |  6 +++++-
 app/test-pmd/parameters.c             |  4 ++++
 app/test-pmd/testpmd.c                | 14 ++++++++++++++
 app/test-pmd/testpmd.h                |  2 ++
 doc/guides/testpmd_app_ug/run_app.rst |  5 +++++
 5 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 31d8ba1b91..bb882a56a4 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -2709,7 +2709,11 @@ rxtx_config_display(void)
 			printf("      RX threshold registers: pthresh=%d hthresh=%d "
 				" wthresh=%d\n",
 				pthresh_tmp, hthresh_tmp, wthresh_tmp);
-			printf("      RX Offloads=0x%"PRIx64"\n", offloads_tmp);
+			printf("      RX Offloads=0x%"PRIx64, offloads_tmp);
+			if (rx_conf->offloads & RTE_ETH_RX_OFFLOAD_SHARED_RXQ)
+				printf(" share group=%u",
+				       rx_conf->shared_group);
+			printf("\n");
 		}
 
 		/* per tx queue config only for first queue to be less verbose */
diff --git a/app/test-pmd/parameters.c b/app/test-pmd/parameters.c
index 7c13210f04..a466a20bfb 100644
--- a/app/test-pmd/parameters.c
+++ b/app/test-pmd/parameters.c
@@ -166,6 +166,7 @@ usage(char* progname)
 	printf("  --tx-ip=src,dst: IP addresses in Tx-only mode\n");
 	printf("  --tx-udp=src[,dst]: UDP ports in Tx-only mode\n");
 	printf("  --eth-link-speed: force link speed.\n");
+	printf("  --rxq-share: share rxq between PF and representors\n");
 	printf("  --disable-link-check: disable check on link status when "
 	       "starting/stopping ports.\n");
 	printf("  --disable-device-start: do not automatically start port\n");
@@ -602,6 +603,7 @@ launch_args_parse(int argc, char** argv)
 		{ "rxpkts",			1, 0, 0 },
 		{ "txpkts",			1, 0, 0 },
 		{ "txonly-multi-flow",		0, 0, 0 },
+		{ "rxq-share",			0, 0, 0 },
 		{ "eth-link-speed",		1, 0, 0 },
 		{ "disable-link-check",		0, 0, 0 },
 		{ "disable-device-start",	0, 0, 0 },
@@ -1256,6 +1258,8 @@ launch_args_parse(int argc, char** argv)
 			}
 			if (!strcmp(lgopts[opt_idx].name, "txonly-multi-flow"))
 				txonly_multi_flow = 1;
+			if (!strcmp(lgopts[opt_idx].name, "rxq-share"))
+				rxq_share = 1;
 			if (!strcmp(lgopts[opt_idx].name, "no-flush-rx"))
 				no_flush_rx = 1;
 			if (!strcmp(lgopts[opt_idx].name, "eth-link-speed")) {
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 6cbe9ba3c8..67fd128862 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -223,6 +223,9 @@ uint8_t  rx_pkt_nb_segs; /**< Number of segments to split */
 uint16_t rx_pkt_seg_offsets[MAX_SEGS_BUFFER_SPLIT];
 uint8_t  rx_pkt_nb_offs; /**< Number of specified offsets */
 
+uint8_t rxq_share;
+/**< Create shared rxq for PF and representors. */
+
 /*
  * Configuration of packet segments used by the "txonly" processing engine.
  */
@@ -1441,6 +1444,11 @@ init_config_port_offloads(portid_t pid, uint32_t socket_id)
 		port->dev_conf.txmode.offloads &=
 			~DEV_TX_OFFLOAD_MBUF_FAST_FREE;
 
+	if (rxq_share &&
+	    (port->dev_info.rx_offload_capa & RTE_ETH_RX_OFFLOAD_SHARED_RXQ))
+		port->dev_conf.rxmode.offloads |=
+				RTE_ETH_RX_OFFLOAD_SHARED_RXQ;
+
 	/* Apply Rx offloads configuration */
 	for (i = 0; i < port->dev_info.max_rx_queues; i++)
 		port->rx_conf[i].offloads = port->dev_conf.rxmode.offloads;
@@ -3334,6 +3342,12 @@ rxtx_port_config(struct rte_port *port)
 	for (qid = 0; qid < nb_rxq; qid++) {
 		offloads = port->rx_conf[qid].offloads;
 		port->rx_conf[qid] = port->dev_info.default_rxconf;
+
+		if (rxq_share > 0 &&
+		    (port->dev_info.rx_offload_capa &
+		     RTE_ETH_RX_OFFLOAD_SHARED_RXQ))
+			offloads |= RTE_ETH_RX_OFFLOAD_SHARED_RXQ;
+
 		if (offloads != 0)
 			port->rx_conf[qid].offloads = offloads;
 
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 16a3598e48..f3b1d34e28 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -477,6 +477,8 @@ extern enum tx_pkt_split tx_pkt_split;
 
 extern uint8_t txonly_multi_flow;
 
+extern uint8_t rxq_share;
+
 extern uint16_t nb_pkt_per_burst;
 extern uint16_t nb_pkt_flowgen_clones;
 extern uint16_t mb_mempool_cache;
diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
index 6061674239..8a9aeeb11f 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -384,6 +384,11 @@ The command line options are:
 
     Generate multiple flows in txonly mode.
 
+*   ``--rxq-share``
+
+    Create all queues in shared RX queue mode if device supports, queues in
+    same switch domain are shared according queue ID.
+
 *   ``--eth-link-speed``
 
     Set a forced link speed to the ethernet port::
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v2 04/15] app/testpmd: make sure shared Rx queue polled on same core
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 02/15] app/testpmd: dump port and queue info for each packet Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 03/15] app/testpmd: new parameter to enable shared Rx queue Xueming Li
@ 2021-08-11 14:04   ` Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 05/15] app/testpmd: adds common forwarding for shared Rx queue Xueming Li
                     ` (11 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-08-11 14:04 UTC (permalink / raw)
  Cc: dev, xuemingl, Xiaoyun Li

Shared rxqs uses one set rx queue internally, queues must be polled from
one core.

Stops forwarding if shared rxq being scheduled on multiple cores.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c  | 91 ++++++++++++++++++++++++++++++++++++++++++
 app/test-pmd/testpmd.c |  4 +-
 app/test-pmd/testpmd.h |  2 +
 3 files changed, 96 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index bb882a56a4..51f7d26045 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -2885,6 +2885,97 @@ port_rss_hash_key_update(portid_t port_id, char rss_type[], uint8_t *hash_key,
 	}
 }
 
+/*
+ * Check whether a shared rxq scheduled on other lcores.
+ */
+static bool
+fwd_stream_on_other_lcores(uint16_t domain_id, portid_t src_port,
+			   queueid_t src_rxq, lcoreid_t src_lc)
+{
+	streamid_t sm_id;
+	streamid_t nb_fs_per_lcore;
+	lcoreid_t  nb_fc;
+	lcoreid_t  lc_id;
+	struct fwd_stream *fs;
+	struct rte_port *port;
+	struct rte_eth_rxconf *rxq_conf;
+
+	nb_fc = cur_fwd_config.nb_fwd_lcores;
+	for (lc_id = src_lc + 1; lc_id < nb_fc; lc_id++) {
+		sm_id = fwd_lcores[lc_id]->stream_idx;
+		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
+		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
+		     sm_id++) {
+			fs = fwd_streams[sm_id];
+			port = &ports[fs->rx_port];
+			rxq_conf = &port->rx_conf[fs->rx_queue];
+			if ((rxq_conf->offloads & RTE_ETH_RX_OFFLOAD_SHARED_RXQ)
+			    == 0)
+				/* Not shared rxq. */
+				continue;
+			if (domain_id != port->dev_info.switch_info.domain_id)
+				continue;
+			if (fs->rx_queue != src_rxq)
+				continue;
+			printf("Shared RX queue can't be scheduled on different cores:\n");
+			printf("  lcore %hhu Port %hu queue %hu\n",
+			       src_lc, src_port, src_rxq);
+			printf("  lcore %hhu Port %hu queue %hu\n",
+			       lc_id, fs->rx_port, fs->rx_queue);
+			printf("  please use --nb-cores=%hu to limit forwarding cores\n",
+			       nb_rxq);
+			return true;
+		}
+	}
+	return false;
+}
+
+/*
+ * Check shared rxq configuration.
+ *
+ * Shared group must not being scheduled on different core.
+ */
+bool
+pkt_fwd_shared_rxq_check(void)
+{
+	streamid_t sm_id;
+	streamid_t nb_fs_per_lcore;
+	lcoreid_t  nb_fc;
+	lcoreid_t  lc_id;
+	struct fwd_stream *fs;
+	uint16_t domain_id;
+	struct rte_port *port;
+	struct rte_eth_rxconf *rxq_conf;
+
+	nb_fc = cur_fwd_config.nb_fwd_lcores;
+	/*
+	 * Check streams on each core, make sure the same switch domain +
+	 * group + queue doesn't get scheduled on other cores.
+	 */
+	for (lc_id = 0; lc_id < nb_fc; lc_id++) {
+		sm_id = fwd_lcores[lc_id]->stream_idx;
+		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
+		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
+		     sm_id++) {
+			fs = fwd_streams[sm_id];
+			/* Update lcore info stream being scheduled. */
+			fs->lcore = fwd_lcores[lc_id];
+			port = &ports[fs->rx_port];
+			rxq_conf = &port->rx_conf[fs->rx_queue];
+			if ((rxq_conf->offloads & RTE_ETH_RX_OFFLOAD_SHARED_RXQ)
+			    == 0)
+				/* Not shared rxq. */
+				continue;
+			/* Check shared rxq not scheduled on remaining cores. */
+			domain_id = port->dev_info.switch_info.domain_id;
+			if (fwd_stream_on_other_lcores(domain_id, fs->rx_port,
+						       fs->rx_queue, lc_id))
+				return false;
+		}
+	}
+	return true;
+}
+
 /*
  * Setup forwarding configuration for each logical core.
  */
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 67fd128862..d941bd982e 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -2169,10 +2169,12 @@ start_packet_forwarding(int with_tx_first)
 
 	fwd_config_setup();
 
+	pkt_fwd_config_display(&cur_fwd_config);
+	if (!pkt_fwd_shared_rxq_check())
+		return;
 	if(!no_flush_rx)
 		flush_fwd_rx_queues();
 
-	pkt_fwd_config_display(&cur_fwd_config);
 	rxtx_config_display();
 
 	fwd_stats_reset();
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index f3b1d34e28..6497c56359 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -144,6 +144,7 @@ struct fwd_stream {
 	uint64_t     core_cycles; /**< used for RX and TX processing */
 	struct pkt_burst_stats rx_burst_stats;
 	struct pkt_burst_stats tx_burst_stats;
+	struct fwd_lcore *lcore; /**< Lcore being scheduled. */
 };
 
 /**
@@ -785,6 +786,7 @@ void port_summary_header_display(void);
 void rx_queue_infos_display(portid_t port_idi, uint16_t queue_id);
 void tx_queue_infos_display(portid_t port_idi, uint16_t queue_id);
 void fwd_lcores_config_display(void);
+bool pkt_fwd_shared_rxq_check(void);
 void pkt_fwd_config_display(struct fwd_config *cfg);
 void rxtx_config_display(void);
 void fwd_config_setup(void);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v2 05/15] app/testpmd: adds common forwarding for shared Rx queue
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
                     ` (2 preceding siblings ...)
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 04/15] app/testpmd: make sure shared Rx queue polled on same core Xueming Li
@ 2021-08-11 14:04   ` Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function Xueming Li
                     ` (10 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-08-11 14:04 UTC (permalink / raw)
  Cc: dev, xuemingl, Xiaoyun Li

By enabling shared Rx queue, received packets come from all member ports
in same shared Rx queue.

This patch adds a common forwarding function for shared Rx queue, groups
source forwarding stream by looking up local streams on current lcore
with packet source port(mbuf->port) and queue, then invokes callback to
handle received packets for source stream.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/testpmd.c | 69 ++++++++++++++++++++++++++++++++++++++++++
 app/test-pmd/testpmd.h |  4 +++
 2 files changed, 73 insertions(+)

diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index d941bd982e..f46bd97948 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -2034,6 +2034,75 @@ flush_fwd_rx_queues(void)
 	}
 }
 
+/**
+ * Get packet source stream by source port and queue.
+ * All streams of same shared Rx queue locates on same core.
+ */
+static struct fwd_stream *
+forward_stream_get(struct fwd_stream *fs, uint16_t port)
+{
+	streamid_t sm_id;
+	struct fwd_lcore *fc;
+	struct fwd_stream **fsm;
+	streamid_t nb_fs;
+
+	fc = fs->lcore;
+	fsm = &fwd_streams[fc->stream_idx];
+	nb_fs = fc->stream_nb;
+	for (sm_id = 0; sm_id < nb_fs; sm_id++) {
+		if (fsm[sm_id]->rx_port == port &&
+		    fsm[sm_id]->rx_queue == fs->rx_queue)
+			return fsm[sm_id];
+	}
+	return NULL;
+}
+
+/**
+ * Forward packet by source port and queue.
+ */
+static void
+forward_by_port(struct fwd_stream *src_fs, uint16_t port, uint16_t nb_rx,
+		struct rte_mbuf **pkts, packet_fwd_cb fwd)
+{
+	struct fwd_stream *fs = forward_stream_get(src_fs, port);
+
+	if (fs != NULL) {
+		fs->rx_packets += nb_rx;
+		fwd(fs, nb_rx, pkts);
+	} else {
+		/* Source stream not found, drop all packets. */
+		src_fs->fwd_dropped += nb_rx;
+		while (nb_rx > 0)
+			rte_pktmbuf_free(pkts[--nb_rx]);
+	}
+}
+
+/**
+ * Forward packets from shared Rx queue.
+ *
+ * Source port of packets are identified by mbuf->port.
+ */
+void
+forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
+		   struct rte_mbuf **pkts_burst, packet_fwd_cb fwd)
+{
+	uint16_t i, nb_fs_rx = 1, port;
+
+	/* Locate real source fs according to mbuf->port. */
+	for (i = 0; i < nb_rx; ++i) {
+		rte_prefetch0(pkts_burst[i + 1]);
+		port = pkts_burst[i]->port;
+		if (i + 1 == nb_rx || pkts_burst[i + 1]->port != port) {
+			/* Forward packets with same source port. */
+			forward_by_port(fs, port, nb_fs_rx,
+					&pkts_burst[i + 1 - nb_fs_rx], fwd);
+			nb_fs_rx = 1;
+		} else {
+			nb_fs_rx++;
+		}
+	}
+}
+
 static void
 run_pkt_fwd_on_lcore(struct fwd_lcore *fc, packet_fwd_t pkt_fwd)
 {
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 6497c56359..13141dfed9 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -272,6 +272,8 @@ struct fwd_lcore {
 typedef void (*port_fwd_begin_t)(portid_t pi);
 typedef void (*port_fwd_end_t)(portid_t pi);
 typedef void (*packet_fwd_t)(struct fwd_stream *fs);
+typedef void (*packet_fwd_cb)(struct fwd_stream *fs, uint16_t nb_rx,
+			      struct rte_mbuf **pkts);
 
 struct fwd_engine {
 	const char       *fwd_mode_name; /**< Forwarding mode name. */
@@ -897,6 +899,8 @@ char *list_pkt_forwarding_modes(void);
 char *list_pkt_forwarding_retry_modes(void);
 void set_pkt_forwarding_mode(const char *fwd_mode);
 void start_packet_forwarding(int with_tx_first);
+void forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
+			struct rte_mbuf **pkts_burst, packet_fwd_cb fwd);
 void fwd_stats_display(void);
 void fwd_stats_reset(void);
 void stop_packet_forwarding(void);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
                     ` (3 preceding siblings ...)
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 05/15] app/testpmd: adds common forwarding for shared Rx queue Xueming Li
@ 2021-08-11 14:04   ` Xueming Li
  2021-08-17  9:37     ` Jerin Jacob
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 07/15] app/testpmd: support shared Rx queues for IO forwarding Xueming Li
                     ` (9 subsequent siblings)
  14 siblings, 1 reply; 266+ messages in thread
From: Xueming Li @ 2021-08-11 14:04 UTC (permalink / raw)
  Cc: Xiaoyu Min, dev, xuemingl, Xiaoyun Li

From: Xiaoyu Min <jackmin@nvidia.com>

Added an inline common wrapper function for all fwd engines
which do the following in common:

1. get_start_cycles
2. rte_eth_rx_burst(...,nb_pkt_per_burst)
3. if rxq_share do forward_shared_rxq(), otherwise do fwd directly
4. get_end_cycle

Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
---
 app/test-pmd/testpmd.h | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 13141dfed9..b685ac48d6 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -1022,6 +1022,30 @@ void add_tx_dynf_callback(portid_t portid);
 void remove_tx_dynf_callback(portid_t portid);
 int update_jumbo_frame_offload(portid_t portid);
 
+static inline void
+do_burst_fwd(struct fwd_stream *fs, packet_fwd_cb fwd)
+{
+	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
+	uint16_t nb_rx;
+	uint64_t start_tsc = 0;
+
+	get_start_cycles(&start_tsc);
+
+	/*
+	 * Receive a burst of packets and forward them.
+	 */
+	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
+			pkts_burst, nb_pkt_per_burst);
+	inc_rx_burst_stats(fs, nb_rx);
+	if (unlikely(nb_rx == 0))
+		return;
+	if (unlikely(rxq_share > 0))
+		forward_shared_rxq(fs, nb_rx, pkts_burst, fwd);
+	else
+		(*fwd)(fs, nb_rx, pkts_burst);
+	get_end_cycles(fs, start_tsc);
+}
+
 /*
  * Work-around of a compilation error with ICC on invocations of the
  * rte_be_to_cpu_16() function.
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v2 07/15] app/testpmd: support shared Rx queues for IO forwarding
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
                     ` (4 preceding siblings ...)
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function Xueming Li
@ 2021-08-11 14:04   ` Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 08/15] app/testpmd: support shared Rx queue for rxonly forwarding Xueming Li
                     ` (8 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-08-11 14:04 UTC (permalink / raw)
  Cc: dev, xuemingl, Xiaoyun Li

Supports shared Rx queue.

If shared Rx queue is enabled, group received packets by stream
according to mbuf->port value and then and forward in stream basis as
before.

If shared Rx queue is not enabled, just forward in stream basis.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/iofwd.c | 27 ++++++++++-----------------
 1 file changed, 10 insertions(+), 17 deletions(-)

diff --git a/app/test-pmd/iofwd.c b/app/test-pmd/iofwd.c
index 83d098adcb..316a80d65c 100644
--- a/app/test-pmd/iofwd.c
+++ b/app/test-pmd/iofwd.c
@@ -44,25 +44,11 @@
  * to packets data.
  */
 static void
-pkt_burst_io_forward(struct fwd_stream *fs)
+io_forward_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		  struct rte_mbuf **pkts_burst)
 {
-	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
-	uint16_t nb_rx;
 	uint16_t nb_tx;
 	uint32_t retry;
-	uint64_t start_tsc = 0;
-
-	get_start_cycles(&start_tsc);
-
-	/*
-	 * Receive a burst of packets and forward them.
-	 */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
-			pkts_burst, nb_pkt_per_burst);
-	inc_rx_burst_stats(fs, nb_rx);
-	if (unlikely(nb_rx == 0))
-		return;
-	fs->rx_packets += nb_rx;
 
 	nb_tx = rte_eth_tx_burst(fs->tx_port, fs->tx_queue,
 			pkts_burst, nb_rx);
@@ -85,8 +71,15 @@ pkt_burst_io_forward(struct fwd_stream *fs)
 			rte_pktmbuf_free(pkts_burst[nb_tx]);
 		} while (++nb_tx < nb_rx);
 	}
+}
 
-	get_end_cycles(fs, start_tsc);
+/*
+ * Wrapper of real fwd engine.
+ */
+static void
+pkt_burst_io_forward(struct fwd_stream *fs)
+{
+	return do_burst_fwd(fs, io_forward_stream);
 }
 
 struct fwd_engine io_fwd_engine = {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v2 08/15] app/testpmd: support shared Rx queue for rxonly forwarding
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
                     ` (5 preceding siblings ...)
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 07/15] app/testpmd: support shared Rx queues for IO forwarding Xueming Li
@ 2021-08-11 14:04   ` Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 09/15] app/testpmd: support shared Rx queue for icmpecho fwd Xueming Li
                     ` (7 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-08-11 14:04 UTC (permalink / raw)
  Cc: dev, xuemingl, Xiaoyun Li

Supports shared Rx queue.

If shared Rx queue is enabled, group received packets by stream
according to mbuf->port value and then and forward in stream basis as
before.

If shared Rx queue is not enabled, just forward in stream basis.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/rxonly.c | 34 +++++++++++++---------------------
 1 file changed, 13 insertions(+), 21 deletions(-)

diff --git a/app/test-pmd/rxonly.c b/app/test-pmd/rxonly.c
index c78fc4609a..80ae0ecf93 100644
--- a/app/test-pmd/rxonly.c
+++ b/app/test-pmd/rxonly.c
@@ -41,32 +41,24 @@
 #include "testpmd.h"
 
 /*
- * Received a burst of packets.
+ * Process a burst of received packets from same stream.
  */
 static void
-pkt_burst_receive(struct fwd_stream *fs)
+rxonly_forward_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		      struct rte_mbuf **pkts_burst)
 {
-	struct rte_mbuf  *pkts_burst[MAX_PKT_BURST];
-	uint16_t nb_rx;
-	uint16_t i;
-	uint64_t start_tsc = 0;
-
-	get_start_cycles(&start_tsc);
-
-	/*
-	 * Receive a burst of packets.
-	 */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
-				 nb_pkt_per_burst);
-	inc_rx_burst_stats(fs, nb_rx);
-	if (unlikely(nb_rx == 0))
-		return;
+	RTE_SET_USED(fs);
+	rte_pktmbuf_free_bulk(pkts_burst, nb_rx);
+}
 
-	fs->rx_packets += nb_rx;
-	for (i = 0; i < nb_rx; i++)
-		rte_pktmbuf_free(pkts_burst[i]);
 
-	get_end_cycles(fs, start_tsc);
+/*
+ * Wrapper of real fwd engine.
+ */
+static void
+pkt_burst_receive(struct fwd_stream *fs)
+{
+	return do_burst_fwd(fs, rxonly_forward_stream);
 }
 
 struct fwd_engine rx_only_engine = {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v2 09/15] app/testpmd: support shared Rx queue for icmpecho fwd
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
                     ` (6 preceding siblings ...)
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 08/15] app/testpmd: support shared Rx queue for rxonly forwarding Xueming Li
@ 2021-08-11 14:04   ` Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 10/15] app/testpmd: support shared Rx queue for csum fwd Xueming Li
                     ` (6 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-08-11 14:04 UTC (permalink / raw)
  Cc: Xiaoyu Min, dev, xuemingl, Xiaoyun Li

From: Xiaoyu Min <jackmin@nvidia.com>

Add support of shared rxq.
If shared rxq is enabled, filter packet by stream according
to mbuf->port value and then fwd it in stream basis (as before).

If shared rxq is not enabled, just fwd it in stream basis.

Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
---
 app/test-pmd/icmpecho.c | 33 +++++++++++++--------------------
 1 file changed, 13 insertions(+), 20 deletions(-)

diff --git a/app/test-pmd/icmpecho.c b/app/test-pmd/icmpecho.c
index 8948f28eb5..d6d11a2efb 100644
--- a/app/test-pmd/icmpecho.c
+++ b/app/test-pmd/icmpecho.c
@@ -267,13 +267,13 @@ ipv4_hdr_cksum(struct rte_ipv4_hdr *ip_h)
 	(((rte_be_to_cpu_32((ipv4_addr)) >> 24) & 0x000000FF) == 0xE0)
 
 /*
- * Receive a burst of packets, lookup for ICMP echo requests, and, if any,
- * send back ICMP echo replies.
+ * Lookup for ICMP echo requests in received mbuf and, if any,
+ * send back ICMP echo replies to corresponding Tx port.
  */
 static void
-reply_to_icmp_echo_rqsts(struct fwd_stream *fs)
+reply_to_icmp_echo_rqsts_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		struct rte_mbuf **pkts_burst)
 {
-	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
 	struct rte_mbuf *pkt;
 	struct rte_ether_hdr *eth_h;
 	struct rte_vlan_hdr *vlan_h;
@@ -283,7 +283,6 @@ reply_to_icmp_echo_rqsts(struct fwd_stream *fs)
 	struct rte_ether_addr eth_addr;
 	uint32_t retry;
 	uint32_t ip_addr;
-	uint16_t nb_rx;
 	uint16_t nb_tx;
 	uint16_t nb_replies;
 	uint16_t eth_type;
@@ -291,22 +290,9 @@ reply_to_icmp_echo_rqsts(struct fwd_stream *fs)
 	uint16_t arp_op;
 	uint16_t arp_pro;
 	uint32_t cksum;
-	uint8_t  i;
+	uint16_t  i;
 	int l2_len;
-	uint64_t start_tsc = 0;
-
-	get_start_cycles(&start_tsc);
-
-	/*
-	 * First, receive a burst of packets.
-	 */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
-				 nb_pkt_per_burst);
-	inc_rx_burst_stats(fs, nb_rx);
-	if (unlikely(nb_rx == 0))
-		return;
 
-	fs->rx_packets += nb_rx;
 	nb_replies = 0;
 	for (i = 0; i < nb_rx; i++) {
 		if (likely(i < nb_rx - 1))
@@ -509,8 +495,15 @@ reply_to_icmp_echo_rqsts(struct fwd_stream *fs)
 			} while (++nb_tx < nb_replies);
 		}
 	}
+}
 
-	get_end_cycles(fs, start_tsc);
+/*
+ * Wrapper of real fwd engine.
+ */
+static void
+reply_to_icmp_echo_rqsts(struct fwd_stream *fs)
+{
+	return do_burst_fwd(fs, reply_to_icmp_echo_rqsts_stream);
 }
 
 struct fwd_engine icmp_echo_engine = {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v2 10/15] app/testpmd: support shared Rx queue for csum fwd
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
                     ` (7 preceding siblings ...)
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 09/15] app/testpmd: support shared Rx queue for icmpecho fwd Xueming Li
@ 2021-08-11 14:04   ` Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 11/15] app/testpmd: support shared Rx queue for flowgen Xueming Li
                     ` (5 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-08-11 14:04 UTC (permalink / raw)
  Cc: Xiaoyu Min, dev, xuemingl, Xiaoyun Li

From: Xiaoyu Min <jackmin@nvidia.com>

Add support of shared rxq.
If shared rxq is enabled, filter packet by stream according
to mbuf->port value and then fwd it in stream basis (as before).

If shared rxq is not enabled, just fwd it in stream basis.

Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
---
 app/test-pmd/csumonly.c | 28 +++++++++++-----------------
 1 file changed, 11 insertions(+), 17 deletions(-)

diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 607c889359..3b7fb35843 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -763,7 +763,7 @@ pkt_copy_split(const struct rte_mbuf *pkt)
 }
 
 /*
- * Receive a burst of packets, and for each packet:
+ * For each packet in received mbuf:
  *  - parse packet, and try to recognize a supported packet type (1)
  *  - if it's not a supported packet type, don't touch the packet, else:
  *  - reprocess the checksum of all supported layers. This is done in SW
@@ -792,9 +792,9 @@ pkt_copy_split(const struct rte_mbuf *pkt)
  * OUTER_IP is only useful for tunnel packets.
  */
 static void
-pkt_burst_checksum_forward(struct fwd_stream *fs)
+checksum_forward_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		struct rte_mbuf **pkts_burst)
 {
-	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
 	struct rte_mbuf *gso_segments[GSO_MAX_PKT_BURST];
 	struct rte_gso_ctx *gso_ctx;
 	struct rte_mbuf **tx_pkts_burst;
@@ -805,7 +805,6 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 	void **gro_ctx;
 	uint16_t gro_pkts_num;
 	uint8_t gro_enable;
-	uint16_t nb_rx;
 	uint16_t nb_tx;
 	uint16_t nb_prep;
 	uint16_t i;
@@ -820,18 +819,6 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 	uint16_t nb_segments = 0;
 	int ret;
 
-	uint64_t start_tsc = 0;
-
-	get_start_cycles(&start_tsc);
-
-	/* receive a burst of packet */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
-				 nb_pkt_per_burst);
-	inc_rx_burst_stats(fs, nb_rx);
-	if (unlikely(nb_rx == 0))
-		return;
-
-	fs->rx_packets += nb_rx;
 	rx_bad_ip_csum = 0;
 	rx_bad_l4_csum = 0;
 	rx_bad_outer_l4_csum = 0;
@@ -1139,8 +1126,15 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 			rte_pktmbuf_free(tx_pkts_burst[nb_tx]);
 		} while (++nb_tx < nb_rx);
 	}
+}
 
-	get_end_cycles(fs, start_tsc);
+/*
+ * Wrapper of real fwd engine.
+ */
+static void
+pkt_burst_checksum_forward(struct fwd_stream *fs)
+{
+	return do_burst_fwd(fs, checksum_forward_stream);
 }
 
 struct fwd_engine csum_fwd_engine = {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v2 11/15] app/testpmd: support shared Rx queue for flowgen
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
                     ` (8 preceding siblings ...)
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 10/15] app/testpmd: support shared Rx queue for csum fwd Xueming Li
@ 2021-08-11 14:04   ` Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 12/15] app/testpmd: support shared Rx queue for MAC fwd Xueming Li
                     ` (4 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-08-11 14:04 UTC (permalink / raw)
  Cc: Xiaoyu Min, dev, xuemingl, Xiaoyun Li

From: Xiaoyu Min <jackmin@nvidia.com>

Add support of shared rxq.
If shared rxq is enabled, filter packet by stream according
to mbuf->port value and then do stats in stream basis (as before).

If shared rxq is not enabled, just as usual in stream basis.

Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
---
 app/test-pmd/flowgen.c | 22 ++++++++++------------
 1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/app/test-pmd/flowgen.c b/app/test-pmd/flowgen.c
index 3bf6e1ce97..d74c302b1c 100644
--- a/app/test-pmd/flowgen.c
+++ b/app/test-pmd/flowgen.c
@@ -83,10 +83,10 @@ ip_sum(const alias_int16_t *hdr, int hdr_len)
  * still do so in order to maintain traffic statistics.
  */
 static void
-pkt_burst_flow_gen(struct fwd_stream *fs)
+flow_gen_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		struct rte_mbuf **pkts_burst)
 {
 	unsigned pkt_size = tx_pkt_length - 4;	/* Adjust FCS */
-	struct rte_mbuf  *pkts_burst[MAX_PKT_BURST];
 	struct rte_mempool *mbp;
 	struct rte_mbuf  *pkt = NULL;
 	struct rte_ether_hdr *eth_hdr;
@@ -94,23 +94,14 @@ pkt_burst_flow_gen(struct fwd_stream *fs)
 	struct rte_udp_hdr *udp_hdr;
 	uint16_t vlan_tci, vlan_tci_outer;
 	uint64_t ol_flags = 0;
-	uint16_t nb_rx;
 	uint16_t nb_tx;
 	uint16_t nb_pkt;
 	uint16_t nb_clones = nb_pkt_flowgen_clones;
 	uint16_t i;
 	uint32_t retry;
 	uint64_t tx_offloads;
-	uint64_t start_tsc = 0;
 	static int next_flow = 0;
 
-	get_start_cycles(&start_tsc);
-
-	/* Receive a burst of packets and discard them. */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
-				 nb_pkt_per_burst);
-	fs->rx_packets += nb_rx;
-
 	for (i = 0; i < nb_rx; i++)
 		rte_pktmbuf_free(pkts_burst[i]);
 
@@ -213,8 +204,15 @@ pkt_burst_flow_gen(struct fwd_stream *fs)
 			rte_pktmbuf_free(pkts_burst[nb_tx]);
 		} while (++nb_tx < nb_pkt);
 	}
+}
 
-	get_end_cycles(fs, start_tsc);
+/*
+ * Wrapper of real fwd engine
+ */
+static void
+pkt_burst_flow_gen(struct fwd_stream *fs)
+{
+	return do_burst_fwd(fs, flow_gen_stream);
 }
 
 struct fwd_engine flow_gen_engine = {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v2 12/15] app/testpmd: support shared Rx queue for MAC fwd
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
                     ` (9 preceding siblings ...)
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 11/15] app/testpmd: support shared Rx queue for flowgen Xueming Li
@ 2021-08-11 14:04   ` Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 13/15] app/testpmd: support shared Rx queue for macswap fwd Xueming Li
                     ` (3 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-08-11 14:04 UTC (permalink / raw)
  Cc: Xiaoyu Min, dev, xuemingl, Xiaoyun Li

From: Xiaoyu Min <jackmin@nvidia.com>

Add support of shared rxq.
If shared rxq is enabled, filter packet by stream according
to mbuf->port value and then fwd it in stream basis (as before).

If shared rxq is not enabled, just fwd it as usual in stream basis.

Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
---
 app/test-pmd/macfwd.c | 27 ++++++++++-----------------
 1 file changed, 10 insertions(+), 17 deletions(-)

diff --git a/app/test-pmd/macfwd.c b/app/test-pmd/macfwd.c
index 0568ea794d..75fbea16d4 100644
--- a/app/test-pmd/macfwd.c
+++ b/app/test-pmd/macfwd.c
@@ -44,32 +44,18 @@
  * before forwarding them.
  */
 static void
-pkt_burst_mac_forward(struct fwd_stream *fs)
+mac_forward_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		struct rte_mbuf **pkts_burst)
 {
-	struct rte_mbuf  *pkts_burst[MAX_PKT_BURST];
 	struct rte_port  *txp;
 	struct rte_mbuf  *mb;
 	struct rte_ether_hdr *eth_hdr;
 	uint32_t retry;
-	uint16_t nb_rx;
 	uint16_t nb_tx;
 	uint16_t i;
 	uint64_t ol_flags = 0;
 	uint64_t tx_offloads;
-	uint64_t start_tsc = 0;
-
-	get_start_cycles(&start_tsc);
-
-	/*
-	 * Receive a burst of packets and forward them.
-	 */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
-				 nb_pkt_per_burst);
-	inc_rx_burst_stats(fs, nb_rx);
-	if (unlikely(nb_rx == 0))
-		return;
 
-	fs->rx_packets += nb_rx;
 	txp = &ports[fs->tx_port];
 	tx_offloads = txp->dev_conf.txmode.offloads;
 	if (tx_offloads	& DEV_TX_OFFLOAD_VLAN_INSERT)
@@ -116,8 +102,15 @@ pkt_burst_mac_forward(struct fwd_stream *fs)
 			rte_pktmbuf_free(pkts_burst[nb_tx]);
 		} while (++nb_tx < nb_rx);
 	}
+}
 
-	get_end_cycles(fs, start_tsc);
+/*
+ * Wrapper of real fwd engine.
+ */
+static void
+pkt_burst_mac_forward(struct fwd_stream *fs)
+{
+	return do_burst_fwd(fs, mac_forward_stream);
 }
 
 struct fwd_engine mac_fwd_engine = {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v2 13/15] app/testpmd: support shared Rx queue for macswap fwd
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
                     ` (10 preceding siblings ...)
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 12/15] app/testpmd: support shared Rx queue for MAC fwd Xueming Li
@ 2021-08-11 14:04   ` Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 14/15] app/testpmd: support shared Rx queue for 5tuple fwd Xueming Li
                     ` (2 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-08-11 14:04 UTC (permalink / raw)
  Cc: Xiaoyu Min, dev, xuemingl, Xiaoyun Li

From: Xiaoyu Min <jackmin@nvidia.com>

Add support of shared rxq.
If shared rxq is enabled, filter packet by stream according
to mbuf->port value and then fwd it in stream basis (as before).

If shared rxq is not enabled, just fwd it as usual in stream basis.

Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
---
 app/test-pmd/macswap.c | 28 +++++++++++-----------------
 1 file changed, 11 insertions(+), 17 deletions(-)

diff --git a/app/test-pmd/macswap.c b/app/test-pmd/macswap.c
index 310bca06af..daf7170092 100644
--- a/app/test-pmd/macswap.c
+++ b/app/test-pmd/macswap.c
@@ -50,27 +50,13 @@
  * addresses of packets before forwarding them.
  */
 static void
-pkt_burst_mac_swap(struct fwd_stream *fs)
+mac_swap_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		struct rte_mbuf **pkts_burst)
 {
-	struct rte_mbuf  *pkts_burst[MAX_PKT_BURST];
 	struct rte_port  *txp;
-	uint16_t nb_rx;
 	uint16_t nb_tx;
 	uint32_t retry;
-	uint64_t start_tsc = 0;
-
-	get_start_cycles(&start_tsc);
-
-	/*
-	 * Receive a burst of packets and forward them.
-	 */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
-				 nb_pkt_per_burst);
-	inc_rx_burst_stats(fs, nb_rx);
-	if (unlikely(nb_rx == 0))
-		return;
 
-	fs->rx_packets += nb_rx;
 	txp = &ports[fs->tx_port];
 
 	do_macswap(pkts_burst, nb_rx, txp);
@@ -95,7 +81,15 @@ pkt_burst_mac_swap(struct fwd_stream *fs)
 			rte_pktmbuf_free(pkts_burst[nb_tx]);
 		} while (++nb_tx < nb_rx);
 	}
-	get_end_cycles(fs, start_tsc);
+}
+
+/*
+ * Wrapper of real fwd engine.
+ */
+static void
+pkt_burst_mac_swap(struct fwd_stream *fs)
+{
+	return do_burst_fwd(fs, mac_swap_stream);
 }
 
 struct fwd_engine mac_swap_engine = {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v2 14/15] app/testpmd: support shared Rx queue for 5tuple fwd
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
                     ` (11 preceding siblings ...)
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 13/15] app/testpmd: support shared Rx queue for macswap fwd Xueming Li
@ 2021-08-11 14:04   ` Xueming Li
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 15/15] app/testpmd: support shared Rx queue for ieee1588 fwd Xueming Li
  2021-08-17  9:33   ` [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue Jerin Jacob
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-08-11 14:04 UTC (permalink / raw)
  Cc: Xiaoyu Min, dev, xuemingl, Xiaoyun Li

From: Xiaoyu Min <jackmin@nvidia.com>

Add support of shared rxq.
If shared rxq is enabled, filter packet by stream according
to mbuf->port value and then fwd it in stream basis (as before).

If shared rxq is not enabled, just fwd it as usual in stream basis.

Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
---
 app/test-pmd/5tswap.c | 30 +++++++++++-------------------
 1 file changed, 11 insertions(+), 19 deletions(-)

diff --git a/app/test-pmd/5tswap.c b/app/test-pmd/5tswap.c
index e8cef9623b..236a117ee3 100644
--- a/app/test-pmd/5tswap.c
+++ b/app/test-pmd/5tswap.c
@@ -82,18 +82,16 @@ swap_udp(struct rte_udp_hdr *udp_hdr)
  * Parses each layer and swaps it. When the next layer doesn't match it stops.
  */
 static void
-pkt_burst_5tuple_swap(struct fwd_stream *fs)
+_5tuple_swap_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		struct rte_mbuf **pkts_burst)
 {
-	struct rte_mbuf  *pkts_burst[MAX_PKT_BURST];
 	struct rte_port  *txp;
 	struct rte_mbuf *mb;
 	uint16_t next_proto;
 	uint64_t ol_flags;
 	uint16_t proto;
-	uint16_t nb_rx;
 	uint16_t nb_tx;
 	uint32_t retry;
-
 	int i;
 	union {
 		struct rte_ether_hdr *eth;
@@ -105,20 +103,6 @@ pkt_burst_5tuple_swap(struct fwd_stream *fs)
 		uint8_t *byte;
 	} h;
 
-	uint64_t start_tsc = 0;
-
-	get_start_cycles(&start_tsc);
-
-	/*
-	 * Receive a burst of packets and forward them.
-	 */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
-				 nb_pkt_per_burst);
-	inc_rx_burst_stats(fs, nb_rx);
-	if (unlikely(nb_rx == 0))
-		return;
-
-	fs->rx_packets += nb_rx;
 	txp = &ports[fs->tx_port];
 	ol_flags = ol_flags_init(txp->dev_conf.txmode.offloads);
 	vlan_qinq_set(pkts_burst, nb_rx, ol_flags,
@@ -182,7 +166,15 @@ pkt_burst_5tuple_swap(struct fwd_stream *fs)
 			rte_pktmbuf_free(pkts_burst[nb_tx]);
 		} while (++nb_tx < nb_rx);
 	}
-	get_end_cycles(fs, start_tsc);
+}
+
+/*
+ * Wrapper of real fwd engine.
+ */
+static void
+pkt_burst_5tuple_swap(struct fwd_stream *fs)
+{
+	return do_burst_fwd(fs, _5tuple_swap_stream);
 }
 
 struct fwd_engine five_tuple_swap_fwd_engine = {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v2 15/15] app/testpmd: support shared Rx queue for ieee1588 fwd
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
                     ` (12 preceding siblings ...)
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 14/15] app/testpmd: support shared Rx queue for 5tuple fwd Xueming Li
@ 2021-08-11 14:04   ` Xueming Li
  2021-08-17  9:33   ` [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue Jerin Jacob
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-08-11 14:04 UTC (permalink / raw)
  Cc: Xiaoyu Min, dev, xuemingl, Xiaoyun Li

From: Xiaoyu Min <jackmin@nvidia.com>

Add support of shared rxq.
If shared rxq is enabled, filter packet by stream according
to mbuf->port value and then fwd it in stream basis (as before).

If shared rxq is not enabled, just fwd it as usual in stream basis.

Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
---
 app/test-pmd/ieee1588fwd.c | 30 ++++++++++++++++++++----------
 1 file changed, 20 insertions(+), 10 deletions(-)

diff --git a/app/test-pmd/ieee1588fwd.c b/app/test-pmd/ieee1588fwd.c
index 034f238c34..dc6bf0e39d 100644
--- a/app/test-pmd/ieee1588fwd.c
+++ b/app/test-pmd/ieee1588fwd.c
@@ -90,23 +90,17 @@ port_ieee1588_tx_timestamp_check(portid_t pi)
 }
 
 static void
-ieee1588_packet_fwd(struct fwd_stream *fs)
+ieee1588_fwd_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		struct rte_mbuf **pkt)
 {
-	struct rte_mbuf  *mb;
+	struct rte_mbuf *mb = (*pkt);
 	struct rte_ether_hdr *eth_hdr;
 	struct rte_ether_addr addr;
 	struct ptpv2_msg *ptp_hdr;
 	uint16_t eth_type;
 	uint32_t timesync_index;
 
-	/*
-	 * Receive 1 packet at a time.
-	 */
-	if (rte_eth_rx_burst(fs->rx_port, fs->rx_queue, &mb, 1) == 0)
-		return;
-
-	fs->rx_packets += 1;
-
+	RTE_SET_USED(nb_rx);
 	/*
 	 * Check that the received packet is a PTP packet that was detected
 	 * by the hardware.
@@ -198,6 +192,22 @@ ieee1588_packet_fwd(struct fwd_stream *fs)
 	port_ieee1588_tx_timestamp_check(fs->rx_port);
 }
 
+/*
+ * Wrapper of real fwd ingine.
+ */
+static void
+ieee1588_packet_fwd(struct fwd_stream *fs)
+{
+	struct rte_mbuf *mb;
+
+	if (rte_eth_rx_burst(fs->rx_port, fs->rx_queue, &mb, 1) == 0)
+		return;
+	if (unlikely(rxq_share > 0))
+		forward_shared_rxq(fs, 1, &mb, ieee1588_fwd_stream);
+	else
+		ieee1588_fwd_stream(fs, 1, &mb);
+}
+
 static void
 port_ieee1588_fwd_begin(portid_t pi)
 {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-08-11 12:59             ` Xueming(Steven) Li
@ 2021-08-12 14:35               ` Xueming(Steven) Li
  2021-09-15 15:34               ` Xueming(Steven) Li
  1 sibling, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-08-12 14:35 UTC (permalink / raw)
  To: Xueming(Steven) Li, Ferruh Yigit, Jerin Jacob
  Cc: dpdk-dev, NBU-Contact-Thomas Monjalon, Andrew Rybchenko



> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Xueming(Steven) Li
> Sent: Wednesday, August 11, 2021 8:59 PM
> To: Ferruh Yigit <ferruh.yigit@intel.com>; Jerin Jacob <jerinjacobk@gmail.com>
> Cc: dpdk-dev <dev@dpdk.org>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>
> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> 
> 
> 
> > -----Original Message-----
> > From: Ferruh Yigit <ferruh.yigit@intel.com>
> > Sent: Wednesday, August 11, 2021 8:04 PM
> > To: Xueming(Steven) Li <xuemingl@nvidia.com>; Jerin Jacob
> > <jerinjacobk@gmail.com>
> > Cc: dpdk-dev <dev@dpdk.org>; NBU-Contact-Thomas Monjalon
> > <thomas@monjalon.net>; Andrew Rybchenko
> > <andrew.rybchenko@oktetlabs.ru>
> > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> >
> > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > >
> > >
> > >> -----Original Message-----
> > >> From: Jerin Jacob <jerinjacobk@gmail.com>
> > >> Sent: Wednesday, August 11, 2021 4:03 PM
> > >> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > >> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > >> NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > >> <andrew.rybchenko@oktetlabs.ru>
> > >> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx
> > >> queue
> > >>
> > >> On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >>>
> > >>> Hi,
> > >>>
> > >>>> -----Original Message-----
> > >>>> From: Jerin Jacob <jerinjacobk@gmail.com>
> > >>>> Sent: Monday, August 9, 2021 9:51 PM
> > >>>> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > >>>> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > >>>> <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > >>>> <thomas@monjalon.net>; Andrew Rybchenko
> > >>>> <andrew.rybchenko@oktetlabs.ru>
> > >>>> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx
> > >>>> queue
> > >>>>
> > >>>> On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > >>>>>
> > >>>>> In current DPDK framework, each RX queue is pre-loaded with
> > >>>>> mbufs for incoming packets. When number of representors scale
> > >>>>> out in a switch domain, the memory consumption became
> > >>>>> significant. Most important, polling all ports leads to high
> > >>>>> cache miss, high latency and low throughput.
> > >>>>>
> > >>>>> This patch introduces shared RX queue. Ports with same
> > >>>>> configuration in a switch domain could share RX queue set by specifying sharing group.
> > >>>>> Polling any queue using same shared RX queue receives packets
> > >>>>> from all member ports. Source port is identified by mbuf->port.
> > >>>>>
> > >>>>> Port queue number in a shared group should be identical. Queue
> > >>>>> index is
> > >>>>> 1:1 mapped in shared group.
> > >>>>>
> > >>>>> Share RX queue is supposed to be polled on same thread.
> > >>>>>
> > >>>>> Multiple groups is supported by group ID.
> > >>>>
> > >>>> Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > >>>
> > >>> Yes, PF and representor in switch domain could take advantage.
> > >>>
> > >>>> If it is for a generic case, how the flow ordering will be maintained?
> > >>>
> > >>> Not quite sure that I understood your question. The control path
> > >>> of is almost same as before, PF and representor port still needed, rte flows not impacted.
> > >>> Queues still needed for each member port, descriptors(mbuf) will
> > >>> be supplied from shared Rx queue in my PMD implementation.
> > >>
> > >> My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > >> offload, multiple ethdev receive queues land into the same receive
> > >> queue, In that case, how the flow order is maintained for
> > respective receive queues.
> > >
> > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > Packets from same source port could be grouped as a small burst to
> > > process, this will accelerates the performance if traffic come from
> > > limited ports. I'll introduce some common api to do shard rxq
> > > forwarding, call it with packets handling callback, so it suites for
> > > all
> > forwarding engine. Will sent patches soon.
> > >
> >
> > All ports will put the packets in to the same queue (share queue),
> > right? Does this means only single core will poll only, what will happen if there are multiple cores polling, won't it cause problem?
> 
> This has been mentioned in commit log, the shared rxq is supposed to be polling in single thread(core) - I think it should be "MUST".
> Result is unexpected if there are multiple cores pooling, that's why I added a polling schedule check in testpmd.

V2 with testpmd code uploaded, please check.

> Similar for rx/tx burst function, a queue can't be polled on multiple thread(core), and for performance concern, no such check in eal
> api.
> 
> If users want to utilize multiple cores to distribute workloads, it's possible to define more groups, queues in different group could be
> could be polled on multiple cores.
> 
> It's possible to poll every member port in group, but not necessary, any port in group could be polled to get packets for all ports in
> group.
> 
> If the member port subject to hot plug/remove,  it's possible to create a vdev with same queue number, copy rxq object and poll vdev
> as a dedicate proxy for the group.
> 
> >
> > And if this requires specific changes in the application, I am not
> > sure about the solution, can't this work in a transparent way to the application?
> 
> Yes, we considered different options in design stage. One possible solution is to cache received packets in rings, this can be done on
> eth layer, but I'm afraid less benefits, user still has to be a ware of multiple core polling.
> This can be done as a wrapper PMD later, more efforts.
> 
> >
> > Overall, is this for optimizing memory for the port represontors? If
> > so can't we have a port representor specific solution, reducing scope can reduce the complexity it brings?
> 
> This feature supports both PF and representor, and yes, major issue is memory of representors. Poll all representors also introduces
> more core cache miss latency. This feature essentially aggregates all ports in group as one port.
> On the other hand, it's useful for rte flow to create offloading flows using representor as a regular port ID.
> 
> It's great if any new solution/suggestion, my head buried in PMD code :)
> 
> >
> > >> If this offload is only useful for representor case, Can we make
> > >> this offload specific to representor the case by changing its name and scope.
> > >
> > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > >
> > >>
> > >>
> > >>>
> > >>>>
> > >>>>>
> > >>>>> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > >>>>> ---
> > >>>>>  doc/guides/nics/features.rst                    | 11 +++++++++++
> > >>>>>  doc/guides/nics/features/default.ini            |  1 +
> > >>>>>  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > >>>>>  lib/ethdev/rte_ethdev.c                         |  1 +
> > >>>>>  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> > >>>>>  5 files changed, 30 insertions(+)
> > >>>>>
> > >>>>> diff --git a/doc/guides/nics/features.rst
> > >>>>> b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554
> > >>>>> 100644
> > >>>>> --- a/doc/guides/nics/features.rst
> > >>>>> +++ b/doc/guides/nics/features.rst
> > >>>>> @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > >>>>>    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > >>>>>
> > >>>>>
> > >>>>> +.. _nic_features_shared_rx_queue:
> > >>>>> +
> > >>>>> +Shared Rx queue
> > >>>>> +---------------
> > >>>>> +
> > >>>>> +Supports shared Rx queue for ports in same switch domain.
> > >>>>> +
> > >>>>> +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > >>>>> +* **[provides] mbuf**: ``mbuf.port``.
> > >>>>> +
> > >>>>> +
> > >>>>>  .. _nic_features_packet_type_parsing:
> > >>>>>
> > >>>>>  Packet type parsing
> > >>>>> diff --git a/doc/guides/nics/features/default.ini
> > >>>>> b/doc/guides/nics/features/default.ini
> > >>>>> index 754184ddd4..ebeb4c1851 100644
> > >>>>> --- a/doc/guides/nics/features/default.ini
> > >>>>> +++ b/doc/guides/nics/features/default.ini
> > >>>>> @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > >>>>>  Queue start/stop     =
> > >>>>>  Runtime Rx queue setup =
> > >>>>>  Runtime Tx queue setup =
> > >>>>> +Shared Rx queue      =
> > >>>>>  Burst mode info      =
> > >>>>>  Power mgmt address monitor =
> > >>>>>  MTU update           =
> > >>>>> diff --git a/doc/guides/prog_guide/switch_representation.rst
> > >>>>> b/doc/guides/prog_guide/switch_representation.rst
> > >>>>> index ff6aa91c80..45bf5a3a10 100644
> > >>>>> --- a/doc/guides/prog_guide/switch_representation.rst
> > >>>>> +++ b/doc/guides/prog_guide/switch_representation.rst
> > >>>>> @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > >>>>>  .. [1] `Ethernet switch device driver model (switchdev)
> > >>>>>
> > >>>>> <https://www.kernel.org/doc/Documentation/networking/switchdev.t
> > >>>>> xt
> > >>>>>> `_
> > >>>>>
> > >>>>> +- Memory usage of representors is huge when number of
> > >>>>> +representor grows,
> > >>>>> +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > >>>>> +  Polling the large number of ports brings more CPU load, cache
> > >>>>> +miss and
> > >>>>> +  latency. Shared Rx queue can be used to share Rx queue
> > >>>>> +between PF and
> > >>>>> +  representors in same switch domain.
> > >>>>> +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > >>>>> +  is present in Rx offloading capability of device info.
> > >>>>> +Setting the
> > >>>>> +  offloading flag in device Rx mode or Rx queue configuration
> > >>>>> +to enable
> > >>>>> +  shared Rx queue. Polling any member port of shared Rx queue
> > >>>>> +can return
> > >>>>> +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> > >>>>> +
> > >>>>>  Basic SR-IOV
> > >>>>>  ------------
> > >>>>>
> > >>>>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > >>>>> index 9d95cd11e1..1361ff759a 100644
> > >>>>> --- a/lib/ethdev/rte_ethdev.c
> > >>>>> +++ b/lib/ethdev/rte_ethdev.c
> > >>>>> @@ -127,6 +127,7 @@ static const struct {
> > >>>>>         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > >>>>>         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > >>>>>         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > >>>>> +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > >>>>>  };
> > >>>>>
> > >>>>>  #undef RTE_RX_OFFLOAD_BIT2STR
> > >>>>> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > >>>>> index d2b27c351f..a578c9db9d 100644
> > >>>>> --- a/lib/ethdev/rte_ethdev.h
> > >>>>> +++ b/lib/ethdev/rte_ethdev.h
> > >>>>> @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > >>>>>         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > >>>>>         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > >>>>>         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > >>>>> */
> > >>>>> +       uint32_t shared_group; /**< Shared port group index in
> > >>>>> + switch domain. */
> > >>>>>         /**
> > >>>>>          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > >>>>>          * Only offloads set on rx_queue_offload_capa or
> > >>>>> rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > >>>>> #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > >>>>>  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > >>>>>  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > >>>>> +/**
> > >>>>> + * Rx queue is shared among ports in same switch domain to save
> > >>>>> +memory,
> > >>>>> + * avoid polling each port. Any port in group can be used to receive packets.
> > >>>>> + * Real source port number saved in mbuf->port field.
> > >>>>> + */
> > >>>>> +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > >>>>>
> > >>>>>  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > >>>>>                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > >>>>> --
> > >>>>> 2.25.1
> > >>>>>


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
                     ` (13 preceding siblings ...)
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 15/15] app/testpmd: support shared Rx queue for ieee1588 fwd Xueming Li
@ 2021-08-17  9:33   ` Jerin Jacob
  2021-08-17 11:31     ` Xueming(Steven) Li
  14 siblings, 1 reply; 266+ messages in thread
From: Jerin Jacob @ 2021-08-17  9:33 UTC (permalink / raw)
  To: Xueming Li; +Cc: dpdk-dev, Ferruh Yigit, Thomas Monjalon, Andrew Rybchenko

On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
>
> In current DPDK framework, each RX queue is pre-loaded with mbufs for
> incoming packets. When number of representors scale out in a switch
> domain, the memory consumption became significant. Most important,
> polling all ports leads to high cache miss, high latency and low
> throughput.
>
> This patch introduces shared RX queue. Ports with same configuration in
> a switch domain could share RX queue set by specifying sharing group.
> Polling any queue using same shared RX queue receives packets from all
> member ports. Source port is identified by mbuf->port.
>
> Port queue number in a shared group should be identical. Queue index is
> 1:1 mapped in shared group.
>
> Share RX queue must be polled on single thread or core.
>
> Multiple groups is supported by group ID.
>
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> Cc: Jerin Jacob <jerinjacobk@gmail.com>
> ---
> Rx queue object could be used as shared Rx queue object, it's important
> to clear all queue control callback api that using queue object:
>   https://mails.dpdk.org/archives/dev/2021-July/215574.html

>  #undef RTE_RX_OFFLOAD_BIT2STR
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index d2b27c351f..a578c9db9d 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
>         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
>         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
>         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> +       uint32_t shared_group; /**< Shared port group index in switch domain. */

Not to able to see anyone setting/creating this group ID test application.
How this group is created?


>         /**
>          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
>          * Only offloads set on rx_queue_offload_capa or rx_offload_capa
> @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
>  #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
>  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
>  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> +/**
> + * Rx queue is shared among ports in same switch domain to save memory,
> + * avoid polling each port. Any port in group can be used to receive packets.
> + * Real source port number saved in mbuf->port field.
> + */
> +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
>
>  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
>                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
  2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function Xueming Li
@ 2021-08-17  9:37     ` Jerin Jacob
  2021-08-18 11:27       ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Jerin Jacob @ 2021-08-17  9:37 UTC (permalink / raw)
  To: Xueming Li; +Cc: Xiaoyu Min, dpdk-dev, Xiaoyun Li

On Wed, Aug 11, 2021 at 7:35 PM Xueming Li <xuemingl@nvidia.com> wrote:
>
> From: Xiaoyu Min <jackmin@nvidia.com>
>
> Added an inline common wrapper function for all fwd engines
> which do the following in common:
>
> 1. get_start_cycles
> 2. rte_eth_rx_burst(...,nb_pkt_per_burst)
> 3. if rxq_share do forward_shared_rxq(), otherwise do fwd directly
> 4. get_end_cycle
>
> Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
> ---
>  app/test-pmd/testpmd.h | 24 ++++++++++++++++++++++++
>  1 file changed, 24 insertions(+)
>
> diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
> index 13141dfed9..b685ac48d6 100644
> --- a/app/test-pmd/testpmd.h
> +++ b/app/test-pmd/testpmd.h
> @@ -1022,6 +1022,30 @@ void add_tx_dynf_callback(portid_t portid);
>  void remove_tx_dynf_callback(portid_t portid);
>  int update_jumbo_frame_offload(portid_t portid);
>
> +static inline void
> +do_burst_fwd(struct fwd_stream *fs, packet_fwd_cb fwd)
> +{
> +       struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
> +       uint16_t nb_rx;
> +       uint64_t start_tsc = 0;
> +
> +       get_start_cycles(&start_tsc);
> +
> +       /*
> +        * Receive a burst of packets and forward them.
> +        */
> +       nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
> +                       pkts_burst, nb_pkt_per_burst);
> +       inc_rx_burst_stats(fs, nb_rx);
> +       if (unlikely(nb_rx == 0))
> +               return;
> +       if (unlikely(rxq_share > 0))

See below. It reads a global memory.

> +               forward_shared_rxq(fs, nb_rx, pkts_burst, fwd);
> +       else
> +               (*fwd)(fs, nb_rx, pkts_burst);

New function pointer in fastpath.

IMO, We should not create performance regression for the existing
forward engine.
Can we have a new forward engine just for shared memory testing?

> +       get_end_cycles(fs, start_tsc);
> +}
> +
>  /*
>   * Work-around of a compilation error with ICC on invocations of the
>   * rte_be_to_cpu_16() function.
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue
  2021-08-17  9:33   ` [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue Jerin Jacob
@ 2021-08-17 11:31     ` Xueming(Steven) Li
  2021-08-17 15:11       ` Jerin Jacob
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-08-17 11:31 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dpdk-dev, Ferruh Yigit, NBU-Contact-Thomas Monjalon, Andrew Rybchenko



> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Tuesday, August 17, 2021 5:33 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> 
> On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> >
> > In current DPDK framework, each RX queue is pre-loaded with mbufs for
> > incoming packets. When number of representors scale out in a switch
> > domain, the memory consumption became significant. Most important,
> > polling all ports leads to high cache miss, high latency and low
> > throughput.
> >
> > This patch introduces shared RX queue. Ports with same configuration
> > in a switch domain could share RX queue set by specifying sharing group.
> > Polling any queue using same shared RX queue receives packets from all
> > member ports. Source port is identified by mbuf->port.
> >
> > Port queue number in a shared group should be identical. Queue index
> > is
> > 1:1 mapped in shared group.
> >
> > Share RX queue must be polled on single thread or core.
> >
> > Multiple groups is supported by group ID.
> >
> > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > ---
> > Rx queue object could be used as shared Rx queue object, it's
> > important to clear all queue control callback api that using queue object:
> >   https://mails.dpdk.org/archives/dev/2021-July/215574.html
> 
> >  #undef RTE_RX_OFFLOAD_BIT2STR
> > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > d2b27c351f..a578c9db9d 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > */
> > +       uint32_t shared_group; /**< Shared port group index in switch
> > + domain. */
> 
> Not to able to see anyone setting/creating this group ID test application.
> How this group is created?

Nice catch, the initial testpmd version only support one default group(0).
All ports that supports shared-rxq assigned in same group.

We should be able to change "--rxq-shared" to "--rxq-shared-group" to support
group other than default.

To support more groups simultaneously, need to consider testpmd forwarding stream
core assignment, all streams in same group need to stay on same core. 
It's possible to specify how many ports to increase group number, but user must
schedule stream affinity carefully - error prone.
 
On the other hand, one group should be sufficient for most customer, the doubt is
whether it valuable to support multiple groups test.

> 
> 
> >         /**
> >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> >          * Only offloads set on rx_queue_offload_capa or
> > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {  #define
> > DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > +/**
> > + * Rx queue is shared among ports in same switch domain to save
> > +memory,
> > + * avoid polling each port. Any port in group can be used to receive packets.
> > + * Real source port number saved in mbuf->port field.
> > + */
> > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> >
> >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue
  2021-08-17 11:31     ` Xueming(Steven) Li
@ 2021-08-17 15:11       ` Jerin Jacob
  2021-08-18 11:14         ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Jerin Jacob @ 2021-08-17 15:11 UTC (permalink / raw)
  To: Xueming(Steven) Li
  Cc: dpdk-dev, Ferruh Yigit, NBU-Contact-Thomas Monjalon, Andrew Rybchenko

On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Tuesday, August 17, 2021 5:33 PM
> > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> >
> > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > >
> > > In current DPDK framework, each RX queue is pre-loaded with mbufs for
> > > incoming packets. When number of representors scale out in a switch
> > > domain, the memory consumption became significant. Most important,
> > > polling all ports leads to high cache miss, high latency and low
> > > throughput.
> > >
> > > This patch introduces shared RX queue. Ports with same configuration
> > > in a switch domain could share RX queue set by specifying sharing group.
> > > Polling any queue using same shared RX queue receives packets from all
> > > member ports. Source port is identified by mbuf->port.
> > >
> > > Port queue number in a shared group should be identical. Queue index
> > > is
> > > 1:1 mapped in shared group.
> > >
> > > Share RX queue must be polled on single thread or core.
> > >
> > > Multiple groups is supported by group ID.
> > >
> > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > ---
> > > Rx queue object could be used as shared Rx queue object, it's
> > > important to clear all queue control callback api that using queue object:
> > >   https://mails.dpdk.org/archives/dev/2021-July/215574.html
> >
> > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > > d2b27c351f..a578c9db9d 100644
> > > --- a/lib/ethdev/rte_ethdev.h
> > > +++ b/lib/ethdev/rte_ethdev.h
> > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > */
> > > +       uint32_t shared_group; /**< Shared port group index in switch
> > > + domain. */
> >
> > Not to able to see anyone setting/creating this group ID test application.
> > How this group is created?
>
> Nice catch, the initial testpmd version only support one default group(0).
> All ports that supports shared-rxq assigned in same group.
>
> We should be able to change "--rxq-shared" to "--rxq-shared-group" to support
> group other than default.
>
> To support more groups simultaneously, need to consider testpmd forwarding stream
> core assignment, all streams in same group need to stay on same core.
> It's possible to specify how many ports to increase group number, but user must
> schedule stream affinity carefully - error prone.
>
> On the other hand, one group should be sufficient for most customer, the doubt is
> whether it valuable to support multiple groups test.

Ack. One group is enough in testpmd.

My question was more about who and how this group is created, Should n't we need
API to create shared_group? If we do the following, at least, I can
think, how it
can be implemented in SW or other HW.

- Create aggregation queue group
- Attach multiple  Rx queues to the aggregation queue group
- Pull the packets from the queue group(which internally fetch from
the Rx queues _attached_)

Does the above kind of sequence, break your representor use case?


>
> >
> >
> > >         /**
> > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > >          * Only offloads set on rx_queue_offload_capa or
> > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {  #define
> > > DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > +/**
> > > + * Rx queue is shared among ports in same switch domain to save
> > > +memory,
> > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > + * Real source port number saved in mbuf->port field.
> > > + */
> > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > >
> > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > --
> > > 2.25.1
> > >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue
  2021-08-17 15:11       ` Jerin Jacob
@ 2021-08-18 11:14         ` Xueming(Steven) Li
  2021-08-19  5:26           ` Jerin Jacob
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-08-18 11:14 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dpdk-dev, Ferruh Yigit, NBU-Contact-Thomas Monjalon, Andrew Rybchenko



> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Tuesday, August 17, 2021 11:12 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> 
> On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Tuesday, August 17, 2021 5:33 PM
> > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > <andrew.rybchenko@oktetlabs.ru>
> > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > >
> > > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > >
> > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > for incoming packets. When number of representors scale out in a
> > > > switch domain, the memory consumption became significant. Most
> > > > important, polling all ports leads to high cache miss, high
> > > > latency and low throughput.
> > > >
> > > > This patch introduces shared RX queue. Ports with same
> > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > Polling any queue using same shared RX queue receives packets from
> > > > all member ports. Source port is identified by mbuf->port.
> > > >
> > > > Port queue number in a shared group should be identical. Queue
> > > > index is
> > > > 1:1 mapped in shared group.
> > > >
> > > > Share RX queue must be polled on single thread or core.
> > > >
> > > > Multiple groups is supported by group ID.
> > > >
> > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > > ---
> > > > Rx queue object could be used as shared Rx queue object, it's
> > > > important to clear all queue control callback api that using queue object:
> > > >   https://mails.dpdk.org/archives/dev/2021-July/215574.html
> > >
> > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > index d2b27c351f..a578c9db9d 100644
> > > > --- a/lib/ethdev/rte_ethdev.h
> > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > */
> > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > + switch domain. */
> > >
> > > Not to able to see anyone setting/creating this group ID test application.
> > > How this group is created?
> >
> > Nice catch, the initial testpmd version only support one default group(0).
> > All ports that supports shared-rxq assigned in same group.
> >
> > We should be able to change "--rxq-shared" to "--rxq-shared-group" to
> > support group other than default.
> >
> > To support more groups simultaneously, need to consider testpmd
> > forwarding stream core assignment, all streams in same group need to stay on same core.
> > It's possible to specify how many ports to increase group number, but
> > user must schedule stream affinity carefully - error prone.
> >
> > On the other hand, one group should be sufficient for most customer,
> > the doubt is whether it valuable to support multiple groups test.
> 
> Ack. One group is enough in testpmd.
> 
> My question was more about who and how this group is created, Should n't we need API to create shared_group? If we do the
> following, at least, I can think, how it can be implemented in SW or other HW.
> 
> - Create aggregation queue group
> - Attach multiple  Rx queues to the aggregation queue group
> - Pull the packets from the queue group(which internally fetch from the Rx queues _attached_)
> 
> Does the above kind of sequence, break your representor use case?

Seems more like a set of EAL wrapper. Current API tries to minimize the application efforts to adapt shared-rxq.
- step 1, not sure how important it is to create group with API, in rte_flow, group is created on demand.
- step 2, currently, the attaching is done in rte_eth_rx_queue_setup, specify offload and group in rx_conf struct.
- step 3, define a dedicate api to receive packets from shared rxq? Looks clear to receive packets from shared rxq.
  currently, rxq objects in share group is same - the shared rxq, so the eth callback eth_rx_burst_t(rxq_obj, mbufs, n) could
  be used to receive packets from any ports in group, normally the first port(PF) in group.
  An alternative way is defining a vdev with same queue number and copy rxq objects will make the vdev a proxy of
  the shared rxq group - this could be an helper API.

Anyway the wrapper doesn't break use case, step 3 api is more clear, need to understand how to implement efficiently.

> 
> 
> >
> > >
> > >
> > > >         /**
> > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > >          * Only offloads set on rx_queue_offload_capa or
> > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > +/**
> > > > + * Rx queue is shared among ports in same switch domain to save
> > > > +memory,
> > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > + * Real source port number saved in mbuf->port field.
> > > > + */
> > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > >
> > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > --
> > > > 2.25.1
> > > >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
  2021-08-17  9:37     ` Jerin Jacob
@ 2021-08-18 11:27       ` Xueming(Steven) Li
  2021-08-18 11:47         ` Jerin Jacob
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-08-18 11:27 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: Jack Min, dpdk-dev, Xiaoyun Li



> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Tuesday, August 17, 2021 5:37 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun Li <xiaoyun.li@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
> 
> On Wed, Aug 11, 2021 at 7:35 PM Xueming Li <xuemingl@nvidia.com> wrote:
> >
> > From: Xiaoyu Min <jackmin@nvidia.com>
> >
> > Added an inline common wrapper function for all fwd engines which do
> > the following in common:
> >
> > 1. get_start_cycles
> > 2. rte_eth_rx_burst(...,nb_pkt_per_burst)
> > 3. if rxq_share do forward_shared_rxq(), otherwise do fwd directly 4.
> > get_end_cycle
> >
> > Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
> > ---
> >  app/test-pmd/testpmd.h | 24 ++++++++++++++++++++++++
> >  1 file changed, 24 insertions(+)
> >
> > diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h index
> > 13141dfed9..b685ac48d6 100644
> > --- a/app/test-pmd/testpmd.h
> > +++ b/app/test-pmd/testpmd.h
> > @@ -1022,6 +1022,30 @@ void add_tx_dynf_callback(portid_t portid);
> > void remove_tx_dynf_callback(portid_t portid);  int
> > update_jumbo_frame_offload(portid_t portid);
> >
> > +static inline void
> > +do_burst_fwd(struct fwd_stream *fs, packet_fwd_cb fwd) {
> > +       struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
> > +       uint16_t nb_rx;
> > +       uint64_t start_tsc = 0;
> > +
> > +       get_start_cycles(&start_tsc);
> > +
> > +       /*
> > +        * Receive a burst of packets and forward them.
> > +        */
> > +       nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
> > +                       pkts_burst, nb_pkt_per_burst);
> > +       inc_rx_burst_stats(fs, nb_rx);
> > +       if (unlikely(nb_rx == 0))
> > +               return;
> > +       if (unlikely(rxq_share > 0))
> 
> See below. It reads a global memory.
> 
> > +               forward_shared_rxq(fs, nb_rx, pkts_burst, fwd);
> > +       else
> > +               (*fwd)(fs, nb_rx, pkts_burst);
> 
> New function pointer in fastpath.
> 
> IMO, We should not create performance regression for the existing forward engine.
> Can we have a new forward engine just for shared memory testing?

Yes, fully aware of the performance concern, the global could be defined around record_core_cycles to minimize the impacts.
Based on test data, the impacts almost invisible in legacy mode.

From test perspective, better to have all forward engine to verify shared rxq, test team want to run the
regression with less impacts. Hope to have a solution to utilize all forwarding engines seamlessly.

> 
> > +       get_end_cycles(fs, start_tsc); }
> > +
> >  /*
> >   * Work-around of a compilation error with ICC on invocations of the
> >   * rte_be_to_cpu_16() function.
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
  2021-08-18 11:27       ` Xueming(Steven) Li
@ 2021-08-18 11:47         ` Jerin Jacob
  2021-08-18 14:08           ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Jerin Jacob @ 2021-08-18 11:47 UTC (permalink / raw)
  To: Xueming(Steven) Li; +Cc: Jack Min, dpdk-dev, Xiaoyun Li

On Wed, Aug 18, 2021 at 4:57 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Tuesday, August 17, 2021 5:37 PM
> > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun Li <xiaoyun.li@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
> >
> > On Wed, Aug 11, 2021 at 7:35 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > >
> > > From: Xiaoyu Min <jackmin@nvidia.com>
> > >
> > > Added an inline common wrapper function for all fwd engines which do
> > > the following in common:
> > >
> > > 1. get_start_cycles
> > > 2. rte_eth_rx_burst(...,nb_pkt_per_burst)
> > > 3. if rxq_share do forward_shared_rxq(), otherwise do fwd directly 4.
> > > get_end_cycle
> > >
> > > Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
> > > ---
> > >  app/test-pmd/testpmd.h | 24 ++++++++++++++++++++++++
> > >  1 file changed, 24 insertions(+)
> > >
> > > diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h index
> > > 13141dfed9..b685ac48d6 100644
> > > --- a/app/test-pmd/testpmd.h
> > > +++ b/app/test-pmd/testpmd.h
> > > @@ -1022,6 +1022,30 @@ void add_tx_dynf_callback(portid_t portid);
> > > void remove_tx_dynf_callback(portid_t portid);  int
> > > update_jumbo_frame_offload(portid_t portid);
> > >
> > > +static inline void
> > > +do_burst_fwd(struct fwd_stream *fs, packet_fwd_cb fwd) {
> > > +       struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
> > > +       uint16_t nb_rx;
> > > +       uint64_t start_tsc = 0;
> > > +
> > > +       get_start_cycles(&start_tsc);
> > > +
> > > +       /*
> > > +        * Receive a burst of packets and forward them.
> > > +        */
> > > +       nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
> > > +                       pkts_burst, nb_pkt_per_burst);
> > > +       inc_rx_burst_stats(fs, nb_rx);
> > > +       if (unlikely(nb_rx == 0))
> > > +               return;
> > > +       if (unlikely(rxq_share > 0))
> >
> > See below. It reads a global memory.
> >
> > > +               forward_shared_rxq(fs, nb_rx, pkts_burst, fwd);
> > > +       else
> > > +               (*fwd)(fs, nb_rx, pkts_burst);
> >
> > New function pointer in fastpath.
> >
> > IMO, We should not create performance regression for the existing forward engine.
> > Can we have a new forward engine just for shared memory testing?
>
> Yes, fully aware of the performance concern, the global could be defined around record_core_cycles to minimize the impacts.
> Based on test data, the impacts almost invisible in legacy mode.

Are you saying there is zero % regression? If not, could you share the data?

>
> From test perspective, better to have all forward engine to verify shared rxq, test team want to run the
> regression with less impacts. Hope to have a solution to utilize all forwarding engines seamlessly.

Yes. it good goal. testpmd forward performance using as synthetic
bench everyone.
I think, we are aligned to not have any regression for the generic
forward engine.

>
> >
> > > +       get_end_cycles(fs, start_tsc); }
> > > +
> > >  /*
> > >   * Work-around of a compilation error with ICC on invocations of the
> > >   * rte_be_to_cpu_16() function.
> > > --
> > > 2.25.1
> > >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
  2021-08-18 11:47         ` Jerin Jacob
@ 2021-08-18 14:08           ` Xueming(Steven) Li
  2021-08-26 11:28             ` Jerin Jacob
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-08-18 14:08 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: Jack Min, dpdk-dev, Xiaoyun Li



> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Wednesday, August 18, 2021 7:48 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun Li <xiaoyun.li@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
> 
> On Wed, Aug 18, 2021 at 4:57 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Tuesday, August 17, 2021 5:37 PM
> > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun
> > > Li <xiaoyun.li@intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd
> > > wrapper function
> > >
> > > On Wed, Aug 11, 2021 at 7:35 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > >
> > > > From: Xiaoyu Min <jackmin@nvidia.com>
> > > >
> > > > Added an inline common wrapper function for all fwd engines which
> > > > do the following in common:
> > > >
> > > > 1. get_start_cycles
> > > > 2. rte_eth_rx_burst(...,nb_pkt_per_burst)
> > > > 3. if rxq_share do forward_shared_rxq(), otherwise do fwd directly 4.
> > > > get_end_cycle
> > > >
> > > > Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
> > > > ---
> > > >  app/test-pmd/testpmd.h | 24 ++++++++++++++++++++++++
> > > >  1 file changed, 24 insertions(+)
> > > >
> > > > diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h index
> > > > 13141dfed9..b685ac48d6 100644
> > > > --- a/app/test-pmd/testpmd.h
> > > > +++ b/app/test-pmd/testpmd.h
> > > > @@ -1022,6 +1022,30 @@ void add_tx_dynf_callback(portid_t portid);
> > > > void remove_tx_dynf_callback(portid_t portid);  int
> > > > update_jumbo_frame_offload(portid_t portid);
> > > >
> > > > +static inline void
> > > > +do_burst_fwd(struct fwd_stream *fs, packet_fwd_cb fwd) {
> > > > +       struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
> > > > +       uint16_t nb_rx;
> > > > +       uint64_t start_tsc = 0;
> > > > +
> > > > +       get_start_cycles(&start_tsc);
> > > > +
> > > > +       /*
> > > > +        * Receive a burst of packets and forward them.
> > > > +        */
> > > > +       nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
> > > > +                       pkts_burst, nb_pkt_per_burst);
> > > > +       inc_rx_burst_stats(fs, nb_rx);
> > > > +       if (unlikely(nb_rx == 0))
> > > > +               return;
> > > > +       if (unlikely(rxq_share > 0))
> > >
> > > See below. It reads a global memory.
> > >
> > > > +               forward_shared_rxq(fs, nb_rx, pkts_burst, fwd);
> > > > +       else
> > > > +               (*fwd)(fs, nb_rx, pkts_burst);
> > >
> > > New function pointer in fastpath.
> > >
> > > IMO, We should not create performance regression for the existing forward engine.
> > > Can we have a new forward engine just for shared memory testing?
> >
> > Yes, fully aware of the performance concern, the global could be defined around record_core_cycles to minimize the impacts.
> > Based on test data, the impacts almost invisible in legacy mode.
> 
> Are you saying there is zero % regression? If not, could you share the data?

Almost zero, here is a quick single core result of rxonly:
	32.2Mpps, 58.9cycles/packet
Revert the patch to rxonly.c:
	32.1Mpps 59.9cycles/packet
The result doesn't make sense and I realized that I used batch mbuf free, apply it now: 
	32.2Mpps, 58.9cycles/packet
There were small digit jumps between testpmd restart, I picked the best one.
The result is almost same, seems the cost of each packet is small enough.
BTW, I'm testing with default burst size and queue depth.

> 
> >
> > From test perspective, better to have all forward engine to verify
> > shared rxq, test team want to run the regression with less impacts. Hope to have a solution to utilize all forwarding engines
> seamlessly.
> 
> Yes. it good goal. testpmd forward performance using as synthetic bench everyone.
> I think, we are aligned to not have any regression for the generic forward engine.
> 
> >
> > >
> > > > +       get_end_cycles(fs, start_tsc); }
> > > > +
> > > >  /*
> > > >   * Work-around of a compilation error with ICC on invocations of the
> > > >   * rte_be_to_cpu_16() function.
> > > > --
> > > > 2.25.1
> > > >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue
  2021-08-18 11:14         ` Xueming(Steven) Li
@ 2021-08-19  5:26           ` Jerin Jacob
  2021-08-19 12:09             ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Jerin Jacob @ 2021-08-19  5:26 UTC (permalink / raw)
  To: Xueming(Steven) Li
  Cc: dpdk-dev, Ferruh Yigit, NBU-Contact-Thomas Monjalon, Andrew Rybchenko

On Wed, Aug 18, 2021 at 4:44 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Tuesday, August 17, 2021 11:12 PM
> > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> >
> > On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > Sent: Tuesday, August 17, 2021 5:33 PM
> > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > <andrew.rybchenko@oktetlabs.ru>
> > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > >
> > > > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > >
> > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > for incoming packets. When number of representors scale out in a
> > > > > switch domain, the memory consumption became significant. Most
> > > > > important, polling all ports leads to high cache miss, high
> > > > > latency and low throughput.
> > > > >
> > > > > This patch introduces shared RX queue. Ports with same
> > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > Polling any queue using same shared RX queue receives packets from
> > > > > all member ports. Source port is identified by mbuf->port.
> > > > >
> > > > > Port queue number in a shared group should be identical. Queue
> > > > > index is
> > > > > 1:1 mapped in shared group.
> > > > >
> > > > > Share RX queue must be polled on single thread or core.
> > > > >
> > > > > Multiple groups is supported by group ID.
> > > > >
> > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > ---
> > > > > Rx queue object could be used as shared Rx queue object, it's
> > > > > important to clear all queue control callback api that using queue object:
> > > > >   https://mails.dpdk.org/archives/dev/2021-July/215574.html
> > > >
> > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > index d2b27c351f..a578c9db9d 100644
> > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > */
> > > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > > + switch domain. */
> > > >
> > > > Not to able to see anyone setting/creating this group ID test application.
> > > > How this group is created?
> > >
> > > Nice catch, the initial testpmd version only support one default group(0).
> > > All ports that supports shared-rxq assigned in same group.
> > >
> > > We should be able to change "--rxq-shared" to "--rxq-shared-group" to
> > > support group other than default.
> > >
> > > To support more groups simultaneously, need to consider testpmd
> > > forwarding stream core assignment, all streams in same group need to stay on same core.
> > > It's possible to specify how many ports to increase group number, but
> > > user must schedule stream affinity carefully - error prone.
> > >
> > > On the other hand, one group should be sufficient for most customer,
> > > the doubt is whether it valuable to support multiple groups test.
> >
> > Ack. One group is enough in testpmd.
> >
> > My question was more about who and how this group is created, Should n't we need API to create shared_group? If we do the
> > following, at least, I can think, how it can be implemented in SW or other HW.
> >
> > - Create aggregation queue group
> > - Attach multiple  Rx queues to the aggregation queue group
> > - Pull the packets from the queue group(which internally fetch from the Rx queues _attached_)
> >
> > Does the above kind of sequence, break your representor use case?
>
> Seems more like a set of EAL wrapper. Current API tries to minimize the application efforts to adapt shared-rxq.
> - step 1, not sure how important it is to create group with API, in rte_flow, group is created on demand.

Which rte_flow pattern/action for this?

> - step 2, currently, the attaching is done in rte_eth_rx_queue_setup, specify offload and group in rx_conf struct.
> - step 3, define a dedicate api to receive packets from shared rxq? Looks clear to receive packets from shared rxq.
>   currently, rxq objects in share group is same - the shared rxq, so the eth callback eth_rx_burst_t(rxq_obj, mbufs, n) could
>   be used to receive packets from any ports in group, normally the first port(PF) in group.
>   An alternative way is defining a vdev with same queue number and copy rxq objects will make the vdev a proxy of
>   the shared rxq group - this could be an helper API.
>
> Anyway the wrapper doesn't break use case, step 3 api is more clear, need to understand how to implement efficiently.

Are you doing this feature based on any HW support or it just pure SW
thing, If it is SW, It is better to have
just new vdev for like drivers/net/bonding/. This we can help
aggregate multiple Rxq across the multiple ports
of same the driver.


>
> >
> >
> > >
> > > >
> > > >
> > > > >         /**
> > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > +/**
> > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > +memory,
> > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > + * Real source port number saved in mbuf->port field.
> > > > > + */
> > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > >
> > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > --
> > > > > 2.25.1
> > > > >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue
  2021-08-19  5:26           ` Jerin Jacob
@ 2021-08-19 12:09             ` Xueming(Steven) Li
  2021-08-26 11:58               ` Jerin Jacob
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-08-19 12:09 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dpdk-dev, Ferruh Yigit, NBU-Contact-Thomas Monjalon, Andrew Rybchenko



> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Thursday, August 19, 2021 1:27 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> 
> On Wed, Aug 18, 2021 at 4:44 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Tuesday, August 17, 2021 11:12 PM
> > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > <andrew.rybchenko@oktetlabs.ru>
> > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > >
> > > On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Tuesday, August 17, 2021 5:33 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > >
> > > > > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > >
> > > > > > In current DPDK framework, each RX queue is pre-loaded with
> > > > > > mbufs for incoming packets. When number of representors scale
> > > > > > out in a switch domain, the memory consumption became
> > > > > > significant. Most important, polling all ports leads to high
> > > > > > cache miss, high latency and low throughput.
> > > > > >
> > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > Polling any queue using same shared RX queue receives packets
> > > > > > from all member ports. Source port is identified by mbuf->port.
> > > > > >
> > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > index is
> > > > > > 1:1 mapped in shared group.
> > > > > >
> > > > > > Share RX queue must be polled on single thread or core.
> > > > > >
> > > > > > Multiple groups is supported by group ID.
> > > > > >
> > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > ---
> > > > > > Rx queue object could be used as shared Rx queue object, it's
> > > > > > important to clear all queue control callback api that using queue object:
> > > > > >   https://mails.dpdk.org/archives/dev/2021-July/215574.html
> > > > >
> > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR diff --git
> > > > > > a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > > > > > d2b27c351f..a578c9db9d 100644
> > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > */
> > > > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > > > + switch domain. */
> > > > >
> > > > > Not to able to see anyone setting/creating this group ID test application.
> > > > > How this group is created?
> > > >
> > > > Nice catch, the initial testpmd version only support one default group(0).
> > > > All ports that supports shared-rxq assigned in same group.
> > > >
> > > > We should be able to change "--rxq-shared" to "--rxq-shared-group"
> > > > to support group other than default.
> > > >
> > > > To support more groups simultaneously, need to consider testpmd
> > > > forwarding stream core assignment, all streams in same group need to stay on same core.
> > > > It's possible to specify how many ports to increase group number,
> > > > but user must schedule stream affinity carefully - error prone.
> > > >
> > > > On the other hand, one group should be sufficient for most
> > > > customer, the doubt is whether it valuable to support multiple groups test.
> > >
> > > Ack. One group is enough in testpmd.
> > >
> > > My question was more about who and how this group is created, Should
> > > n't we need API to create shared_group? If we do the following, at least, I can think, how it can be implemented in SW or other HW.
> > >
> > > - Create aggregation queue group
> > > - Attach multiple  Rx queues to the aggregation queue group
> > > - Pull the packets from the queue group(which internally fetch from
> > > the Rx queues _attached_)
> > >
> > > Does the above kind of sequence, break your representor use case?
> >
> > Seems more like a set of EAL wrapper. Current API tries to minimize the application efforts to adapt shared-rxq.
> > - step 1, not sure how important it is to create group with API, in rte_flow, group is created on demand.
> 
> Which rte_flow pattern/action for this?

No rte_flow for this, just recalled that the group in rte_flow is not created along with flow, not via api.
I don’t see anything else to create along with group, just double whether it valuable to introduce a new api set to manage group.

> 
> > - step 2, currently, the attaching is done in rte_eth_rx_queue_setup, specify offload and group in rx_conf struct.
> > - step 3, define a dedicate api to receive packets from shared rxq? Looks clear to receive packets from shared rxq.
> >   currently, rxq objects in share group is same - the shared rxq, so the eth callback eth_rx_burst_t(rxq_obj, mbufs, n) could
> >   be used to receive packets from any ports in group, normally the first port(PF) in group.
> >   An alternative way is defining a vdev with same queue number and copy rxq objects will make the vdev a proxy of
> >   the shared rxq group - this could be an helper API.
> >
> > Anyway the wrapper doesn't break use case, step 3 api is more clear, need to understand how to implement efficiently.
> 
> Are you doing this feature based on any HW support or it just pure SW thing, If it is SW, It is better to have just new vdev for like
> drivers/net/bonding/. This we can help aggregate multiple Rxq across the multiple ports of same the driver.

Based on HW support. 

Most user might uses PF in group as the anchor port to rx burst, current definition should be easy for them to migrate.
but some user might prefer grouping some hot plug/unpluggedrepresentors, EAL could provide wrappers, users could do 
that either due to the strategy not complex enough. Anyway, welcome any suggestion.

> 
> 
> >
> > >
> > >
> > > >
> > > > >
> > > > >
> > > > > >         /**
> > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > +/**
> > > > > > + * Rx queue is shared among ports in same switch domain to
> > > > > > +save memory,
> > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > + */
> > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > >
> > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > --
> > > > > > 2.25.1
> > > > > >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
  2021-08-18 14:08           ` Xueming(Steven) Li
@ 2021-08-26 11:28             ` Jerin Jacob
  2021-08-29  7:07               ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Jerin Jacob @ 2021-08-26 11:28 UTC (permalink / raw)
  To: Xueming(Steven) Li; +Cc: Jack Min, dpdk-dev, Xiaoyun Li

On Wed, Aug 18, 2021 at 7:38 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Wednesday, August 18, 2021 7:48 PM
> > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun Li <xiaoyun.li@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
> >
> > On Wed, Aug 18, 2021 at 4:57 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > Sent: Tuesday, August 17, 2021 5:37 PM
> > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun
> > > > Li <xiaoyun.li@intel.com>
> > > > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd
> > > > wrapper function
> > > >
> > > > On Wed, Aug 11, 2021 at 7:35 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > >
> > > > > From: Xiaoyu Min <jackmin@nvidia.com>
> > > > >
> > > > > Added an inline common wrapper function for all fwd engines which
> > > > > do the following in common:
> > > > >
> > > > > 1. get_start_cycles
> > > > > 2. rte_eth_rx_burst(...,nb_pkt_per_burst)
> > > > > 3. if rxq_share do forward_shared_rxq(), otherwise do fwd directly 4.
> > > > > get_end_cycle
> > > > >
> > > > > Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
> > > > > ---
> > > > >  app/test-pmd/testpmd.h | 24 ++++++++++++++++++++++++
> > > > >  1 file changed, 24 insertions(+)
> > > > >
> > > > > diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h index
> > > > > 13141dfed9..b685ac48d6 100644
> > > > > --- a/app/test-pmd/testpmd.h
> > > > > +++ b/app/test-pmd/testpmd.h
> > > > > @@ -1022,6 +1022,30 @@ void add_tx_dynf_callback(portid_t portid);
> > > > > void remove_tx_dynf_callback(portid_t portid);  int
> > > > > update_jumbo_frame_offload(portid_t portid);
> > > > >
> > > > > +static inline void
> > > > > +do_burst_fwd(struct fwd_stream *fs, packet_fwd_cb fwd) {
> > > > > +       struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
> > > > > +       uint16_t nb_rx;
> > > > > +       uint64_t start_tsc = 0;
> > > > > +
> > > > > +       get_start_cycles(&start_tsc);
> > > > > +
> > > > > +       /*
> > > > > +        * Receive a burst of packets and forward them.
> > > > > +        */
> > > > > +       nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
> > > > > +                       pkts_burst, nb_pkt_per_burst);
> > > > > +       inc_rx_burst_stats(fs, nb_rx);
> > > > > +       if (unlikely(nb_rx == 0))
> > > > > +               return;
> > > > > +       if (unlikely(rxq_share > 0))
> > > >
> > > > See below. It reads a global memory.
> > > >
> > > > > +               forward_shared_rxq(fs, nb_rx, pkts_burst, fwd);
> > > > > +       else
> > > > > +               (*fwd)(fs, nb_rx, pkts_burst);
> > > >
> > > > New function pointer in fastpath.
> > > >
> > > > IMO, We should not create performance regression for the existing forward engine.
> > > > Can we have a new forward engine just for shared memory testing?
> > >
> > > Yes, fully aware of the performance concern, the global could be defined around record_core_cycles to minimize the impacts.
> > > Based on test data, the impacts almost invisible in legacy mode.
> >
> > Are you saying there is zero % regression? If not, could you share the data?
>
> Almost zero, here is a quick single core result of rxonly:
>         32.2Mpps, 58.9cycles/packet
> Revert the patch to rxonly.c:
>         32.1Mpps 59.9cycles/packet
> The result doesn't make sense and I realized that I used batch mbuf free, apply it now:
>         32.2Mpps, 58.9cycles/packet
> There were small digit jumps between testpmd restart, I picked the best one.
> The result is almost same, seems the cost of each packet is small enough.
> BTW, I'm testing with default burst size and queue depth.

I tested this on octeontx2 with iofwd with single core with 100Gbps
Without this patch - 73.5mpps
With this patch - 72.8 mpps

We are taking the shared queue runtime option without a separate fwd engine.
and to have zero performance impact and no compile time flag
Then I think, only way to have a function template .
Example change to outline function template principle.

static inline
__pkt_burst_io_forward(struct fwd_stream *fs, const u64 flag)
{

Introduce new checks under
if (flags & SHARED_QUEUE)


}

Have two versions of io_fwd_engine.packet_fwd per engine.

- first version
static pkt_burst_io_forward(struct fwd_stream *fs)
{
        return __pkt_burst_io_forward(fs, 0);
}

- Second version
static pkt_burst_io_forward_shared_queue(struct fwd_stream *fs)
{
        return __pkt_burst_io_forward(fs, SHARED_QUEUE);
}


Update io_fwd_engine.packet_fwd in slowpath to respective version
based on offload.

If shared offoad is not selected, pkt_burst_io_forward() will be selected and
__pkt_burst_io_forward() will be a compile time version of
!SHARED_QUEUE aka same as existing coe.

>
> >
> > >
> > > From test perspective, better to have all forward engine to verify
> > > shared rxq, test team want to run the regression with less impacts. Hope to have a solution to utilize all forwarding engines
> > seamlessly.
> >
> > Yes. it good goal. testpmd forward performance using as synthetic bench everyone.
> > I think, we are aligned to not have any regression for the generic forward engine.
> >
> > >
> > > >
> > > > > +       get_end_cycles(fs, start_tsc); }
> > > > > +
> > > > >  /*
> > > > >   * Work-around of a compilation error with ICC on invocations of the
> > > > >   * rte_be_to_cpu_16() function.
> > > > > --
> > > > > 2.25.1
> > > > >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue
  2021-08-19 12:09             ` Xueming(Steven) Li
@ 2021-08-26 11:58               ` Jerin Jacob
  2021-08-28 14:16                 ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Jerin Jacob @ 2021-08-26 11:58 UTC (permalink / raw)
  To: Xueming(Steven) Li
  Cc: dpdk-dev, Ferruh Yigit, NBU-Contact-Thomas Monjalon, Andrew Rybchenko

On Thu, Aug 19, 2021 at 5:39 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Thursday, August 19, 2021 1:27 PM
> > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> >
> > On Wed, Aug 18, 2021 at 4:44 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > Sent: Tuesday, August 17, 2021 11:12 PM
> > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > <andrew.rybchenko@oktetlabs.ru>
> > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > >
> > > > On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > Sent: Tuesday, August 17, 2021 5:33 PM
> > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > > >
> > > > > > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > >
> > > > > > > In current DPDK framework, each RX queue is pre-loaded with
> > > > > > > mbufs for incoming packets. When number of representors scale
> > > > > > > out in a switch domain, the memory consumption became
> > > > > > > significant. Most important, polling all ports leads to high
> > > > > > > cache miss, high latency and low throughput.
> > > > > > >
> > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > Polling any queue using same shared RX queue receives packets
> > > > > > > from all member ports. Source port is identified by mbuf->port.
> > > > > > >
> > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > index is
> > > > > > > 1:1 mapped in shared group.
> > > > > > >
> > > > > > > Share RX queue must be polled on single thread or core.
> > > > > > >
> > > > > > > Multiple groups is supported by group ID.
> > > > > > >
> > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > ---
> > > > > > > Rx queue object could be used as shared Rx queue object, it's
> > > > > > > important to clear all queue control callback api that using queue object:
> > > > > > >   https://mails.dpdk.org/archives/dev/2021-July/215574.html
> > > > > >
> > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR diff --git
> > > > > > > a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > > > > > > d2b27c351f..a578c9db9d 100644
> > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > */
> > > > > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > > > > + switch domain. */
> > > > > >
> > > > > > Not to able to see anyone setting/creating this group ID test application.
> > > > > > How this group is created?
> > > > >
> > > > > Nice catch, the initial testpmd version only support one default group(0).
> > > > > All ports that supports shared-rxq assigned in same group.
> > > > >
> > > > > We should be able to change "--rxq-shared" to "--rxq-shared-group"
> > > > > to support group other than default.
> > > > >
> > > > > To support more groups simultaneously, need to consider testpmd
> > > > > forwarding stream core assignment, all streams in same group need to stay on same core.
> > > > > It's possible to specify how many ports to increase group number,
> > > > > but user must schedule stream affinity carefully - error prone.
> > > > >
> > > > > On the other hand, one group should be sufficient for most
> > > > > customer, the doubt is whether it valuable to support multiple groups test.
> > > >
> > > > Ack. One group is enough in testpmd.
> > > >
> > > > My question was more about who and how this group is created, Should
> > > > n't we need API to create shared_group? If we do the following, at least, I can think, how it can be implemented in SW or other HW.
> > > >
> > > > - Create aggregation queue group
> > > > - Attach multiple  Rx queues to the aggregation queue group
> > > > - Pull the packets from the queue group(which internally fetch from
> > > > the Rx queues _attached_)
> > > >
> > > > Does the above kind of sequence, break your representor use case?
> > >
> > > Seems more like a set of EAL wrapper. Current API tries to minimize the application efforts to adapt shared-rxq.
> > > - step 1, not sure how important it is to create group with API, in rte_flow, group is created on demand.
> >
> > Which rte_flow pattern/action for this?
>
> No rte_flow for this, just recalled that the group in rte_flow is not created along with flow, not via api.
> I don’t see anything else to create along with group, just double whether it valuable to introduce a new api set to manage group.

See below.

>
> >
> > > - step 2, currently, the attaching is done in rte_eth_rx_queue_setup, specify offload and group in rx_conf struct.
> > > - step 3, define a dedicate api to receive packets from shared rxq? Looks clear to receive packets from shared rxq.
> > >   currently, rxq objects in share group is same - the shared rxq, so the eth callback eth_rx_burst_t(rxq_obj, mbufs, n) could
> > >   be used to receive packets from any ports in group, normally the first port(PF) in group.
> > >   An alternative way is defining a vdev with same queue number and copy rxq objects will make the vdev a proxy of
> > >   the shared rxq group - this could be an helper API.
> > >
> > > Anyway the wrapper doesn't break use case, step 3 api is more clear, need to understand how to implement efficiently.
> >
> > Are you doing this feature based on any HW support or it just pure SW thing, If it is SW, It is better to have just new vdev for like
> > drivers/net/bonding/. This we can help aggregate multiple Rxq across the multiple ports of same the driver.
>
> Based on HW support.

In Marvel HW, we do some support, I will outline here and some queries on this.

# We need to create some new HW structure for aggregation
# Connect each Rxq to the new HW structure for aggregation
# Use rx_burst from the new HW structure.

Could you outline your HW support?

Also, I am not able to understand how this will reduce the memory,
atleast in our HW need creating more memory now to deal this
as we need to deal new HW structure.

How is in your HW it reduces the memory? Also, if memory is the
constraint, why NOT reduce the number of queues.

# Also, I was thinking, one way to avoid the fast path or ABI change would like.

# Driver Initializes one more eth_dev_ops in driver as aggregator ethdev
# devargs of new ethdev or specific API like
drivers/net/bonding/rte_eth_bond.h can take the argument (port, queue)
tuples which needs to aggregate by new ethdev port
# No change in fastpath or ABI is required in this model.



> Most user might uses PF in group as the anchor port to rx burst, current definition should be easy for them to migrate.
> but some user might prefer grouping some hot plug/unpluggedrepresentors, EAL could provide wrappers, users could do
> that either due to the strategy not complex enough. Anyway, welcome any suggestion.
>
> >
> >
> > >
> > > >
> > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > >         /**
> > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > +/**
> > > > > > > + * Rx queue is shared among ports in same switch domain to
> > > > > > > +save memory,
> > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > + */
> > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > >
> > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > --
> > > > > > > 2.25.1
> > > > > > >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue
  2021-08-26 11:58               ` Jerin Jacob
@ 2021-08-28 14:16                 ` Xueming(Steven) Li
  2021-08-30  9:31                   ` Jerin Jacob
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-08-28 14:16 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dpdk-dev, Ferruh Yigit, NBU-Contact-Thomas Monjalon, Andrew Rybchenko



> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Thursday, August 26, 2021 7:58 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> 
> On Thu, Aug 19, 2021 at 5:39 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Thursday, August 19, 2021 1:27 PM
> > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > <andrew.rybchenko@oktetlabs.ru>
> > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > >
> > > On Wed, Aug 18, 2021 at 4:44 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Tuesday, August 17, 2021 11:12 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > >
> > > > > On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > Sent: Tuesday, August 17, 2021 5:33 PM
> > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx
> > > > > > > queue
> > > > > > >
> > > > > > > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > >
> > > > > > > > In current DPDK framework, each RX queue is pre-loaded
> > > > > > > > with mbufs for incoming packets. When number of
> > > > > > > > representors scale out in a switch domain, the memory
> > > > > > > > consumption became significant. Most important, polling
> > > > > > > > all ports leads to high cache miss, high latency and low throughput.
> > > > > > > >
> > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > Polling any queue using same shared RX queue receives
> > > > > > > > packets from all member ports. Source port is identified by mbuf->port.
> > > > > > > >
> > > > > > > > Port queue number in a shared group should be identical.
> > > > > > > > Queue index is
> > > > > > > > 1:1 mapped in shared group.
> > > > > > > >
> > > > > > > > Share RX queue must be polled on single thread or core.
> > > > > > > >
> > > > > > > > Multiple groups is supported by group ID.
> > > > > > > >
> > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > ---
> > > > > > > > Rx queue object could be used as shared Rx queue object,
> > > > > > > > it's important to clear all queue control callback api that using queue object:
> > > > > > > >
> > > > > > > > https://mails.dpdk.org/archives/dev/2021-July/215574.html
> > > > > > >
> > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR diff --git
> > > > > > > > a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > > > > > > > d2b27c351f..a578c9db9d 100644
> > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > */
> > > > > > > > +       uint32_t shared_group; /**< Shared port group
> > > > > > > > + index in switch domain. */
> > > > > > >
> > > > > > > Not to able to see anyone setting/creating this group ID test application.
> > > > > > > How this group is created?
> > > > > >
> > > > > > Nice catch, the initial testpmd version only support one default group(0).
> > > > > > All ports that supports shared-rxq assigned in same group.
> > > > > >
> > > > > > We should be able to change "--rxq-shared" to "--rxq-shared-group"
> > > > > > to support group other than default.
> > > > > >
> > > > > > To support more groups simultaneously, need to consider
> > > > > > testpmd forwarding stream core assignment, all streams in same group need to stay on same core.
> > > > > > It's possible to specify how many ports to increase group
> > > > > > number, but user must schedule stream affinity carefully - error prone.
> > > > > >
> > > > > > On the other hand, one group should be sufficient for most
> > > > > > customer, the doubt is whether it valuable to support multiple groups test.
> > > > >
> > > > > Ack. One group is enough in testpmd.
> > > > >
> > > > > My question was more about who and how this group is created,
> > > > > Should n't we need API to create shared_group? If we do the following, at least, I can think, how it can be implemented in SW
> or other HW.
> > > > >
> > > > > - Create aggregation queue group
> > > > > - Attach multiple  Rx queues to the aggregation queue group
> > > > > - Pull the packets from the queue group(which internally fetch
> > > > > from the Rx queues _attached_)
> > > > >
> > > > > Does the above kind of sequence, break your representor use case?
> > > >
> > > > Seems more like a set of EAL wrapper. Current API tries to minimize the application efforts to adapt shared-rxq.
> > > > - step 1, not sure how important it is to create group with API, in rte_flow, group is created on demand.
> > >
> > > Which rte_flow pattern/action for this?
> >
> > No rte_flow for this, just recalled that the group in rte_flow is not created along with flow, not via api.
> > I don’t see anything else to create along with group, just double whether it valuable to introduce a new api set to manage group.
> 
> See below.
> 
> >
> > >
> > > > - step 2, currently, the attaching is done in rte_eth_rx_queue_setup, specify offload and group in rx_conf struct.
> > > > - step 3, define a dedicate api to receive packets from shared rxq? Looks clear to receive packets from shared rxq.
> > > >   currently, rxq objects in share group is same - the shared rxq, so the eth callback eth_rx_burst_t(rxq_obj, mbufs, n) could
> > > >   be used to receive packets from any ports in group, normally the first port(PF) in group.
> > > >   An alternative way is defining a vdev with same queue number and copy rxq objects will make the vdev a proxy of
> > > >   the shared rxq group - this could be an helper API.
> > > >
> > > > Anyway the wrapper doesn't break use case, step 3 api is more clear, need to understand how to implement efficiently.
> > >
> > > Are you doing this feature based on any HW support or it just pure
> > > SW thing, If it is SW, It is better to have just new vdev for like drivers/net/bonding/. This we can help aggregate multiple Rxq across
> the multiple ports of same the driver.
> >
> > Based on HW support.
> 
> In Marvel HW, we do some support, I will outline here and some queries on this.
> 
> # We need to create some new HW structure for aggregation # Connect each Rxq to the new HW structure for aggregation # Use
> rx_burst from the new HW structure.
> 
> Could you outline your HW support?
> 
> Also, I am not able to understand how this will reduce the memory, atleast in our HW need creating more memory now to deal this as
> we need to deal new HW structure.
> 
> How is in your HW it reduces the memory? Also, if memory is the constraint, why NOT reduce the number of queues.
> 

Glad to know that Marvel is working on this, what's the status of driver implementation?

In my PMD implementation, it's very similar, a new HW object shared memory pool is created to replace per rxq memory pool.
Legacy rxq feed queue with allocated mbufs as number of descriptors, now shared rxqs share the same pool, no need to supply
mbufs for each rxq, just feed the shared rxq.

So the memory saving reflects to mbuf per rxq, even 1000 representors in shared rxq group, the mbufs consumed is one rxq.
In other words, new members in shared rxq doesn’t allocate new mbufs to feed rxq, just share with existing shared rxq(HW mempool).
The memory required to setup each rxq doesn't change too much, agree.

> # Also, I was thinking, one way to avoid the fast path or ABI change would like.
> 
> # Driver Initializes one more eth_dev_ops in driver as aggregator ethdev # devargs of new ethdev or specific API like
> drivers/net/bonding/rte_eth_bond.h can take the argument (port, queue) tuples which needs to aggregate by new ethdev port # No
> change in fastpath or ABI is required in this model.
> 

This could be an option to access shared rxq. What's the difference of the new PMD?
What's the difference of PMD driver to create the new device? 

Is it important in your implementation? Does it work with existing rx_burst api?

> 
> 
> > Most user might uses PF in group as the anchor port to rx burst, current definition should be easy for them to migrate.
> > but some user might prefer grouping some hot
> > plug/unpluggedrepresentors, EAL could provide wrappers, users could do that either due to the strategy not complex enough.
> Anyway, welcome any suggestion.
> >
> > >
> > >
> > > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >         /**
> > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf
> > > > > > > > { #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > +/**
> > > > > > > > + * Rx queue is shared among ports in same switch domain
> > > > > > > > +to save memory,
> > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > + */
> > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > >
> > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM
> > > > > > > > | \
> > > > > > > > --
> > > > > > > > 2.25.1
> > > > > > > >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
  2021-08-26 11:28             ` Jerin Jacob
@ 2021-08-29  7:07               ` Xueming(Steven) Li
  2021-09-01 14:44                 ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-08-29  7:07 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: Jack Min, dpdk-dev, Xiaoyun Li



> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Thursday, August 26, 2021 7:28 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun Li <xiaoyun.li@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
> 
> On Wed, Aug 18, 2021 at 7:38 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Wednesday, August 18, 2021 7:48 PM
> > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun
> > > Li <xiaoyun.li@intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd
> > > wrapper function
> > >
> > > On Wed, Aug 18, 2021 at 4:57 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Tuesday, August 17, 2021 5:37 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>;
> > > > > Xiaoyun Li <xiaoyun.li@intel.com>
> > > > > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common
> > > > > fwd wrapper function
> > > > >
> > > > > On Wed, Aug 11, 2021 at 7:35 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > >
> > > > > > From: Xiaoyu Min <jackmin@nvidia.com>
> > > > > >
> > > > > > Added an inline common wrapper function for all fwd engines
> > > > > > which do the following in common:
> > > > > >
> > > > > > 1. get_start_cycles
> > > > > > 2. rte_eth_rx_burst(...,nb_pkt_per_burst)
> > > > > > 3. if rxq_share do forward_shared_rxq(), otherwise do fwd directly 4.
> > > > > > get_end_cycle
> > > > > >
> > > > > > Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
> > > > > > ---
> > > > > >  app/test-pmd/testpmd.h | 24 ++++++++++++++++++++++++
> > > > > >  1 file changed, 24 insertions(+)
> > > > > >
> > > > > > diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
> > > > > > index
> > > > > > 13141dfed9..b685ac48d6 100644
> > > > > > --- a/app/test-pmd/testpmd.h
> > > > > > +++ b/app/test-pmd/testpmd.h
> > > > > > @@ -1022,6 +1022,30 @@ void add_tx_dynf_callback(portid_t
> > > > > > portid); void remove_tx_dynf_callback(portid_t portid);  int
> > > > > > update_jumbo_frame_offload(portid_t portid);
> > > > > >
> > > > > > +static inline void
> > > > > > +do_burst_fwd(struct fwd_stream *fs, packet_fwd_cb fwd) {
> > > > > > +       struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
> > > > > > +       uint16_t nb_rx;
> > > > > > +       uint64_t start_tsc = 0;
> > > > > > +
> > > > > > +       get_start_cycles(&start_tsc);
> > > > > > +
> > > > > > +       /*
> > > > > > +        * Receive a burst of packets and forward them.
> > > > > > +        */
> > > > > > +       nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
> > > > > > +                       pkts_burst, nb_pkt_per_burst);
> > > > > > +       inc_rx_burst_stats(fs, nb_rx);
> > > > > > +       if (unlikely(nb_rx == 0))
> > > > > > +               return;
> > > > > > +       if (unlikely(rxq_share > 0))
> > > > >
> > > > > See below. It reads a global memory.
> > > > >
> > > > > > +               forward_shared_rxq(fs, nb_rx, pkts_burst, fwd);
> > > > > > +       else
> > > > > > +               (*fwd)(fs, nb_rx, pkts_burst);
> > > > >
> > > > > New function pointer in fastpath.
> > > > >
> > > > > IMO, We should not create performance regression for the existing forward engine.
> > > > > Can we have a new forward engine just for shared memory testing?
> > > >
> > > > Yes, fully aware of the performance concern, the global could be defined around record_core_cycles to minimize the impacts.
> > > > Based on test data, the impacts almost invisible in legacy mode.
> > >
> > > Are you saying there is zero % regression? If not, could you share the data?
> >
> > Almost zero, here is a quick single core result of rxonly:
> >         32.2Mpps, 58.9cycles/packet
> > Revert the patch to rxonly.c:
> >         32.1Mpps 59.9cycles/packet
> > The result doesn't make sense and I realized that I used batch mbuf free, apply it now:
> >         32.2Mpps, 58.9cycles/packet
> > There were small digit jumps between testpmd restart, I picked the best one.
> > The result is almost same, seems the cost of each packet is small enough.
> > BTW, I'm testing with default burst size and queue depth.
> 
> I tested this on octeontx2 with iofwd with single core with 100Gbps Without this patch - 73.5mpps With this patch - 72.8 mpps
> 
> We are taking the shared queue runtime option without a separate fwd engine.
> and to have zero performance impact and no compile time flag Then I think, only way to have a function template .
> Example change to outline function template principle.
> 
> static inline
> __pkt_burst_io_forward(struct fwd_stream *fs, const u64 flag) {
> 
> Introduce new checks under
> if (flags & SHARED_QUEUE)
> 
> 
> }
> 
> Have two versions of io_fwd_engine.packet_fwd per engine.
> 
> - first version
> static pkt_burst_io_forward(struct fwd_stream *fs) {
>         return __pkt_burst_io_forward(fs, 0); }
> 
> - Second version
> static pkt_burst_io_forward_shared_queue(struct fwd_stream *fs) {
>         return __pkt_burst_io_forward(fs, SHARED_QUEUE); }
> 
> 
> Update io_fwd_engine.packet_fwd in slowpath to respective version based on offload.
> 
> If shared offoad is not selected, pkt_burst_io_forward() will be selected and
> __pkt_burst_io_forward() will be a compile time version of !SHARED_QUEUE aka same as existing coe.

Thanks for testing and suggestion. So the only difference here in above code is access to rxq_shared changed
to function parameter, right? Have you tested this performance? If not, I could verify.

> 
> >
> > >
> > > >
> > > > From test perspective, better to have all forward engine to verify
> > > > shared rxq, test team want to run the regression with less
> > > > impacts. Hope to have a solution to utilize all forwarding engines
> > > seamlessly.
> > >
> > > Yes. it good goal. testpmd forward performance using as synthetic bench everyone.
> > > I think, we are aligned to not have any regression for the generic forward engine.
> > >
> > > >
> > > > >
> > > > > > +       get_end_cycles(fs, start_tsc); }
> > > > > > +
> > > > > >  /*
> > > > > >   * Work-around of a compilation error with ICC on invocations of the
> > > > > >   * rte_be_to_cpu_16() function.
> > > > > > --
> > > > > > 2.25.1
> > > > > >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue
  2021-08-28 14:16                 ` Xueming(Steven) Li
@ 2021-08-30  9:31                   ` Jerin Jacob
  2021-08-30 10:13                     ` Xueming(Steven) Li
  2021-09-15 14:45                     ` Xueming(Steven) Li
  0 siblings, 2 replies; 266+ messages in thread
From: Jerin Jacob @ 2021-08-30  9:31 UTC (permalink / raw)
  To: Xueming(Steven) Li
  Cc: dpdk-dev, Ferruh Yigit, NBU-Contact-Thomas Monjalon, Andrew Rybchenko

On Sat, Aug 28, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Thursday, August 26, 2021 7:58 PM
> > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> >
> > On Thu, Aug 19, 2021 at 5:39 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > Sent: Thursday, August 19, 2021 1:27 PM
> > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > <andrew.rybchenko@oktetlabs.ru>
> > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > >
> > > > On Wed, Aug 18, 2021 at 4:44 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > Sent: Tuesday, August 17, 2021 11:12 PM
> > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > > >
> > > > > > On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > Sent: Tuesday, August 17, 2021 5:33 PM
> > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx
> > > > > > > > queue
> > > > > > > >
> > > > > > > > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > >
> > > > > > > > > In current DPDK framework, each RX queue is pre-loaded
> > > > > > > > > with mbufs for incoming packets. When number of
> > > > > > > > > representors scale out in a switch domain, the memory
> > > > > > > > > consumption became significant. Most important, polling
> > > > > > > > > all ports leads to high cache miss, high latency and low throughput.
> > > > > > > > >
> > > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > > Polling any queue using same shared RX queue receives
> > > > > > > > > packets from all member ports. Source port is identified by mbuf->port.
> > > > > > > > >
> > > > > > > > > Port queue number in a shared group should be identical.
> > > > > > > > > Queue index is
> > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > >
> > > > > > > > > Share RX queue must be polled on single thread or core.
> > > > > > > > >
> > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > ---
> > > > > > > > > Rx queue object could be used as shared Rx queue object,
> > > > > > > > > it's important to clear all queue control callback api that using queue object:
> > > > > > > > >
> > > > > > > > > https://mails.dpdk.org/archives/dev/2021-July/215574.html
> > > > > > > >
> > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR diff --git
> > > > > > > > > a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > > > > > > > > d2b27c351f..a578c9db9d 100644
> > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > > */
> > > > > > > > > +       uint32_t shared_group; /**< Shared port group
> > > > > > > > > + index in switch domain. */
> > > > > > > >
> > > > > > > > Not to able to see anyone setting/creating this group ID test application.
> > > > > > > > How this group is created?
> > > > > > >
> > > > > > > Nice catch, the initial testpmd version only support one default group(0).
> > > > > > > All ports that supports shared-rxq assigned in same group.
> > > > > > >
> > > > > > > We should be able to change "--rxq-shared" to "--rxq-shared-group"
> > > > > > > to support group other than default.
> > > > > > >
> > > > > > > To support more groups simultaneously, need to consider
> > > > > > > testpmd forwarding stream core assignment, all streams in same group need to stay on same core.
> > > > > > > It's possible to specify how many ports to increase group
> > > > > > > number, but user must schedule stream affinity carefully - error prone.
> > > > > > >
> > > > > > > On the other hand, one group should be sufficient for most
> > > > > > > customer, the doubt is whether it valuable to support multiple groups test.
> > > > > >
> > > > > > Ack. One group is enough in testpmd.
> > > > > >
> > > > > > My question was more about who and how this group is created,
> > > > > > Should n't we need API to create shared_group? If we do the following, at least, I can think, how it can be implemented in SW
> > or other HW.
> > > > > >
> > > > > > - Create aggregation queue group
> > > > > > - Attach multiple  Rx queues to the aggregation queue group
> > > > > > - Pull the packets from the queue group(which internally fetch
> > > > > > from the Rx queues _attached_)
> > > > > >
> > > > > > Does the above kind of sequence, break your representor use case?
> > > > >
> > > > > Seems more like a set of EAL wrapper. Current API tries to minimize the application efforts to adapt shared-rxq.
> > > > > - step 1, not sure how important it is to create group with API, in rte_flow, group is created on demand.
> > > >
> > > > Which rte_flow pattern/action for this?
> > >
> > > No rte_flow for this, just recalled that the group in rte_flow is not created along with flow, not via api.
> > > I don’t see anything else to create along with group, just double whether it valuable to introduce a new api set to manage group.
> >
> > See below.
> >
> > >
> > > >
> > > > > - step 2, currently, the attaching is done in rte_eth_rx_queue_setup, specify offload and group in rx_conf struct.
> > > > > - step 3, define a dedicate api to receive packets from shared rxq? Looks clear to receive packets from shared rxq.
> > > > >   currently, rxq objects in share group is same - the shared rxq, so the eth callback eth_rx_burst_t(rxq_obj, mbufs, n) could
> > > > >   be used to receive packets from any ports in group, normally the first port(PF) in group.
> > > > >   An alternative way is defining a vdev with same queue number and copy rxq objects will make the vdev a proxy of
> > > > >   the shared rxq group - this could be an helper API.
> > > > >
> > > > > Anyway the wrapper doesn't break use case, step 3 api is more clear, need to understand how to implement efficiently.
> > > >
> > > > Are you doing this feature based on any HW support or it just pure
> > > > SW thing, If it is SW, It is better to have just new vdev for like drivers/net/bonding/. This we can help aggregate multiple Rxq across
> > the multiple ports of same the driver.
> > >
> > > Based on HW support.
> >
> > In Marvel HW, we do some support, I will outline here and some queries on this.
> >
> > # We need to create some new HW structure for aggregation # Connect each Rxq to the new HW structure for aggregation # Use
> > rx_burst from the new HW structure.
> >
> > Could you outline your HW support?
> >
> > Also, I am not able to understand how this will reduce the memory, atleast in our HW need creating more memory now to deal this as
> > we need to deal new HW structure.
> >
> > How is in your HW it reduces the memory? Also, if memory is the constraint, why NOT reduce the number of queues.
> >
>
> Glad to know that Marvel is working on this, what's the status of driver implementation?
>
> In my PMD implementation, it's very similar, a new HW object shared memory pool is created to replace per rxq memory pool.
> Legacy rxq feed queue with allocated mbufs as number of descriptors, now shared rxqs share the same pool, no need to supply
> mbufs for each rxq, just feed the shared rxq.
>
> So the memory saving reflects to mbuf per rxq, even 1000 representors in shared rxq group, the mbufs consumed is one rxq.
> In other words, new members in shared rxq doesn’t allocate new mbufs to feed rxq, just share with existing shared rxq(HW mempool).
> The memory required to setup each rxq doesn't change too much, agree.

We can ask the application to configure the same mempool for multiple
RQ too. RIght? If the saving is based on sharing the mempool
with multiple RQs.

>
> > # Also, I was thinking, one way to avoid the fast path or ABI change would like.
> >
> > # Driver Initializes one more eth_dev_ops in driver as aggregator ethdev # devargs of new ethdev or specific API like
> > drivers/net/bonding/rte_eth_bond.h can take the argument (port, queue) tuples which needs to aggregate by new ethdev port # No
> > change in fastpath or ABI is required in this model.
> >
>
> This could be an option to access shared rxq. What's the difference of the new PMD?

No ABI and fast change are required.

> What's the difference of PMD driver to create the new device?
>
> Is it important in your implementation? Does it work with existing rx_burst api?

Yes . It will work with the existing rx_burst API.

>
> >
> >
> > > Most user might uses PF in group as the anchor port to rx burst, current definition should be easy for them to migrate.
> > > but some user might prefer grouping some hot
> > > plug/unpluggedrepresentors, EAL could provide wrappers, users could do that either due to the strategy not complex enough.
> > Anyway, welcome any suggestion.
> > >
> > > >
> > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >         /**
> > > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf
> > > > > > > > > { #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > > +/**
> > > > > > > > > + * Rx queue is shared among ports in same switch domain
> > > > > > > > > +to save memory,
> > > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > > + */
> > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > > >
> > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM
> > > > > > > > > | \
> > > > > > > > > --
> > > > > > > > > 2.25.1
> > > > > > > > >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue
  2021-08-30  9:31                   ` Jerin Jacob
@ 2021-08-30 10:13                     ` Xueming(Steven) Li
  2021-09-15 14:45                     ` Xueming(Steven) Li
  1 sibling, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-08-30 10:13 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dpdk-dev, Ferruh Yigit, NBU-Contact-Thomas Monjalon, Andrew Rybchenko



> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Monday, August 30, 2021 5:31 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> 
> On Sat, Aug 28, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Thursday, August 26, 2021 7:58 PM
> > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > <andrew.rybchenko@oktetlabs.ru>
> > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > >
> > > On Thu, Aug 19, 2021 at 5:39 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Thursday, August 19, 2021 1:27 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > >
> > > > > On Wed, Aug 18, 2021 at 4:44 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > Sent: Tuesday, August 17, 2021 11:12 PM
> > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx
> > > > > > > queue
> > > > > > >
> > > > > > > On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > Sent: Tuesday, August 17, 2021 5:33 PM
> > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared
> > > > > > > > > Rx queue
> > > > > > > > >
> > > > > > > > > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > >
> > > > > > > > > > In current DPDK framework, each RX queue is pre-loaded
> > > > > > > > > > with mbufs for incoming packets. When number of
> > > > > > > > > > representors scale out in a switch domain, the memory
> > > > > > > > > > consumption became significant. Most important,
> > > > > > > > > > polling all ports leads to high cache miss, high latency and low throughput.
> > > > > > > > > >
> > > > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > > > Polling any queue using same shared RX queue receives
> > > > > > > > > > packets from all member ports. Source port is identified by mbuf->port.
> > > > > > > > > >
> > > > > > > > > > Port queue number in a shared group should be identical.
> > > > > > > > > > Queue index is
> > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > >
> > > > > > > > > > Share RX queue must be polled on single thread or core.
> > > > > > > > > >
> > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > ---
> > > > > > > > > > Rx queue object could be used as shared Rx queue
> > > > > > > > > > object, it's important to clear all queue control callback api that using queue object:
> > > > > > > > > >
> > > > > > > > > > https://mails.dpdk.org/archives/dev/2021-July/215574.h
> > > > > > > > > > tml
> > > > > > > > >
> > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR diff --git
> > > > > > > > > > a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > > > */
> > > > > > > > > > +       uint32_t shared_group; /**< Shared port group
> > > > > > > > > > + index in switch domain. */
> > > > > > > > >
> > > > > > > > > Not to able to see anyone setting/creating this group ID test application.
> > > > > > > > > How this group is created?
> > > > > > > >
> > > > > > > > Nice catch, the initial testpmd version only support one default group(0).
> > > > > > > > All ports that supports shared-rxq assigned in same group.
> > > > > > > >
> > > > > > > > We should be able to change "--rxq-shared" to "--rxq-shared-group"
> > > > > > > > to support group other than default.
> > > > > > > >
> > > > > > > > To support more groups simultaneously, need to consider
> > > > > > > > testpmd forwarding stream core assignment, all streams in same group need to stay on same core.
> > > > > > > > It's possible to specify how many ports to increase group
> > > > > > > > number, but user must schedule stream affinity carefully - error prone.
> > > > > > > >
> > > > > > > > On the other hand, one group should be sufficient for most
> > > > > > > > customer, the doubt is whether it valuable to support multiple groups test.
> > > > > > >
> > > > > > > Ack. One group is enough in testpmd.
> > > > > > >
> > > > > > > My question was more about who and how this group is
> > > > > > > created, Should n't we need API to create shared_group? If
> > > > > > > we do the following, at least, I can think, how it can be
> > > > > > > implemented in SW
> > > or other HW.
> > > > > > >
> > > > > > > - Create aggregation queue group
> > > > > > > - Attach multiple  Rx queues to the aggregation queue group
> > > > > > > - Pull the packets from the queue group(which internally
> > > > > > > fetch from the Rx queues _attached_)
> > > > > > >
> > > > > > > Does the above kind of sequence, break your representor use case?
> > > > > >
> > > > > > Seems more like a set of EAL wrapper. Current API tries to minimize the application efforts to adapt shared-rxq.
> > > > > > - step 1, not sure how important it is to create group with API, in rte_flow, group is created on demand.
> > > > >
> > > > > Which rte_flow pattern/action for this?
> > > >
> > > > No rte_flow for this, just recalled that the group in rte_flow is not created along with flow, not via api.
> > > > I don’t see anything else to create along with group, just double whether it valuable to introduce a new api set to manage group.
> > >
> > > See below.
> > >
> > > >
> > > > >
> > > > > > - step 2, currently, the attaching is done in rte_eth_rx_queue_setup, specify offload and group in rx_conf struct.
> > > > > > - step 3, define a dedicate api to receive packets from shared rxq? Looks clear to receive packets from shared rxq.
> > > > > >   currently, rxq objects in share group is same - the shared rxq, so the eth callback eth_rx_burst_t(rxq_obj, mbufs, n) could
> > > > > >   be used to receive packets from any ports in group, normally the first port(PF) in group.
> > > > > >   An alternative way is defining a vdev with same queue number and copy rxq objects will make the vdev a proxy of
> > > > > >   the shared rxq group - this could be an helper API.
> > > > > >
> > > > > > Anyway the wrapper doesn't break use case, step 3 api is more clear, need to understand how to implement efficiently.
> > > > >
> > > > > Are you doing this feature based on any HW support or it just
> > > > > pure SW thing, If it is SW, It is better to have just new vdev
> > > > > for like drivers/net/bonding/. This we can help aggregate
> > > > > multiple Rxq across
> > > the multiple ports of same the driver.
> > > >
> > > > Based on HW support.
> > >
> > > In Marvel HW, we do some support, I will outline here and some queries on this.
> > >
> > > # We need to create some new HW structure for aggregation # Connect
> > > each Rxq to the new HW structure for aggregation # Use rx_burst from the new HW structure.
> > >
> > > Could you outline your HW support?
> > >
> > > Also, I am not able to understand how this will reduce the memory,
> > > atleast in our HW need creating more memory now to deal this as we need to deal new HW structure.
> > >
> > > How is in your HW it reduces the memory? Also, if memory is the constraint, why NOT reduce the number of queues.
> > >
> >
> > Glad to know that Marvel is working on this, what's the status of driver implementation?
> >
> > In my PMD implementation, it's very similar, a new HW object shared memory pool is created to replace per rxq memory pool.
> > Legacy rxq feed queue with allocated mbufs as number of descriptors,
> > now shared rxqs share the same pool, no need to supply mbufs for each rxq, just feed the shared rxq.
> >
> > So the memory saving reflects to mbuf per rxq, even 1000 representors in shared rxq group, the mbufs consumed is one rxq.
> > In other words, new members in shared rxq doesn’t allocate new mbufs to feed rxq, just share with existing shared rxq(HW
> mempool).
> > The memory required to setup each rxq doesn't change too much, agree.
> 
> We can ask the application to configure the same mempool for multiple RQ too. RIght? If the saving is based on sharing the mempool
> with multiple RQs.

Yes, using the same mempool is the fundamental. The difference is how many mbufs allocate from pool.
Assuming 512 descriptors perf rxq and 4 rxqs per device, it's 2.3K(mbuf) * 512 * 4 = 4.6M / device
To support 1000 representors, need a 4.6G mempool :)
For shared rxq, only 4.6M(one device) mbufs allocate from mempool, they are shared for all rxqs in group.

> 
> >
> > > # Also, I was thinking, one way to avoid the fast path or ABI change would like.
> > >
> > > # Driver Initializes one more eth_dev_ops in driver as aggregator
> > > ethdev # devargs of new ethdev or specific API like
> > > drivers/net/bonding/rte_eth_bond.h can take the argument (port, queue) tuples which needs to aggregate by new ethdev port #
> No change in fastpath or ABI is required in this model.
> > >
> >
> > This could be an option to access shared rxq. What's the difference of the new PMD?
> 
> No ABI and fast change are required.
> 
> > What's the difference of PMD driver to create the new device?
> >
> > Is it important in your implementation? Does it work with existing rx_burst api?
> 
> Yes . It will work with the existing rx_burst API.
> 
> >
> > >
> > >
> > > > Most user might uses PF in group as the anchor port to rx burst, current definition should be easy for them to migrate.
> > > > but some user might prefer grouping some hot
> > > > plug/unpluggedrepresentors, EAL could provide wrappers, users could do that either due to the strategy not complex enough.
> > > Anyway, welcome any suggestion.
> > > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >         /**
> > > > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > >          * Only offloads set on rx_queue_offload_capa
> > > > > > > > > > or rx_offload_capa @@ -1373,6 +1374,12 @@ struct
> > > > > > > > > > rte_eth_conf { #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > > > +/**
> > > > > > > > > > + * Rx queue is shared among ports in same switch
> > > > > > > > > > +domain to save memory,
> > > > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > > > + */
> > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > > > >
> > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > >
> > > > > > > > > > DEV_RX_OFFLOAD_UDP_CKSUM
> > > > > > > > > > | \
> > > > > > > > > > --
> > > > > > > > > > 2.25.1
> > > > > > > > > >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
  2021-08-29  7:07               ` Xueming(Steven) Li
@ 2021-09-01 14:44                 ` Xueming(Steven) Li
  2021-09-28  5:54                   ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-01 14:44 UTC (permalink / raw)
  To: Xueming(Steven) Li, Jerin Jacob; +Cc: Jack Min, dpdk-dev, Xiaoyun Li



> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Xueming(Steven) Li
> Sent: Sunday, August 29, 2021 3:08 PM
> To: Jerin Jacob <jerinjacobk@gmail.com>
> Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun Li <xiaoyun.li@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
> 
> 
> 
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Thursday, August 26, 2021 7:28 PM
> > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun Li
> > <xiaoyun.li@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd
> > wrapper function
> >
> > On Wed, Aug 18, 2021 at 7:38 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > Sent: Wednesday, August 18, 2021 7:48 PM
> > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>;
> > > > Xiaoyun Li <xiaoyun.li@intel.com>
> > > > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common
> > > > fwd wrapper function
> > > >
> > > > On Wed, Aug 18, 2021 at 4:57 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > Sent: Tuesday, August 17, 2021 5:37 PM
> > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>;
> > > > > > Xiaoyun Li <xiaoyun.li@intel.com>
> > > > > > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add
> > > > > > common fwd wrapper function
> > > > > >
> > > > > > On Wed, Aug 11, 2021 at 7:35 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > >
> > > > > > > From: Xiaoyu Min <jackmin@nvidia.com>
> > > > > > >
> > > > > > > Added an inline common wrapper function for all fwd engines
> > > > > > > which do the following in common:
> > > > > > >
> > > > > > > 1. get_start_cycles
> > > > > > > 2. rte_eth_rx_burst(...,nb_pkt_per_burst)
> > > > > > > 3. if rxq_share do forward_shared_rxq(), otherwise do fwd directly 4.
> > > > > > > get_end_cycle
> > > > > > >
> > > > > > > Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
> > > > > > > ---
> > > > > > >  app/test-pmd/testpmd.h | 24 ++++++++++++++++++++++++
> > > > > > >  1 file changed, 24 insertions(+)
> > > > > > >
> > > > > > > diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
> > > > > > > index
> > > > > > > 13141dfed9..b685ac48d6 100644
> > > > > > > --- a/app/test-pmd/testpmd.h
> > > > > > > +++ b/app/test-pmd/testpmd.h
> > > > > > > @@ -1022,6 +1022,30 @@ void add_tx_dynf_callback(portid_t
> > > > > > > portid); void remove_tx_dynf_callback(portid_t portid);  int
> > > > > > > update_jumbo_frame_offload(portid_t portid);
> > > > > > >
> > > > > > > +static inline void
> > > > > > > +do_burst_fwd(struct fwd_stream *fs, packet_fwd_cb fwd) {
> > > > > > > +       struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
> > > > > > > +       uint16_t nb_rx;
> > > > > > > +       uint64_t start_tsc = 0;
> > > > > > > +
> > > > > > > +       get_start_cycles(&start_tsc);
> > > > > > > +
> > > > > > > +       /*
> > > > > > > +        * Receive a burst of packets and forward them.
> > > > > > > +        */
> > > > > > > +       nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
> > > > > > > +                       pkts_burst, nb_pkt_per_burst);
> > > > > > > +       inc_rx_burst_stats(fs, nb_rx);
> > > > > > > +       if (unlikely(nb_rx == 0))
> > > > > > > +               return;
> > > > > > > +       if (unlikely(rxq_share > 0))
> > > > > >
> > > > > > See below. It reads a global memory.
> > > > > >
> > > > > > > +               forward_shared_rxq(fs, nb_rx, pkts_burst, fwd);
> > > > > > > +       else
> > > > > > > +               (*fwd)(fs, nb_rx, pkts_burst);
> > > > > >
> > > > > > New function pointer in fastpath.
> > > > > >
> > > > > > IMO, We should not create performance regression for the existing forward engine.
> > > > > > Can we have a new forward engine just for shared memory testing?
> > > > >
> > > > > Yes, fully aware of the performance concern, the global could be defined around record_core_cycles to minimize the impacts.
> > > > > Based on test data, the impacts almost invisible in legacy mode.
> > > >
> > > > Are you saying there is zero % regression? If not, could you share the data?
> > >
> > > Almost zero, here is a quick single core result of rxonly:
> > >         32.2Mpps, 58.9cycles/packet
> > > Revert the patch to rxonly.c:
> > >         32.1Mpps 59.9cycles/packet
> > > The result doesn't make sense and I realized that I used batch mbuf free, apply it now:
> > >         32.2Mpps, 58.9cycles/packet
> > > There were small digit jumps between testpmd restart, I picked the best one.
> > > The result is almost same, seems the cost of each packet is small enough.
> > > BTW, I'm testing with default burst size and queue depth.
> >
> > I tested this on octeontx2 with iofwd with single core with 100Gbps
> > Without this patch - 73.5mpps With this patch - 72.8 mpps
> >
> > We are taking the shared queue runtime option without a separate fwd engine.
> > and to have zero performance impact and no compile time flag Then I think, only way to have a function template .
> > Example change to outline function template principle.
> >
> > static inline
> > __pkt_burst_io_forward(struct fwd_stream *fs, const u64 flag) {
> >
> > Introduce new checks under
> > if (flags & SHARED_QUEUE)
> >
> >
> > }
> >
> > Have two versions of io_fwd_engine.packet_fwd per engine.
> >
> > - first version
> > static pkt_burst_io_forward(struct fwd_stream *fs) {
> >         return __pkt_burst_io_forward(fs, 0); }
> >
> > - Second version
> > static pkt_burst_io_forward_shared_queue(struct fwd_stream *fs) {
> >         return __pkt_burst_io_forward(fs, SHARED_QUEUE); }
> >
> >
> > Update io_fwd_engine.packet_fwd in slowpath to respective version based on offload.
> >
> > If shared offoad is not selected, pkt_burst_io_forward() will be
> > selected and
> > __pkt_burst_io_forward() will be a compile time version of !SHARED_QUEUE aka same as existing coe.
> 
> Thanks for testing and suggestion. So the only difference here in above code is access to rxq_shared changed to function parameter,
> right? Have you tested this performance? If not, I could verify.

Performance result looks better by removing this wrapper and hide global variable access like you suggested, thanks!
Tried to add rxq_share bit field  in struct fwd_stream, same result as the static function selection, looks less changes.

> 
> >
> > >
> > > >
> > > > >
> > > > > From test perspective, better to have all forward engine to
> > > > > verify shared rxq, test team want to run the regression with
> > > > > less impacts. Hope to have a solution to utilize all forwarding
> > > > > engines
> > > > seamlessly.
> > > >
> > > > Yes. it good goal. testpmd forward performance using as synthetic bench everyone.
> > > > I think, we are aligned to not have any regression for the generic forward engine.
> > > >
> > > > >
> > > > > >
> > > > > > > +       get_end_cycles(fs, start_tsc); }
> > > > > > > +
> > > > > > >  /*
> > > > > > >   * Work-around of a compilation error with ICC on invocations of the
> > > > > > >   * rte_be_to_cpu_16() function.
> > > > > > > --
> > > > > > > 2.25.1
> > > > > > >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue
  2021-08-30  9:31                   ` Jerin Jacob
  2021-08-30 10:13                     ` Xueming(Steven) Li
@ 2021-09-15 14:45                     ` Xueming(Steven) Li
  2021-09-16  4:16                       ` Jerin Jacob
  1 sibling, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-15 14:45 UTC (permalink / raw)
  To: jerinjacobk
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

Hi Jerin,

On Mon, 2021-08-30 at 15:01 +0530, Jerin Jacob wrote:
> On Sat, Aug 28, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > 
> > 
> > 
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Thursday, August 26, 2021 7:58 PM
> > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > 
> > > On Thu, Aug 19, 2021 at 5:39 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > 
> > > > 
> > > > 
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Thursday, August 19, 2021 1:27 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > > 
> > > > > On Wed, Aug 18, 2021 at 4:44 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > Sent: Tuesday, August 17, 2021 11:12 PM
> > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > > > > 
> > > > > > > On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > Sent: Tuesday, August 17, 2021 5:33 PM
> > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx
> > > > > > > > > queue
> > > > > > > > > 
> > > > > > > > > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > > 
> > > > > > > > > > In current DPDK framework, each RX queue is pre-loaded
> > > > > > > > > > with mbufs for incoming packets. When number of
> > > > > > > > > > representors scale out in a switch domain, the memory
> > > > > > > > > > consumption became significant. Most important, polling
> > > > > > > > > > all ports leads to high cache miss, high latency and low throughput.
> > > > > > > > > > 
> > > > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > > > Polling any queue using same shared RX queue receives
> > > > > > > > > > packets from all member ports. Source port is identified by mbuf->port.
> > > > > > > > > > 
> > > > > > > > > > Port queue number in a shared group should be identical.
> > > > > > > > > > Queue index is
> > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > 
> > > > > > > > > > Share RX queue must be polled on single thread or core.
> > > > > > > > > > 
> > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > ---
> > > > > > > > > > Rx queue object could be used as shared Rx queue object,
> > > > > > > > > > it's important to clear all queue control callback api that using queue object:
> > > > > > > > > > 
> > > > > > > > > > https://mails.dpdk.org/archives/dev/2021-July/215574.html
> > > > > > > > > 
> > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR diff --git
> > > > > > > > > > a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > > > > > > > > > d2b27c351f..a578c9db9d 100644
> > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > > > */
> > > > > > > > > > +       uint32_t shared_group; /**< Shared port group
> > > > > > > > > > + index in switch domain. */
> > > > > > > > > 
> > > > > > > > > Not to able to see anyone setting/creating this group ID test application.
> > > > > > > > > How this group is created?
> > > > > > > > 
> > > > > > > > Nice catch, the initial testpmd version only support one default group(0).
> > > > > > > > All ports that supports shared-rxq assigned in same group.
> > > > > > > > 
> > > > > > > > We should be able to change "--rxq-shared" to "--rxq-shared-group"
> > > > > > > > to support group other than default.
> > > > > > > > 
> > > > > > > > To support more groups simultaneously, need to consider
> > > > > > > > testpmd forwarding stream core assignment, all streams in same group need to stay on same core.
> > > > > > > > It's possible to specify how many ports to increase group
> > > > > > > > number, but user must schedule stream affinity carefully - error prone.
> > > > > > > > 
> > > > > > > > On the other hand, one group should be sufficient for most
> > > > > > > > customer, the doubt is whether it valuable to support multiple groups test.
> > > > > > > 
> > > > > > > Ack. One group is enough in testpmd.
> > > > > > > 
> > > > > > > My question was more about who and how this group is created,
> > > > > > > Should n't we need API to create shared_group? If we do the following, at least, I can think, how it can be implemented in SW
> > > or other HW.
> > > > > > > 
> > > > > > > - Create aggregation queue group
> > > > > > > - Attach multiple  Rx queues to the aggregation queue group
> > > > > > > - Pull the packets from the queue group(which internally fetch
> > > > > > > from the Rx queues _attached_)
> > > > > > > 
> > > > > > > Does the above kind of sequence, break your representor use case?
> > > > > > 
> > > > > > Seems more like a set of EAL wrapper. Current API tries to minimize the application efforts to adapt shared-rxq.
> > > > > > - step 1, not sure how important it is to create group with API, in rte_flow, group is created on demand.
> > > > > 
> > > > > Which rte_flow pattern/action for this?
> > > > 
> > > > No rte_flow for this, just recalled that the group in rte_flow is not created along with flow, not via api.
> > > > I don’t see anything else to create along with group, just double whether it valuable to introduce a new api set to manage group.
> > > 
> > > See below.
> > > 
> > > > 
> > > > > 
> > > > > > - step 2, currently, the attaching is done in rte_eth_rx_queue_setup, specify offload and group in rx_conf struct.
> > > > > > - step 3, define a dedicate api to receive packets from shared rxq? Looks clear to receive packets from shared rxq.
> > > > > >   currently, rxq objects in share group is same - the shared rxq, so the eth callback eth_rx_burst_t(rxq_obj, mbufs, n) could
> > > > > >   be used to receive packets from any ports in group, normally the first port(PF) in group.
> > > > > >   An alternative way is defining a vdev with same queue number and copy rxq objects will make the vdev a proxy of
> > > > > >   the shared rxq group - this could be an helper API.
> > > > > > 
> > > > > > Anyway the wrapper doesn't break use case, step 3 api is more clear, need to understand how to implement efficiently.
> > > > > 
> > > > > Are you doing this feature based on any HW support or it just pure
> > > > > SW thing, If it is SW, It is better to have just new vdev for like drivers/net/bonding/. This we can help aggregate multiple Rxq across
> > > the multiple ports of same the driver.
> > > > 
> > > > Based on HW support.
> > > 
> > > In Marvel HW, we do some support, I will outline here and some queries on this.
> > > 
> > > # We need to create some new HW structure for aggregation # Connect each Rxq to the new HW structure for aggregation # Use
> > > rx_burst from the new HW structure.
> > > 
> > > Could you outline your HW support?
> > > 
> > > Also, I am not able to understand how this will reduce the memory, atleast in our HW need creating more memory now to deal this as
> > > we need to deal new HW structure.
> > > 
> > > How is in your HW it reduces the memory? Also, if memory is the constraint, why NOT reduce the number of queues.
> > > 
> > 
> > Glad to know that Marvel is working on this, what's the status of driver implementation?
> > 
> > In my PMD implementation, it's very similar, a new HW object shared memory pool is created to replace per rxq memory pool.
> > Legacy rxq feed queue with allocated mbufs as number of descriptors, now shared rxqs share the same pool, no need to supply
> > mbufs for each rxq, just feed the shared rxq.
> > 
> > So the memory saving reflects to mbuf per rxq, even 1000 representors in shared rxq group, the mbufs consumed is one rxq.
> > In other words, new members in shared rxq doesn’t allocate new mbufs to feed rxq, just share with existing shared rxq(HW mempool).
> > The memory required to setup each rxq doesn't change too much, agree.
> 
> We can ask the application to configure the same mempool for multiple
> RQ too. RIght? If the saving is based on sharing the mempool
> with multiple RQs.
> 
> > 
> > > # Also, I was thinking, one way to avoid the fast path or ABI change would like.
> > > 
> > > # Driver Initializes one more eth_dev_ops in driver as aggregator ethdev # devargs of new ethdev or specific API like
> > > drivers/net/bonding/rte_eth_bond.h can take the argument (port, queue) tuples which needs to aggregate by new ethdev port # No
> > > change in fastpath or ABI is required in this model.
> > > 
> > 
> > This could be an option to access shared rxq. What's the difference of the new PMD?
> 
> No ABI and fast change are required.
> 
> > What's the difference of PMD driver to create the new device?
> > 
> > Is it important in your implementation? Does it work with existing rx_burst api?
> 
> Yes . It will work with the existing rx_burst API.
> 

The aggregator ethdev required by user is a port, maybe it good to add
a callback for PMD to prepare a complete ethdev just like creating
representor ethdev - pmd register new port internally. If the PMD
doens't provide the callback, ethdev api fallback to initialize an
empty ethdev by copy rxq data(shared) and rx_burst api from source port
and share group. Actually users can do this fallback themselves or with
an util api.

IIUC, an aggregator ethdev not a must, do you think we can continue and
leave that design in later stage? 

> > 
> > > 
> > > 
> > > > Most user might uses PF in group as the anchor port to rx burst, current definition should be easy for them to migrate.
> > > > but some user might prefer grouping some hot
> > > > plug/unpluggedrepresentors, EAL could provide wrappers, users could do that either due to the strategy not complex enough.
> > > Anyway, welcome any suggestion.
> > > > 
> > > > > 
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > >         /**
> > > > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf
> > > > > > > > > > { #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > > > +/**
> > > > > > > > > > + * Rx queue is shared among ports in same switch domain
> > > > > > > > > > +to save memory,
> > > > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > > > + */
> > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > > > > 
> > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM
> > > > > > > > > > > \
> > > > > > > > > > --
> > > > > > > > > > 2.25.1
> > > > > > > > > > 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-08-11 12:59             ` Xueming(Steven) Li
  2021-08-12 14:35               ` Xueming(Steven) Li
@ 2021-09-15 15:34               ` Xueming(Steven) Li
  1 sibling, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-15 15:34 UTC (permalink / raw)
  To: jerinjacobk, ferruh.yigit
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev

On Wed, 2021-08-11 at 12:59 +0000, Xueming(Steven) Li wrote:
> 
> > -----Original Message-----
> > From: Ferruh Yigit <ferruh.yigit@intel.com>
> > Sent: Wednesday, August 11, 2021 8:04 PM
> > To: Xueming(Steven) Li <xuemingl@nvidia.com>; Jerin Jacob <jerinjacobk@gmail.com>
> > Cc: dpdk-dev <dev@dpdk.org>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > <andrew.rybchenko@oktetlabs.ru>
> > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > 
> > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > 
> > > 
> > > > -----Original Message-----
> > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > <andrew.rybchenko@oktetlabs.ru>
> > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > 
> > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx
> > > > > > queue
> > > > > > 
> > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > 
> > > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > > for incoming packets. When number of representors scale out in a
> > > > > > > switch domain, the memory consumption became significant. Most
> > > > > > > important, polling all ports leads to high cache miss, high
> > > > > > > latency and low throughput.
> > > > > > > 
> > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > > 
> > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > index is
> > > > > > > 1:1 mapped in shared group.
> > > > > > > 
> > > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > > 
> > > > > > > Multiple groups is supported by group ID.
> > > > > > 
> > > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > > 
> > > > > Yes, PF and representor in switch domain could take advantage.
> > > > > 
> > > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > > 
> > > > > Not quite sure that I understood your question. The control path of
> > > > > is almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > > supplied from shared Rx queue in my PMD implementation.
> > > > 
> > > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > offload, multiple ethdev receive queues land into the same receive queue, In that case, how the flow order is maintained for
> > respective receive queues.
> > > 
> > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > Packets from same source port could be grouped as a small burst to
> > > process, this will accelerates the performance if traffic come from
> > > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for all
> > forwarding engine. Will sent patches soon.
> > > 
> > 
> > All ports will put the packets in to the same queue (share queue), right? Does this means only single core will poll only, what will
> > happen if there are multiple cores polling, won't it cause problem?
> 
> This has been mentioned in commit log, the shared rxq is supposed to be polling in single thread(core) - I think it should be "MUST".
> Result is unexpected if there are multiple cores pooling, that's why I added a polling schedule check in testpmd.
> Similar for rx/tx burst function, a queue can't be polled on multiple thread(core), and for performance concern, no such check in eal api.
> 
> If users want to utilize multiple cores to distribute workloads, it's possible to define more groups, queues in different group could be
> could be polled on multiple cores.
> 
> It's possible to poll every member port in group, but not necessary, any port in group could be polled to get packets for all ports in group.
> 
> If the member port subject to hot plug/remove,  it's possible to create a vdev with same queue number, copy rxq object and poll vdev
> as a dedicate proxy for the group.
> 
> > 
> > And if this requires specific changes in the application, I am not sure about the solution, can't this work in a transparent way to the
> > application?
> 
> Yes, we considered different options in design stage. One possible solution is to cache received packets in rings, this can be done on
> eth layer, but I'm afraid less benefits, user still has to be a ware of multiple core polling. 
> This can be done as a wrapper PMD later, more efforts.

For people want to use shared rxq to save memory, they need to be
conscious on the core polling rule: dedicate core for shared rxq, like
the rule for rxq and txq.

I'm afraid specific changes in application is a must, but not too much:
polling one port per group is sufficient. Protections in data plane
will definitely hurt performance :(

> 
> > 
> > Overall, is this for optimizing memory for the port represontors? If so can't we have a port representor specific solution, reducing
> > scope can reduce the complexity it brings?
> 
> This feature supports both PF and representor, and yes, major issue is memory of representors. Poll all representors also 
> introduces more core cache miss latency. This feature essentially aggregates all ports in group as one port.
> On the other hand, it's useful for rte flow to create offloading flows using representor as a regular port ID.
> 
As discussion with Jerin below, the major memory consumed by PF or
representor is mbufs pre-filled to rxq. PMD can't assume all
representors shares same memory pool, or share rxqs internally in PMD -
user might schedule representors to different cores. Defining shared
rxq flag and group looks a good direction.

> It's great if any new solution/suggestion, my head buried in PMD code :)
> 
> > 
> > > > If this offload is only useful for representor case, Can we make this
> > > > offload specific to representor the case by changing its name and scope.
> > > 
> > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > > 
> > > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > ---
> > > > > > >  doc/guides/nics/features.rst                    | 11 +++++++++++
> > > > > > >  doc/guides/nics/features/default.ini            |  1 +
> > > > > > >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > > >  lib/ethdev/rte_ethdev.c                         |  1 +
> > > > > > >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > 
> > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > 
> > > > > > > 
> > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > +
> > > > > > > +Shared Rx queue
> > > > > > > +---------------
> > > > > > > +
> > > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > > +
> > > > > > > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > +
> > > > > > > +
> > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > 
> > > > > > >  Packet type parsing
> > > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > >  Queue start/stop     =
> > > > > > >  Runtime Rx queue setup =
> > > > > > >  Runtime Tx queue setup =
> > > > > > > +Shared Rx queue      =
> > > > > > >  Burst mode info      =
> > > > > > >  Power mgmt address monitor =
> > > > > > >  MTU update           =
> > > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > > >  .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > > 
> > > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > > `_
> > > > > > > 
> > > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > > +grows,
> > > > > > > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > > +  Polling the large number of ports brings more CPU load, cache
> > > > > > > +miss and
> > > > > > > +  latency. Shared Rx queue can be used to share Rx queue between
> > > > > > > +PF and
> > > > > > > +  representors in same switch domain.
> > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > +  is present in Rx offloading capability of device info. Setting
> > > > > > > +the
> > > > > > > +  offloading flag in device Rx mode or Rx queue configuration to
> > > > > > > +enable
> > > > > > > +  shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > > +return
> > > > > > > +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > > +
> > > > > > >  Basic SR-IOV
> > > > > > >  ------------
> > > > > > > 
> > > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > > +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > >  };
> > > > > > > 
> > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > */
> > > > > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > > > > + switch domain. */
> > > > > > >         /**
> > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > +/**
> > > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > > +memory,
> > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > + */
> > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > 
> > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > --
> > > > > > > 2.25.1
> > > > > > > 
> 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue
  2021-09-15 14:45                     ` Xueming(Steven) Li
@ 2021-09-16  4:16                       ` Jerin Jacob
  2021-09-28  5:50                         ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Jerin Jacob @ 2021-09-16  4:16 UTC (permalink / raw)
  To: Xueming(Steven) Li
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

On Wed, Sep 15, 2021 at 8:15 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
> Hi Jerin,
>
> On Mon, 2021-08-30 at 15:01 +0530, Jerin Jacob wrote:
> > On Sat, Aug 28, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > Sent: Thursday, August 26, 2021 7:58 PM
> > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > >
> > > > On Thu, Aug 19, 2021 at 5:39 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > Sent: Thursday, August 19, 2021 1:27 PM
> > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > > >
> > > > > > On Wed, Aug 18, 2021 at 4:44 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > Sent: Tuesday, August 17, 2021 11:12 PM
> > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > > > > >
> > > > > > > > On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > Sent: Tuesday, August 17, 2021 5:33 PM
> > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx
> > > > > > > > > > queue
> > > > > > > > > >
> > > > > > > > > > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > In current DPDK framework, each RX queue is pre-loaded
> > > > > > > > > > > with mbufs for incoming packets. When number of
> > > > > > > > > > > representors scale out in a switch domain, the memory
> > > > > > > > > > > consumption became significant. Most important, polling
> > > > > > > > > > > all ports leads to high cache miss, high latency and low throughput.
> > > > > > > > > > >
> > > > > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > > > > Polling any queue using same shared RX queue receives
> > > > > > > > > > > packets from all member ports. Source port is identified by mbuf->port.
> > > > > > > > > > >
> > > > > > > > > > > Port queue number in a shared group should be identical.
> > > > > > > > > > > Queue index is
> > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > >
> > > > > > > > > > > Share RX queue must be polled on single thread or core.
> > > > > > > > > > >
> > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > >
> > > > > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > > > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > ---
> > > > > > > > > > > Rx queue object could be used as shared Rx queue object,
> > > > > > > > > > > it's important to clear all queue control callback api that using queue object:
> > > > > > > > > > >
> > > > > > > > > > > https://mails.dpdk.org/archives/dev/2021-July/215574.html
> > > > > > > > > >
> > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR diff --git
> > > > > > > > > > > a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > > > > > > > > > > d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > > > > */
> > > > > > > > > > > +       uint32_t shared_group; /**< Shared port group
> > > > > > > > > > > + index in switch domain. */
> > > > > > > > > >
> > > > > > > > > > Not to able to see anyone setting/creating this group ID test application.
> > > > > > > > > > How this group is created?
> > > > > > > > >
> > > > > > > > > Nice catch, the initial testpmd version only support one default group(0).
> > > > > > > > > All ports that supports shared-rxq assigned in same group.
> > > > > > > > >
> > > > > > > > > We should be able to change "--rxq-shared" to "--rxq-shared-group"
> > > > > > > > > to support group other than default.
> > > > > > > > >
> > > > > > > > > To support more groups simultaneously, need to consider
> > > > > > > > > testpmd forwarding stream core assignment, all streams in same group need to stay on same core.
> > > > > > > > > It's possible to specify how many ports to increase group
> > > > > > > > > number, but user must schedule stream affinity carefully - error prone.
> > > > > > > > >
> > > > > > > > > On the other hand, one group should be sufficient for most
> > > > > > > > > customer, the doubt is whether it valuable to support multiple groups test.
> > > > > > > >
> > > > > > > > Ack. One group is enough in testpmd.
> > > > > > > >
> > > > > > > > My question was more about who and how this group is created,
> > > > > > > > Should n't we need API to create shared_group? If we do the following, at least, I can think, how it can be implemented in SW
> > > > or other HW.
> > > > > > > >
> > > > > > > > - Create aggregation queue group
> > > > > > > > - Attach multiple  Rx queues to the aggregation queue group
> > > > > > > > - Pull the packets from the queue group(which internally fetch
> > > > > > > > from the Rx queues _attached_)
> > > > > > > >
> > > > > > > > Does the above kind of sequence, break your representor use case?
> > > > > > >
> > > > > > > Seems more like a set of EAL wrapper. Current API tries to minimize the application efforts to adapt shared-rxq.
> > > > > > > - step 1, not sure how important it is to create group with API, in rte_flow, group is created on demand.
> > > > > >
> > > > > > Which rte_flow pattern/action for this?
> > > > >
> > > > > No rte_flow for this, just recalled that the group in rte_flow is not created along with flow, not via api.
> > > > > I don’t see anything else to create along with group, just double whether it valuable to introduce a new api set to manage group.
> > > >
> > > > See below.
> > > >
> > > > >
> > > > > >
> > > > > > > - step 2, currently, the attaching is done in rte_eth_rx_queue_setup, specify offload and group in rx_conf struct.
> > > > > > > - step 3, define a dedicate api to receive packets from shared rxq? Looks clear to receive packets from shared rxq.
> > > > > > >   currently, rxq objects in share group is same - the shared rxq, so the eth callback eth_rx_burst_t(rxq_obj, mbufs, n) could
> > > > > > >   be used to receive packets from any ports in group, normally the first port(PF) in group.
> > > > > > >   An alternative way is defining a vdev with same queue number and copy rxq objects will make the vdev a proxy of
> > > > > > >   the shared rxq group - this could be an helper API.
> > > > > > >
> > > > > > > Anyway the wrapper doesn't break use case, step 3 api is more clear, need to understand how to implement efficiently.
> > > > > >
> > > > > > Are you doing this feature based on any HW support or it just pure
> > > > > > SW thing, If it is SW, It is better to have just new vdev for like drivers/net/bonding/. This we can help aggregate multiple Rxq across
> > > > the multiple ports of same the driver.
> > > > >
> > > > > Based on HW support.
> > > >
> > > > In Marvel HW, we do some support, I will outline here and some queries on this.
> > > >
> > > > # We need to create some new HW structure for aggregation # Connect each Rxq to the new HW structure for aggregation # Use
> > > > rx_burst from the new HW structure.
> > > >
> > > > Could you outline your HW support?
> > > >
> > > > Also, I am not able to understand how this will reduce the memory, atleast in our HW need creating more memory now to deal this as
> > > > we need to deal new HW structure.
> > > >
> > > > How is in your HW it reduces the memory? Also, if memory is the constraint, why NOT reduce the number of queues.
> > > >
> > >
> > > Glad to know that Marvel is working on this, what's the status of driver implementation?
> > >
> > > In my PMD implementation, it's very similar, a new HW object shared memory pool is created to replace per rxq memory pool.
> > > Legacy rxq feed queue with allocated mbufs as number of descriptors, now shared rxqs share the same pool, no need to supply
> > > mbufs for each rxq, just feed the shared rxq.
> > >
> > > So the memory saving reflects to mbuf per rxq, even 1000 representors in shared rxq group, the mbufs consumed is one rxq.
> > > In other words, new members in shared rxq doesn’t allocate new mbufs to feed rxq, just share with existing shared rxq(HW mempool).
> > > The memory required to setup each rxq doesn't change too much, agree.
> >
> > We can ask the application to configure the same mempool for multiple
> > RQ too. RIght? If the saving is based on sharing the mempool
> > with multiple RQs.
> >
> > >
> > > > # Also, I was thinking, one way to avoid the fast path or ABI change would like.
> > > >
> > > > # Driver Initializes one more eth_dev_ops in driver as aggregator ethdev # devargs of new ethdev or specific API like
> > > > drivers/net/bonding/rte_eth_bond.h can take the argument (port, queue) tuples which needs to aggregate by new ethdev port # No
> > > > change in fastpath or ABI is required in this model.
> > > >
> > >
> > > This could be an option to access shared rxq. What's the difference of the new PMD?
> >
> > No ABI and fast change are required.
> >
> > > What's the difference of PMD driver to create the new device?
> > >
> > > Is it important in your implementation? Does it work with existing rx_burst api?
> >
> > Yes . It will work with the existing rx_burst API.
> >
>
> The aggregator ethdev required by user is a port, maybe it good to add
> a callback for PMD to prepare a complete ethdev just like creating
> representor ethdev - pmd register new port internally. If the PMD
> doens't provide the callback, ethdev api fallback to initialize an
> empty ethdev by copy rxq data(shared) and rx_burst api from source port
> and share group. Actually users can do this fallback themselves or with
> an util api.
>
> IIUC, an aggregator ethdev not a must, do you think we can continue and
> leave that design in later stage?


IMO aggregator ethdev reduces the complexity for application hence
avoid any change in
test application etc. IMO, I prefer to take that. I will leave the
decision to ethdev maintainers.


>
> > >
> > > >
> > > >
> > > > > Most user might uses PF in group as the anchor port to rx burst, current definition should be easy for them to migrate.
> > > > > but some user might prefer grouping some hot
> > > > > plug/unpluggedrepresentors, EAL could provide wrappers, users could do that either due to the strategy not complex enough.
> > > > Anyway, welcome any suggestion.
> > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >         /**
> > > > > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf
> > > > > > > > > > > { #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > > > > +/**
> > > > > > > > > > > + * Rx queue is shared among ports in same switch domain
> > > > > > > > > > > +to save memory,
> > > > > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > > > > + */
> > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > > > > >
> > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM
> > > > > > > > > > > > \
> > > > > > > > > > > --
> > > > > > > > > > > 2.25.1
> > > > > > > > > > >
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 0/8] ethdev: introduce shared Rx queue
  2021-07-27  3:42 [dpdk-dev] [RFC] ethdev: introduce shared Rx queue Xueming Li
                   ` (2 preceding siblings ...)
  2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
@ 2021-09-17  8:01 ` Xueming Li
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 1/8] " Xueming Li
                     ` (7 more replies)
  2021-09-30 14:55 ` [dpdk-dev] [PATCH v4 0/6] ethdev: introduce shared Rx queue Xueming Li
                   ` (12 subsequent siblings)
  16 siblings, 8 replies; 266+ messages in thread
From: Xueming Li @ 2021-09-17  8:01 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit

In current DPDK framework, all RX queues is pre-loaded with mbufs for
incoming packets. When number of representors scale out in a switch
domain, the memory consumption became significant. Further more,
polling all ports leads to high cache miss, high latency and low
throughputs.

This patch introduces shared RX queue. PF and representors with same
configuration in same switch domain could share RX queue set by
specifying shared Rx queue offloading flag and sharing group.

All ports that Shared Rx queue actually shares One Rx queue and only
pre-load mbufs to one Rx queue, memory is saved.

Polling any queue using same shared RX queue receives packets from all
member ports. Source port is identified by mbuf->port.

Multiple groups is supported by group ID. Port queue number in a shared
group should be identical. Queue index is 1:1 mapped in shared group.
An example of polling two share groups:
  core	group	queue
  0	0	0
  1	0	1
  2	0	2
  3	0	3
  4	1	0
  5	1	1
  6	1	2
  7	1	3

Shared RX queue must be polled on single thread or core. If both PF0 and
representor0 joined same share group, can't poll pf0rxq0 on core1 and
rep0rxq0 on core2. Actually, polling one port within share group is
sufficient since polling any port in group will return packets for any
port in group.

v1:
  - initial version
v2:
  - add testpmd patches
v3:
  - change common forwarding api to macro for performance, thanks Jerin.
  - save global variable accessed in forwarding to flowstream to minimize
    cache miss
  - combined patches for each forwarding engine
  - support multiple groups in testpmd "--share-rxq" parameter
  - new api to aggerate shared rxq group

Xiaoyu Min (1):
  app/testpmd: add common fwd wrapper

Xueming Li (7):
  ethdev: introduce shared Rx queue
  ethdev: new API to aggregate shared Rx queue group
  app/testpmd: dump port and queue info for each packet
  app/testpmd: new parameter to enable shared Rx queue
  app/testpmd: force shared Rx queue polled on same core
  app/testpmd: improve forwarding cache miss
  app/testpmd: support shared Rx queue forwarding

 app/test-pmd/5tswap.c                         |  25 +---
 app/test-pmd/config.c                         | 120 +++++++++++++++++-
 app/test-pmd/csumonly.c                       |  25 +---
 app/test-pmd/flowgen.c                        |  26 ++--
 app/test-pmd/icmpecho.c                       |  30 ++---
 app/test-pmd/ieee1588fwd.c                    |  30 +++--
 app/test-pmd/iofwd.c                          |  24 +---
 app/test-pmd/macfwd.c                         |  24 +---
 app/test-pmd/macswap.c                        |  23 +---
 app/test-pmd/noisy_vnf.c                      |   2 +-
 app/test-pmd/parameters.c                     |  13 ++
 app/test-pmd/rxonly.c                         |  32 ++---
 app/test-pmd/testpmd.c                        |  91 ++++++++++++-
 app/test-pmd/testpmd.h                        |  47 ++++++-
 app/test-pmd/txonly.c                         |   8 +-
 app/test-pmd/util.c                           |   1 +
 doc/guides/nics/features.rst                  |  11 ++
 doc/guides/nics/features/default.ini          |   1 +
 .../prog_guide/switch_representation.rst      |  10 ++
 doc/guides/testpmd_app_ug/run_app.rst         |   5 +
 lib/ethdev/ethdev_driver.h                    |  23 +++-
 lib/ethdev/rte_ethdev.c                       |  23 ++++
 lib/ethdev/rte_ethdev.h                       |  23 ++++
 lib/ethdev/version.map                        |   3 +
 24 files changed, 432 insertions(+), 188 deletions(-)

-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 1/8] ethdev: introduce shared Rx queue
  2021-09-17  8:01 ` [dpdk-dev] [PATCH v3 0/8] " Xueming Li
@ 2021-09-17  8:01   ` Xueming Li
  2021-09-27 23:53     ` Ajit Khaparde
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 2/8] ethdev: new API to aggregate shared Rx queue group Xueming Li
                     ` (6 subsequent siblings)
  7 siblings, 1 reply; 266+ messages in thread
From: Xueming Li @ 2021-09-17  8:01 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit

In current DPDK framework, each RX queue is pre-loaded with mbufs for
incoming packets. When number of representors scale out in a switch
domain, the memory consumption became significant. Most important,
polling all ports leads to high cache miss, high latency and low
throughput.

This patch introduces shared RX queue. Ports with same configuration in
a switch domain could share RX queue set by specifying sharing group.
Polling any queue using same shared RX queue receives packets from all
member ports. Source port is identified by mbuf->port.

Port queue number in a shared group should be identical. Queue index is
1:1 mapped in shared group.

Share RX queue must be polled on single thread or core.

Multiple groups is supported by group ID.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Cc: Jerin Jacob <jerinjacobk@gmail.com>
---
Rx queue object could be used as shared Rx queue object, it's important
to clear all queue control callback api that using queue object:
  https://mails.dpdk.org/archives/dev/2021-July/215574.html
---
 doc/guides/nics/features.rst                    | 11 +++++++++++
 doc/guides/nics/features/default.ini            |  1 +
 doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
 lib/ethdev/rte_ethdev.c                         |  1 +
 lib/ethdev/rte_ethdev.h                         |  7 +++++++
 5 files changed, 30 insertions(+)

diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index a96e12d155..2e2a9b1554 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
   ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
 
 
+.. _nic_features_shared_rx_queue:
+
+Shared Rx queue
+---------------
+
+Supports shared Rx queue for ports in same switch domain.
+
+* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
+* **[provides] mbuf**: ``mbuf.port``.
+
+
 .. _nic_features_packet_type_parsing:
 
 Packet type parsing
diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
index 754184ddd4..ebeb4c1851 100644
--- a/doc/guides/nics/features/default.ini
+++ b/doc/guides/nics/features/default.ini
@@ -19,6 +19,7 @@ Free Tx mbuf on demand =
 Queue start/stop     =
 Runtime Rx queue setup =
 Runtime Tx queue setup =
+Shared Rx queue      =
 Burst mode info      =
 Power mgmt address monitor =
 MTU update           =
diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
index ff6aa91c80..45bf5a3a10 100644
--- a/doc/guides/prog_guide/switch_representation.rst
+++ b/doc/guides/prog_guide/switch_representation.rst
@@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
 .. [1] `Ethernet switch device driver model (switchdev)
        <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
 
+- Memory usage of representors is huge when number of representor grows,
+  because PMD always allocate mbuf for each descriptor of Rx queue.
+  Polling the large number of ports brings more CPU load, cache miss and
+  latency. Shared Rx queue can be used to share Rx queue between PF and
+  representors in same switch domain. ``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
+  is present in Rx offloading capability of device info. Setting the
+  offloading flag in device Rx mode or Rx queue configuration to enable
+  shared Rx queue. Polling any member port of shared Rx queue can return
+  packets of all ports in group, port ID is saved in ``mbuf.port``.
+
 Basic SR-IOV
 ------------
 
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index a7c090ce79..b3a58d5e65 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -127,6 +127,7 @@ static const struct {
 	RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
 	RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
 	RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
+	RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
 };
 
 #undef RTE_RX_OFFLOAD_BIT2STR
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index d2b27c351f..a578c9db9d 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
 	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
 	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
 	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
+	uint32_t shared_group; /**< Shared port group index in switch domain. */
 	/**
 	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
 	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
@@ -1373,6 +1374,12 @@ struct rte_eth_conf {
 #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
 #define DEV_RX_OFFLOAD_RSS_HASH		0x00080000
 #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
+/**
+ * Rx queue is shared among ports in same switch domain to save memory,
+ * avoid polling each port. Any port in group can be used to receive packets.
+ * Real source port number saved in mbuf->port field.
+ */
+#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
 
 #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
 				 DEV_RX_OFFLOAD_UDP_CKSUM | \
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 2/8] ethdev: new API to aggregate shared Rx queue group
  2021-09-17  8:01 ` [dpdk-dev] [PATCH v3 0/8] " Xueming Li
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 1/8] " Xueming Li
@ 2021-09-17  8:01   ` Xueming Li
  2021-09-26 17:54     ` Ajit Khaparde
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 3/8] app/testpmd: dump port and queue info for each packet Xueming Li
                     ` (5 subsequent siblings)
  7 siblings, 1 reply; 266+ messages in thread
From: Xueming Li @ 2021-09-17  8:01 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ray Kinsella

This patch introduces new api to aggreated ports among same shared Rx
queue group.  Only queues with specified share group is aggregated.
Rx burst and device close are expected to be supported by new device.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 lib/ethdev/ethdev_driver.h | 23 ++++++++++++++++++++++-
 lib/ethdev/rte_ethdev.c    | 22 ++++++++++++++++++++++
 lib/ethdev/rte_ethdev.h    | 16 ++++++++++++++++
 lib/ethdev/version.map     |  3 +++
 4 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index 524757cf6f..72156a4153 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -786,10 +786,28 @@ typedef int (*eth_get_monitor_addr_t)(void *rxq,
  * @return
  *   Negative errno value on error, number of info entries otherwise.
  */
-
 typedef int (*eth_representor_info_get_t)(struct rte_eth_dev *dev,
 	struct rte_eth_representor_info *info);
 
+/**
+ * @internal
+ * Aggregate shared Rx queue.
+ *
+ * Create a new port used for shared Rx queue polling.
+ *
+ * Only queues with specified share group are aggregated.
+ * At least Rx burst and device close should be supported.
+ *
+ * @param dev
+ *   Ethdev handle of port.
+ * @param group
+ *   Shared Rx queue group to aggregate.
+ * @return
+ *   UINT16_MAX if failed, otherwise aggregated port number.
+ */
+typedef int (*eth_shared_rxq_aggregate_t)(struct rte_eth_dev *dev,
+					  uint32_t group);
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -950,6 +968,9 @@ struct eth_dev_ops {
 
 	eth_representor_info_get_t representor_info_get;
 	/**< Get representor info. */
+
+	eth_shared_rxq_aggregate_t shared_rxq_aggregate;
+	/**< Aggregate shared Rx queue. */
 };
 
 /**
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index b3a58d5e65..9f2ef58309 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -6301,6 +6301,28 @@ rte_eth_representor_info_get(uint16_t port_id,
 	return eth_err(port_id, (*dev->dev_ops->representor_info_get)(dev, info));
 }
 
+uint16_t
+rte_eth_shared_rxq_aggregate(uint16_t port_id, uint32_t group)
+{
+	struct rte_eth_dev *dev;
+	uint64_t offloads;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+	dev = &rte_eth_devices[port_id];
+
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->shared_rxq_aggregate,
+				UINT16_MAX);
+
+	offloads = dev->data->dev_conf.rxmode.offloads;
+	if ((offloads & RTE_ETH_RX_OFFLOAD_SHARED_RXQ) == 0) {
+		RTE_ETHDEV_LOG(ERR, "port_id=%u doesn't support Rx offload\n",
+			       port_id);
+		return UINT16_MAX;
+	}
+
+	return (*dev->dev_ops->shared_rxq_aggregate)(dev, group);
+}
+
 RTE_LOG_REGISTER_DEFAULT(rte_eth_dev_logtype, INFO);
 
 RTE_INIT(ethdev_init_telemetry)
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index a578c9db9d..f15d2142b2 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -4895,6 +4895,22 @@ __rte_experimental
 int rte_eth_representor_info_get(uint16_t port_id,
 				 struct rte_eth_representor_info *info);
 
+/**
+ * Aggregate shared Rx queue ports to one port for polling.
+ *
+ * Only queues with specified share group is aggregated.
+ * Any operation besides Rx burst and device close is unexpected.
+ *
+ * @param port_id
+ *   The port identifier of the device from shared Rx queue group.
+ * @param group
+ *   Shared Rx queue group to aggregate.
+ * @return
+ *   UINT16_MAX if failed, otherwise aggregated port number.
+ */
+__rte_experimental
+uint16_t rte_eth_shared_rxq_aggregate(uint16_t port_id, uint32_t group);
+
 #include <rte_ethdev_core.h>
 
 /**
diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
index 3eece75b72..97a2233508 100644
--- a/lib/ethdev/version.map
+++ b/lib/ethdev/version.map
@@ -249,6 +249,9 @@ EXPERIMENTAL {
 	rte_mtr_meter_policy_delete;
 	rte_mtr_meter_policy_update;
 	rte_mtr_meter_policy_validate;
+
+	# added in 21.11
+	rte_eth_shared_rxq_aggregate;
 };
 
 INTERNAL {
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 3/8] app/testpmd: dump port and queue info for each packet
  2021-09-17  8:01 ` [dpdk-dev] [PATCH v3 0/8] " Xueming Li
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 1/8] " Xueming Li
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 2/8] ethdev: new API to aggregate shared Rx queue group Xueming Li
@ 2021-09-17  8:01   ` Xueming Li
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 4/8] app/testpmd: new parameter to enable shared Rx queue Xueming Li
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-09-17  8:01 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit, Xiaoyun Li

In case of shared Rx queue, port number of mbufs returned from one rx
burst could be different.

To support shared Rx queue, this patch dumps mbuf->port and queue for
each packet.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/util.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/app/test-pmd/util.c b/app/test-pmd/util.c
index 14a9a251fb..b85fbf75a5 100644
--- a/app/test-pmd/util.c
+++ b/app/test-pmd/util.c
@@ -100,6 +100,7 @@ dump_pkt_burst(uint16_t port_id, uint16_t queue, struct rte_mbuf *pkts[],
 		struct rte_flow_restore_info info = { 0, };
 
 		mb = pkts[i];
+		MKDUMPSTR(print_buf, buf_size, cur_len, "port %u, ", mb->port);
 		eth_hdr = rte_pktmbuf_read(mb, 0, sizeof(_eth_hdr), &_eth_hdr);
 		eth_type = RTE_BE_TO_CPU_16(eth_hdr->ether_type);
 		packet_type = mb->packet_type;
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 4/8] app/testpmd: new parameter to enable shared Rx queue
  2021-09-17  8:01 ` [dpdk-dev] [PATCH v3 0/8] " Xueming Li
                     ` (2 preceding siblings ...)
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 3/8] app/testpmd: dump port and queue info for each packet Xueming Li
@ 2021-09-17  8:01   ` Xueming Li
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 5/8] app/testpmd: force shared Rx queue polled on same core Xueming Li
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-09-17  8:01 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit, Xiaoyun Li

Adds "--rxq-share" parameter to enable shared rxq for each rxq.

Default shared rxq group 0 is used, RX queues in same switch domain
shares same rxq according to queue index.

Shared Rx queue is enabled only if device support offloading flag
RTE_ETH_RX_OFFLOAD_SHARED_RXQ.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c                 |  6 +++++-
 app/test-pmd/parameters.c             | 13 +++++++++++++
 app/test-pmd/testpmd.c                | 18 ++++++++++++++++++
 app/test-pmd/testpmd.h                |  2 ++
 doc/guides/testpmd_app_ug/run_app.rst |  5 +++++
 5 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index f5765b34f7..8ec5f87ef3 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -2707,7 +2707,11 @@ rxtx_config_display(void)
 			printf("      RX threshold registers: pthresh=%d hthresh=%d "
 				" wthresh=%d\n",
 				pthresh_tmp, hthresh_tmp, wthresh_tmp);
-			printf("      RX Offloads=0x%"PRIx64"\n", offloads_tmp);
+			printf("      RX Offloads=0x%"PRIx64, offloads_tmp);
+			if (rx_conf->offloads & RTE_ETH_RX_OFFLOAD_SHARED_RXQ)
+				printf(" share group=%u",
+				       rx_conf->shared_group);
+			printf("\n");
 		}
 
 		/* per tx queue config only for first queue to be less verbose */
diff --git a/app/test-pmd/parameters.c b/app/test-pmd/parameters.c
index 3f94a82e32..de0f1d28cc 100644
--- a/app/test-pmd/parameters.c
+++ b/app/test-pmd/parameters.c
@@ -167,6 +167,7 @@ usage(char* progname)
 	printf("  --tx-ip=src,dst: IP addresses in Tx-only mode\n");
 	printf("  --tx-udp=src[,dst]: UDP ports in Tx-only mode\n");
 	printf("  --eth-link-speed: force link speed.\n");
+	printf("  --rxq-share: number of ports per shared rxq groups\n");
 	printf("  --disable-link-check: disable check on link status when "
 	       "starting/stopping ports.\n");
 	printf("  --disable-device-start: do not automatically start port\n");
@@ -607,6 +608,7 @@ launch_args_parse(int argc, char** argv)
 		{ "rxpkts",			1, 0, 0 },
 		{ "txpkts",			1, 0, 0 },
 		{ "txonly-multi-flow",		0, 0, 0 },
+		{ "rxq-share",			0, 0, 0 },
 		{ "eth-link-speed",		1, 0, 0 },
 		{ "disable-link-check",		0, 0, 0 },
 		{ "disable-device-start",	0, 0, 0 },
@@ -1271,6 +1273,17 @@ launch_args_parse(int argc, char** argv)
 			}
 			if (!strcmp(lgopts[opt_idx].name, "txonly-multi-flow"))
 				txonly_multi_flow = 1;
+			if (!strcmp(lgopts[opt_idx].name, "rxq-share")) {
+				if (optarg == NULL) {
+					rxq_share = UINT32_MAX;
+				} else {
+					n = atoi(optarg);
+					if (n >= 0)
+						rxq_share = (uint32_t)n;
+					else
+						rte_exit(EXIT_FAILURE, "rxq-share must be >= 0\n");
+				}
+			}
 			if (!strcmp(lgopts[opt_idx].name, "no-flush-rx"))
 				no_flush_rx = 1;
 			if (!strcmp(lgopts[opt_idx].name, "eth-link-speed")) {
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 97ae52e17e..417e92ade1 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -498,6 +498,11 @@ uint8_t record_core_cycles;
  */
 uint8_t record_burst_stats;
 
+/*
+ * Number of ports per shared Rx queue group, 0 disable.
+ */
+uint32_t rxq_share;
+
 unsigned int num_sockets = 0;
 unsigned int socket_ids[RTE_MAX_NUMA_NODES];
 
@@ -1506,6 +1511,11 @@ init_config_port_offloads(portid_t pid, uint32_t socket_id)
 		port->dev_conf.txmode.offloads &=
 			~DEV_TX_OFFLOAD_MBUF_FAST_FREE;
 
+	if (rxq_share > 0 &&
+	    (port->dev_info.rx_offload_capa & RTE_ETH_RX_OFFLOAD_SHARED_RXQ))
+		port->dev_conf.rxmode.offloads |=
+				RTE_ETH_RX_OFFLOAD_SHARED_RXQ;
+
 	/* Apply Rx offloads configuration */
 	for (i = 0; i < port->dev_info.max_rx_queues; i++)
 		port->rx_conf[i].offloads = port->dev_conf.rxmode.offloads;
@@ -3401,6 +3411,14 @@ rxtx_port_config(struct rte_port *port)
 	for (qid = 0; qid < nb_rxq; qid++) {
 		offloads = port->rx_conf[qid].offloads;
 		port->rx_conf[qid] = port->dev_info.default_rxconf;
+
+		if (rxq_share > 0 &&
+		    (port->dev_info.rx_offload_capa &
+		     RTE_ETH_RX_OFFLOAD_SHARED_RXQ)) {
+			offloads |= RTE_ETH_RX_OFFLOAD_SHARED_RXQ;
+			port->rx_conf[qid].shared_group = nb_ports / rxq_share;
+		}
+
 		if (offloads != 0)
 			port->rx_conf[qid].offloads = offloads;
 
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 5863b2f43f..3dfaaad94c 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -477,6 +477,8 @@ extern enum tx_pkt_split tx_pkt_split;
 
 extern uint8_t txonly_multi_flow;
 
+extern uint32_t rxq_share;
+
 extern uint16_t nb_pkt_per_burst;
 extern uint16_t nb_pkt_flowgen_clones;
 extern int nb_flows_flowgen;
diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
index 640eadeff7..1b9f715608 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -389,6 +389,11 @@ The command line options are:
 
     Generate multiple flows in txonly mode.
 
+*   ``--rxq-share=[X]``
+
+    Create all queues in shared RX queue mode if device supports.
+    Group number grows per X ports, default to group 0 if X not specified.
+
 *   ``--eth-link-speed``
 
     Set a forced link speed to the ethernet port::
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 5/8] app/testpmd: force shared Rx queue polled on same core
  2021-09-17  8:01 ` [dpdk-dev] [PATCH v3 0/8] " Xueming Li
                     ` (3 preceding siblings ...)
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 4/8] app/testpmd: new parameter to enable shared Rx queue Xueming Li
@ 2021-09-17  8:01   ` Xueming Li
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 6/8] app/testpmd: add common fwd wrapper Xueming Li
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-09-17  8:01 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit, Xiaoyun Li

Shared rxqs shares one set rx queue of groups zero. Shared Rx queue must
must be polled from one core.

Checks and stops forwarding if shared rxq being scheduled on multiple
cores.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c  | 96 ++++++++++++++++++++++++++++++++++++++++++
 app/test-pmd/testpmd.c |  4 +-
 app/test-pmd/testpmd.h |  2 +
 3 files changed, 101 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 8ec5f87ef3..035247c33f 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -2883,6 +2883,102 @@ port_rss_hash_key_update(portid_t port_id, char rss_type[], uint8_t *hash_key,
 	}
 }
 
+/*
+ * Check whether a shared rxq scheduled on other lcores.
+ */
+static bool
+fwd_stream_on_other_lcores(uint16_t domain_id, portid_t src_port,
+			   queueid_t src_rxq, lcoreid_t src_lc,
+			   uint32_t shared_group)
+{
+	streamid_t sm_id;
+	streamid_t nb_fs_per_lcore;
+	lcoreid_t  nb_fc;
+	lcoreid_t  lc_id;
+	struct fwd_stream *fs;
+	struct rte_port *port;
+	struct rte_eth_rxconf *rxq_conf;
+
+	nb_fc = cur_fwd_config.nb_fwd_lcores;
+	for (lc_id = src_lc + 1; lc_id < nb_fc; lc_id++) {
+		sm_id = fwd_lcores[lc_id]->stream_idx;
+		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
+		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
+		     sm_id++) {
+			fs = fwd_streams[sm_id];
+			port = &ports[fs->rx_port];
+			rxq_conf = &port->rx_conf[fs->rx_queue];
+			if ((rxq_conf->offloads & RTE_ETH_RX_OFFLOAD_SHARED_RXQ)
+			    == 0)
+				/* Not shared rxq. */
+				continue;
+			if (domain_id != port->dev_info.switch_info.domain_id)
+				continue;
+			if (fs->rx_queue != src_rxq)
+				continue;
+			if (rxq_conf->shared_group != shared_group)
+				continue;
+			printf("Shared RX queue group %u can't be scheduled on different cores:\n",
+			       shared_group);
+			printf("  lcore %hhu Port %hu queue %hu\n",
+			       src_lc, src_port, src_rxq);
+			printf("  lcore %hhu Port %hu queue %hu\n",
+			       lc_id, fs->rx_port, fs->rx_queue);
+			printf("  please use --nb-cores=%hu to limit forwarding cores\n",
+			       nb_rxq);
+			return true;
+		}
+	}
+	return false;
+}
+
+/*
+ * Check shared rxq configuration.
+ *
+ * Shared group must not being scheduled on different core.
+ */
+bool
+pkt_fwd_shared_rxq_check(void)
+{
+	streamid_t sm_id;
+	streamid_t nb_fs_per_lcore;
+	lcoreid_t  nb_fc;
+	lcoreid_t  lc_id;
+	struct fwd_stream *fs;
+	uint16_t domain_id;
+	struct rte_port *port;
+	struct rte_eth_rxconf *rxq_conf;
+
+	nb_fc = cur_fwd_config.nb_fwd_lcores;
+	/*
+	 * Check streams on each core, make sure the same switch domain +
+	 * group + queue doesn't get scheduled on other cores.
+	 */
+	for (lc_id = 0; lc_id < nb_fc; lc_id++) {
+		sm_id = fwd_lcores[lc_id]->stream_idx;
+		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
+		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
+		     sm_id++) {
+			fs = fwd_streams[sm_id];
+			/* Update lcore info stream being scheduled. */
+			fs->lcore = fwd_lcores[lc_id];
+			port = &ports[fs->rx_port];
+			rxq_conf = &port->rx_conf[fs->rx_queue];
+			if ((rxq_conf->offloads & RTE_ETH_RX_OFFLOAD_SHARED_RXQ)
+			    == 0)
+				/* Not shared rxq. */
+				continue;
+			/* Check shared rxq not scheduled on remaining cores. */
+			domain_id = port->dev_info.switch_info.domain_id;
+			if (fwd_stream_on_other_lcores(domain_id, fs->rx_port,
+						       fs->rx_queue, lc_id,
+						       rxq_conf->shared_group))
+				return false;
+		}
+	}
+	return true;
+}
+
 /*
  * Setup forwarding configuration for each logical core.
  */
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 417e92ade1..cab4b36b04 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -2241,10 +2241,12 @@ start_packet_forwarding(int with_tx_first)
 
 	fwd_config_setup();
 
+	pkt_fwd_config_display(&cur_fwd_config);
+	if (!pkt_fwd_shared_rxq_check())
+		return;
 	if(!no_flush_rx)
 		flush_fwd_rx_queues();
 
-	pkt_fwd_config_display(&cur_fwd_config);
 	rxtx_config_display();
 
 	fwd_stats_reset();
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 3dfaaad94c..f121a2da90 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -144,6 +144,7 @@ struct fwd_stream {
 	uint64_t     core_cycles; /**< used for RX and TX processing */
 	struct pkt_burst_stats rx_burst_stats;
 	struct pkt_burst_stats tx_burst_stats;
+	struct fwd_lcore *lcore; /**< Lcore being scheduled. */
 };
 
 /**
@@ -795,6 +796,7 @@ void port_summary_header_display(void);
 void rx_queue_infos_display(portid_t port_idi, uint16_t queue_id);
 void tx_queue_infos_display(portid_t port_idi, uint16_t queue_id);
 void fwd_lcores_config_display(void);
+bool pkt_fwd_shared_rxq_check(void);
 void pkt_fwd_config_display(struct fwd_config *cfg);
 void rxtx_config_display(void);
 void fwd_config_setup(void);
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 6/8] app/testpmd: add common fwd wrapper
  2021-09-17  8:01 ` [dpdk-dev] [PATCH v3 0/8] " Xueming Li
                     ` (4 preceding siblings ...)
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 5/8] app/testpmd: force shared Rx queue polled on same core Xueming Li
@ 2021-09-17  8:01   ` Xueming Li
  2021-09-17 11:24     ` Jerin Jacob
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 7/8] app/testpmd: improve forwarding cache miss Xueming Li
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 8/8] app/testpmd: support shared Rx queue forwarding Xueming Li
  7 siblings, 1 reply; 266+ messages in thread
From: Xueming Li @ 2021-09-17  8:01 UTC (permalink / raw)
  To: dev
  Cc: Xiaoyu Min, xuemingl, Jerin Jacob, Ferruh Yigit,
	Andrew Rybchenko, Viacheslav Ovsiienko, Thomas Monjalon,
	Lior Margalit, Xiaoyun Li

From: Xiaoyu Min <jackmin@nvidia.com>

Added common forwarding wrapper function for all fwd engines
which do the following in common:

- record core cycles
- call rte_eth_rx_burst(...,nb_pkt_per_burst)
- update received packets
- handle received mbufs with callback function

For better performance, the function is defined as macro.

Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/5tswap.c   | 25 +++++--------------------
 app/test-pmd/csumonly.c | 25 ++++++-------------------
 app/test-pmd/flowgen.c  | 20 +++++---------------
 app/test-pmd/icmpecho.c | 30 ++++++++----------------------
 app/test-pmd/iofwd.c    | 24 +++++-------------------
 app/test-pmd/macfwd.c   | 24 +++++-------------------
 app/test-pmd/macswap.c  | 23 +++++------------------
 app/test-pmd/rxonly.c   | 32 ++++++++------------------------
 app/test-pmd/testpmd.h  | 19 +++++++++++++++++++
 9 files changed, 66 insertions(+), 156 deletions(-)

diff --git a/app/test-pmd/5tswap.c b/app/test-pmd/5tswap.c
index e8cef9623b..8fe940294f 100644
--- a/app/test-pmd/5tswap.c
+++ b/app/test-pmd/5tswap.c
@@ -82,18 +82,16 @@ swap_udp(struct rte_udp_hdr *udp_hdr)
  * Parses each layer and swaps it. When the next layer doesn't match it stops.
  */
 static void
-pkt_burst_5tuple_swap(struct fwd_stream *fs)
+_5tuple_swap_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		struct rte_mbuf **pkts_burst)
 {
-	struct rte_mbuf  *pkts_burst[MAX_PKT_BURST];
 	struct rte_port  *txp;
 	struct rte_mbuf *mb;
 	uint16_t next_proto;
 	uint64_t ol_flags;
 	uint16_t proto;
-	uint16_t nb_rx;
 	uint16_t nb_tx;
 	uint32_t retry;
-
 	int i;
 	union {
 		struct rte_ether_hdr *eth;
@@ -105,20 +103,6 @@ pkt_burst_5tuple_swap(struct fwd_stream *fs)
 		uint8_t *byte;
 	} h;
 
-	uint64_t start_tsc = 0;
-
-	get_start_cycles(&start_tsc);
-
-	/*
-	 * Receive a burst of packets and forward them.
-	 */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
-				 nb_pkt_per_burst);
-	inc_rx_burst_stats(fs, nb_rx);
-	if (unlikely(nb_rx == 0))
-		return;
-
-	fs->rx_packets += nb_rx;
 	txp = &ports[fs->tx_port];
 	ol_flags = ol_flags_init(txp->dev_conf.txmode.offloads);
 	vlan_qinq_set(pkts_burst, nb_rx, ol_flags,
@@ -182,12 +166,13 @@ pkt_burst_5tuple_swap(struct fwd_stream *fs)
 			rte_pktmbuf_free(pkts_burst[nb_tx]);
 		} while (++nb_tx < nb_rx);
 	}
-	get_end_cycles(fs, start_tsc);
 }
 
+PKT_BURST_FWD(_5tuple_swap_stream);
+
 struct fwd_engine five_tuple_swap_fwd_engine = {
 	.fwd_mode_name  = "5tswap",
 	.port_fwd_begin = NULL,
 	.port_fwd_end   = NULL,
-	.packet_fwd     = pkt_burst_5tuple_swap,
+	.packet_fwd     = pkt_burst_fwd,
 };
diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 38cc256533..9bfc7d10dc 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -763,7 +763,7 @@ pkt_copy_split(const struct rte_mbuf *pkt)
 }
 
 /*
- * Receive a burst of packets, and for each packet:
+ * For each packet in received mbuf:
  *  - parse packet, and try to recognize a supported packet type (1)
  *  - if it's not a supported packet type, don't touch the packet, else:
  *  - reprocess the checksum of all supported layers. This is done in SW
@@ -792,9 +792,9 @@ pkt_copy_split(const struct rte_mbuf *pkt)
  * OUTER_IP is only useful for tunnel packets.
  */
 static void
-pkt_burst_checksum_forward(struct fwd_stream *fs)
+checksum_forward_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		struct rte_mbuf **pkts_burst)
 {
-	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
 	struct rte_mbuf *gso_segments[GSO_MAX_PKT_BURST];
 	struct rte_gso_ctx *gso_ctx;
 	struct rte_mbuf **tx_pkts_burst;
@@ -805,7 +805,6 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 	void **gro_ctx;
 	uint16_t gro_pkts_num;
 	uint8_t gro_enable;
-	uint16_t nb_rx;
 	uint16_t nb_tx;
 	uint16_t nb_prep;
 	uint16_t i;
@@ -820,18 +819,6 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 	uint16_t nb_segments = 0;
 	int ret;
 
-	uint64_t start_tsc = 0;
-
-	get_start_cycles(&start_tsc);
-
-	/* receive a burst of packet */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
-				 nb_pkt_per_burst);
-	inc_rx_burst_stats(fs, nb_rx);
-	if (unlikely(nb_rx == 0))
-		return;
-
-	fs->rx_packets += nb_rx;
 	rx_bad_ip_csum = 0;
 	rx_bad_l4_csum = 0;
 	rx_bad_outer_l4_csum = 0;
@@ -1138,13 +1125,13 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 			rte_pktmbuf_free(tx_pkts_burst[nb_tx]);
 		} while (++nb_tx < nb_rx);
 	}
-
-	get_end_cycles(fs, start_tsc);
 }
 
+PKT_BURST_FWD(checksum_forward_stream);
+
 struct fwd_engine csum_fwd_engine = {
 	.fwd_mode_name  = "csum",
 	.port_fwd_begin = NULL,
 	.port_fwd_end   = NULL,
-	.packet_fwd     = pkt_burst_checksum_forward,
+	.packet_fwd     = pkt_burst_fwd,
 };
diff --git a/app/test-pmd/flowgen.c b/app/test-pmd/flowgen.c
index 0d3664a64d..aa45948b4c 100644
--- a/app/test-pmd/flowgen.c
+++ b/app/test-pmd/flowgen.c
@@ -61,10 +61,10 @@ RTE_DEFINE_PER_LCORE(int, _next_flow);
  * still do so in order to maintain traffic statistics.
  */
 static void
-pkt_burst_flow_gen(struct fwd_stream *fs)
+flow_gen_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		struct rte_mbuf **pkts_burst)
 {
 	unsigned pkt_size = tx_pkt_length - 4;	/* Adjust FCS */
-	struct rte_mbuf  *pkts_burst[MAX_PKT_BURST];
 	struct rte_mempool *mbp;
 	struct rte_mbuf  *pkt = NULL;
 	struct rte_ether_hdr *eth_hdr;
@@ -72,7 +72,6 @@ pkt_burst_flow_gen(struct fwd_stream *fs)
 	struct rte_udp_hdr *udp_hdr;
 	uint16_t vlan_tci, vlan_tci_outer;
 	uint64_t ol_flags = 0;
-	uint16_t nb_rx;
 	uint16_t nb_tx;
 	uint16_t nb_dropped;
 	uint16_t nb_pkt;
@@ -80,17 +79,9 @@ pkt_burst_flow_gen(struct fwd_stream *fs)
 	uint16_t i;
 	uint32_t retry;
 	uint64_t tx_offloads;
-	uint64_t start_tsc = 0;
 	int next_flow = RTE_PER_LCORE(_next_flow);
 
-	get_start_cycles(&start_tsc);
-
-	/* Receive a burst of packets and discard them. */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
-				 nb_pkt_per_burst);
 	inc_rx_burst_stats(fs, nb_rx);
-	fs->rx_packets += nb_rx;
-
 	for (i = 0; i < nb_rx; i++)
 		rte_pktmbuf_free(pkts_burst[i]);
 
@@ -195,12 +186,11 @@ pkt_burst_flow_gen(struct fwd_stream *fs)
 			rte_pktmbuf_free(pkts_burst[nb_tx]);
 		} while (++nb_tx < nb_pkt);
 	}
-
 	RTE_PER_LCORE(_next_flow) = next_flow;
-
-	get_end_cycles(fs, start_tsc);
 }
 
+PKT_BURST_FWD(flow_gen_stream);
+
 static void
 flowgen_begin(portid_t pi)
 {
@@ -211,5 +201,5 @@ struct fwd_engine flow_gen_engine = {
 	.fwd_mode_name  = "flowgen",
 	.port_fwd_begin = flowgen_begin,
 	.port_fwd_end   = NULL,
-	.packet_fwd     = pkt_burst_flow_gen,
+	.packet_fwd     = pkt_burst_fwd,
 };
diff --git a/app/test-pmd/icmpecho.c b/app/test-pmd/icmpecho.c
index 8948f28eb5..467ba330aa 100644
--- a/app/test-pmd/icmpecho.c
+++ b/app/test-pmd/icmpecho.c
@@ -267,13 +267,13 @@ ipv4_hdr_cksum(struct rte_ipv4_hdr *ip_h)
 	(((rte_be_to_cpu_32((ipv4_addr)) >> 24) & 0x000000FF) == 0xE0)
 
 /*
- * Receive a burst of packets, lookup for ICMP echo requests, and, if any,
- * send back ICMP echo replies.
+ * Lookup for ICMP echo requests in received mbuf and, if any,
+ * send back ICMP echo replies to corresponding Tx port.
  */
 static void
-reply_to_icmp_echo_rqsts(struct fwd_stream *fs)
+reply_to_icmp_echo_rqsts_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		struct rte_mbuf **pkts_burst)
 {
-	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
 	struct rte_mbuf *pkt;
 	struct rte_ether_hdr *eth_h;
 	struct rte_vlan_hdr *vlan_h;
@@ -283,7 +283,6 @@ reply_to_icmp_echo_rqsts(struct fwd_stream *fs)
 	struct rte_ether_addr eth_addr;
 	uint32_t retry;
 	uint32_t ip_addr;
-	uint16_t nb_rx;
 	uint16_t nb_tx;
 	uint16_t nb_replies;
 	uint16_t eth_type;
@@ -291,22 +290,9 @@ reply_to_icmp_echo_rqsts(struct fwd_stream *fs)
 	uint16_t arp_op;
 	uint16_t arp_pro;
 	uint32_t cksum;
-	uint8_t  i;
+	uint16_t  i;
 	int l2_len;
-	uint64_t start_tsc = 0;
 
-	get_start_cycles(&start_tsc);
-
-	/*
-	 * First, receive a burst of packets.
-	 */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
-				 nb_pkt_per_burst);
-	inc_rx_burst_stats(fs, nb_rx);
-	if (unlikely(nb_rx == 0))
-		return;
-
-	fs->rx_packets += nb_rx;
 	nb_replies = 0;
 	for (i = 0; i < nb_rx; i++) {
 		if (likely(i < nb_rx - 1))
@@ -509,13 +495,13 @@ reply_to_icmp_echo_rqsts(struct fwd_stream *fs)
 			} while (++nb_tx < nb_replies);
 		}
 	}
-
-	get_end_cycles(fs, start_tsc);
 }
 
+PKT_BURST_FWD(reply_to_icmp_echo_rqsts_stream);
+
 struct fwd_engine icmp_echo_engine = {
 	.fwd_mode_name  = "icmpecho",
 	.port_fwd_begin = NULL,
 	.port_fwd_end   = NULL,
-	.packet_fwd     = reply_to_icmp_echo_rqsts,
+	.packet_fwd     = pkt_burst_fwd,
 };
diff --git a/app/test-pmd/iofwd.c b/app/test-pmd/iofwd.c
index 83d098adcb..dbd78167b4 100644
--- a/app/test-pmd/iofwd.c
+++ b/app/test-pmd/iofwd.c
@@ -44,25 +44,11 @@
  * to packets data.
  */
 static void
-pkt_burst_io_forward(struct fwd_stream *fs)
+io_forward_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		  struct rte_mbuf **pkts_burst)
 {
-	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
-	uint16_t nb_rx;
 	uint16_t nb_tx;
 	uint32_t retry;
-	uint64_t start_tsc = 0;
-
-	get_start_cycles(&start_tsc);
-
-	/*
-	 * Receive a burst of packets and forward them.
-	 */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
-			pkts_burst, nb_pkt_per_burst);
-	inc_rx_burst_stats(fs, nb_rx);
-	if (unlikely(nb_rx == 0))
-		return;
-	fs->rx_packets += nb_rx;
 
 	nb_tx = rte_eth_tx_burst(fs->tx_port, fs->tx_queue,
 			pkts_burst, nb_rx);
@@ -85,13 +71,13 @@ pkt_burst_io_forward(struct fwd_stream *fs)
 			rte_pktmbuf_free(pkts_burst[nb_tx]);
 		} while (++nb_tx < nb_rx);
 	}
-
-	get_end_cycles(fs, start_tsc);
 }
 
+PKT_BURST_FWD(io_forward_stream);
+
 struct fwd_engine io_fwd_engine = {
 	.fwd_mode_name  = "io",
 	.port_fwd_begin = NULL,
 	.port_fwd_end   = NULL,
-	.packet_fwd     = pkt_burst_io_forward,
+	.packet_fwd     = pkt_burst_fwd,
 };
diff --git a/app/test-pmd/macfwd.c b/app/test-pmd/macfwd.c
index 0568ea794d..b0728c7597 100644
--- a/app/test-pmd/macfwd.c
+++ b/app/test-pmd/macfwd.c
@@ -44,32 +44,18 @@
  * before forwarding them.
  */
 static void
-pkt_burst_mac_forward(struct fwd_stream *fs)
+mac_forward_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		struct rte_mbuf **pkts_burst)
 {
-	struct rte_mbuf  *pkts_burst[MAX_PKT_BURST];
 	struct rte_port  *txp;
 	struct rte_mbuf  *mb;
 	struct rte_ether_hdr *eth_hdr;
 	uint32_t retry;
-	uint16_t nb_rx;
 	uint16_t nb_tx;
 	uint16_t i;
 	uint64_t ol_flags = 0;
 	uint64_t tx_offloads;
-	uint64_t start_tsc = 0;
 
-	get_start_cycles(&start_tsc);
-
-	/*
-	 * Receive a burst of packets and forward them.
-	 */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
-				 nb_pkt_per_burst);
-	inc_rx_burst_stats(fs, nb_rx);
-	if (unlikely(nb_rx == 0))
-		return;
-
-	fs->rx_packets += nb_rx;
 	txp = &ports[fs->tx_port];
 	tx_offloads = txp->dev_conf.txmode.offloads;
 	if (tx_offloads	& DEV_TX_OFFLOAD_VLAN_INSERT)
@@ -116,13 +102,13 @@ pkt_burst_mac_forward(struct fwd_stream *fs)
 			rte_pktmbuf_free(pkts_burst[nb_tx]);
 		} while (++nb_tx < nb_rx);
 	}
-
-	get_end_cycles(fs, start_tsc);
 }
 
+PKT_BURST_FWD(mac_forward_stream);
+
 struct fwd_engine mac_fwd_engine = {
 	.fwd_mode_name  = "mac",
 	.port_fwd_begin = NULL,
 	.port_fwd_end   = NULL,
-	.packet_fwd     = pkt_burst_mac_forward,
+	.packet_fwd     = pkt_burst_fwd,
 };
diff --git a/app/test-pmd/macswap.c b/app/test-pmd/macswap.c
index 310bca06af..cc208944d7 100644
--- a/app/test-pmd/macswap.c
+++ b/app/test-pmd/macswap.c
@@ -50,27 +50,13 @@
  * addresses of packets before forwarding them.
  */
 static void
-pkt_burst_mac_swap(struct fwd_stream *fs)
+mac_swap_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		struct rte_mbuf **pkts_burst)
 {
-	struct rte_mbuf  *pkts_burst[MAX_PKT_BURST];
 	struct rte_port  *txp;
-	uint16_t nb_rx;
 	uint16_t nb_tx;
 	uint32_t retry;
-	uint64_t start_tsc = 0;
 
-	get_start_cycles(&start_tsc);
-
-	/*
-	 * Receive a burst of packets and forward them.
-	 */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
-				 nb_pkt_per_burst);
-	inc_rx_burst_stats(fs, nb_rx);
-	if (unlikely(nb_rx == 0))
-		return;
-
-	fs->rx_packets += nb_rx;
 	txp = &ports[fs->tx_port];
 
 	do_macswap(pkts_burst, nb_rx, txp);
@@ -95,12 +81,13 @@ pkt_burst_mac_swap(struct fwd_stream *fs)
 			rte_pktmbuf_free(pkts_burst[nb_tx]);
 		} while (++nb_tx < nb_rx);
 	}
-	get_end_cycles(fs, start_tsc);
 }
 
+PKT_BURST_FWD(mac_swap_stream);
+
 struct fwd_engine mac_swap_engine = {
 	.fwd_mode_name  = "macswap",
 	.port_fwd_begin = NULL,
 	.port_fwd_end   = NULL,
-	.packet_fwd     = pkt_burst_mac_swap,
+	.packet_fwd     = pkt_burst_fwd,
 };
diff --git a/app/test-pmd/rxonly.c b/app/test-pmd/rxonly.c
index c78fc4609a..a7354596b5 100644
--- a/app/test-pmd/rxonly.c
+++ b/app/test-pmd/rxonly.c
@@ -41,37 +41,21 @@
 #include "testpmd.h"
 
 /*
- * Received a burst of packets.
+ * Process a burst of received packets from same stream.
  */
 static void
-pkt_burst_receive(struct fwd_stream *fs)
+rxonly_forward_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		      struct rte_mbuf **pkts_burst)
 {
-	struct rte_mbuf  *pkts_burst[MAX_PKT_BURST];
-	uint16_t nb_rx;
-	uint16_t i;
-	uint64_t start_tsc = 0;
-
-	get_start_cycles(&start_tsc);
-
-	/*
-	 * Receive a burst of packets.
-	 */
-	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
-				 nb_pkt_per_burst);
-	inc_rx_burst_stats(fs, nb_rx);
-	if (unlikely(nb_rx == 0))
-		return;
-
-	fs->rx_packets += nb_rx;
-	for (i = 0; i < nb_rx; i++)
-		rte_pktmbuf_free(pkts_burst[i]);
-
-	get_end_cycles(fs, start_tsc);
+	RTE_SET_USED(fs);
+	rte_pktmbuf_free_bulk(pkts_burst, nb_rx);
 }
 
+PKT_BURST_FWD(rxonly_forward_stream)
+
 struct fwd_engine rx_only_engine = {
 	.fwd_mode_name  = "rxonly",
 	.port_fwd_begin = NULL,
 	.port_fwd_end   = NULL,
-	.packet_fwd     = pkt_burst_receive,
+	.packet_fwd     = pkt_burst_fwd,
 };
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index f121a2da90..4792bef03b 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -1028,6 +1028,25 @@ void add_tx_dynf_callback(portid_t portid);
 void remove_tx_dynf_callback(portid_t portid);
 int update_jumbo_frame_offload(portid_t portid);
 
+#define PKT_BURST_FWD(cb)                                       \
+static void                                                     \
+pkt_burst_fwd(struct fwd_stream *fs)                            \
+{                                                               \
+	struct rte_mbuf *pkts_burst[nb_pkt_per_burst];          \
+	uint16_t nb_rx;                                         \
+	uint64_t start_tsc = 0;                                 \
+								\
+	get_start_cycles(&start_tsc);                           \
+	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,     \
+			pkts_burst, nb_pkt_per_burst);          \
+	inc_rx_burst_stats(fs, nb_rx);                          \
+	if (unlikely(nb_rx == 0))                               \
+		return;                                         \
+	fs->rx_packets += nb_rx;                                \
+	cb(fs, nb_rx, pkts_burst);                              \
+	get_end_cycles(fs, start_tsc);                          \
+}
+
 /*
  * Work-around of a compilation error with ICC on invocations of the
  * rte_be_to_cpu_16() function.
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 7/8] app/testpmd: improve forwarding cache miss
  2021-09-17  8:01 ` [dpdk-dev] [PATCH v3 0/8] " Xueming Li
                     ` (5 preceding siblings ...)
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 6/8] app/testpmd: add common fwd wrapper Xueming Li
@ 2021-09-17  8:01   ` Xueming Li
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 8/8] app/testpmd: support shared Rx queue forwarding Xueming Li
  7 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-09-17  8:01 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit, Xiaoyun Li

To minimize cache miss, adds flags and burst size used in forwarding to
stream, moves condition tests in forwarding to flags in stream.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c    | 18 ++++++++++++++----
 app/test-pmd/flowgen.c   |  6 +++---
 app/test-pmd/noisy_vnf.c |  2 +-
 app/test-pmd/testpmd.h   | 21 ++++++++++++---------
 app/test-pmd/txonly.c    |  8 ++++----
 5 files changed, 34 insertions(+), 21 deletions(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 035247c33f..5cdf8fa082 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -3050,6 +3050,16 @@ fwd_topology_tx_port_get(portid_t rxp)
 	}
 }
 
+static void
+fwd_stream_set_common(struct fwd_stream *fs)
+{
+	fs->nb_pkt_per_burst = nb_pkt_per_burst;
+	fs->record_burst_stats = !!record_burst_stats;
+	fs->record_core_cycles = !!record_core_cycles;
+	fs->retry_enabled = !!retry_enabled;
+	fs->rxq_share = !!rxq_share;
+}
+
 static void
 simple_fwd_config_setup(void)
 {
@@ -3079,7 +3089,7 @@ simple_fwd_config_setup(void)
 				fwd_ports_ids[fwd_topology_tx_port_get(i)];
 		fwd_streams[i]->tx_queue  = 0;
 		fwd_streams[i]->peer_addr = fwd_streams[i]->tx_port;
-		fwd_streams[i]->retry_enabled = retry_enabled;
+		fwd_stream_set_common(fwd_streams[i]);
 	}
 }
 
@@ -3140,7 +3150,7 @@ rss_fwd_config_setup(void)
 		fs->tx_port = fwd_ports_ids[txp];
 		fs->tx_queue = rxq;
 		fs->peer_addr = fs->tx_port;
-		fs->retry_enabled = retry_enabled;
+		fwd_stream_set_common(fs);
 		rxp++;
 		if (rxp < nb_fwd_ports)
 			continue;
@@ -3255,7 +3265,7 @@ dcb_fwd_config_setup(void)
 				fs->tx_port = fwd_ports_ids[txp];
 				fs->tx_queue = txq + j % nb_tx_queue;
 				fs->peer_addr = fs->tx_port;
-				fs->retry_enabled = retry_enabled;
+				fwd_stream_set_common(fs);
 			}
 			fwd_lcores[lc_id]->stream_nb +=
 				rxp_dcb_info.tc_queue.tc_rxq[i][tc].nb_queue;
@@ -3326,7 +3336,7 @@ icmp_echo_config_setup(void)
 			fs->tx_port = fs->rx_port;
 			fs->tx_queue = rxq;
 			fs->peer_addr = fs->tx_port;
-			fs->retry_enabled = retry_enabled;
+			fwd_stream_set_common(fs);
 			if (verbose_level > 0)
 				printf("  stream=%d port=%d rxq=%d txq=%d\n",
 				       sm_id, fs->rx_port, fs->rx_queue,
diff --git a/app/test-pmd/flowgen.c b/app/test-pmd/flowgen.c
index aa45948b4c..c282f3bcb1 100644
--- a/app/test-pmd/flowgen.c
+++ b/app/test-pmd/flowgen.c
@@ -97,12 +97,12 @@ flow_gen_stream(struct fwd_stream *fs, uint16_t nb_rx,
 	if (tx_offloads	& DEV_TX_OFFLOAD_MACSEC_INSERT)
 		ol_flags |= PKT_TX_MACSEC;
 
-	for (nb_pkt = 0; nb_pkt < nb_pkt_per_burst; nb_pkt++) {
+	for (nb_pkt = 0; nb_pkt < fs->nb_pkt_per_burst; nb_pkt++) {
 		if (!nb_pkt || !nb_clones) {
 			nb_clones = nb_pkt_flowgen_clones;
 			/* Logic limitation */
-			if (nb_clones > nb_pkt_per_burst)
-				nb_clones = nb_pkt_per_burst;
+			if (nb_clones > fs->nb_pkt_per_burst)
+				nb_clones = fs->nb_pkt_per_burst;
 
 			pkt = rte_mbuf_raw_alloc(mbp);
 			if (!pkt)
diff --git a/app/test-pmd/noisy_vnf.c b/app/test-pmd/noisy_vnf.c
index 382a4c2aae..56bf6a4e70 100644
--- a/app/test-pmd/noisy_vnf.c
+++ b/app/test-pmd/noisy_vnf.c
@@ -153,7 +153,7 @@ pkt_burst_noisy_vnf(struct fwd_stream *fs)
 	uint64_t now;
 
 	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
-			pkts_burst, nb_pkt_per_burst);
+			pkts_burst, fs->nb_pkt_per_burst);
 	inc_rx_burst_stats(fs, nb_rx);
 	if (unlikely(nb_rx == 0))
 		goto flush;
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 4792bef03b..3b8796a7a5 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -128,12 +128,17 @@ struct fwd_stream {
 	queueid_t  tx_queue;  /**< TX queue to send forwarded packets */
 	streamid_t peer_addr; /**< index of peer ethernet address of packets */
 
-	unsigned int retry_enabled;
+	uint16_t nb_pkt_per_burst;
+	unsigned int record_burst_stats:1;
+	unsigned int record_core_cycles:1;
+	unsigned int retry_enabled:1;
+	unsigned int rxq_share:1;
 
 	/* "read-write" results */
 	uint64_t rx_packets;  /**< received packets */
 	uint64_t tx_packets;  /**< received packets transmitted */
 	uint64_t fwd_dropped; /**< received packets not forwarded */
+	uint64_t core_cycles; /**< used for RX and TX processing */
 	uint64_t rx_bad_ip_csum ; /**< received packets has bad ip checksum */
 	uint64_t rx_bad_l4_csum ; /**< received packets has bad l4 checksum */
 	uint64_t rx_bad_outer_l4_csum;
@@ -141,7 +146,6 @@ struct fwd_stream {
 	uint64_t rx_bad_outer_ip_csum;
 	/**< received packets having bad outer ip checksum */
 	unsigned int gro_times;	/**< GRO operation times */
-	uint64_t     core_cycles; /**< used for RX and TX processing */
 	struct pkt_burst_stats rx_burst_stats;
 	struct pkt_burst_stats tx_burst_stats;
 	struct fwd_lcore *lcore; /**< Lcore being scheduled. */
@@ -750,28 +754,27 @@ port_pci_reg_write(struct rte_port *port, uint32_t reg_off, uint32_t reg_v)
 static inline void
 get_start_cycles(uint64_t *start_tsc)
 {
-	if (record_core_cycles)
-		*start_tsc = rte_rdtsc();
+	*start_tsc = rte_rdtsc();
 }
 
 static inline void
 get_end_cycles(struct fwd_stream *fs, uint64_t start_tsc)
 {
-	if (record_core_cycles)
+	if (unlikely(fs->record_core_cycles))
 		fs->core_cycles += rte_rdtsc() - start_tsc;
 }
 
 static inline void
 inc_rx_burst_stats(struct fwd_stream *fs, uint16_t nb_rx)
 {
-	if (record_burst_stats)
+	if (unlikely(fs->record_burst_stats))
 		fs->rx_burst_stats.pkt_burst_spread[nb_rx]++;
 }
 
 static inline void
 inc_tx_burst_stats(struct fwd_stream *fs, uint16_t nb_tx)
 {
-	if (record_burst_stats)
+	if (unlikely(fs->record_burst_stats))
 		fs->tx_burst_stats.pkt_burst_spread[nb_tx]++;
 }
 
@@ -1032,13 +1035,13 @@ int update_jumbo_frame_offload(portid_t portid);
 static void                                                     \
 pkt_burst_fwd(struct fwd_stream *fs)                            \
 {                                                               \
-	struct rte_mbuf *pkts_burst[nb_pkt_per_burst];          \
+	struct rte_mbuf *pkts_burst[fs->nb_pkt_per_burst];      \
 	uint16_t nb_rx;                                         \
 	uint64_t start_tsc = 0;                                 \
 								\
 	get_start_cycles(&start_tsc);                           \
 	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,     \
-			pkts_burst, nb_pkt_per_burst);          \
+			pkts_burst, fs->nb_pkt_per_burst);      \
 	inc_rx_burst_stats(fs, nb_rx);                          \
 	if (unlikely(nb_rx == 0))                               \
 		return;                                         \
diff --git a/app/test-pmd/txonly.c b/app/test-pmd/txonly.c
index aed820f5d3..db6130421c 100644
--- a/app/test-pmd/txonly.c
+++ b/app/test-pmd/txonly.c
@@ -367,8 +367,8 @@ pkt_burst_transmit(struct fwd_stream *fs)
 	eth_hdr.ether_type = rte_cpu_to_be_16(RTE_ETHER_TYPE_IPV4);
 
 	if (rte_mempool_get_bulk(mbp, (void **)pkts_burst,
-				nb_pkt_per_burst) == 0) {
-		for (nb_pkt = 0; nb_pkt < nb_pkt_per_burst; nb_pkt++) {
+				fs->nb_pkt_per_burst) == 0) {
+		for (nb_pkt = 0; nb_pkt < fs->nb_pkt_per_burst; nb_pkt++) {
 			if (unlikely(!pkt_burst_prepare(pkts_burst[nb_pkt], mbp,
 							&eth_hdr, vlan_tci,
 							vlan_tci_outer,
@@ -376,12 +376,12 @@ pkt_burst_transmit(struct fwd_stream *fs)
 							nb_pkt, fs))) {
 				rte_mempool_put_bulk(mbp,
 						(void **)&pkts_burst[nb_pkt],
-						nb_pkt_per_burst - nb_pkt);
+						fs->nb_pkt_per_burst - nb_pkt);
 				break;
 			}
 		}
 	} else {
-		for (nb_pkt = 0; nb_pkt < nb_pkt_per_burst; nb_pkt++) {
+		for (nb_pkt = 0; nb_pkt < fs->nb_pkt_per_burst; nb_pkt++) {
 			pkt = rte_mbuf_raw_alloc(mbp);
 			if (pkt == NULL)
 				break;
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 8/8] app/testpmd: support shared Rx queue forwarding
  2021-09-17  8:01 ` [dpdk-dev] [PATCH v3 0/8] " Xueming Li
                     ` (6 preceding siblings ...)
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 7/8] app/testpmd: improve forwarding cache miss Xueming Li
@ 2021-09-17  8:01   ` Xueming Li
  7 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-09-17  8:01 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit, Xiaoyun Li

By enabling shared Rx queue, received packets come from all member ports
in same shared Rx queue.

This patch adds a common forwarding function for shared Rx queue, groups
source forwarding stream by looking into local streams on current lcore
with packet source port(mbuf->port) and queue, then invokes callback to
handle received packets for each source stream.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/ieee1588fwd.c | 30 +++++++++++------
 app/test-pmd/testpmd.c     | 69 ++++++++++++++++++++++++++++++++++++++
 app/test-pmd/testpmd.h     |  9 ++++-
 3 files changed, 97 insertions(+), 11 deletions(-)

diff --git a/app/test-pmd/ieee1588fwd.c b/app/test-pmd/ieee1588fwd.c
index 034f238c34..0151d6de74 100644
--- a/app/test-pmd/ieee1588fwd.c
+++ b/app/test-pmd/ieee1588fwd.c
@@ -90,23 +90,17 @@ port_ieee1588_tx_timestamp_check(portid_t pi)
 }
 
 static void
-ieee1588_packet_fwd(struct fwd_stream *fs)
+ieee1588_fwd_stream(struct fwd_stream *fs, uint16_t nb_rx,
+		struct rte_mbuf **pkt)
 {
-	struct rte_mbuf  *mb;
+	struct rte_mbuf *mb = (*pkt);
 	struct rte_ether_hdr *eth_hdr;
 	struct rte_ether_addr addr;
 	struct ptpv2_msg *ptp_hdr;
 	uint16_t eth_type;
 	uint32_t timesync_index;
 
-	/*
-	 * Receive 1 packet at a time.
-	 */
-	if (rte_eth_rx_burst(fs->rx_port, fs->rx_queue, &mb, 1) == 0)
-		return;
-
-	fs->rx_packets += 1;
-
+	RTE_SET_USED(nb_rx);
 	/*
 	 * Check that the received packet is a PTP packet that was detected
 	 * by the hardware.
@@ -198,6 +192,22 @@ ieee1588_packet_fwd(struct fwd_stream *fs)
 	port_ieee1588_tx_timestamp_check(fs->rx_port);
 }
 
+/*
+ * Wrapper of real fwd ingine.
+ */
+static void
+ieee1588_packet_fwd(struct fwd_stream *fs)
+{
+	struct rte_mbuf *mb;
+
+	if (rte_eth_rx_burst(fs->rx_port, fs->rx_queue, &mb, 1) == 0)
+		return;
+	if (unlikely(fs->rxq_share > 0))
+		forward_shared_rxq(fs, 1, &mb, ieee1588_fwd_stream);
+	else
+		ieee1588_fwd_stream(fs, 1, &mb);
+}
+
 static void
 port_ieee1588_fwd_begin(portid_t pi)
 {
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index cab4b36b04..1d82397831 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -2106,6 +2106,75 @@ flush_fwd_rx_queues(void)
 	}
 }
 
+/**
+ * Get packet source stream by source port and queue.
+ * All streams of same shared Rx queue locates on same core.
+ */
+static struct fwd_stream *
+forward_stream_get(struct fwd_stream *fs, uint16_t port)
+{
+	streamid_t sm_id;
+	struct fwd_lcore *fc;
+	struct fwd_stream **fsm;
+	streamid_t nb_fs;
+
+	fc = fs->lcore;
+	fsm = &fwd_streams[fc->stream_idx];
+	nb_fs = fc->stream_nb;
+	for (sm_id = 0; sm_id < nb_fs; sm_id++) {
+		if (fsm[sm_id]->rx_port == port &&
+		    fsm[sm_id]->rx_queue == fs->rx_queue)
+			return fsm[sm_id];
+	}
+	return NULL;
+}
+
+/**
+ * Forward packet by source port and queue.
+ */
+static void
+forward_by_port(struct fwd_stream *src_fs, uint16_t port, uint16_t nb_rx,
+		struct rte_mbuf **pkts, packet_fwd_cb fwd)
+{
+	struct fwd_stream *fs = forward_stream_get(src_fs, port);
+
+	if (fs != NULL) {
+		fs->rx_packets += nb_rx;
+		fwd(fs, nb_rx, pkts);
+	} else {
+		/* Source stream not found, drop all packets. */
+		src_fs->fwd_dropped += nb_rx;
+		while (nb_rx > 0)
+			rte_pktmbuf_free(pkts[--nb_rx]);
+	}
+}
+
+/**
+ * Forward packets from shared Rx queue.
+ *
+ * Source port of packets are identified by mbuf->port.
+ */
+void
+forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
+		   struct rte_mbuf **pkts_burst, packet_fwd_cb fwd)
+{
+	uint16_t i, nb_fs_rx = 1, port;
+
+	/* Locate real source fs according to mbuf->port. */
+	for (i = 0; i < nb_rx; ++i) {
+		rte_prefetch0(pkts_burst[i + 1]);
+		port = pkts_burst[i]->port;
+		if (i + 1 == nb_rx || pkts_burst[i + 1]->port != port) {
+			/* Forward packets with same source port. */
+			forward_by_port(fs, port, nb_fs_rx,
+					&pkts_burst[i + 1 - nb_fs_rx], fwd);
+			nb_fs_rx = 1;
+		} else {
+			nb_fs_rx++;
+		}
+	}
+}
+
 static void
 run_pkt_fwd_on_lcore(struct fwd_lcore *fc, packet_fwd_t pkt_fwd)
 {
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 3b8796a7a5..7869f61f74 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -276,6 +276,8 @@ struct fwd_lcore {
 typedef void (*port_fwd_begin_t)(portid_t pi);
 typedef void (*port_fwd_end_t)(portid_t pi);
 typedef void (*packet_fwd_t)(struct fwd_stream *fs);
+typedef void (*packet_fwd_cb)(struct fwd_stream *fs, uint16_t nb_rx,
+			      struct rte_mbuf **pkts);
 
 struct fwd_engine {
 	const char       *fwd_mode_name; /**< Forwarding mode name. */
@@ -910,6 +912,8 @@ char *list_pkt_forwarding_modes(void);
 char *list_pkt_forwarding_retry_modes(void);
 void set_pkt_forwarding_mode(const char *fwd_mode);
 void start_packet_forwarding(int with_tx_first);
+void forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
+			struct rte_mbuf **pkts_burst, packet_fwd_cb fwd);
 void fwd_stats_display(void);
 void fwd_stats_reset(void);
 void stop_packet_forwarding(void);
@@ -1046,7 +1050,10 @@ pkt_burst_fwd(struct fwd_stream *fs)                            \
 	if (unlikely(nb_rx == 0))                               \
 		return;                                         \
 	fs->rx_packets += nb_rx;                                \
-	cb(fs, nb_rx, pkts_burst);                              \
+	if (fs->rxq_share)                                      \
+		forward_shared_rxq(fs, nb_rx, pkts_burst, cb);  \
+	else                                                    \
+		cb(fs, nb_rx, pkts_burst);                      \
 	get_end_cycles(fs, start_tsc);                          \
 }
 
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v3 6/8] app/testpmd: add common fwd wrapper
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 6/8] app/testpmd: add common fwd wrapper Xueming Li
@ 2021-09-17 11:24     ` Jerin Jacob
  0 siblings, 0 replies; 266+ messages in thread
From: Jerin Jacob @ 2021-09-17 11:24 UTC (permalink / raw)
  To: Xueming Li
  Cc: dpdk-dev, Xiaoyu Min, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit, Xiaoyun Li

On Fri, Sep 17, 2021 at 1:33 PM Xueming Li <xuemingl@nvidia.com> wrote:
>
> From: Xiaoyu Min <jackmin@nvidia.com>
>
> Added common forwarding wrapper function for all fwd engines
> which do the following in common:
>
> - record core cycles
> - call rte_eth_rx_burst(...,nb_pkt_per_burst)
> - update received packets
> - handle received mbufs with callback function
>
> For better performance, the function is defined as macro.
>
> Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> ---
>  app/test-pmd/5tswap.c   | 25 +++++--------------------
>  app/test-pmd/csumonly.c | 25 ++++++-------------------
>  app/test-pmd/flowgen.c  | 20 +++++---------------
>  app/test-pmd/icmpecho.c | 30 ++++++++----------------------
>  app/test-pmd/iofwd.c    | 24 +++++-------------------
>  app/test-pmd/macfwd.c   | 24 +++++-------------------
>  app/test-pmd/macswap.c  | 23 +++++------------------
>  app/test-pmd/rxonly.c   | 32 ++++++++------------------------
>  app/test-pmd/testpmd.h  | 19 +++++++++++++++++++
>  9 files changed, 66 insertions(+), 156 deletions(-)
>
> diff --git a/app/test-pmd/5tswap.c b/app/test-pmd/5tswap.c
> index e8cef9623b..8fe940294f 100644
> --- a/app/test-pmd/5tswap.c
> +++ b/app/test-pmd/5tswap.c
> @@ -82,18 +82,16 @@ swap_udp(struct rte_udp_hdr *udp_hdr)
>   * Parses each layer and swaps it. When the next layer doesn't match it stops.
>   */

> +PKT_BURST_FWD(_5tuple_swap_stream);

Please make _5tuple_swap_stream aka "cb" as inline function to  make sure
compiler doesn't generate yet another function pointer.

>  struct fwd_engine mac_swap_engine = {
>         .fwd_mode_name  = "macswap",
>         .port_fwd_begin = NULL,
>         .port_fwd_end   = NULL,
> -       .packet_fwd     = pkt_burst_mac_swap,

See below

> +       .packet_fwd     = pkt_burst_fwd,
>
> +#define PKT_BURST_FWD(cb)                                       \

Probably it can pass prefix too like PKT_BURST_FWD(cb, prefix)
to make a unique function and call PKT_BURST_FWD(_5tuple_swap_stream,
mac_swap) for better readability
and avoid diff above section.


> +static void                                                     \
> +pkt_burst_fwd(struct fwd_stream *fs)

pkt_burst_fwd##prefix(struct fwd_stream *fs)
                            \
> +{                                                               \
> +       struct rte_mbuf *pkts_burst[nb_pkt_per_burst];          \
> +       uint16_t nb_rx;                                         \
> +       uint64_t start_tsc = 0;                                 \
> +                                                               \
> +       get_start_cycles(&start_tsc);                           \
> +       nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,     \
> +                       pkts_burst, nb_pkt_per_burst);          \
> +       inc_rx_burst_stats(fs, nb_rx);                          \
> +       if (unlikely(nb_rx == 0))                               \
> +               return;                                         \
> +       fs->rx_packets += nb_rx;                                \
> +       cb(fs, nb_rx, pkts_burst);                              \
> +       get_end_cycles(fs, start_tsc);                          \
> +}
> +
>  /*
>   * Work-around of a compilation error with ICC on invocations of the
>   * rte_be_to_cpu_16() function.
> --
> 2.33.0
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-08-11 12:04           ` Ferruh Yigit
  2021-08-11 12:59             ` Xueming(Steven) Li
@ 2021-09-26  5:35             ` Xueming(Steven) Li
  2021-09-28  9:35               ` Jerin Jacob
  1 sibling, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-26  5:35 UTC (permalink / raw)
  To: jerinjacobk, ferruh.yigit
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev

On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > 
> > 
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > 
> > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > 
> > > > Hi,
> > > > 
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > 
> > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > 
> > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > for incoming packets. When number of representors scale out in a
> > > > > > switch domain, the memory consumption became significant. Most
> > > > > > important, polling all ports leads to high cache miss, high
> > > > > > latency and low throughput.
> > > > > > 
> > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > 
> > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > index is
> > > > > > 1:1 mapped in shared group.
> > > > > > 
> > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > 
> > > > > > Multiple groups is supported by group ID.
> > > > > 
> > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > 
> > > > Yes, PF and representor in switch domain could take advantage.
> > > > 
> > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > 
> > > > Not quite sure that I understood your question. The control path of is
> > > > almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > supplied from shared Rx queue in my PMD implementation.
> > > 
> > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
> > > receive queue, In that case, how the flow order is maintained for respective receive queues.
> > 
> > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
> > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> > all forwarding engine. Will sent patches soon.
> > 
> 
> All ports will put the packets in to the same queue (share queue), right? Does
> this means only single core will poll only, what will happen if there are
> multiple cores polling, won't it cause problem?
> 
> And if this requires specific changes in the application, I am not sure about
> the solution, can't this work in a transparent way to the application?

Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports
in same group into one new port. Users could schedule polling on the
aggregated port instead of all member ports.

> 
> Overall, is this for optimizing memory for the port represontors? If so can't we
> have a port representor specific solution, reducing scope can reduce the
> complexity it brings?
> 
> > > If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
> > > scope.
> > 
> > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > 
> > > 
> > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > ---
> > > > > >  doc/guides/nics/features.rst                    | 11 +++++++++++
> > > > > >  doc/guides/nics/features/default.ini            |  1 +
> > > > > >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > >  lib/ethdev/rte_ethdev.c                         |  1 +
> > > > > >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> > > > > >  5 files changed, 30 insertions(+)
> > > > > > 
> > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > --- a/doc/guides/nics/features.rst
> > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > 
> > > > > > 
> > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > +
> > > > > > +Shared Rx queue
> > > > > > +---------------
> > > > > > +
> > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > +
> > > > > > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > +
> > > > > > +
> > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > 
> > > > > >  Packet type parsing
> > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > b/doc/guides/nics/features/default.ini
> > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > >  Queue start/stop     =
> > > > > >  Runtime Rx queue setup =
> > > > > >  Runtime Tx queue setup =
> > > > > > +Shared Rx queue      =
> > > > > >  Burst mode info      =
> > > > > >  Power mgmt address monitor =
> > > > > >  MTU update           =
> > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > >  .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > 
> > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > `_
> > > > > > 
> > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > +grows,
> > > > > > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > +  Polling the large number of ports brings more CPU load, cache
> > > > > > +miss and
> > > > > > +  latency. Shared Rx queue can be used to share Rx queue between
> > > > > > +PF and
> > > > > > +  representors in same switch domain.
> > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > +  is present in Rx offloading capability of device info. Setting
> > > > > > +the
> > > > > > +  offloading flag in device Rx mode or Rx queue configuration to
> > > > > > +enable
> > > > > > +  shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > +return
> > > > > > +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > +
> > > > > >  Basic SR-IOV
> > > > > >  ------------
> > > > > > 
> > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > >  };
> > > > > > 
> > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > */
> > > > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > > > + switch domain. */
> > > > > >         /**
> > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > +/**
> > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > +memory,
> > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > + */
> > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > 
> > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > --
> > > > > > 2.25.1
> > > > > > 
> 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/8] ethdev: new API to aggregate shared Rx queue group
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 2/8] ethdev: new API to aggregate shared Rx queue group Xueming Li
@ 2021-09-26 17:54     ` Ajit Khaparde
  0 siblings, 0 replies; 266+ messages in thread
From: Ajit Khaparde @ 2021-09-26 17:54 UTC (permalink / raw)
  To: Xueming Li
  Cc: dpdk-dev, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ray Kinsella

[-- Attachment #1: Type: text/plain, Size: 4651 bytes --]

On Fri, Sep 17, 2021 at 1:02 AM Xueming Li <xuemingl@nvidia.com> wrote:
>
> This patch introduces new api to aggreated ports among same shared Rx
s/aggregated/aggregate

> queue group.  Only queues with specified share group is aggregated.
s/is/are

> Rx burst and device close are expected to be supported by new device.
>
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Minor nits - typos actually!

> ---
>  lib/ethdev/ethdev_driver.h | 23 ++++++++++++++++++++++-
>  lib/ethdev/rte_ethdev.c    | 22 ++++++++++++++++++++++
>  lib/ethdev/rte_ethdev.h    | 16 ++++++++++++++++
>  lib/ethdev/version.map     |  3 +++
>  4 files changed, 63 insertions(+), 1 deletion(-)
>
> diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> index 524757cf6f..72156a4153 100644
> --- a/lib/ethdev/ethdev_driver.h
> +++ b/lib/ethdev/ethdev_driver.h
> @@ -786,10 +786,28 @@ typedef int (*eth_get_monitor_addr_t)(void *rxq,
>   * @return
>   *   Negative errno value on error, number of info entries otherwise.
>   */
> -
>  typedef int (*eth_representor_info_get_t)(struct rte_eth_dev *dev,
>         struct rte_eth_representor_info *info);
>
> +/**
> + * @internal
> + * Aggregate shared Rx queue.
> + *
> + * Create a new port used for shared Rx queue polling.
> + *
> + * Only queues with specified share group are aggregated.
> + * At least Rx burst and device close should be supported.
> + *
> + * @param dev
> + *   Ethdev handle of port.
> + * @param group
> + *   Shared Rx queue group to aggregate.
> + * @return
> + *   UINT16_MAX if failed, otherwise aggregated port number.
> + */
> +typedef int (*eth_shared_rxq_aggregate_t)(struct rte_eth_dev *dev,
> +                                         uint32_t group);
> +
>  /**
>   * @internal A structure containing the functions exported by an Ethernet driver.
>   */
> @@ -950,6 +968,9 @@ struct eth_dev_ops {
>
>         eth_representor_info_get_t representor_info_get;
>         /**< Get representor info. */
> +
> +       eth_shared_rxq_aggregate_t shared_rxq_aggregate;
> +       /**< Aggregate shared Rx queue. */
>  };
>
>  /**
> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index b3a58d5e65..9f2ef58309 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -6301,6 +6301,28 @@ rte_eth_representor_info_get(uint16_t port_id,
>         return eth_err(port_id, (*dev->dev_ops->representor_info_get)(dev, info));
>  }
>
> +uint16_t
> +rte_eth_shared_rxq_aggregate(uint16_t port_id, uint32_t group)
> +{
> +       struct rte_eth_dev *dev;
> +       uint64_t offloads;
> +
> +       RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
> +       dev = &rte_eth_devices[port_id];
> +
> +       RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->shared_rxq_aggregate,
> +                               UINT16_MAX);
> +
> +       offloads = dev->data->dev_conf.rxmode.offloads;
> +       if ((offloads & RTE_ETH_RX_OFFLOAD_SHARED_RXQ) == 0) {
> +               RTE_ETHDEV_LOG(ERR, "port_id=%u doesn't support Rx offload\n",
> +                              port_id);
> +               return UINT16_MAX;
> +       }
> +
> +       return (*dev->dev_ops->shared_rxq_aggregate)(dev, group);
> +}
> +
>  RTE_LOG_REGISTER_DEFAULT(rte_eth_dev_logtype, INFO);
>
>  RTE_INIT(ethdev_init_telemetry)
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index a578c9db9d..f15d2142b2 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -4895,6 +4895,22 @@ __rte_experimental
>  int rte_eth_representor_info_get(uint16_t port_id,
>                                  struct rte_eth_representor_info *info);
>
> +/**
> + * Aggregate shared Rx queue ports to one port for polling.
> + *
> + * Only queues with specified share group is aggregated.
s/is/are

> + * Any operation besides Rx burst and device close is unexpected.
> + *
> + * @param port_id
> + *   The port identifier of the device from shared Rx queue group.
> + * @param group
> + *   Shared Rx queue group to aggregate.
> + * @return
> + *   UINT16_MAX if failed, otherwise aggregated port number.
> + */
> +__rte_experimental
> +uint16_t rte_eth_shared_rxq_aggregate(uint16_t port_id, uint32_t group);
> +
>  #include <rte_ethdev_core.h>
>
>  /**
> diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
> index 3eece75b72..97a2233508 100644
> --- a/lib/ethdev/version.map
> +++ b/lib/ethdev/version.map
> @@ -249,6 +249,9 @@ EXPERIMENTAL {
>         rte_mtr_meter_policy_delete;
>         rte_mtr_meter_policy_update;
>         rte_mtr_meter_policy_validate;
> +
> +       # added in 21.11
> +       rte_eth_shared_rxq_aggregate;
>  };
>
>  INTERNAL {
> --
> 2.33.0
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/8] ethdev: introduce shared Rx queue
  2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 1/8] " Xueming Li
@ 2021-09-27 23:53     ` Ajit Khaparde
  2021-09-28 14:24       ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Ajit Khaparde @ 2021-09-27 23:53 UTC (permalink / raw)
  To: Xueming Li
  Cc: dpdk-dev, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit

[-- Attachment #1: Type: text/plain, Size: 5910 bytes --]

On Fri, Sep 17, 2021 at 1:02 AM Xueming Li <xuemingl@nvidia.com> wrote:
>
> In current DPDK framework, each RX queue is pre-loaded with mbufs for
> incoming packets. When number of representors scale out in a switch
> domain, the memory consumption became significant. Most important,
> polling all ports leads to high cache miss, high latency and low
> throughput.
>
> This patch introduces shared RX queue. Ports with same configuration in
> a switch domain could share RX queue set by specifying sharing group.
> Polling any queue using same shared RX queue receives packets from all
> member ports. Source port is identified by mbuf->port.
>
> Port queue number in a shared group should be identical. Queue index is
> 1:1 mapped in shared group.
>
> Share RX queue must be polled on single thread or core.
>
> Multiple groups is supported by group ID.
Can you clarify this a little more?

Apologies if this was already covered:
* Can't we do this for Tx also?

Couple of nits inline. Thanks

>
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> Cc: Jerin Jacob <jerinjacobk@gmail.com>
> ---
> Rx queue object could be used as shared Rx queue object, it's important
> to clear all queue control callback api that using queue object:
>   https://mails.dpdk.org/archives/dev/2021-July/215574.html
> ---
>  doc/guides/nics/features.rst                    | 11 +++++++++++
>  doc/guides/nics/features/default.ini            |  1 +
>  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
>  lib/ethdev/rte_ethdev.c                         |  1 +
>  lib/ethdev/rte_ethdev.h                         |  7 +++++++
>  5 files changed, 30 insertions(+)
>
> diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> index a96e12d155..2e2a9b1554 100644
> --- a/doc/guides/nics/features.rst
> +++ b/doc/guides/nics/features.rst
> @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
>    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
>
>
> +.. _nic_features_shared_rx_queue:
> +
> +Shared Rx queue
> +---------------
> +
> +Supports shared Rx queue for ports in same switch domain.
> +
> +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> +* **[provides] mbuf**: ``mbuf.port``.
> +
> +
>  .. _nic_features_packet_type_parsing:
>
>  Packet type parsing
> diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
> index 754184ddd4..ebeb4c1851 100644
> --- a/doc/guides/nics/features/default.ini
> +++ b/doc/guides/nics/features/default.ini
> @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
>  Queue start/stop     =
>  Runtime Rx queue setup =
>  Runtime Tx queue setup =
> +Shared Rx queue      =
>  Burst mode info      =
>  Power mgmt address monitor =
>  MTU update           =
> diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
> index ff6aa91c80..45bf5a3a10 100644
> --- a/doc/guides/prog_guide/switch_representation.rst
> +++ b/doc/guides/prog_guide/switch_representation.rst
> @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
>  .. [1] `Ethernet switch device driver model (switchdev)
>         <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
>
> +- Memory usage of representors is huge when number of representor grows,
> +  because PMD always allocate mbuf for each descriptor of Rx queue.
> +  Polling the large number of ports brings more CPU load, cache miss and
> +  latency. Shared Rx queue can be used to share Rx queue between PF and
> +  representors in same switch domain. ``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``

"in the same switch"

> +  is present in Rx offloading capability of device info. Setting the
> +  offloading flag in device Rx mode or Rx queue configuration to enable
> +  shared Rx queue. Polling any member port of shared Rx queue can return
"of the shared Rx queue.."

> +  packets of all ports in group, port ID is saved in ``mbuf.port``.

"ports in the group, "

> +
>  Basic SR-IOV
>  ------------
>
> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index a7c090ce79..b3a58d5e65 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -127,6 +127,7 @@ static const struct {
>         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
>         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
>         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
>  };
>
>  #undef RTE_RX_OFFLOAD_BIT2STR
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index d2b27c351f..a578c9db9d 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
>         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
>         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
>         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> +       uint32_t shared_group; /**< Shared port group index in switch domain. */
>         /**
>          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
>          * Only offloads set on rx_queue_offload_capa or rx_offload_capa
> @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
>  #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
>  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
>  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> +/**
> + * Rx queue is shared among ports in same switch domain to save memory,
> + * avoid polling each port. Any port in group can be used to receive packets.

"Any port in the group can"


> + * Real source port number saved in mbuf->port field.
> + */
> +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
>
>  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
>                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> --
> 2.33.0
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue
  2021-09-16  4:16                       ` Jerin Jacob
@ 2021-09-28  5:50                         ` Xueming(Steven) Li
  0 siblings, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-28  5:50 UTC (permalink / raw)
  To: jerinjacobk
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

On Thu, 2021-09-16 at 09:46 +0530, Jerin Jacob wrote:
> On Wed, Sep 15, 2021 at 8:15 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > 
> > Hi Jerin,
> > 
> > On Mon, 2021-08-30 at 15:01 +0530, Jerin Jacob wrote:
> > > On Sat, Aug 28, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > 
> > > > 
> > > > 
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Thursday, August 26, 2021 7:58 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > > 
> > > > > On Thu, Aug 19, 2021 at 5:39 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > Sent: Thursday, August 19, 2021 1:27 PM
> > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > > > > 
> > > > > > > On Wed, Aug 18, 2021 at 4:44 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > Sent: Tuesday, August 17, 2021 11:12 PM
> > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > > > > > > 
> > > > > > > > > On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > Sent: Tuesday, August 17, 2021 5:33 PM
> > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx
> > > > > > > > > > > queue
> > > > > > > > > > > 
> > > > > > > > > > > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > 
> > > > > > > > > > > > In current DPDK framework, each RX queue is pre-loaded
> > > > > > > > > > > > with mbufs for incoming packets. When number of
> > > > > > > > > > > > representors scale out in a switch domain, the memory
> > > > > > > > > > > > consumption became significant. Most important, polling
> > > > > > > > > > > > all ports leads to high cache miss, high latency and low throughput.
> > > > > > > > > > > > 
> > > > > > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > > > > > Polling any queue using same shared RX queue receives
> > > > > > > > > > > > packets from all member ports. Source port is identified by mbuf->port.
> > > > > > > > > > > > 
> > > > > > > > > > > > Port queue number in a shared group should be identical.
> > > > > > > > > > > > Queue index is
> > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > 
> > > > > > > > > > > > Share RX queue must be polled on single thread or core.
> > > > > > > > > > > > 
> > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > 
> > > > > > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > > > > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > ---
> > > > > > > > > > > > Rx queue object could be used as shared Rx queue object,
> > > > > > > > > > > > it's important to clear all queue control callback api that using queue object:
> > > > > > > > > > > > 
> > > > > > > > > > > > https://mails.dpdk.org/archives/dev/2021-July/215574.html
> > > > > > > > > > > 
> > > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR diff --git
> > > > > > > > > > > > a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > > > > > > > > > > > d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > > > > > */
> > > > > > > > > > > > +       uint32_t shared_group; /**< Shared port group
> > > > > > > > > > > > + index in switch domain. */
> > > > > > > > > > > 
> > > > > > > > > > > Not to able to see anyone setting/creating this group ID test application.
> > > > > > > > > > > How this group is created?
> > > > > > > > > > 
> > > > > > > > > > Nice catch, the initial testpmd version only support one default group(0).
> > > > > > > > > > All ports that supports shared-rxq assigned in same group.
> > > > > > > > > > 
> > > > > > > > > > We should be able to change "--rxq-shared" to "--rxq-shared-group"
> > > > > > > > > > to support group other than default.
> > > > > > > > > > 
> > > > > > > > > > To support more groups simultaneously, need to consider
> > > > > > > > > > testpmd forwarding stream core assignment, all streams in same group need to stay on same core.
> > > > > > > > > > It's possible to specify how many ports to increase group
> > > > > > > > > > number, but user must schedule stream affinity carefully - error prone.
> > > > > > > > > > 
> > > > > > > > > > On the other hand, one group should be sufficient for most
> > > > > > > > > > customer, the doubt is whether it valuable to support multiple groups test.
> > > > > > > > > 
> > > > > > > > > Ack. One group is enough in testpmd.
> > > > > > > > > 
> > > > > > > > > My question was more about who and how this group is created,
> > > > > > > > > Should n't we need API to create shared_group? If we do the following, at least, I can think, how it can be implemented in SW
> > > > > or other HW.
> > > > > > > > > 
> > > > > > > > > - Create aggregation queue group
> > > > > > > > > - Attach multiple  Rx queues to the aggregation queue group
> > > > > > > > > - Pull the packets from the queue group(which internally fetch
> > > > > > > > > from the Rx queues _attached_)
> > > > > > > > > 
> > > > > > > > > Does the above kind of sequence, break your representor use case?
> > > > > > > > 
> > > > > > > > Seems more like a set of EAL wrapper. Current API tries to minimize the application efforts to adapt shared-rxq.
> > > > > > > > - step 1, not sure how important it is to create group with API, in rte_flow, group is created on demand.
> > > > > > > 
> > > > > > > Which rte_flow pattern/action for this?
> > > > > > 
> > > > > > No rte_flow for this, just recalled that the group in rte_flow is not created along with flow, not via api.
> > > > > > I don’t see anything else to create along with group, just double whether it valuable to introduce a new api set to manage group.
> > > > > 
> > > > > See below.
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > > - step 2, currently, the attaching is done in rte_eth_rx_queue_setup, specify offload and group in rx_conf struct.
> > > > > > > > - step 3, define a dedicate api to receive packets from shared rxq? Looks clear to receive packets from shared rxq.
> > > > > > > >   currently, rxq objects in share group is same - the shared rxq, so the eth callback eth_rx_burst_t(rxq_obj, mbufs, n) could
> > > > > > > >   be used to receive packets from any ports in group, normally the first port(PF) in group.
> > > > > > > >   An alternative way is defining a vdev with same queue number and copy rxq objects will make the vdev a proxy of
> > > > > > > >   the shared rxq group - this could be an helper API.
> > > > > > > > 
> > > > > > > > Anyway the wrapper doesn't break use case, step 3 api is more clear, need to understand how to implement efficiently.
> > > > > > > 
> > > > > > > Are you doing this feature based on any HW support or it just pure
> > > > > > > SW thing, If it is SW, It is better to have just new vdev for like drivers/net/bonding/. This we can help aggregate multiple Rxq across
> > > > > the multiple ports of same the driver.
> > > > > > 
> > > > > > Based on HW support.
> > > > > 
> > > > > In Marvel HW, we do some support, I will outline here and some queries on this.
> > > > > 
> > > > > # We need to create some new HW structure for aggregation # Connect each Rxq to the new HW structure for aggregation # Use
> > > > > rx_burst from the new HW structure.
> > > > > 
> > > > > Could you outline your HW support?
> > > > > 
> > > > > Also, I am not able to understand how this will reduce the memory, atleast in our HW need creating more memory now to deal this as
> > > > > we need to deal new HW structure.
> > > > > 
> > > > > How is in your HW it reduces the memory? Also, if memory is the constraint, why NOT reduce the number of queues.
> > > > > 
> > > > 
> > > > Glad to know that Marvel is working on this, what's the status of driver implementation?
> > > > 
> > > > In my PMD implementation, it's very similar, a new HW object shared memory pool is created to replace per rxq memory pool.
> > > > Legacy rxq feed queue with allocated mbufs as number of descriptors, now shared rxqs share the same pool, no need to supply
> > > > mbufs for each rxq, just feed the shared rxq.
> > > > 
> > > > So the memory saving reflects to mbuf per rxq, even 1000 representors in shared rxq group, the mbufs consumed is one rxq.
> > > > In other words, new members in shared rxq doesn’t allocate new mbufs to feed rxq, just share with existing shared rxq(HW mempool).
> > > > The memory required to setup each rxq doesn't change too much, agree.
> > > 
> > > We can ask the application to configure the same mempool for multiple
> > > RQ too. RIght? If the saving is based on sharing the mempool
> > > with multiple RQs.
> > > 
> > > > 
> > > > > # Also, I was thinking, one way to avoid the fast path or ABI change would like.
> > > > > 
> > > > > # Driver Initializes one more eth_dev_ops in driver as aggregator ethdev # devargs of new ethdev or specific API like
> > > > > drivers/net/bonding/rte_eth_bond.h can take the argument (port, queue) tuples which needs to aggregate by new ethdev port # No
> > > > > change in fastpath or ABI is required in this model.
> > > > > 
> > > > 
> > > > This could be an option to access shared rxq. What's the difference of the new PMD?
> > > 
> > > No ABI and fast change are required.
> > > 
> > > > What's the difference of PMD driver to create the new device?
> > > > 
> > > > Is it important in your implementation? Does it work with existing rx_burst api?
> > > 
> > > Yes . It will work with the existing rx_burst API.
> > > 
> > 
> > The aggregator ethdev required by user is a port, maybe it good to add
> > a callback for PMD to prepare a complete ethdev just like creating
> > representor ethdev - pmd register new port internally. If the PMD
> > doens't provide the callback, ethdev api fallback to initialize an
> > empty ethdev by copy rxq data(shared) and rx_burst api from source port
> > and share group. Actually users can do this fallback themselves or with
> > an util api.
> > 
> > IIUC, an aggregator ethdev not a must, do you think we can continue and
> > leave that design in later stage?
> 
> 
> IMO aggregator ethdev reduces the complexity for application hence
> avoid any change in
> test application etc. IMO, I prefer to take that. I will leave the
> decision to ethdev maintainers.

Hi Jerin, new API added for aggregator, the last one in v3, thanks! 

> 
> 
> > 
> > > > 
> > > > > 
> > > > > 
> > > > > > Most user might uses PF in group as the anchor port to rx burst, current definition should be easy for them to migrate.
> > > > > > but some user might prefer grouping some hot
> > > > > > plug/unpluggedrepresentors, EAL could provide wrappers, users could do that either due to the strategy not complex enough.
> > > > > Anyway, welcome any suggestion.
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > >         /**
> > > > > > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf
> > > > > > > > > > > > { #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > > > > > +/**
> > > > > > > > > > > > + * Rx queue is shared among ports in same switch domain
> > > > > > > > > > > > +to save memory,
> > > > > > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > > > > > + */
> > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > > > > > > 
> > > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM
> > > > > > > > > > > > > \
> > > > > > > > > > > > --
> > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > 
> > 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
  2021-09-01 14:44                 ` Xueming(Steven) Li
@ 2021-09-28  5:54                   ` Xueming(Steven) Li
  0 siblings, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-28  5:54 UTC (permalink / raw)
  To: jerinjacobk; +Cc: xiaoyun.li, Jack Min, dev

On Wed, 2021-09-01 at 14:44 +0000, Xueming(Steven) Li wrote:
> 
> > -----Original Message-----
> > From: dev <dev-bounces@dpdk.org> On Behalf Of Xueming(Steven) Li
> > Sent: Sunday, August 29, 2021 3:08 PM
> > To: Jerin Jacob <jerinjacobk@gmail.com>
> > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun Li <xiaoyun.li@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function
> > 
> > 
> > 
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Thursday, August 26, 2021 7:28 PM
> > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>; Xiaoyun Li
> > > <xiaoyun.li@intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd
> > > wrapper function
> > > 
> > > On Wed, Aug 18, 2021 at 7:38 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > 
> > > > 
> > > > 
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Wednesday, August 18, 2021 7:48 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>;
> > > > > Xiaoyun Li <xiaoyun.li@intel.com>
> > > > > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common
> > > > > fwd wrapper function
> > > > > 
> > > > > On Wed, Aug 18, 2021 at 4:57 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > Sent: Tuesday, August 17, 2021 5:37 PM
> > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > Cc: Jack Min <jackmin@nvidia.com>; dpdk-dev <dev@dpdk.org>;
> > > > > > > Xiaoyun Li <xiaoyun.li@intel.com>
> > > > > > > Subject: Re: [dpdk-dev] [PATCH v2 06/15] app/testpmd: add
> > > > > > > common fwd wrapper function
> > > > > > > 
> > > > > > > On Wed, Aug 11, 2021 at 7:35 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > 
> > > > > > > > From: Xiaoyu Min <jackmin@nvidia.com>
> > > > > > > > 
> > > > > > > > Added an inline common wrapper function for all fwd engines
> > > > > > > > which do the following in common:
> > > > > > > > 
> > > > > > > > 1. get_start_cycles
> > > > > > > > 2. rte_eth_rx_burst(...,nb_pkt_per_burst)
> > > > > > > > 3. if rxq_share do forward_shared_rxq(), otherwise do fwd directly 4.
> > > > > > > > get_end_cycle
> > > > > > > > 
> > > > > > > > Signed-off-by: Xiaoyu Min <jackmin@nvidia.com>
> > > > > > > > ---
> > > > > > > >  app/test-pmd/testpmd.h | 24 ++++++++++++++++++++++++
> > > > > > > >  1 file changed, 24 insertions(+)
> > > > > > > > 
> > > > > > > > diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
> > > > > > > > index
> > > > > > > > 13141dfed9..b685ac48d6 100644
> > > > > > > > --- a/app/test-pmd/testpmd.h
> > > > > > > > +++ b/app/test-pmd/testpmd.h
> > > > > > > > @@ -1022,6 +1022,30 @@ void add_tx_dynf_callback(portid_t
> > > > > > > > portid); void remove_tx_dynf_callback(portid_t portid);  int
> > > > > > > > update_jumbo_frame_offload(portid_t portid);
> > > > > > > > 
> > > > > > > > +static inline void
> > > > > > > > +do_burst_fwd(struct fwd_stream *fs, packet_fwd_cb fwd) {
> > > > > > > > +       struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
> > > > > > > > +       uint16_t nb_rx;
> > > > > > > > +       uint64_t start_tsc = 0;
> > > > > > > > +
> > > > > > > > +       get_start_cycles(&start_tsc);
> > > > > > > > +
> > > > > > > > +       /*
> > > > > > > > +        * Receive a burst of packets and forward them.
> > > > > > > > +        */
> > > > > > > > +       nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue,
> > > > > > > > +                       pkts_burst, nb_pkt_per_burst);
> > > > > > > > +       inc_rx_burst_stats(fs, nb_rx);
> > > > > > > > +       if (unlikely(nb_rx == 0))
> > > > > > > > +               return;
> > > > > > > > +       if (unlikely(rxq_share > 0))
> > > > > > > 
> > > > > > > See below. It reads a global memory.
> > > > > > > 
> > > > > > > > +               forward_shared_rxq(fs, nb_rx, pkts_burst, fwd);
> > > > > > > > +       else
> > > > > > > > +               (*fwd)(fs, nb_rx, pkts_burst);
> > > > > > > 
> > > > > > > New function pointer in fastpath.
> > > > > > > 
> > > > > > > IMO, We should not create performance regression for the existing forward engine.
> > > > > > > Can we have a new forward engine just for shared memory testing?
> > > > > > 
> > > > > > Yes, fully aware of the performance concern, the global could be defined around record_core_cycles to minimize the impacts.
> > > > > > Based on test data, the impacts almost invisible in legacy mode.
> > > > > 
> > > > > Are you saying there is zero % regression? If not, could you share the data?
> > > > 
> > > > Almost zero, here is a quick single core result of rxonly:
> > > >         32.2Mpps, 58.9cycles/packet
> > > > Revert the patch to rxonly.c:
> > > >         32.1Mpps 59.9cycles/packet
> > > > The result doesn't make sense and I realized that I used batch mbuf free, apply it now:
> > > >         32.2Mpps, 58.9cycles/packet
> > > > There were small digit jumps between testpmd restart, I picked the best one.
> > > > The result is almost same, seems the cost of each packet is small enough.
> > > > BTW, I'm testing with default burst size and queue depth.
> > > 
> > > I tested this on octeontx2 with iofwd with single core with 100Gbps
> > > Without this patch - 73.5mpps With this patch - 72.8 mpps
> > > 
> > > We are taking the shared queue runtime option without a separate fwd engine.
> > > and to have zero performance impact and no compile time flag Then I think, only way to have a function template .
> > > Example change to outline function template principle.
> > > 
> > > static inline
> > > __pkt_burst_io_forward(struct fwd_stream *fs, const u64 flag) {
> > > 
> > > Introduce new checks under
> > > if (flags & SHARED_QUEUE)
> > > 
> > > 
> > > }
> > > 
> > > Have two versions of io_fwd_engine.packet_fwd per engine.
> > > 
> > > - first version
> > > static pkt_burst_io_forward(struct fwd_stream *fs) {
> > >         return __pkt_burst_io_forward(fs, 0); }
> > > 
> > > - Second version
> > > static pkt_burst_io_forward_shared_queue(struct fwd_stream *fs) {
> > >         return __pkt_burst_io_forward(fs, SHARED_QUEUE); }
> > > 
> > > 
> > > Update io_fwd_engine.packet_fwd in slowpath to respective version based on offload.
> > > 
> > > If shared offoad is not selected, pkt_burst_io_forward() will be
> > > selected and
> > > __pkt_burst_io_forward() will be a compile time version of !SHARED_QUEUE aka same as existing coe.
> > 
> > Thanks for testing and suggestion. So the only difference here in above code is access to rxq_shared changed to function parameter,
> > right? Have you tested this performance? If not, I could verify.
> 
> Performance result looks better by removing this wrapper and hide global variable access like you suggested, thanks!
> Tried to add rxq_share bit field  in struct fwd_stream, same result as the static function selection, looks less changes.

The changes reflected in v3, also consolidated patches for each
forwarding engine into one, please check.

> > 
> > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > From test perspective, better to have all forward engine to
> > > > > > verify shared rxq, test team want to run the regression with
> > > > > > less impacts. Hope to have a solution to utilize all forwarding
> > > > > > engines
> > > > > seamlessly.
> > > > > 
> > > > > Yes. it good goal. testpmd forward performance using as synthetic bench everyone.
> > > > > I think, we are aligned to not have any regression for the generic forward engine.
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > > +       get_end_cycles(fs, start_tsc); }
> > > > > > > > +
> > > > > > > >  /*
> > > > > > > >   * Work-around of a compilation error with ICC on invocations of the
> > > > > > > >   * rte_be_to_cpu_16() function.
> > > > > > > > --
> > > > > > > > 2.25.1
> > > > > > > > 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-26  5:35             ` Xueming(Steven) Li
@ 2021-09-28  9:35               ` Jerin Jacob
  2021-09-28 11:36                 ` Xueming(Steven) Li
                                   ` (3 more replies)
  0 siblings, 4 replies; 266+ messages in thread
From: Jerin Jacob @ 2021-09-28  9:35 UTC (permalink / raw)
  To: Xueming(Steven) Li
  Cc: ferruh.yigit, NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev

On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
> On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > >
> > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > >
> > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > >
> > > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > > for incoming packets. When number of representors scale out in a
> > > > > > > switch domain, the memory consumption became significant. Most
> > > > > > > important, polling all ports leads to high cache miss, high
> > > > > > > latency and low throughput.
> > > > > > >
> > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > >
> > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > index is
> > > > > > > 1:1 mapped in shared group.
> > > > > > >
> > > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > >
> > > > > > > Multiple groups is supported by group ID.
> > > > > >
> > > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > >
> > > > > Yes, PF and representor in switch domain could take advantage.
> > > > >
> > > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > >
> > > > > Not quite sure that I understood your question. The control path of is
> > > > > almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > > supplied from shared Rx queue in my PMD implementation.
> > > >
> > > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
> > > > receive queue, In that case, how the flow order is maintained for respective receive queues.
> > >
> > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
> > > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> > > all forwarding engine. Will sent patches soon.
> > >
> >
> > All ports will put the packets in to the same queue (share queue), right? Does
> > this means only single core will poll only, what will happen if there are
> > multiple cores polling, won't it cause problem?
> >
> > And if this requires specific changes in the application, I am not sure about
> > the solution, can't this work in a transparent way to the application?
>
> Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports
> in same group into one new port. Users could schedule polling on the
> aggregated port instead of all member ports.

The v3 still has testpmd changes in fastpath. Right? IMO, For this
feature, we should not change fastpath of testpmd
application. Instead, testpmd can use aggregated ports probably as
separate fwd_engine to show how to use this feature.

>
> >
> > Overall, is this for optimizing memory for the port represontors? If so can't we
> > have a port representor specific solution, reducing scope can reduce the
> > complexity it brings?
> >
> > > > If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
> > > > scope.
> > >
> > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > >
> > > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > ---
> > > > > > >  doc/guides/nics/features.rst                    | 11 +++++++++++
> > > > > > >  doc/guides/nics/features/default.ini            |  1 +
> > > > > > >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > > >  lib/ethdev/rte_ethdev.c                         |  1 +
> > > > > > >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> > > > > > >  5 files changed, 30 insertions(+)
> > > > > > >
> > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > >
> > > > > > >
> > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > +
> > > > > > > +Shared Rx queue
> > > > > > > +---------------
> > > > > > > +
> > > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > > +
> > > > > > > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > +
> > > > > > > +
> > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > >
> > > > > > >  Packet type parsing
> > > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > >  Queue start/stop     =
> > > > > > >  Runtime Rx queue setup =
> > > > > > >  Runtime Tx queue setup =
> > > > > > > +Shared Rx queue      =
> > > > > > >  Burst mode info      =
> > > > > > >  Power mgmt address monitor =
> > > > > > >  MTU update           =
> > > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > > >  .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > >
> > > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > > `_
> > > > > > >
> > > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > > +grows,
> > > > > > > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > > +  Polling the large number of ports brings more CPU load, cache
> > > > > > > +miss and
> > > > > > > +  latency. Shared Rx queue can be used to share Rx queue between
> > > > > > > +PF and
> > > > > > > +  representors in same switch domain.
> > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > +  is present in Rx offloading capability of device info. Setting
> > > > > > > +the
> > > > > > > +  offloading flag in device Rx mode or Rx queue configuration to
> > > > > > > +enable
> > > > > > > +  shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > > +return
> > > > > > > +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > > +
> > > > > > >  Basic SR-IOV
> > > > > > >  ------------
> > > > > > >
> > > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > > +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > >  };
> > > > > > >
> > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > */
> > > > > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > > > > + switch domain. */
> > > > > > >         /**
> > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > +/**
> > > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > > +memory,
> > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > + */
> > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > >
> > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > --
> > > > > > > 2.25.1
> > > > > > >
> >
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-28  9:35               ` Jerin Jacob
@ 2021-09-28 11:36                 ` Xueming(Steven) Li
  2021-09-28 11:37                 ` Xueming(Steven) Li
                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-28 11:36 UTC (permalink / raw)
  To: jerinjacobk
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:

On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li <xuemingl@nvidia.com<mailto:xuemingl@nvidia.com>> wrote:


On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:

On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:



-----Original Message-----

From: Jerin Jacob <jerinjacobk@gmail.com<mailto:jerinjacobk@gmail.com>>

Sent: Wednesday, August 11, 2021 4:03 PM

To: Xueming(Steven) Li <xuemingl@nvidia.com<mailto:xuemingl@nvidia.com>>

Cc: dpdk-dev <dev@dpdk.org<mailto:dev@dpdk.org>>; Ferruh Yigit <ferruh.yigit@intel.com<mailto:ferruh.yigit@intel.com>>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net<mailto:thomas@monjalon.net>>;

Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru<mailto:andrew.rybchenko@oktetlabs.ru>>

Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue


On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com<mailto:xuemingl@nvidia.com>> wrote:


Hi,


-----Original Message-----

From: Jerin Jacob <jerinjacobk@gmail.com<mailto:jerinjacobk@gmail.com>>

Sent: Monday, August 9, 2021 9:51 PM

To: Xueming(Steven) Li <xuemingl@nvidia.com<mailto:xuemingl@nvidia.com>>

Cc: dpdk-dev <dev@dpdk.org<mailto:dev@dpdk.org>>; Ferruh Yigit <ferruh.yigit@intel.com<mailto:ferruh.yigit@intel.com>>;

NBU-Contact-Thomas Monjalon <thomas@monjalon.net<mailto:thomas@monjalon.net>>; Andrew Rybchenko

<andrew.rybchenko@oktetlabs.ru<mailto:andrew.rybchenko@oktetlabs.ru>>

Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue


On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com<mailto:xuemingl@nvidia.com>> wrote:


In current DPDK framework, each RX queue is pre-loaded with mbufs

for incoming packets. When number of representors scale out in a

switch domain, the memory consumption became significant. Most

important, polling all ports leads to high cache miss, high

latency and low throughput.


This patch introduces shared RX queue. Ports with same

configuration in a switch domain could share RX queue set by specifying sharing group.

Polling any queue using same shared RX queue receives packets from

all member ports. Source port is identified by mbuf->port.


Port queue number in a shared group should be identical. Queue

index is

1:1 mapped in shared group.


Share RX queue is supposed to be polled on same thread.


Multiple groups is supported by group ID.


Is this offload specific to the representor? If so can this name be changed specifically to representor?


Yes, PF and representor in switch domain could take advantage.


If it is for a generic case, how the flow ordering will be maintained?


Not quite sure that I understood your question. The control path of is

almost same as before, PF and representor port still needed, rte flows not impacted.

Queues still needed for each member port, descriptors(mbuf) will be

supplied from shared Rx queue in my PMD implementation.


My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same

receive queue, In that case, how the flow order is maintained for respective receive queues.


I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.

basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.

Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from

limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for

all forwarding engine. Will sent patches soon.



All ports will put the packets in to the same queue (share queue), right? Does

this means only single core will poll only, what will happen if there are

multiple cores polling, won't it cause problem?


And if this requires specific changes in the application, I am not sure about

the solution, can't this work in a transparent way to the application?


Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports

in same group into one new port. Users could schedule polling on the

aggregated port instead of all member ports.


The v3 still has testpmd changes in fastpath. Right? IMO, For this

feature, we should not change fastpath of testpmd

application. Instead, testpmd can use aggregated ports probably as

separate fwd_engine to show how to use this feature.


Good point to discuss :) There are two strategies to polling a shared

Rxq:

1. polling each member port

   All forwarding engines can be reused to work as before.

   My testpmd patches are efforts towards this direction.

   Does your PMD support this?

2. polling aggregated port

   Besides forwarding engine, need more work to to demo it.

   This is an optional API, not supported by my PMD yet.






Overall, is this for optimizing memory for the port represontors? If so can't we

have a port representor specific solution, reducing scope can reduce the

complexity it brings?


If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and

scope.


It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.







Signed-off-by: Xueming Li <xuemingl@nvidia.com<mailto:xuemingl@nvidia.com>>

---

 doc/guides/nics/features.rst                    | 11 +++++++++++

 doc/guides/nics/features/default.ini            |  1 +

 doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++

 lib/ethdev/rte_ethdev.c                         |  1 +

 lib/ethdev/rte_ethdev.h                         |  7 +++++++

 5 files changed, 30 insertions(+)


diff --git a/doc/guides/nics/features.rst

b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644

--- a/doc/guides/nics/features.rst

+++ b/doc/guides/nics/features.rst

@@ -624,6 +624,17 @@ Supports inner packet L4 checksum.

   ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.



+.. _nic_features_shared_rx_queue:

+

+Shared Rx queue

+---------------

+

+Supports shared Rx queue for ports in same switch domain.

+

+* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.

+* **[provides] mbuf**: ``mbuf.port``.

+

+

 .. _nic_features_packet_type_parsing:


 Packet type parsing

diff --git a/doc/guides/nics/features/default.ini

b/doc/guides/nics/features/default.ini

index 754184ddd4..ebeb4c1851 100644

--- a/doc/guides/nics/features/default.ini

+++ b/doc/guides/nics/features/default.ini

@@ -19,6 +19,7 @@ Free Tx mbuf on demand =

 Queue start/stop     =

 Runtime Rx queue setup =

 Runtime Tx queue setup =

+Shared Rx queue      =

 Burst mode info      =

 Power mgmt address monitor =

 MTU update           =

diff --git a/doc/guides/prog_guide/switch_representation.rst

b/doc/guides/prog_guide/switch_representation.rst

index ff6aa91c80..45bf5a3a10 100644

--- a/doc/guides/prog_guide/switch_representation.rst

+++ b/doc/guides/prog_guide/switch_representation.rst

@@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.

 .. [1] `Ethernet switch device driver model (switchdev)


<https://www.kernel.org/doc/Documentation/networking/switchdev.txt

`_


+- Memory usage of representors is huge when number of representor

+grows,

+  because PMD always allocate mbuf for each descriptor of Rx queue.

+  Polling the large number of ports brings more CPU load, cache

+miss and

+  latency. Shared Rx queue can be used to share Rx queue between

+PF and

+  representors in same switch domain.

+``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``

+  is present in Rx offloading capability of device info. Setting

+the

+  offloading flag in device Rx mode or Rx queue configuration to

+enable

+  shared Rx queue. Polling any member port of shared Rx queue can

+return

+  packets of all ports in group, port ID is saved in ``mbuf.port``.

+

 Basic SR-IOV

 ------------


diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c

index 9d95cd11e1..1361ff759a 100644

--- a/lib/ethdev/rte_ethdev.c

+++ b/lib/ethdev/rte_ethdev.c

@@ -127,6 +127,7 @@ static const struct {

        RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),

        RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),

        RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),

+       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),

 };


 #undef RTE_RX_OFFLOAD_BIT2STR

diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h

index d2b27c351f..a578c9db9d 100644

--- a/lib/ethdev/rte_ethdev.h

+++ b/lib/ethdev/rte_ethdev.h

@@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {

        uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */

        uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */

        uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.

*/

+       uint32_t shared_group; /**< Shared port group index in

+ switch domain. */

        /**

         * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.

         * Only offloads set on rx_queue_offload_capa or

rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {

#define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000

 #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000

 #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000

+/**

+ * Rx queue is shared among ports in same switch domain to save

+memory,

+ * avoid polling each port. Any port in group can be used to receive packets.

+ * Real source port number saved in mbuf->port field.

+ */

+#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000


 #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \

                                 DEV_RX_OFFLOAD_UDP_CKSUM | \

--

2.25.1






^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-28  9:35               ` Jerin Jacob
  2021-09-28 11:36                 ` Xueming(Steven) Li
@ 2021-09-28 11:37                 ` Xueming(Steven) Li
  2021-09-28 11:37                 ` Xueming(Steven) Li
  2021-09-28 12:59                 ` Xueming(Steven) Li
  3 siblings, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-28 11:37 UTC (permalink / raw)
  To: jerinjacobk
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > 
> > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > 
> > > > 
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > 
> > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > 
> > > > > > Hi,
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > 
> > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > 
> > > > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > > > for incoming packets. When number of representors scale out in a
> > > > > > > > switch domain, the memory consumption became significant. Most
> > > > > > > > important, polling all ports leads to high cache miss, high
> > > > > > > > latency and low throughput.
> > > > > > > > 
> > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > > > 
> > > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > > index is
> > > > > > > > 1:1 mapped in shared group.
> > > > > > > > 
> > > > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > > > 
> > > > > > > > Multiple groups is supported by group ID.
> > > > > > > 
> > > > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > > > 
> > > > > > Yes, PF and representor in switch domain could take advantage.
> > > > > > 
> > > > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > > > 
> > > > > > Not quite sure that I understood your question. The control path of is
> > > > > > almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > > > supplied from shared Rx queue in my PMD implementation.
> > > > > 
> > > > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
> > > > > receive queue, In that case, how the flow order is maintained for respective receive queues.
> > > > 
> > > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > > Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
> > > > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> > > > all forwarding engine. Will sent patches soon.
> > > > 
> > > 
> > > All ports will put the packets in to the same queue (share queue), right? Does
> > > this means only single core will poll only, what will happen if there are
> > > multiple cores polling, won't it cause problem?
> > > 
> > > And if this requires specific changes in the application, I am not sure about
> > > the solution, can't this work in a transparent way to the application?
> > 
> > Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports
> > in same group into one new port. Users could schedule polling on the
> > aggregated port instead of all member ports.
> 
> The v3 still has testpmd changes in fastpath. Right? IMO, For this
> feature, we should not change fastpath of testpmd
> application. Instead, testpmd can use aggregated ports probably as
> separate fwd_engine to show how to use this feature.

Good point to discuss :) There are two strategies to polling a shared
Rxq:
1. polling each member port
   All forwarding engines can be reused to work as before. 
   My testpmd patches are efforts towards this direction.
   Does your PMD support this?
2. polling aggregated port
   Besides forwarding engine, need more work to to demo it.
   This is an optional API, not supported by my PMD yet.


> 
> > 
> > > 
> > > Overall, is this for optimizing memory for the port represontors? If so can't we
> > > have a port representor specific solution, reducing scope can reduce the
> > > complexity it brings?
> > > 
> > > > > If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
> > > > > scope.
> > > > 
> > > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > > > 
> > > > > 
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > ---
> > > > > > > >  doc/guides/nics/features.rst                    | 11 +++++++++++
> > > > > > > >  doc/guides/nics/features/default.ini            |  1 +
> > > > > > > >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > > > >  lib/ethdev/rte_ethdev.c                         |  1 +
> > > > > > > >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > 
> > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > +
> > > > > > > > +Shared Rx queue
> > > > > > > > +---------------
> > > > > > > > +
> > > > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > > > +
> > > > > > > > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > +
> > > > > > > > +
> > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > 
> > > > > > > >  Packet type parsing
> > > > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > >  Queue start/stop     =
> > > > > > > >  Runtime Rx queue setup =
> > > > > > > >  Runtime Tx queue setup =
> > > > > > > > +Shared Rx queue      =
> > > > > > > >  Burst mode info      =
> > > > > > > >  Power mgmt address monitor =
> > > > > > > >  MTU update           =
> > > > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > > > >  .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > > > 
> > > > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > > > `_
> > > > > > > > 
> > > > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > > > +grows,
> > > > > > > > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > > > +  Polling the large number of ports brings more CPU load, cache
> > > > > > > > +miss and
> > > > > > > > +  latency. Shared Rx queue can be used to share Rx queue between
> > > > > > > > +PF and
> > > > > > > > +  representors in same switch domain.
> > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > +  is present in Rx offloading capability of device info. Setting
> > > > > > > > +the
> > > > > > > > +  offloading flag in device Rx mode or Rx queue configuration to
> > > > > > > > +enable
> > > > > > > > +  shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > > > +return
> > > > > > > > +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > > > +
> > > > > > > >  Basic SR-IOV
> > > > > > > >  ------------
> > > > > > > > 
> > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > > > +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > >  };
> > > > > > > > 
> > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > */
> > > > > > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > > > > > + switch domain. */
> > > > > > > >         /**
> > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > +/**
> > > > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > > > +memory,
> > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > + */
> > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > > 
> > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > > --
> > > > > > > > 2.25.1
> > > > > > > > 
> > > 
> > 



^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-28  9:35               ` Jerin Jacob
  2021-09-28 11:36                 ` Xueming(Steven) Li
  2021-09-28 11:37                 ` Xueming(Steven) Li
@ 2021-09-28 11:37                 ` Xueming(Steven) Li
  2021-09-28 12:58                   ` Jerin Jacob
  2021-09-28 12:59                 ` Xueming(Steven) Li
  3 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-28 11:37 UTC (permalink / raw)
  To: jerinjacobk
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > 
> > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > 
> > > > 
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > 
> > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > 
> > > > > > Hi,
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > 
> > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > 
> > > > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > > > for incoming packets. When number of representors scale out in a
> > > > > > > > switch domain, the memory consumption became significant. Most
> > > > > > > > important, polling all ports leads to high cache miss, high
> > > > > > > > latency and low throughput.
> > > > > > > > 
> > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > > > 
> > > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > > index is
> > > > > > > > 1:1 mapped in shared group.
> > > > > > > > 
> > > > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > > > 
> > > > > > > > Multiple groups is supported by group ID.
> > > > > > > 
> > > > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > > > 
> > > > > > Yes, PF and representor in switch domain could take advantage.
> > > > > > 
> > > > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > > > 
> > > > > > Not quite sure that I understood your question. The control path of is
> > > > > > almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > > > supplied from shared Rx queue in my PMD implementation.
> > > > > 
> > > > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
> > > > > receive queue, In that case, how the flow order is maintained for respective receive queues.
> > > > 
> > > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > > Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
> > > > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> > > > all forwarding engine. Will sent patches soon.
> > > > 
> > > 
> > > All ports will put the packets in to the same queue (share queue), right? Does
> > > this means only single core will poll only, what will happen if there are
> > > multiple cores polling, won't it cause problem?
> > > 
> > > And if this requires specific changes in the application, I am not sure about
> > > the solution, can't this work in a transparent way to the application?
> > 
> > Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports
> > in same group into one new port. Users could schedule polling on the
> > aggregated port instead of all member ports.
> 
> The v3 still has testpmd changes in fastpath. Right? IMO, For this
> feature, we should not change fastpath of testpmd
> application. Instead, testpmd can use aggregated ports probably as
> separate fwd_engine to show how to use this feature.

Good point to discuss :) There are two strategies to polling a shared
Rxq:
1. polling each member port
   All forwarding engines can be reused to work as before. 
   My testpmd patches are efforts towards this direction.
   Does your PMD support this?
2. polling aggregated port
   Besides forwarding engine, need more work to to demo it.
   This is an optional API, not supported by my PMD yet.


> 
> > 
> > > 
> > > Overall, is this for optimizing memory for the port represontors? If so can't we
> > > have a port representor specific solution, reducing scope can reduce the
> > > complexity it brings?
> > > 
> > > > > If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
> > > > > scope.
> > > > 
> > > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > > > 
> > > > > 
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > ---
> > > > > > > >  doc/guides/nics/features.rst                    | 11 +++++++++++
> > > > > > > >  doc/guides/nics/features/default.ini            |  1 +
> > > > > > > >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > > > >  lib/ethdev/rte_ethdev.c                         |  1 +
> > > > > > > >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > 
> > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > +
> > > > > > > > +Shared Rx queue
> > > > > > > > +---------------
> > > > > > > > +
> > > > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > > > +
> > > > > > > > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > +
> > > > > > > > +
> > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > 
> > > > > > > >  Packet type parsing
> > > > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > >  Queue start/stop     =
> > > > > > > >  Runtime Rx queue setup =
> > > > > > > >  Runtime Tx queue setup =
> > > > > > > > +Shared Rx queue      =
> > > > > > > >  Burst mode info      =
> > > > > > > >  Power mgmt address monitor =
> > > > > > > >  MTU update           =
> > > > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > > > >  .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > > > 
> > > > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > > > `_
> > > > > > > > 
> > > > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > > > +grows,
> > > > > > > > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > > > +  Polling the large number of ports brings more CPU load, cache
> > > > > > > > +miss and
> > > > > > > > +  latency. Shared Rx queue can be used to share Rx queue between
> > > > > > > > +PF and
> > > > > > > > +  representors in same switch domain.
> > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > +  is present in Rx offloading capability of device info. Setting
> > > > > > > > +the
> > > > > > > > +  offloading flag in device Rx mode or Rx queue configuration to
> > > > > > > > +enable
> > > > > > > > +  shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > > > +return
> > > > > > > > +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > > > +
> > > > > > > >  Basic SR-IOV
> > > > > > > >  ------------
> > > > > > > > 
> > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > > > +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > >  };
> > > > > > > > 
> > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > */
> > > > > > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > > > > > + switch domain. */
> > > > > > > >         /**
> > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > +/**
> > > > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > > > +memory,
> > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > + */
> > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > > 
> > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > > --
> > > > > > > > 2.25.1
> > > > > > > > 
> > > 
> > 



^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-28 11:37                 ` Xueming(Steven) Li
@ 2021-09-28 12:58                   ` Jerin Jacob
  2021-09-28 13:25                     ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Jerin Jacob @ 2021-09-28 12:58 UTC (permalink / raw)
  To: Xueming(Steven) Li
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
> On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >
> > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > >
> > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > >
> > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > >
> > > > > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > > > > for incoming packets. When number of representors scale out in a
> > > > > > > > > switch domain, the memory consumption became significant. Most
> > > > > > > > > important, polling all ports leads to high cache miss, high
> > > > > > > > > latency and low throughput.
> > > > > > > > >
> > > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > > > >
> > > > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > > > index is
> > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > >
> > > > > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > > > >
> > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > >
> > > > > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > > > >
> > > > > > > Yes, PF and representor in switch domain could take advantage.
> > > > > > >
> > > > > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > > > >
> > > > > > > Not quite sure that I understood your question. The control path of is
> > > > > > > almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > > > > supplied from shared Rx queue in my PMD implementation.
> > > > > >
> > > > > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
> > > > > > receive queue, In that case, how the flow order is maintained for respective receive queues.
> > > > >
> > > > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > > > Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
> > > > > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> > > > > all forwarding engine. Will sent patches soon.
> > > > >
> > > >
> > > > All ports will put the packets in to the same queue (share queue), right? Does
> > > > this means only single core will poll only, what will happen if there are
> > > > multiple cores polling, won't it cause problem?
> > > >
> > > > And if this requires specific changes in the application, I am not sure about
> > > > the solution, can't this work in a transparent way to the application?
> > >
> > > Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports
> > > in same group into one new port. Users could schedule polling on the
> > > aggregated port instead of all member ports.
> >
> > The v3 still has testpmd changes in fastpath. Right? IMO, For this
> > feature, we should not change fastpath of testpmd
> > application. Instead, testpmd can use aggregated ports probably as
> > separate fwd_engine to show how to use this feature.
>
> Good point to discuss :) There are two strategies to polling a shared
> Rxq:
> 1. polling each member port
>    All forwarding engines can be reused to work as before.
>    My testpmd patches are efforts towards this direction.
>    Does your PMD support this?

Not unfortunately. More than that, every application needs to change
to support this model.

> 2. polling aggregated port
>    Besides forwarding engine, need more work to to demo it.
>    This is an optional API, not supported by my PMD yet.

We are thinking of implementing this PMD when it comes to it, ie.
without application change in fastpath
logic.

>
>
> >
> > >
> > > >
> > > > Overall, is this for optimizing memory for the port represontors? If so can't we
> > > > have a port representor specific solution, reducing scope can reduce the
> > > > complexity it brings?
> > > >
> > > > > > If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
> > > > > > scope.
> > > > >
> > > > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > > ---
> > > > > > > > >  doc/guides/nics/features.rst                    | 11 +++++++++++
> > > > > > > > >  doc/guides/nics/features/default.ini            |  1 +
> > > > > > > > >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > > > > >  lib/ethdev/rte_ethdev.c                         |  1 +
> > > > > > > > >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > >
> > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > +
> > > > > > > > > +Shared Rx queue
> > > > > > > > > +---------------
> > > > > > > > > +
> > > > > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > > > > +
> > > > > > > > > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > +
> > > > > > > > > +
> > > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > >
> > > > > > > > >  Packet type parsing
> > > > > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > >  Queue start/stop     =
> > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > +Shared Rx queue      =
> > > > > > > > >  Burst mode info      =
> > > > > > > > >  Power mgmt address monitor =
> > > > > > > > >  MTU update           =
> > > > > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > > > > >  .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > > > >
> > > > > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > > > > `_
> > > > > > > > >
> > > > > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > > > > +grows,
> > > > > > > > > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > > > > +  Polling the large number of ports brings more CPU load, cache
> > > > > > > > > +miss and
> > > > > > > > > +  latency. Shared Rx queue can be used to share Rx queue between
> > > > > > > > > +PF and
> > > > > > > > > +  representors in same switch domain.
> > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > +  is present in Rx offloading capability of device info. Setting
> > > > > > > > > +the
> > > > > > > > > +  offloading flag in device Rx mode or Rx queue configuration to
> > > > > > > > > +enable
> > > > > > > > > +  shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > > > > +return
> > > > > > > > > +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > > > > +
> > > > > > > > >  Basic SR-IOV
> > > > > > > > >  ------------
> > > > > > > > >
> > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > > > > +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > >  };
> > > > > > > > >
> > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > > */
> > > > > > > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > > > > > > + switch domain. */
> > > > > > > > >         /**
> > > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > > +/**
> > > > > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > > > > +memory,
> > > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > > + */
> > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > > >
> > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > > > --
> > > > > > > > > 2.25.1
> > > > > > > > >
> > > >
> > >
>
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-28  9:35               ` Jerin Jacob
                                   ` (2 preceding siblings ...)
  2021-09-28 11:37                 ` Xueming(Steven) Li
@ 2021-09-28 12:59                 ` Xueming(Steven) Li
  3 siblings, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-28 12:59 UTC (permalink / raw)
  To: jerinjacobk
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > 
> > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > 
> > > > 
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > 
> > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > 
> > > > > > Hi,
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > 
> > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > 
> > > > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > > > for incoming packets. When number of representors scale out in a
> > > > > > > > switch domain, the memory consumption became significant. Most
> > > > > > > > important, polling all ports leads to high cache miss, high
> > > > > > > > latency and low throughput.
> > > > > > > > 
> > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > > > 
> > > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > > index is
> > > > > > > > 1:1 mapped in shared group.
> > > > > > > > 
> > > > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > > > 
> > > > > > > > Multiple groups is supported by group ID.
> > > > > > > 
> > > > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > > > 
> > > > > > Yes, PF and representor in switch domain could take advantage.
> > > > > > 
> > > > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > > > 
> > > > > > Not quite sure that I understood your question. The control path of is
> > > > > > almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > > > supplied from shared Rx queue in my PMD implementation.
> > > > > 
> > > > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
> > > > > receive queue, In that case, how the flow order is maintained for respective receive queues.
> > > > 
> > > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > > Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
> > > > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> > > > all forwarding engine. Will sent patches soon.
> > > > 
> > > 
> > > All ports will put the packets in to the same queue (share queue), right? Does
> > > this means only single core will poll only, what will happen if there are
> > > multiple cores polling, won't it cause problem?
> > > 
> > > And if this requires specific changes in the application, I am not sure about
> > > the solution, can't this work in a transparent way to the application?
> > 
> > Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports
> > in same group into one new port. Users could schedule polling on the
> > aggregated port instead of all member ports.
> 
> The v3 still has testpmd changes in fastpath. Right? IMO, For this
> feature, we should not change fastpath of testpmd
> application. Instead, testpmd can use aggregated ports probably as
> separate fwd_engine to show how to use this feature.

Good point to discuss :) There are two strategies to polling a shared
Rxq:
1. polling each member port
   All forwarding engines can be reused to work as before. 
   My testpmd patches are efforts towards this direction.
   Does your PMD support this?
2. polling aggregated port
   Besides forwarding engine, need more work to to demo it.
   This is an optional API, not supported by my PMD yet.


> 
> > 
> > > 
> > > Overall, is this for optimizing memory for the port represontors? If so can't we
> > > have a port representor specific solution, reducing scope can reduce the
> > > complexity it brings?
> > > 
> > > > > If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
> > > > > scope.
> > > > 
> > > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > > > 
> > > > > 
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > ---
> > > > > > > >  doc/guides/nics/features.rst                    | 11 +++++++++++
> > > > > > > >  doc/guides/nics/features/default.ini            |  1 +
> > > > > > > >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > > > >  lib/ethdev/rte_ethdev.c                         |  1 +
> > > > > > > >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > 
> > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > +
> > > > > > > > +Shared Rx queue
> > > > > > > > +---------------
> > > > > > > > +
> > > > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > > > +
> > > > > > > > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > +
> > > > > > > > +
> > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > 
> > > > > > > >  Packet type parsing
> > > > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > >  Queue start/stop     =
> > > > > > > >  Runtime Rx queue setup =
> > > > > > > >  Runtime Tx queue setup =
> > > > > > > > +Shared Rx queue      =
> > > > > > > >  Burst mode info      =
> > > > > > > >  Power mgmt address monitor =
> > > > > > > >  MTU update           =
> > > > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > > > >  .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > > > 
> > > > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > > > `_
> > > > > > > > 
> > > > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > > > +grows,
> > > > > > > > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > > > +  Polling the large number of ports brings more CPU load, cache
> > > > > > > > +miss and
> > > > > > > > +  latency. Shared Rx queue can be used to share Rx queue between
> > > > > > > > +PF and
> > > > > > > > +  representors in same switch domain.
> > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > +  is present in Rx offloading capability of device info. Setting
> > > > > > > > +the
> > > > > > > > +  offloading flag in device Rx mode or Rx queue configuration to
> > > > > > > > +enable
> > > > > > > > +  shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > > > +return
> > > > > > > > +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > > > +
> > > > > > > >  Basic SR-IOV
> > > > > > > >  ------------
> > > > > > > > 
> > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > > > +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > >  };
> > > > > > > > 
> > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > */
> > > > > > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > > > > > + switch domain. */
> > > > > > > >         /**
> > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > +/**
> > > > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > > > +memory,
> > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > + */
> > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > > 
> > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > > --
> > > > > > > > 2.25.1
> > > > > > > > 
> > > 
> > 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-28 12:58                   ` Jerin Jacob
@ 2021-09-28 13:25                     ` Xueming(Steven) Li
  2021-09-28 13:38                       ` Jerin Jacob
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-28 13:25 UTC (permalink / raw)
  To: jerinjacobk
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > 
> > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > 
> > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > > 
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > 
> > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > > > 
> > > > > > > > Hi,
> > > > > > > > 
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > > > 
> > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > > 
> > > > > > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > > > > > for incoming packets. When number of representors scale out in a
> > > > > > > > > > switch domain, the memory consumption became significant. Most
> > > > > > > > > > important, polling all ports leads to high cache miss, high
> > > > > > > > > > latency and low throughput.
> > > > > > > > > > 
> > > > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > > > > > 
> > > > > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > > > > index is
> > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > 
> > > > > > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > > > > > 
> > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > 
> > > > > > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > > > > > 
> > > > > > > > Yes, PF and representor in switch domain could take advantage.
> > > > > > > > 
> > > > > > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > > > > > 
> > > > > > > > Not quite sure that I understood your question. The control path of is
> > > > > > > > almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > > > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > > > > > supplied from shared Rx queue in my PMD implementation.
> > > > > > > 
> > > > > > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
> > > > > > > receive queue, In that case, how the flow order is maintained for respective receive queues.
> > > > > > 
> > > > > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > > > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > > > > Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
> > > > > > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> > > > > > all forwarding engine. Will sent patches soon.
> > > > > > 
> > > > > 
> > > > > All ports will put the packets in to the same queue (share queue), right? Does
> > > > > this means only single core will poll only, what will happen if there are
> > > > > multiple cores polling, won't it cause problem?
> > > > > 
> > > > > And if this requires specific changes in the application, I am not sure about
> > > > > the solution, can't this work in a transparent way to the application?
> > > > 
> > > > Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports
> > > > in same group into one new port. Users could schedule polling on the
> > > > aggregated port instead of all member ports.
> > > 
> > > The v3 still has testpmd changes in fastpath. Right? IMO, For this
> > > feature, we should not change fastpath of testpmd
> > > application. Instead, testpmd can use aggregated ports probably as
> > > separate fwd_engine to show how to use this feature.
> > 
> > Good point to discuss :) There are two strategies to polling a shared
> > Rxq:
> > 1. polling each member port
> >    All forwarding engines can be reused to work as before.
> >    My testpmd patches are efforts towards this direction.
> >    Does your PMD support this?
> 
> Not unfortunately. More than that, every application needs to change
> to support this model.

Both strategies need user application to resolve port ID from mbuf and
process accordingly.
This one doesn't demand aggregated port, no polling schedule change.

> 
> > 2. polling aggregated port
> >    Besides forwarding engine, need more work to to demo it.
> >    This is an optional API, not supported by my PMD yet.
> 
> We are thinking of implementing this PMD when it comes to it, ie.
> without application change in fastpath
> logic.

Fastpath have to resolve port ID anyway and forwarding according to
logic. Forwarding engine need to adapt to support shard Rxq.
Fortunately, in testpmd, this can be done with an abstract API.

Let's defer part 2 until some PMD really support it and tested, how do
you think?

> 
> > 
> > 
> > > 
> > > > 
> > > > > 
> > > > > Overall, is this for optimizing memory for the port represontors? If so can't we
> > > > > have a port representor specific solution, reducing scope can reduce the
> > > > > complexity it brings?
> > > > > 
> > > > > > > If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
> > > > > > > scope.
> > > > > > 
> > > > > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > > > ---
> > > > > > > > > >  doc/guides/nics/features.rst                    | 11 +++++++++++
> > > > > > > > > >  doc/guides/nics/features/default.ini            |  1 +
> > > > > > > > > >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > > > > > >  lib/ethdev/rte_ethdev.c                         |  1 +
> > > > > > > > > >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> > > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > > > 
> > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > +
> > > > > > > > > > +Shared Rx queue
> > > > > > > > > > +---------------
> > > > > > > > > > +
> > > > > > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > > > > > +
> > > > > > > > > > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > +
> > > > > > > > > > +
> > > > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > > > 
> > > > > > > > > >  Packet type parsing
> > > > > > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > >  Queue start/stop     =
> > > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > > +Shared Rx queue      =
> > > > > > > > > >  Burst mode info      =
> > > > > > > > > >  Power mgmt address monitor =
> > > > > > > > > >  MTU update           =
> > > > > > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > > > > > >  .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > > > > > 
> > > > > > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > > > > > `_
> > > > > > > > > > 
> > > > > > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > > > > > +grows,
> > > > > > > > > > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > > > > > +  Polling the large number of ports brings more CPU load, cache
> > > > > > > > > > +miss and
> > > > > > > > > > +  latency. Shared Rx queue can be used to share Rx queue between
> > > > > > > > > > +PF and
> > > > > > > > > > +  representors in same switch domain.
> > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > +  is present in Rx offloading capability of device info. Setting
> > > > > > > > > > +the
> > > > > > > > > > +  offloading flag in device Rx mode or Rx queue configuration to
> > > > > > > > > > +enable
> > > > > > > > > > +  shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > > > > > +return
> > > > > > > > > > +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > > > > > +
> > > > > > > > > >  Basic SR-IOV
> > > > > > > > > >  ------------
> > > > > > > > > > 
> > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > > > > > +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > >  };
> > > > > > > > > > 
> > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > > > */
> > > > > > > > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > > > > > > > + switch domain. */
> > > > > > > > > >         /**
> > > > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > > > +/**
> > > > > > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > > > > > +memory,
> > > > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > > > + */
> > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > > > > 
> > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > > > > --
> > > > > > > > > > 2.25.1
> > > > > > > > > > 
> > > > > 
> > > > 
> > 
> > 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-28 13:25                     ` Xueming(Steven) Li
@ 2021-09-28 13:38                       ` Jerin Jacob
  2021-09-28 13:59                         ` Ananyev, Konstantin
  2021-09-28 14:51                         ` Xueming(Steven) Li
  0 siblings, 2 replies; 266+ messages in thread
From: Jerin Jacob @ 2021-09-28 13:38 UTC (permalink / raw)
  To: Xueming(Steven) Li
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

On Tue, Sep 28, 2021 at 6:55 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
> On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> > On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >
> > > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > >
> > > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > > >
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > >
> > > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > > > >
> > > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > > > > > > for incoming packets. When number of representors scale out in a
> > > > > > > > > > > switch domain, the memory consumption became significant. Most
> > > > > > > > > > > important, polling all ports leads to high cache miss, high
> > > > > > > > > > > latency and low throughput.
> > > > > > > > > > >
> > > > > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > > > > > >
> > > > > > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > > > > > index is
> > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > >
> > > > > > > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > > > > > >
> > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > >
> > > > > > > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > > > > > >
> > > > > > > > > Yes, PF and representor in switch domain could take advantage.
> > > > > > > > >
> > > > > > > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > > > > > >
> > > > > > > > > Not quite sure that I understood your question. The control path of is
> > > > > > > > > almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > > > > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > > > > > > supplied from shared Rx queue in my PMD implementation.
> > > > > > > >
> > > > > > > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
> > > > > > > > receive queue, In that case, how the flow order is maintained for respective receive queues.
> > > > > > >
> > > > > > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > > > > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > > > > > Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
> > > > > > > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > >
> > > > > >
> > > > > > All ports will put the packets in to the same queue (share queue), right? Does
> > > > > > this means only single core will poll only, what will happen if there are
> > > > > > multiple cores polling, won't it cause problem?
> > > > > >
> > > > > > And if this requires specific changes in the application, I am not sure about
> > > > > > the solution, can't this work in a transparent way to the application?
> > > > >
> > > > > Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports
> > > > > in same group into one new port. Users could schedule polling on the
> > > > > aggregated port instead of all member ports.
> > > >
> > > > The v3 still has testpmd changes in fastpath. Right? IMO, For this
> > > > feature, we should not change fastpath of testpmd
> > > > application. Instead, testpmd can use aggregated ports probably as
> > > > separate fwd_engine to show how to use this feature.
> > >
> > > Good point to discuss :) There are two strategies to polling a shared
> > > Rxq:
> > > 1. polling each member port
> > >    All forwarding engines can be reused to work as before.
> > >    My testpmd patches are efforts towards this direction.
> > >    Does your PMD support this?
> >
> > Not unfortunately. More than that, every application needs to change
> > to support this model.
>
> Both strategies need user application to resolve port ID from mbuf and
> process accordingly.
> This one doesn't demand aggregated port, no polling schedule change.

I was thinking, mbuf will be updated from driver/aggregator port as when it
comes to application.

>
> >
> > > 2. polling aggregated port
> > >    Besides forwarding engine, need more work to to demo it.
> > >    This is an optional API, not supported by my PMD yet.
> >
> > We are thinking of implementing this PMD when it comes to it, ie.
> > without application change in fastpath
> > logic.
>
> Fastpath have to resolve port ID anyway and forwarding according to
> logic. Forwarding engine need to adapt to support shard Rxq.
> Fortunately, in testpmd, this can be done with an abstract API.
>
> Let's defer part 2 until some PMD really support it and tested, how do
> you think?

We are not planning to use this feature so either way it is OK to me.
I leave to ethdev maintainers decide between 1 vs 2.

I do have a strong opinion not changing the testpmd basic forward engines
for this feature.I would like to keep it simple as fastpath optimized and would
like to add a separate Forwarding engine as means to verify this feature.



>
> >
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > Overall, is this for optimizing memory for the port represontors? If so can't we
> > > > > > have a port representor specific solution, reducing scope can reduce the
> > > > > > complexity it brings?
> > > > > >
> > > > > > > > If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
> > > > > > > > scope.
> > > > > > >
> > > > > > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > > > > ---
> > > > > > > > > > >  doc/guides/nics/features.rst                    | 11 +++++++++++
> > > > > > > > > > >  doc/guides/nics/features/default.ini            |  1 +
> > > > > > > > > > >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > > > > > > >  lib/ethdev/rte_ethdev.c                         |  1 +
> > > > > > > > > > >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> > > > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > +
> > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > +---------------
> > > > > > > > > > > +
> > > > > > > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > > > > > > +
> > > > > > > > > > > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > +
> > > > > > > > > > > +
> > > > > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > > > >
> > > > > > > > > > >  Packet type parsing
> > > > > > > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > >  Queue start/stop     =
> > > > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > > > +Shared Rx queue      =
> > > > > > > > > > >  Burst mode info      =
> > > > > > > > > > >  Power mgmt address monitor =
> > > > > > > > > > >  MTU update           =
> > > > > > > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > > > > > > >  .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > > > > > >
> > > > > > > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > > > > > > `_
> > > > > > > > > > >
> > > > > > > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > > > > > > +grows,
> > > > > > > > > > > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > > > > > > +  Polling the large number of ports brings more CPU load, cache
> > > > > > > > > > > +miss and
> > > > > > > > > > > +  latency. Shared Rx queue can be used to share Rx queue between
> > > > > > > > > > > +PF and
> > > > > > > > > > > +  representors in same switch domain.
> > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > +  is present in Rx offloading capability of device info. Setting
> > > > > > > > > > > +the
> > > > > > > > > > > +  offloading flag in device Rx mode or Rx queue configuration to
> > > > > > > > > > > +enable
> > > > > > > > > > > +  shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > > > > > > +return
> > > > > > > > > > > +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > > > > > > +
> > > > > > > > > > >  Basic SR-IOV
> > > > > > > > > > >  ------------
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > > > > > > +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > >  };
> > > > > > > > > > >
> > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > > > > */
> > > > > > > > > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > > > > > > > > + switch domain. */
> > > > > > > > > > >         /**
> > > > > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > > > > +/**
> > > > > > > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > > > > > > +memory,
> > > > > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > > > > + */
> > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > > > > >
> > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > > > > > --
> > > > > > > > > > > 2.25.1
> > > > > > > > > > >
> > > > > >
> > > > >
> > >
> > >
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-28 13:38                       ` Jerin Jacob
@ 2021-09-28 13:59                         ` Ananyev, Konstantin
  2021-09-28 14:40                           ` Xueming(Steven) Li
  2021-09-28 14:51                         ` Xueming(Steven) Li
  1 sibling, 1 reply; 266+ messages in thread
From: Ananyev, Konstantin @ 2021-09-28 13:59 UTC (permalink / raw)
  To: Jerin Jacob, Xueming(Steven) Li
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, Yigit, Ferruh

> 
> On Tue, Sep 28, 2021 at 6:55 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> > On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> > > On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > >
> > > > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > >
> > > > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> <thomas@monjalon.net>;
> > > > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > > >
> > > > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > > > > > > > for incoming packets. When number of representors scale out in a
> > > > > > > > > > > > switch domain, the memory consumption became significant. Most
> > > > > > > > > > > > important, polling all ports leads to high cache miss, high
> > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > >
> > > > > > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > > > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > > > > > > >
> > > > > > > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > > > > > > index is
> > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > >
> > > > > > > > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > > > > > > >
> > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > >
> > > > > > > > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > > > > > > >
> > > > > > > > > > Yes, PF and representor in switch domain could take advantage.
> > > > > > > > > >
> > > > > > > > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > > > > > > >
> > > > > > > > > > Not quite sure that I understood your question. The control path of is
> > > > > > > > > > almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > > > > > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > > > > > > > supplied from shared Rx queue in my PMD implementation.
> > > > > > > > >
> > > > > > > > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into
> the same
> > > > > > > > > receive queue, In that case, how the flow order is maintained for respective receive queues.
> > > > > > > >
> > > > > > > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > > > > > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > > > > > > Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic
> come from
> > > > > > > > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > >
> > > > > > >
> > > > > > > All ports will put the packets in to the same queue (share queue), right? Does
> > > > > > > this means only single core will poll only, what will happen if there are
> > > > > > > multiple cores polling, won't it cause problem?
> > > > > > >
> > > > > > > And if this requires specific changes in the application, I am not sure about
> > > > > > > the solution, can't this work in a transparent way to the application?
> > > > > >
> > > > > > Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports
> > > > > > in same group into one new port. Users could schedule polling on the
> > > > > > aggregated port instead of all member ports.
> > > > >
> > > > > The v3 still has testpmd changes in fastpath. Right? IMO, For this
> > > > > feature, we should not change fastpath of testpmd
> > > > > application. Instead, testpmd can use aggregated ports probably as
> > > > > separate fwd_engine to show how to use this feature.
> > > >
> > > > Good point to discuss :) There are two strategies to polling a shared
> > > > Rxq:
> > > > 1. polling each member port
> > > >    All forwarding engines can be reused to work as before.
> > > >    My testpmd patches are efforts towards this direction.
> > > >    Does your PMD support this?
> > >
> > > Not unfortunately. More than that, every application needs to change
> > > to support this model.
> >
> > Both strategies need user application to resolve port ID from mbuf and
> > process accordingly.
> > This one doesn't demand aggregated port, no polling schedule change.
> 
> I was thinking, mbuf will be updated from driver/aggregator port as when it
> comes to application.
> 
> >
> > >
> > > > 2. polling aggregated port
> > > >    Besides forwarding engine, need more work to to demo it.
> > > >    This is an optional API, not supported by my PMD yet.
> > >
> > > We are thinking of implementing this PMD when it comes to it, ie.
> > > without application change in fastpath
> > > logic.
> >
> > Fastpath have to resolve port ID anyway and forwarding according to
> > logic. Forwarding engine need to adapt to support shard Rxq.
> > Fortunately, in testpmd, this can be done with an abstract API.
> >
> > Let's defer part 2 until some PMD really support it and tested, how do
> > you think?
> 
> We are not planning to use this feature so either way it is OK to me.
> I leave to ethdev maintainers decide between 1 vs 2.
> 
> I do have a strong opinion not changing the testpmd basic forward engines
> for this feature.I would like to keep it simple as fastpath optimized and would
> like to add a separate Forwarding engine as means to verify this feature.

+1 to that.
I don't think it a 'common' feature.
So separate FWD mode seems like a best choice to me.

> 
> 
> 
> >
> > >
> > > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > Overall, is this for optimizing memory for the port represontors? If so can't we
> > > > > > > have a port representor specific solution, reducing scope can reduce the
> > > > > > > complexity it brings?
> > > > > > >
> > > > > > > > > If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its
> name and
> > > > > > > > > scope.
> > > > > > > >
> > > > > > > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > > > > > ---
> > > > > > > > > > > >  doc/guides/nics/features.rst                    | 11 +++++++++++
> > > > > > > > > > > >  doc/guides/nics/features/default.ini            |  1 +
> > > > > > > > > > > >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > > > > > > > >  lib/ethdev/rte_ethdev.c                         |  1 +
> > > > > > > > > > > >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> > > > > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > > > > >
> > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > > > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > +
> > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > +---------------
> > > > > > > > > > > > +
> > > > > > > > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > > > > > > > +
> > > > > > > > > > > > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > +
> > > > > > > > > > > > +
> > > > > > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > > > > >
> > > > > > > > > > > >  Packet type parsing
> > > > > > > > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > > >  Queue start/stop     =
> > > > > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > > > > +Shared Rx queue      =
> > > > > > > > > > > >  Burst mode info      =
> > > > > > > > > > > >  Power mgmt address monitor =
> > > > > > > > > > > >  MTU update           =
> > > > > > > > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > > > > > > > >  .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > > > > > > >
> > > > > > > > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > > > > > > > `_
> > > > > > > > > > > >
> > > > > > > > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > > > > > > > +grows,
> > > > > > > > > > > > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > > > > > > > +  Polling the large number of ports brings more CPU load, cache
> > > > > > > > > > > > +miss and
> > > > > > > > > > > > +  latency. Shared Rx queue can be used to share Rx queue between
> > > > > > > > > > > > +PF and
> > > > > > > > > > > > +  representors in same switch domain.
> > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > +  is present in Rx offloading capability of device info. Setting
> > > > > > > > > > > > +the
> > > > > > > > > > > > +  offloading flag in device Rx mode or Rx queue configuration to
> > > > > > > > > > > > +enable
> > > > > > > > > > > > +  shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > > > > > > > +return
> > > > > > > > > > > > +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > > > > > > > +
> > > > > > > > > > > >  Basic SR-IOV
> > > > > > > > > > > >  ------------
> > > > > > > > > > > >
> > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > > > > > > > +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > >  };
> > > > > > > > > > > >
> > > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > > > > > */
> > > > > > > > > > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > >         /**
> > > > > > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > > > > > +/**
> > > > > > > > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > > > > > > > +memory,
> > > > > > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > > > > > + */
> > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > > > > > >
> > > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > > > > > > --
> > > > > > > > > > > > 2.25.1
> > > > > > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > > >
> >

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/8] ethdev: introduce shared Rx queue
  2021-09-27 23:53     ` Ajit Khaparde
@ 2021-09-28 14:24       ` Xueming(Steven) Li
  0 siblings, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-28 14:24 UTC (permalink / raw)
  To: ajit.khaparde
  Cc: jerinjacobk, NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev,
	Slava Ovsiienko, ferruh.yigit, Lior Margalit

On Mon, 2021-09-27 at 16:53 -0700, Ajit Khaparde wrote:
> On Fri, Sep 17, 2021 at 1:02 AM Xueming Li <xuemingl@nvidia.com> wrote:
> > 
> > In current DPDK framework, each RX queue is pre-loaded with mbufs for
> > incoming packets. When number of representors scale out in a switch
> > domain, the memory consumption became significant. Most important,
> > polling all ports leads to high cache miss, high latency and low
> > throughput.
> > 
> > This patch introduces shared RX queue. Ports with same configuration in
> > a switch domain could share RX queue set by specifying sharing group.
> > Polling any queue using same shared RX queue receives packets from all
> > member ports. Source port is identified by mbuf->port.
> > 
> > Port queue number in a shared group should be identical. Queue index is
> > 1:1 mapped in shared group.
> > 
> > Share RX queue must be polled on single thread or core.
> > 
> > Multiple groups is supported by group ID.
> Can you clarify this a little more?

Thanks for the review!

By using group ID, user can specify for example:
 group 0: port 0-3, 2 queues per port, poll on core 0 and 1
 group 1: port 4-127, 1 queue per port, poll on core 1.
Normally used for QoS and load balance.

> 
> Apologies if this was already covered:
> * Can't we do this for Tx also?

Rx queue pre-fill mbufs for each queue, which consuming huge mbufs by
default, most of them are less active, saving memory is the primary
motatation for this feature.
Tx queue doesn't consume any mbuf by default until starting tx, no
strong reason so far. 

> 
> Couple of nits inline. Thanks
> 
> > 
> > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > ---
> > Rx queue object could be used as shared Rx queue object, it's important
> > to clear all queue control callback api that using queue object:
> >   https://mails.dpdk.org/archives/dev/2021-July/215574.html
> > ---
> >  doc/guides/nics/features.rst                    | 11 +++++++++++
> >  doc/guides/nics/features/default.ini            |  1 +
> >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> >  lib/ethdev/rte_ethdev.c                         |  1 +
> >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> >  5 files changed, 30 insertions(+)
> > 
> > diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> > index a96e12d155..2e2a9b1554 100644
> > --- a/doc/guides/nics/features.rst
> > +++ b/doc/guides/nics/features.rst
> > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > 
> > 
> > +.. _nic_features_shared_rx_queue:
> > +
> > +Shared Rx queue
> > +---------------
> > +
> > +Supports shared Rx queue for ports in same switch domain.
> > +
> > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > +* **[provides] mbuf**: ``mbuf.port``.
> > +
> > +
> >  .. _nic_features_packet_type_parsing:
> > 
> >  Packet type parsing
> > diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
> > index 754184ddd4..ebeb4c1851 100644
> > --- a/doc/guides/nics/features/default.ini
> > +++ b/doc/guides/nics/features/default.ini
> > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> >  Queue start/stop     =
> >  Runtime Rx queue setup =
> >  Runtime Tx queue setup =
> > +Shared Rx queue      =
> >  Burst mode info      =
> >  Power mgmt address monitor =
> >  MTU update           =
> > diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
> > index ff6aa91c80..45bf5a3a10 100644
> > --- a/doc/guides/prog_guide/switch_representation.rst
> > +++ b/doc/guides/prog_guide/switch_representation.rst
> > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> >  .. [1] `Ethernet switch device driver model (switchdev)
> >         <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
> > 
> > +- Memory usage of representors is huge when number of representor grows,
> > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > +  Polling the large number of ports brings more CPU load, cache miss and
> > +  latency. Shared Rx queue can be used to share Rx queue between PF and
> > +  representors in same switch domain. ``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> 
> "in the same switch"
> 
> > +  is present in Rx offloading capability of device info. Setting the
> > +  offloading flag in device Rx mode or Rx queue configuration to enable
> > +  shared Rx queue. Polling any member port of shared Rx queue can return
> "of the shared Rx queue.."
> 
> > +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> 
> "ports in the group, "
> 
> > +
> >  Basic SR-IOV
> >  ------------
> > 
> > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > index a7c090ce79..b3a58d5e65 100644
> > --- a/lib/ethdev/rte_ethdev.c
> > +++ b/lib/ethdev/rte_ethdev.c
> > @@ -127,6 +127,7 @@ static const struct {
> >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> >  };
> > 
> >  #undef RTE_RX_OFFLOAD_BIT2STR
> > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > index d2b27c351f..a578c9db9d 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> > +       uint32_t shared_group; /**< Shared port group index in switch domain. */
> >         /**
> >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> >          * Only offloads set on rx_queue_offload_capa or rx_offload_capa
> > @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> >  #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > +/**
> > + * Rx queue is shared among ports in same switch domain to save memory,
> > + * avoid polling each port. Any port in group can be used to receive packets.
> 
> "Any port in the group can"
> 
> 
> > + * Real source port number saved in mbuf->port field.
> > + */
> > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > 
> >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > --
> > 2.33.0
> > 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-28 13:59                         ` Ananyev, Konstantin
@ 2021-09-28 14:40                           ` Xueming(Steven) Li
  2021-09-28 14:59                             ` Jerin Jacob
  2021-09-29  0:26                             ` Ananyev, Konstantin
  0 siblings, 2 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-28 14:40 UTC (permalink / raw)
  To: jerinjacobk, konstantin.ananyev
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

On Tue, 2021-09-28 at 13:59 +0000, Ananyev, Konstantin wrote:
> > 
> > On Tue, Sep 28, 2021 at 6:55 PM Xueming(Steven) Li
> > <xuemingl@nvidia.com> wrote:
> > > 
> > > On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> > > > On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li
> > > > <xuemingl@nvidia.com> wrote:
> > > > > 
> > > > > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > > > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li
> > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > 
> > > > > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas
> > > > > > > > > > Monjalon
> > <thomas@monjalon.net>;
> > > > > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > 
> > > > > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li
> > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > 
> > > > > > > > > > > Hi,
> > > > > > > > > > > 
> > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > <ferruh.yigit@intel.com>;
> > > > > > > > > > > > NBU-Contact-Thomas Monjalon
> > > > > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > 
> > > > > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li
> > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > In current DPDK framework, each RX queue is
> > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > switch domain, the memory consumption became
> > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > important, polling all ports leads to high
> > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > This patch introduces shared RX queue. Ports
> > > > > > > > > > > > > with same
> > > > > > > > > > > > > configuration in a switch domain could share
> > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > Polling any queue using same shared RX queue
> > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > all member ports. Source port is identified
> > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Port queue number in a shared group should be
> > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > index is
> > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Share RX queue is supposed to be polled on
> > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > 
> > > > > > > > > > > > Is this offload specific to the representor? If
> > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > representor?
> > > > > > > > > > > 
> > > > > > > > > > > Yes, PF and representor in switch domain could
> > > > > > > > > > > take advantage.
> > > > > > > > > > > 
> > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > 
> > > > > > > > > > > Not quite sure that I understood your question.
> > > > > > > > > > > The control path of is
> > > > > > > > > > > almost same as before, PF and representor port
> > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > implementation.
> > > > > > > > > > 
> > > > > > > > > > My question was if create a generic
> > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > ethdev receive queues land into
> > the same
> > > > > > > > > > receive queue, In that case, how the flow order is
> > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > 
> > > > > > > > > I guess the question is testpmd forward stream? The
> > > > > > > > > forwarding logic has to be changed slightly in case
> > > > > > > > > of shared rxq.
> > > > > > > > > basically for each packet in rx_burst result, lookup
> > > > > > > > > source stream according to mbuf->port, forwarding to
> > > > > > > > > target fs.
> > > > > > > > > Packets from same source port could be grouped as a
> > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > performance if traffic
> > come from
> > > > > > > > > limited ports. I'll introduce some common api to do
> > > > > > > > > shard rxq forwarding, call it with packets handling
> > > > > > > > > callback, so it suites for
> > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > (share queue), right? Does
> > > > > > > > this means only single core will poll only, what will
> > > > > > > > happen if there are
> > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > 
> > > > > > > > And if this requires specific changes in the
> > > > > > > > application, I am not sure about
> > > > > > > > the solution, can't this work in a transparent way to
> > > > > > > > the application?
> > > > > > > 
> > > > > > > Discussed with Jerin, new API introduced in v3 2/8 that
> > > > > > > aggregate ports
> > > > > > > in same group into one new port. Users could schedule
> > > > > > > polling on the
> > > > > > > aggregated port instead of all member ports.
> > > > > > 
> > > > > > The v3 still has testpmd changes in fastpath. Right? IMO,
> > > > > > For this
> > > > > > feature, we should not change fastpath of testpmd
> > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > probably as
> > > > > > separate fwd_engine to show how to use this feature.
> > > > > 
> > > > > Good point to discuss :) There are two strategies to polling
> > > > > a shared
> > > > > Rxq:
> > > > > 1. polling each member port
> > > > >    All forwarding engines can be reused to work as before.
> > > > >    My testpmd patches are efforts towards this direction.
> > > > >    Does your PMD support this?
> > > > 
> > > > Not unfortunately. More than that, every application needs to
> > > > change
> > > > to support this model.
> > > 
> > > Both strategies need user application to resolve port ID from
> > > mbuf and
> > > process accordingly.
> > > This one doesn't demand aggregated port, no polling schedule
> > > change.
> > 
> > I was thinking, mbuf will be updated from driver/aggregator port as
> > when it
> > comes to application.
> > 
> > > 
> > > > 
> > > > > 2. polling aggregated port
> > > > >    Besides forwarding engine, need more work to to demo it.
> > > > >    This is an optional API, not supported by my PMD yet.
> > > > 
> > > > We are thinking of implementing this PMD when it comes to it,
> > > > ie.
> > > > without application change in fastpath
> > > > logic.
> > > 
> > > Fastpath have to resolve port ID anyway and forwarding according
> > > to
> > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > Fortunately, in testpmd, this can be done with an abstract API.
> > > 
> > > Let's defer part 2 until some PMD really support it and tested,
> > > how do
> > > you think?
> > 
> > We are not planning to use this feature so either way it is OK to
> > me.
> > I leave to ethdev maintainers decide between 1 vs 2.
> > 
> > I do have a strong opinion not changing the testpmd basic forward
> > engines
> > for this feature.I would like to keep it simple as fastpath
> > optimized and would
> > like to add a separate Forwarding engine as means to verify this
> > feature.
> 
> +1 to that.
> I don't think it a 'common' feature.
> So separate FWD mode seems like a best choice to me.

-1 :)
There was some internal requirement from test team, they need to verify
all features like packet content, rss, vlan, checksum, rte_flow... to
be working based on shared rx queue. Based on the patch, I believe the
impact has been minimized.

> 
> > 
> > 
> > 
> > > 
> > > > 
> > > > > 
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > represontors? If so can't we
> > > > > > > > have a port representor specific solution, reducing
> > > > > > > > scope can reduce the
> > > > > > > > complexity it brings?
> > > > > > > > 
> > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > representor the case by changing its
> > name and
> > > > > > > > > > scope.
> > > > > > > > > 
> > > > > > > > > It works for both PF and representors in same switch
> > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > apply.
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > ---
> > > > > > > > > > > > >  doc/guides/nics/features.rst               
> > > > > > > > > > > > > | 11 +++++++++++
> > > > > > > > > > > > >  doc/guides/nics/features/default.ini       
> > > > > > > > > > > > > |  1 +
> > > > > > > > > > > > >  doc/guides/prog_guide/switch_representation.
> > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > >  lib/ethdev/rte_ethdev.c                    
> > > > > > > > > > > > > |  1 +
> > > > > > > > > > > > >  lib/ethdev/rte_ethdev.h                    
> > > > > > > > > > > > > |  7 +++++++
> > > > > > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > > > > > > 
> > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4
> > > > > > > > > > > > > checksum.
> > > > > > > > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DE
> > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +Supports shared Rx queue for ports in same
> > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +* **[uses]    
> > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +
> > > > > > > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > 
> > > > > > > > > > > > >  Packet type parsing
> > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > > > >  Queue start/stop     =
> > > > > > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > > > > > +Shared Rx queue      =
> > > > > > > > > > > > >  Burst mode info      =
> > > > > > > > > > > > >  Power mgmt address monitor =
> > > > > > > > > > > > >  MTU update           =
> > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > .rst
> > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > .rst
> > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > ---
> > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > .rst
> > > > > > > > > > > > > +++
> > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > .rst
> > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software
> > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > >  .. [1] `Ethernet switch device driver model
> > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > 
> > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation/net
> > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > `_
> > > > > > > > > > > > > 
> > > > > > > > > > > > > +- Memory usage of representors is huge when
> > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > +  because PMD always allocate mbuf for each
> > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > +  Polling the large number of ports brings
> > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > +  latency. Shared Rx queue can be used to
> > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > +  representors in same switch domain.
> > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > +  is present in Rx offloading capability of
> > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > +the
> > > > > > > > > > > > > +  offloading flag in device Rx mode or Rx
> > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > +enable
> > > > > > > > > > > > > +  shared Rx queue. Polling any member port
> > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > +return
> > > > > > > > > > > > > +  packets of all ports in group, port ID is
> > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > +
> > > > > > > > > > > > >  Basic SR-IOV
> > > > > > > > > > > > >  ------------
> > > > > > > > > > > > > 
> > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSU
> > > > > > > > > > > > > M),
> > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPL
> > > > > > > > > > > > > IT),
> > > > > > > > > > > > > +      
> > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > >  };
> > > > > > > > > > > > > 
> > > > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets
> > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > >         uint8_t rx_deferred_start; /**< Do
> > > > > > > > > > > > > not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > > >         uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > */
> > > > > > > > > > > > > +       uint32_t shared_group; /**< Shared
> > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > >         /**
> > > > > > > > > > > > >          * Per-queue Rx offloads to be set
> > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > >          * Only offloads set on
> > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct
> > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM 
> > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH            
> > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > +/**
> > > > > > > > > > > > > + * Rx queue is shared among ports in same
> > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > + * Real source port number saved in mbuf-
> > > > > > > > > > > > > >port field.
> > > > > > > > > > > > > + */
> > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ  
> > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > 
> > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > >                                  DEV_RX_OFFLO
> > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > --
> > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > 
> > > > > 
> > > 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-28 13:38                       ` Jerin Jacob
  2021-09-28 13:59                         ` Ananyev, Konstantin
@ 2021-09-28 14:51                         ` Xueming(Steven) Li
  1 sibling, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-28 14:51 UTC (permalink / raw)
  To: jerinjacobk
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

On Tue, 2021-09-28 at 19:08 +0530, Jerin Jacob wrote:
> On Tue, Sep 28, 2021 at 6:55 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > 
> > On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> > > On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > 
> > > > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > 
> > > > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > > > 
> > > > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > > 
> > > > > > > > > > Hi,
> > > > > > > > > > 
> > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > > > > > 
> > > > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > 
> > > > > > > > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > > > > > > > for incoming packets. When number of representors scale out in a
> > > > > > > > > > > > switch domain, the memory consumption became significant. Most
> > > > > > > > > > > > important, polling all ports leads to high cache miss, high
> > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > 
> > > > > > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > > > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > > > > > > > 
> > > > > > > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > > > > > > index is
> > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > 
> > > > > > > > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > > > > > > > 
> > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > 
> > > > > > > > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > > > > > > > 
> > > > > > > > > > Yes, PF and representor in switch domain could take advantage.
> > > > > > > > > > 
> > > > > > > > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > > > > > > > 
> > > > > > > > > > Not quite sure that I understood your question. The control path of is
> > > > > > > > > > almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > > > > > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > > > > > > > supplied from shared Rx queue in my PMD implementation.
> > > > > > > > > 
> > > > > > > > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
> > > > > > > > > receive queue, In that case, how the flow order is maintained for respective receive queues.
> > > > > > > > 
> > > > > > > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > > > > > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > > > > > > Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
> > > > > > > > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > 
> > > > > > > 
> > > > > > > All ports will put the packets in to the same queue (share queue), right? Does
> > > > > > > this means only single core will poll only, what will happen if there are
> > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > 
> > > > > > > And if this requires specific changes in the application, I am not sure about
> > > > > > > the solution, can't this work in a transparent way to the application?
> > > > > > 
> > > > > > Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports
> > > > > > in same group into one new port. Users could schedule polling on the
> > > > > > aggregated port instead of all member ports.
> > > > > 
> > > > > The v3 still has testpmd changes in fastpath. Right? IMO, For this
> > > > > feature, we should not change fastpath of testpmd
> > > > > application. Instead, testpmd can use aggregated ports probably as
> > > > > separate fwd_engine to show how to use this feature.
> > > > 
> > > > Good point to discuss :) There are two strategies to polling a shared
> > > > Rxq:
> > > > 1. polling each member port
> > > >    All forwarding engines can be reused to work as before.
> > > >    My testpmd patches are efforts towards this direction.
> > > >    Does your PMD support this?
> > > 
> > > Not unfortunately. More than that, every application needs to change
> > > to support this model.
> > 
> > Both strategies need user application to resolve port ID from mbuf and
> > process accordingly.
> > This one doesn't demand aggregated port, no polling schedule change.
> 
> I was thinking, mbuf will be updated from driver/aggregator port as when it
> comes to application.
> 
> > 
> > > 
> > > > 2. polling aggregated port
> > > >    Besides forwarding engine, need more work to to demo it.
> > > >    This is an optional API, not supported by my PMD yet.
> > > 
> > > We are thinking of implementing this PMD when it comes to it, ie.
> > > without application change in fastpath
> > > logic.
> > 
> > Fastpath have to resolve port ID anyway and forwarding according to
> > logic. Forwarding engine need to adapt to support shard Rxq.
> > Fortunately, in testpmd, this can be done with an abstract API.
> > 
> > Let's defer part 2 until some PMD really support it and tested, how do
> > you think?
> 
> We are not planning to use this feature so either way it is OK to me.
> I leave to ethdev maintainers decide between 1 vs 2.

A better driver should support both, but specific driver could select
either one. 1 brings less changes to application, 2 brings better
performance with additional steps.

> 
> I do have a strong opinion not changing the testpmd basic forward engines
> for this feature.I would like to keep it simple as fastpath optimized and would
> like to add a separate Forwarding engine as means to verify this feature.
> 
> 
> 
> > 
> > > 
> > > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > Overall, is this for optimizing memory for the port represontors? If so can't we
> > > > > > > have a port representor specific solution, reducing scope can reduce the
> > > > > > > complexity it brings?
> > > > > > > 
> > > > > > > > > If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
> > > > > > > > > scope.
> > > > > > > > 
> > > > > > > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > > > > > ---
> > > > > > > > > > > >  doc/guides/nics/features.rst                    | 11 +++++++++++
> > > > > > > > > > > >  doc/guides/nics/features/default.ini            |  1 +
> > > > > > > > > > > >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > > > > > > > >  lib/ethdev/rte_ethdev.c                         |  1 +
> > > > > > > > > > > >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> > > > > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > > > > > 
> > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > > > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > +
> > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > +---------------
> > > > > > > > > > > > +
> > > > > > > > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > > > > > > > +
> > > > > > > > > > > > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > +
> > > > > > > > > > > > +
> > > > > > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > 
> > > > > > > > > > > >  Packet type parsing
> > > > > > > > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > > >  Queue start/stop     =
> > > > > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > > > > +Shared Rx queue      =
> > > > > > > > > > > >  Burst mode info      =
> > > > > > > > > > > >  Power mgmt address monitor =
> > > > > > > > > > > >  MTU update           =
> > > > > > > > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > > > > > > > >  .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > > > > > > > 
> > > > > > > > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > > > > > > > `_
> > > > > > > > > > > > 
> > > > > > > > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > > > > > > > +grows,
> > > > > > > > > > > > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > > > > > > > +  Polling the large number of ports brings more CPU load, cache
> > > > > > > > > > > > +miss and
> > > > > > > > > > > > +  latency. Shared Rx queue can be used to share Rx queue between
> > > > > > > > > > > > +PF and
> > > > > > > > > > > > +  representors in same switch domain.
> > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > +  is present in Rx offloading capability of device info. Setting
> > > > > > > > > > > > +the
> > > > > > > > > > > > +  offloading flag in device Rx mode or Rx queue configuration to
> > > > > > > > > > > > +enable
> > > > > > > > > > > > +  shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > > > > > > > +return
> > > > > > > > > > > > +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > > > > > > > +
> > > > > > > > > > > >  Basic SR-IOV
> > > > > > > > > > > >  ------------
> > > > > > > > > > > > 
> > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > > > > > > > +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > >  };
> > > > > > > > > > > > 
> > > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > > > > > */
> > > > > > > > > > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > >         /**
> > > > > > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > > > > > +/**
> > > > > > > > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > > > > > > > +memory,
> > > > > > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > > > > > + */
> > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > > > > > > 
> > > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > > > > > > --
> > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > 
> > > > 
> > 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-28 14:40                           ` Xueming(Steven) Li
@ 2021-09-28 14:59                             ` Jerin Jacob
  2021-09-29  7:41                               ` Xueming(Steven) Li
  2021-09-29  0:26                             ` Ananyev, Konstantin
  1 sibling, 1 reply; 266+ messages in thread
From: Jerin Jacob @ 2021-09-28 14:59 UTC (permalink / raw)
  To: Xueming(Steven) Li
  Cc: konstantin.ananyev, NBU-Contact-Thomas Monjalon,
	andrew.rybchenko, dev, ferruh.yigit

On Tue, Sep 28, 2021 at 8:10 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
> On Tue, 2021-09-28 at 13:59 +0000, Ananyev, Konstantin wrote:
> > >
> > > On Tue, Sep 28, 2021 at 6:55 PM Xueming(Steven) Li
> > > <xuemingl@nvidia.com> wrote:
> > > >
> > > > On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> > > > > On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li
> > > > > <xuemingl@nvidia.com> wrote:
> > > > > >
> > > > > > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > > > > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li
> > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas
> > > > > > > > > > > Monjalon
> > > <thomas@monjalon.net>;
> > > > > > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li
> > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Hi,
> > > > > > > > > > > >
> > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > > <ferruh.yigit@intel.com>;
> > > > > > > > > > > > > NBU-Contact-Thomas Monjalon
> > > > > > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li
> > > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > In current DPDK framework, each RX queue is
> > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > switch domain, the memory consumption became
> > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > important, polling all ports leads to high
> > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This patch introduces shared RX queue. Ports
> > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > configuration in a switch domain could share
> > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > Polling any queue using same shared RX queue
> > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > all member ports. Source port is identified
> > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Port queue number in a shared group should be
> > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Share RX queue is supposed to be polled on
> > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Is this offload specific to the representor? If
> > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > representor?
> > > > > > > > > > > >
> > > > > > > > > > > > Yes, PF and representor in switch domain could
> > > > > > > > > > > > take advantage.
> > > > > > > > > > > >
> > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > >
> > > > > > > > > > > > Not quite sure that I understood your question.
> > > > > > > > > > > > The control path of is
> > > > > > > > > > > > almost same as before, PF and representor port
> > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > implementation.
> > > > > > > > > > >
> > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > ethdev receive queues land into
> > > the same
> > > > > > > > > > > receive queue, In that case, how the flow order is
> > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > >
> > > > > > > > > > I guess the question is testpmd forward stream? The
> > > > > > > > > > forwarding logic has to be changed slightly in case
> > > > > > > > > > of shared rxq.
> > > > > > > > > > basically for each packet in rx_burst result, lookup
> > > > > > > > > > source stream according to mbuf->port, forwarding to
> > > > > > > > > > target fs.
> > > > > > > > > > Packets from same source port could be grouped as a
> > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > performance if traffic
> > > come from
> > > > > > > > > > limited ports. I'll introduce some common api to do
> > > > > > > > > > shard rxq forwarding, call it with packets handling
> > > > > > > > > > callback, so it suites for
> > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > (share queue), right? Does
> > > > > > > > > this means only single core will poll only, what will
> > > > > > > > > happen if there are
> > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > >
> > > > > > > > > And if this requires specific changes in the
> > > > > > > > > application, I am not sure about
> > > > > > > > > the solution, can't this work in a transparent way to
> > > > > > > > > the application?
> > > > > > > >
> > > > > > > > Discussed with Jerin, new API introduced in v3 2/8 that
> > > > > > > > aggregate ports
> > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > polling on the
> > > > > > > > aggregated port instead of all member ports.
> > > > > > >
> > > > > > > The v3 still has testpmd changes in fastpath. Right? IMO,
> > > > > > > For this
> > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > probably as
> > > > > > > separate fwd_engine to show how to use this feature.
> > > > > >
> > > > > > Good point to discuss :) There are two strategies to polling
> > > > > > a shared
> > > > > > Rxq:
> > > > > > 1. polling each member port
> > > > > >    All forwarding engines can be reused to work as before.
> > > > > >    My testpmd patches are efforts towards this direction.
> > > > > >    Does your PMD support this?
> > > > >
> > > > > Not unfortunately. More than that, every application needs to
> > > > > change
> > > > > to support this model.
> > > >
> > > > Both strategies need user application to resolve port ID from
> > > > mbuf and
> > > > process accordingly.
> > > > This one doesn't demand aggregated port, no polling schedule
> > > > change.
> > >
> > > I was thinking, mbuf will be updated from driver/aggregator port as
> > > when it
> > > comes to application.
> > >
> > > >
> > > > >
> > > > > > 2. polling aggregated port
> > > > > >    Besides forwarding engine, need more work to to demo it.
> > > > > >    This is an optional API, not supported by my PMD yet.
> > > > >
> > > > > We are thinking of implementing this PMD when it comes to it,
> > > > > ie.
> > > > > without application change in fastpath
> > > > > logic.
> > > >
> > > > Fastpath have to resolve port ID anyway and forwarding according
> > > > to
> > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > Fortunately, in testpmd, this can be done with an abstract API.
> > > >
> > > > Let's defer part 2 until some PMD really support it and tested,
> > > > how do
> > > > you think?
> > >
> > > We are not planning to use this feature so either way it is OK to
> > > me.
> > > I leave to ethdev maintainers decide between 1 vs 2.
> > >
> > > I do have a strong opinion not changing the testpmd basic forward
> > > engines
> > > for this feature.I would like to keep it simple as fastpath
> > > optimized and would
> > > like to add a separate Forwarding engine as means to verify this
> > > feature.
> >
> > +1 to that.
> > I don't think it a 'common' feature.
> > So separate FWD mode seems like a best choice to me.
>
> -1 :)
> There was some internal requirement from test team, they need to verify

Internal QA requirements may not be the driving factor :-)

> all features like packet content, rss, vlan, checksum, rte_flow... to
> be working based on shared rx queue. Based on the patch, I believe the
> impact has been minimized.


>
> >
> > >
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > represontors? If so can't we
> > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > scope can reduce the
> > > > > > > > > complexity it brings?
> > > > > > > > >
> > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > representor the case by changing its
> > > name and
> > > > > > > > > > > scope.
> > > > > > > > > >
> > > > > > > > > > It works for both PF and representors in same switch
> > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > apply.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > ---
> > > > > > > > > > > > > >  doc/guides/nics/features.rst
> > > > > > > > > > > > > > | 11 +++++++++++
> > > > > > > > > > > > > >  doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > |  1 +
> > > > > > > > > > > > > >  doc/guides/prog_guide/switch_representation.
> > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > |  1 +
> > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > |  7 +++++++
> > > > > > > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4
> > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DE
> > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +Supports shared Rx queue for ports in same
> > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >  Packet type parsing
> > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > > > > >  Queue start/stop     =
> > > > > > > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > > > > > > +Shared Rx queue      =
> > > > > > > > > > > > > >  Burst mode info      =
> > > > > > > > > > > > > >  Power mgmt address monitor =
> > > > > > > > > > > > > >  MTU update           =
> > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software
> > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > >  .. [1] `Ethernet switch device driver model
> > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation/net
> > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > +- Memory usage of representors is huge when
> > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > +  because PMD always allocate mbuf for each
> > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > +  Polling the large number of ports brings
> > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > +  latency. Shared Rx queue can be used to
> > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > +  representors in same switch domain.
> > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > +  is present in Rx offloading capability of
> > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > +  offloading flag in device Rx mode or Rx
> > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > +  shared Rx queue. Polling any member port
> > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > +  packets of all ports in group, port ID is
> > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > >  Basic SR-IOV
> > > > > > > > > > > > > >  ------------
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSU
> > > > > > > > > > > > > > M),
> > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPL
> > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > >  };
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets
> > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > >         uint8_t rx_deferred_start; /**< Do
> > > > > > > > > > > > > > not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > > > >         uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > */
> > > > > > > > > > > > > > +       uint32_t shared_group; /**< Shared
> > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > >         /**
> > > > > > > > > > > > > >          * Per-queue Rx offloads to be set
> > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > >          * Only offloads set on
> > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct
> > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > + * Rx queue is shared among ports in same
> > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > + * Real source port number saved in mbuf-
> > > > > > > > > > > > > > >port field.
> > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > >                                  DEV_RX_OFFLO
> > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > > >
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-28 14:40                           ` Xueming(Steven) Li
  2021-09-28 14:59                             ` Jerin Jacob
@ 2021-09-29  0:26                             ` Ananyev, Konstantin
  2021-09-29  8:40                               ` Xueming(Steven) Li
  2021-09-29  9:12                               ` Xueming(Steven) Li
  1 sibling, 2 replies; 266+ messages in thread
From: Ananyev, Konstantin @ 2021-09-29  0:26 UTC (permalink / raw)
  To: Xueming(Steven) Li, jerinjacobk
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, Yigit, Ferruh


> > > > > > > > > > > > > > In current DPDK framework, each RX queue is
> > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > switch domain, the memory consumption became
> > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > important, polling all ports leads to high
> > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This patch introduces shared RX queue. Ports
> > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > configuration in a switch domain could share
> > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > Polling any queue using same shared RX queue
> > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > all member ports. Source port is identified
> > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Port queue number in a shared group should be
> > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Share RX queue is supposed to be polled on
> > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Is this offload specific to the representor? If
> > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > representor?
> > > > > > > > > > > >
> > > > > > > > > > > > Yes, PF and representor in switch domain could
> > > > > > > > > > > > take advantage.
> > > > > > > > > > > >
> > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > >
> > > > > > > > > > > > Not quite sure that I understood your question.
> > > > > > > > > > > > The control path of is
> > > > > > > > > > > > almost same as before, PF and representor port
> > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > implementation.
> > > > > > > > > > >
> > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > ethdev receive queues land into
> > > the same
> > > > > > > > > > > receive queue, In that case, how the flow order is
> > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > >
> > > > > > > > > > I guess the question is testpmd forward stream? The
> > > > > > > > > > forwarding logic has to be changed slightly in case
> > > > > > > > > > of shared rxq.
> > > > > > > > > > basically for each packet in rx_burst result, lookup
> > > > > > > > > > source stream according to mbuf->port, forwarding to
> > > > > > > > > > target fs.
> > > > > > > > > > Packets from same source port could be grouped as a
> > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > performance if traffic
> > > come from
> > > > > > > > > > limited ports. I'll introduce some common api to do
> > > > > > > > > > shard rxq forwarding, call it with packets handling
> > > > > > > > > > callback, so it suites for
> > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > (share queue), right? Does
> > > > > > > > > this means only single core will poll only, what will
> > > > > > > > > happen if there are
> > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > >
> > > > > > > > > And if this requires specific changes in the
> > > > > > > > > application, I am not sure about
> > > > > > > > > the solution, can't this work in a transparent way to
> > > > > > > > > the application?
> > > > > > > >
> > > > > > > > Discussed with Jerin, new API introduced in v3 2/8 that
> > > > > > > > aggregate ports
> > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > polling on the
> > > > > > > > aggregated port instead of all member ports.
> > > > > > >
> > > > > > > The v3 still has testpmd changes in fastpath. Right? IMO,
> > > > > > > For this
> > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > probably as
> > > > > > > separate fwd_engine to show how to use this feature.
> > > > > >
> > > > > > Good point to discuss :) There are two strategies to polling
> > > > > > a shared
> > > > > > Rxq:
> > > > > > 1. polling each member port
> > > > > >    All forwarding engines can be reused to work as before.
> > > > > >    My testpmd patches are efforts towards this direction.
> > > > > >    Does your PMD support this?
> > > > >
> > > > > Not unfortunately. More than that, every application needs to
> > > > > change
> > > > > to support this model.
> > > >
> > > > Both strategies need user application to resolve port ID from
> > > > mbuf and
> > > > process accordingly.
> > > > This one doesn't demand aggregated port, no polling schedule
> > > > change.
> > >
> > > I was thinking, mbuf will be updated from driver/aggregator port as
> > > when it
> > > comes to application.
> > >
> > > >
> > > > >
> > > > > > 2. polling aggregated port
> > > > > >    Besides forwarding engine, need more work to to demo it.
> > > > > >    This is an optional API, not supported by my PMD yet.
> > > > >
> > > > > We are thinking of implementing this PMD when it comes to it,
> > > > > ie.
> > > > > without application change in fastpath
> > > > > logic.
> > > >
> > > > Fastpath have to resolve port ID anyway and forwarding according
> > > > to
> > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > Fortunately, in testpmd, this can be done with an abstract API.
> > > >
> > > > Let's defer part 2 until some PMD really support it and tested,
> > > > how do
> > > > you think?
> > >
> > > We are not planning to use this feature so either way it is OK to
> > > me.
> > > I leave to ethdev maintainers decide between 1 vs 2.
> > >
> > > I do have a strong opinion not changing the testpmd basic forward
> > > engines
> > > for this feature.I would like to keep it simple as fastpath
> > > optimized and would
> > > like to add a separate Forwarding engine as means to verify this
> > > feature.
> >
> > +1 to that.
> > I don't think it a 'common' feature.
> > So separate FWD mode seems like a best choice to me.
> 
> -1 :)
> There was some internal requirement from test team, they need to verify
> all features like packet content, rss, vlan, checksum, rte_flow... to
> be working based on shared rx queue.

Then I suppose you'll need to write really comprehensive fwd-engine 
to satisfy your test team :)
Speaking seriously, I still don't understand why do you need all
available fwd-engines to verify this feature.
From what I understand, main purpose of your changes to test-pmd:
allow to fwd packet though different fwd_stream (TX through different HW queue).
In theory, if implemented in generic and extendable way - that
might be a useful add-on to tespmd fwd functionality.
But current implementation looks very case specific.
And as I don't think it is a common case, I don't see much point to pollute
basic fwd cases with it.

BTW, as a side note, the code below looks bogus to me:
+void
+forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
+		   struct rte_mbuf **pkts_burst, packet_fwd_cb fwd)	
+{
+	uint16_t i, nb_fs_rx = 1, port;
+
+	/* Locate real source fs according to mbuf->port. */
+	for (i = 0; i < nb_rx; ++i) {
+		rte_prefetch0(pkts_burst[i + 1]);

you access pkt_burst[] beyond array boundaries,
also you ask cpu to prefetch some unknown and possibly invalid address.

> Based on the patch, I believe the
> impact has been minimized.
> 
> >
> > >
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > represontors? If so can't we
> > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > scope can reduce the
> > > > > > > > > complexity it brings?
> > > > > > > > >
> > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > representor the case by changing its
> > > name and
> > > > > > > > > > > scope.
> > > > > > > > > >
> > > > > > > > > > It works for both PF and representors in same switch
> > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > apply.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > ---
> > > > > > > > > > > > > >  doc/guides/nics/features.rst
> > > > > > > > > > > > > > | 11 +++++++++++
> > > > > > > > > > > > > >  doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > |  1 +
> > > > > > > > > > > > > >  doc/guides/prog_guide/switch_representation.
> > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > |  1 +
> > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > |  7 +++++++
> > > > > > > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4
> > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DE
> > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +Supports shared Rx queue for ports in same
> > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >  Packet type parsing
> > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > > > > >  Queue start/stop     =
> > > > > > > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > > > > > > +Shared Rx queue      =
> > > > > > > > > > > > > >  Burst mode info      =
> > > > > > > > > > > > > >  Power mgmt address monitor =
> > > > > > > > > > > > > >  MTU update           =
> > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software
> > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > >  .. [1] `Ethernet switch device driver model
> > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation/net
> > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > +- Memory usage of representors is huge when
> > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > +  because PMD always allocate mbuf for each
> > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > +  Polling the large number of ports brings
> > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > +  latency. Shared Rx queue can be used to
> > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > +  representors in same switch domain.
> > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > +  is present in Rx offloading capability of
> > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > +  offloading flag in device Rx mode or Rx
> > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > +  shared Rx queue. Polling any member port
> > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > +  packets of all ports in group, port ID is
> > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > >  Basic SR-IOV
> > > > > > > > > > > > > >  ------------
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSU
> > > > > > > > > > > > > > M),
> > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPL
> > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > >  };
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets
> > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > >         uint8_t rx_deferred_start; /**< Do
> > > > > > > > > > > > > > not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > > > >         uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > */
> > > > > > > > > > > > > > +       uint32_t shared_group; /**< Shared
> > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > >         /**
> > > > > > > > > > > > > >          * Per-queue Rx offloads to be set
> > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > >          * Only offloads set on
> > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct
> > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > + * Rx queue is shared among ports in same
> > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > + * Real source port number saved in mbuf-
> > > > > > > > > > > > > > >port field.
> > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > >                                  DEV_RX_OFFLO
> > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > > >


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-28 14:59                             ` Jerin Jacob
@ 2021-09-29  7:41                               ` Xueming(Steven) Li
  2021-09-29  8:05                                 ` Jerin Jacob
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-29  7:41 UTC (permalink / raw)
  To: jerinjacobk, Raslan Darawsheh
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko,
	konstantin.ananyev, dev, ferruh.yigit

On Tue, 2021-09-28 at 20:29 +0530, Jerin Jacob wrote:
> On Tue, Sep 28, 2021 at 8:10 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > 
> > On Tue, 2021-09-28 at 13:59 +0000, Ananyev, Konstantin wrote:
> > > > 
> > > > On Tue, Sep 28, 2021 at 6:55 PM Xueming(Steven) Li
> > > > <xuemingl@nvidia.com> wrote:
> > > > > 
> > > > > On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> > > > > > On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li
> > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > 
> > > > > > > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > > > > > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li
> > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > 
> > > > > > > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > > > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas
> > > > > > > > > > > > Monjalon
> > > > <thomas@monjalon.net>;
> > > > > > > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > 
> > > > > > > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li
> > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > > > <ferruh.yigit@intel.com>;
> > > > > > > > > > > > > > NBU-Contact-Thomas Monjalon
> > > > > > > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li
> > > > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > In current DPDK framework, each RX queue is
> > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > switch domain, the memory consumption became
> > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > important, polling all ports leads to high
> > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > This patch introduces shared RX queue. Ports
> > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > configuration in a switch domain could share
> > > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > > Polling any queue using same shared RX queue
> > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > all member ports. Source port is identified
> > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Port queue number in a shared group should be
> > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Share RX queue is supposed to be polled on
> > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Is this offload specific to the representor? If
> > > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Yes, PF and representor in switch domain could
> > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Not quite sure that I understood your question.
> > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > almost same as before, PF and representor port
> > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > implementation.
> > > > > > > > > > > > 
> > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > > ethdev receive queues land into
> > > > the same
> > > > > > > > > > > > receive queue, In that case, how the flow order is
> > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > 
> > > > > > > > > > > I guess the question is testpmd forward stream? The
> > > > > > > > > > > forwarding logic has to be changed slightly in case
> > > > > > > > > > > of shared rxq.
> > > > > > > > > > > basically for each packet in rx_burst result, lookup
> > > > > > > > > > > source stream according to mbuf->port, forwarding to
> > > > > > > > > > > target fs.
> > > > > > > > > > > Packets from same source port could be grouped as a
> > > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > > performance if traffic
> > > > come from
> > > > > > > > > > > limited ports. I'll introduce some common api to do
> > > > > > > > > > > shard rxq forwarding, call it with packets handling
> > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > > (share queue), right? Does
> > > > > > > > > > this means only single core will poll only, what will
> > > > > > > > > > happen if there are
> > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > 
> > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > application, I am not sure about
> > > > > > > > > > the solution, can't this work in a transparent way to
> > > > > > > > > > the application?
> > > > > > > > > 
> > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8 that
> > > > > > > > > aggregate ports
> > > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > > polling on the
> > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > 
> > > > > > > > The v3 still has testpmd changes in fastpath. Right? IMO,
> > > > > > > > For this
> > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > > probably as
> > > > > > > > separate fwd_engine to show how to use this feature.
> > > > > > > 
> > > > > > > Good point to discuss :) There are two strategies to polling
> > > > > > > a shared
> > > > > > > Rxq:
> > > > > > > 1. polling each member port
> > > > > > >    All forwarding engines can be reused to work as before.
> > > > > > >    My testpmd patches are efforts towards this direction.
> > > > > > >    Does your PMD support this?
> > > > > > 
> > > > > > Not unfortunately. More than that, every application needs to
> > > > > > change
> > > > > > to support this model.
> > > > > 
> > > > > Both strategies need user application to resolve port ID from
> > > > > mbuf and
> > > > > process accordingly.
> > > > > This one doesn't demand aggregated port, no polling schedule
> > > > > change.
> > > > 
> > > > I was thinking, mbuf will be updated from driver/aggregator port as
> > > > when it
> > > > comes to application.
> > > > 
> > > > > 
> > > > > > 
> > > > > > > 2. polling aggregated port
> > > > > > >    Besides forwarding engine, need more work to to demo it.
> > > > > > >    This is an optional API, not supported by my PMD yet.
> > > > > > 
> > > > > > We are thinking of implementing this PMD when it comes to it,
> > > > > > ie.
> > > > > > without application change in fastpath
> > > > > > logic.
> > > > > 
> > > > > Fastpath have to resolve port ID anyway and forwarding according
> > > > > to
> > > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > > Fortunately, in testpmd, this can be done with an abstract API.
> > > > > 
> > > > > Let's defer part 2 until some PMD really support it and tested,
> > > > > how do
> > > > > you think?
> > > > 
> > > > We are not planning to use this feature so either way it is OK to
> > > > me.
> > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > 
> > > > I do have a strong opinion not changing the testpmd basic forward
> > > > engines
> > > > for this feature.I would like to keep it simple as fastpath
> > > > optimized and would
> > > > like to add a separate Forwarding engine as means to verify this
> > > > feature.
> > > 
> > > +1 to that.
> > > I don't think it a 'common' feature.
> > > So separate FWD mode seems like a best choice to me.
> > 
> > -1 :)
> > There was some internal requirement from test team, they need to verify
> 
> Internal QA requirements may not be the driving factor :-)

It will be a test requirement for any driver to face, not internal. The
performance difference almost zero in v3, only an "unlikely if" test on
each burst. Shared Rxq is a low level feature, reusing all current FWD
engines to verify driver high level features is important IMHO.

> 
> > all features like packet content, rss, vlan, checksum, rte_flow... to
> > be working based on shared rx queue. Based on the patch, I believe the
> > impact has been minimized.
> 
> 
> > 
> > > 
> > > > 
> > > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > > represontors? If so can't we
> > > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > > scope can reduce the
> > > > > > > > > > complexity it brings?
> > > > > > > > > > 
> > > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > > representor the case by changing its
> > > > name and
> > > > > > > > > > > > scope.
> > > > > > > > > > > 
> > > > > > > > > > > It works for both PF and representors in same switch
> > > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > > apply.
> > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > >  doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > 11 +++++++++++
> > > > > > > > > > > > > > >  doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > >  doc/guides/prog_guide/switch_representation.
> > > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > >  7 +++++++
> > > > > > > > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4
> > > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DE
> > > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +Supports shared Rx queue for ports in same
> > > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >  Packet type parsing
> > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > > > > > >  Queue start/stop     =
> > > > > > > > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > > > > > > > +Shared Rx queue      =
> > > > > > > > > > > > > > >  Burst mode info      =
> > > > > > > > > > > > > > >  Power mgmt address monitor =
> > > > > > > > > > > > > > >  MTU update           =
> > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software
> > > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > > >  .. [1] `Ethernet switch device driver model
> > > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation/net
> > > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > +- Memory usage of representors is huge when
> > > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > > +  because PMD always allocate mbuf for each
> > > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > > +  Polling the large number of ports brings
> > > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > > +  latency. Shared Rx queue can be used to
> > > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > > +  representors in same switch domain.
> > > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > > +  is present in Rx offloading capability of
> > > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > > +  offloading flag in device Rx mode or Rx
> > > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > > +  shared Rx queue. Polling any member port
> > > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > > +  packets of all ports in group, port ID is
> > > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > >  Basic SR-IOV
> > > > > > > > > > > > > > >  ------------
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSU
> > > > > > > > > > > > > > > M),
> > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPL
> > > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > > >  };
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets
> > > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > > >         uint8_t rx_deferred_start; /**< Do
> > > > > > > > > > > > > > > not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > > > > >         uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > +       uint32_t shared_group; /**< Shared
> > > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > > >         /**
> > > > > > > > > > > > > > >          * Per-queue Rx offloads to be set
> > > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > > >          * Only offloads set on
> > > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct
> > > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > > + * Rx queue is shared among ports in same
> > > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > > + * Real source port number saved in mbuf-
> > > > > > > > > > > > > > > > port field.
> > > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > > >                                  DEV_RX_OFFLO
> > > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > 
> > 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-29  7:41                               ` Xueming(Steven) Li
@ 2021-09-29  8:05                                 ` Jerin Jacob
  2021-10-08  8:26                                   ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Jerin Jacob @ 2021-09-29  8:05 UTC (permalink / raw)
  To: Xueming(Steven) Li
  Cc: Raslan Darawsheh, NBU-Contact-Thomas Monjalon, andrew.rybchenko,
	konstantin.ananyev, dev, ferruh.yigit

On Wed, Sep 29, 2021 at 1:11 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
> On Tue, 2021-09-28 at 20:29 +0530, Jerin Jacob wrote:
> > On Tue, Sep 28, 2021 at 8:10 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >
> > > On Tue, 2021-09-28 at 13:59 +0000, Ananyev, Konstantin wrote:
> > > > >
> > > > > On Tue, Sep 28, 2021 at 6:55 PM Xueming(Steven) Li
> > > > > <xuemingl@nvidia.com> wrote:
> > > > > >
> > > > > > On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> > > > > > > On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li
> > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > >
> > > > > > > > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > > > > > > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li
> > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > > > > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas
> > > > > > > > > > > > > Monjalon
> > > > > <thomas@monjalon.net>;
> > > > > > > > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li
> > > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > > > > <ferruh.yigit@intel.com>;
> > > > > > > > > > > > > > > NBU-Contact-Thomas Monjalon
> > > > > > > > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li
> > > > > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > In current DPDK framework, each RX queue is
> > > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > > switch domain, the memory consumption became
> > > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > > important, polling all ports leads to high
> > > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > This patch introduces shared RX queue. Ports
> > > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > > configuration in a switch domain could share
> > > > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > > > Polling any queue using same shared RX queue
> > > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > > all member ports. Source port is identified
> > > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Port queue number in a shared group should be
> > > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Share RX queue is supposed to be polled on
> > > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Is this offload specific to the representor? If
> > > > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Yes, PF and representor in switch domain could
> > > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Not quite sure that I understood your question.
> > > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > > almost same as before, PF and representor port
> > > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > > implementation.
> > > > > > > > > > > > >
> > > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > > > ethdev receive queues land into
> > > > > the same
> > > > > > > > > > > > > receive queue, In that case, how the flow order is
> > > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > >
> > > > > > > > > > > > I guess the question is testpmd forward stream? The
> > > > > > > > > > > > forwarding logic has to be changed slightly in case
> > > > > > > > > > > > of shared rxq.
> > > > > > > > > > > > basically for each packet in rx_burst result, lookup
> > > > > > > > > > > > source stream according to mbuf->port, forwarding to
> > > > > > > > > > > > target fs.
> > > > > > > > > > > > Packets from same source port could be grouped as a
> > > > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > > > performance if traffic
> > > > > come from
> > > > > > > > > > > > limited ports. I'll introduce some common api to do
> > > > > > > > > > > > shard rxq forwarding, call it with packets handling
> > > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > > > (share queue), right? Does
> > > > > > > > > > > this means only single core will poll only, what will
> > > > > > > > > > > happen if there are
> > > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > >
> > > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > > application, I am not sure about
> > > > > > > > > > > the solution, can't this work in a transparent way to
> > > > > > > > > > > the application?
> > > > > > > > > >
> > > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8 that
> > > > > > > > > > aggregate ports
> > > > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > > > polling on the
> > > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > >
> > > > > > > > > The v3 still has testpmd changes in fastpath. Right? IMO,
> > > > > > > > > For this
> > > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > > > probably as
> > > > > > > > > separate fwd_engine to show how to use this feature.
> > > > > > > >
> > > > > > > > Good point to discuss :) There are two strategies to polling
> > > > > > > > a shared
> > > > > > > > Rxq:
> > > > > > > > 1. polling each member port
> > > > > > > >    All forwarding engines can be reused to work as before.
> > > > > > > >    My testpmd patches are efforts towards this direction.
> > > > > > > >    Does your PMD support this?
> > > > > > >
> > > > > > > Not unfortunately. More than that, every application needs to
> > > > > > > change
> > > > > > > to support this model.
> > > > > >
> > > > > > Both strategies need user application to resolve port ID from
> > > > > > mbuf and
> > > > > > process accordingly.
> > > > > > This one doesn't demand aggregated port, no polling schedule
> > > > > > change.
> > > > >
> > > > > I was thinking, mbuf will be updated from driver/aggregator port as
> > > > > when it
> > > > > comes to application.
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > 2. polling aggregated port
> > > > > > > >    Besides forwarding engine, need more work to to demo it.
> > > > > > > >    This is an optional API, not supported by my PMD yet.
> > > > > > >
> > > > > > > We are thinking of implementing this PMD when it comes to it,
> > > > > > > ie.
> > > > > > > without application change in fastpath
> > > > > > > logic.
> > > > > >
> > > > > > Fastpath have to resolve port ID anyway and forwarding according
> > > > > > to
> > > > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > > > Fortunately, in testpmd, this can be done with an abstract API.
> > > > > >
> > > > > > Let's defer part 2 until some PMD really support it and tested,
> > > > > > how do
> > > > > > you think?
> > > > >
> > > > > We are not planning to use this feature so either way it is OK to
> > > > > me.
> > > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > >
> > > > > I do have a strong opinion not changing the testpmd basic forward
> > > > > engines
> > > > > for this feature.I would like to keep it simple as fastpath
> > > > > optimized and would
> > > > > like to add a separate Forwarding engine as means to verify this
> > > > > feature.
> > > >
> > > > +1 to that.
> > > > I don't think it a 'common' feature.
> > > > So separate FWD mode seems like a best choice to me.
> > >
> > > -1 :)
> > > There was some internal requirement from test team, they need to verify
> >


> > Internal QA requirements may not be the driving factor :-)
>
> It will be a test requirement for any driver to face, not internal. The
> performance difference almost zero in v3, only an "unlikely if" test on
> each burst. Shared Rxq is a low level feature, reusing all current FWD
> engines to verify driver high level features is important IMHO.

In addition to additional if check, The real concern is polluting the
common forward engine for the not common feature.

If you really want to reuse the existing application without any
application change,
I think, you need to hook this to eventdev
http://code.dpdk.org/dpdk/latest/source/lib/eventdev/rte_eventdev.h#L34

Where eventdev drivers does this thing in addition to other features, Ie.
t has ports (which is kind of aggregator),
it can receive the packets from any queue with mbuf->port as actually
received port.
That is in terms of mapping:
- event queue will be dummy it will be as same as Rx queue
- Rx adapter will be also a dummy
- event ports aggregate multiple queues and connect to core via event port
- On Rxing the packet, mbuf->port will be the actual Port which is received.
app/test-eventdev written to use this model.



>
> >
> > > all features like packet content, rss, vlan, checksum, rte_flow... to
> > > be working based on shared rx queue. Based on the patch, I believe the
> > > impact has been minimized.
> >
> >
> > >
> > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > > > represontors? If so can't we
> > > > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > > > scope can reduce the
> > > > > > > > > > > complexity it brings?
> > > > > > > > > > >
> > > > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > > > representor the case by changing its
> > > > > name and
> > > > > > > > > > > > > scope.
> > > > > > > > > > > >
> > > > > > > > > > > > It works for both PF and representors in same switch
> > > > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > > > apply.
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > >  doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > 11 +++++++++++
> > > > > > > > > > > > > > > >  doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > > >  doc/guides/prog_guide/switch_representation.
> > > > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > >  7 +++++++
> > > > > > > > > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4
> > > > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DE
> > > > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +Supports shared Rx queue for ports in same
> > > > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >  Packet type parsing
> > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > > > > > > >  Queue start/stop     =
> > > > > > > > > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > > > > > > > > +Shared Rx queue      =
> > > > > > > > > > > > > > > >  Burst mode info      =
> > > > > > > > > > > > > > > >  Power mgmt address monitor =
> > > > > > > > > > > > > > > >  MTU update           =
> > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software
> > > > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > > > >  .. [1] `Ethernet switch device driver model
> > > > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation/net
> > > > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > +- Memory usage of representors is huge when
> > > > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > > > +  because PMD always allocate mbuf for each
> > > > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > > > +  Polling the large number of ports brings
> > > > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > > > +  latency. Shared Rx queue can be used to
> > > > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > > > +  representors in same switch domain.
> > > > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > > > +  is present in Rx offloading capability of
> > > > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > > > +  offloading flag in device Rx mode or Rx
> > > > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > > > +  shared Rx queue. Polling any member port
> > > > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > > > +  packets of all ports in group, port ID is
> > > > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > >  Basic SR-IOV
> > > > > > > > > > > > > > > >  ------------
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSU
> > > > > > > > > > > > > > > > M),
> > > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPL
> > > > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > > > >  };
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets
> > > > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > > > >         uint8_t rx_deferred_start; /**< Do
> > > > > > > > > > > > > > > > not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > > > > > >         uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > > +       uint32_t shared_group; /**< Shared
> > > > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > > > >         /**
> > > > > > > > > > > > > > > >          * Per-queue Rx offloads to be set
> > > > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > > > >          * Only offloads set on
> > > > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct
> > > > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > > > + * Rx queue is shared among ports in same
> > > > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > > > + * Real source port number saved in mbuf-
> > > > > > > > > > > > > > > > > port field.
> > > > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > > > >                                  DEV_RX_OFFLO
> > > > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > >
> > >
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-29  0:26                             ` Ananyev, Konstantin
@ 2021-09-29  8:40                               ` Xueming(Steven) Li
  2021-09-29 10:20                                 ` Ananyev, Konstantin
  2021-09-29  9:12                               ` Xueming(Steven) Li
  1 sibling, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-29  8:40 UTC (permalink / raw)
  To: jerinjacobk, Raslan Darawsheh, konstantin.ananyev
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

On Wed, 2021-09-29 at 00:26 +0000, Ananyev, Konstantin wrote:
> > > > > > > > > > > > > > > In current DPDK framework, each RX queue
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > switch domain, the memory consumption
> > > > > > > > > > > > > > > became
> > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > important, polling all ports leads to
> > > > > > > > > > > > > > > high
> > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > This patch introduces shared RX queue.
> > > > > > > > > > > > > > > Ports
> > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > configuration in a switch domain could
> > > > > > > > > > > > > > > share
> > > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > > Polling any queue using same shared RX
> > > > > > > > > > > > > > > queue
> > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > all member ports. Source port is
> > > > > > > > > > > > > > > identified
> > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Port queue number in a shared group
> > > > > > > > > > > > > > > should be
> > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Share RX queue is supposed to be polled
> > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Is this offload specific to the
> > > > > > > > > > > > > > representor? If
> > > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Yes, PF and representor in switch domain
> > > > > > > > > > > > > could
> > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Not quite sure that I understood your
> > > > > > > > > > > > > question.
> > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > almost same as before, PF and representor
> > > > > > > > > > > > > port
> > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > implementation.
> > > > > > > > > > > > 
> > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > > ethdev receive queues land into
> > > > the same
> > > > > > > > > > > > receive queue, In that case, how the flow order
> > > > > > > > > > > > is
> > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > 
> > > > > > > > > > > I guess the question is testpmd forward stream?
> > > > > > > > > > > The
> > > > > > > > > > > forwarding logic has to be changed slightly in
> > > > > > > > > > > case
> > > > > > > > > > > of shared rxq.
> > > > > > > > > > > basically for each packet in rx_burst result,
> > > > > > > > > > > lookup
> > > > > > > > > > > source stream according to mbuf->port, forwarding
> > > > > > > > > > > to
> > > > > > > > > > > target fs.
> > > > > > > > > > > Packets from same source port could be grouped as
> > > > > > > > > > > a
> > > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > > performance if traffic
> > > > come from
> > > > > > > > > > > limited ports. I'll introduce some common api to
> > > > > > > > > > > do
> > > > > > > > > > > shard rxq forwarding, call it with packets
> > > > > > > > > > > handling
> > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > > (share queue), right? Does
> > > > > > > > > > this means only single core will poll only, what
> > > > > > > > > > will
> > > > > > > > > > happen if there are
> > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > 
> > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > application, I am not sure about
> > > > > > > > > > the solution, can't this work in a transparent way
> > > > > > > > > > to
> > > > > > > > > > the application?
> > > > > > > > > 
> > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8
> > > > > > > > > that
> > > > > > > > > aggregate ports
> > > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > > polling on the
> > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > 
> > > > > > > > The v3 still has testpmd changes in fastpath. Right?
> > > > > > > > IMO,
> > > > > > > > For this
> > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > > probably as
> > > > > > > > separate fwd_engine to show how to use this feature.
> > > > > > > 
> > > > > > > Good point to discuss :) There are two strategies to
> > > > > > > polling
> > > > > > > a shared
> > > > > > > Rxq:
> > > > > > > 1. polling each member port
> > > > > > >    All forwarding engines can be reused to work as
> > > > > > > before.
> > > > > > >    My testpmd patches are efforts towards this direction.
> > > > > > >    Does your PMD support this?
> > > > > > 
> > > > > > Not unfortunately. More than that, every application needs
> > > > > > to
> > > > > > change
> > > > > > to support this model.
> > > > > 
> > > > > Both strategies need user application to resolve port ID from
> > > > > mbuf and
> > > > > process accordingly.
> > > > > This one doesn't demand aggregated port, no polling schedule
> > > > > change.
> > > > 
> > > > I was thinking, mbuf will be updated from driver/aggregator
> > > > port as
> > > > when it
> > > > comes to application.
> > > > 
> > > > > 
> > > > > > 
> > > > > > > 2. polling aggregated port
> > > > > > >    Besides forwarding engine, need more work to to demo
> > > > > > > it.
> > > > > > >    This is an optional API, not supported by my PMD yet.
> > > > > > 
> > > > > > We are thinking of implementing this PMD when it comes to
> > > > > > it,
> > > > > > ie.
> > > > > > without application change in fastpath
> > > > > > logic.
> > > > > 
> > > > > Fastpath have to resolve port ID anyway and forwarding
> > > > > according
> > > > > to
> > > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > > Fortunately, in testpmd, this can be done with an abstract
> > > > > API.
> > > > > 
> > > > > Let's defer part 2 until some PMD really support it and
> > > > > tested,
> > > > > how do
> > > > > you think?
> > > > 
> > > > We are not planning to use this feature so either way it is OK
> > > > to
> > > > me.
> > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > 
> > > > I do have a strong opinion not changing the testpmd basic
> > > > forward
> > > > engines
> > > > for this feature.I would like to keep it simple as fastpath
> > > > optimized and would
> > > > like to add a separate Forwarding engine as means to verify
> > > > this
> > > > feature.
> > > 
> > > +1 to that.
> > > I don't think it a 'common' feature.
> > > So separate FWD mode seems like a best choice to me.
> > 
> > -1 :)
> > There was some internal requirement from test team, they need to
> > verify
> > all features like packet content, rss, vlan, checksum, rte_flow...
> > to
> > be working based on shared rx queue.
> 
> Then I suppose you'll need to write really comprehensive fwd-engine 
> to satisfy your test team :)
> Speaking seriously, I still don't understand why do you need all
> available fwd-engines to verify this feature.

The shared Rxq is low level feature, need to make sure driver higher
level features working properly. fwd-engines like csum checks input
packet and enable L3/L4 checksum and tunnel offloads accordingly,
other engines do their own feature verification. All test automation
could be reused with these engines supported seamlessly.

> From what I understand, main purpose of your changes to test-pmd:
> allow to fwd packet though different fwd_stream (TX through different
> HW queue).

Yes, each mbuf in burst come from differnt port, testpmd current fwd-
engines relies heavily on source forwarding stream, that's why the
patch devide burst result mbufs into sub-burst and use orginal fwd-
engine callback to handle. How to handle is not changed.

> In theory, if implemented in generic and extendable way - that
> might be a useful add-on to tespmd fwd functionality.
> But current implementation looks very case specific.
> And as I don't think it is a common case, I don't see much point to
> pollute
> basic fwd cases with it.

Shared Rxq is a ethdev feature that impacts how packets get handled.
It's natural to update forwarding engines to avoid broken.
The new macro is introduced to minimize performance impact, I'm also
wondering is there an elegant solution :) Current performance penalty
is one "if unlikely" per burst.

Think in reverse direction, if we don't update fwd-engines here, all
malfunction when shared rxq enabled, users can't verify driver
features, are you expecting this?

> 
> BTW, as a side note, the code below looks bogus to me:
> +void
> +forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
> +		   struct rte_mbuf **pkts_burst, packet_fwd_cb
> fwd)	
> +{
> +	uint16_t i, nb_fs_rx = 1, port;
> +
> +	/* Locate real source fs according to mbuf->port. */
> +	for (i = 0; i < nb_rx; ++i) {
> +		rte_prefetch0(pkts_burst[i + 1]);
> 
> you access pkt_burst[] beyond array boundaries,
> also you ask cpu to prefetch some unknown and possibly invalid
> address.
> 
> > Based on the patch, I believe the
> > impact has been minimized.
> > 
> > > 
> > > > 
> > > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > > represontors? If so can't we
> > > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > > scope can reduce the
> > > > > > > > > > complexity it brings?
> > > > > > > > > > 
> > > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > > representor the case by changing its
> > > > name and
> > > > > > > > > > > > scope.
> > > > > > > > > > > 
> > > > > > > > > > > It works for both PF and representors in same
> > > > > > > > > > > switch
> > > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > > apply.
> > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > >  doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > 11 +++++++++++
> > > > > > > > > > > > > > >  doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > >  doc/guides/prog_guide/switch_representat
> > > > > > > > > > > > > > > ion.
> > > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > >  7 +++++++
> > > > > > > > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner
> > > > > > > > > > > > > > > packet L4
> > > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > > >    ``tx_offload_capa,tx_queue_offload_cap
> > > > > > > > > > > > > > > a:DE
> > > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +Supports shared Rx queue for ports in
> > > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ`
> > > > > > > > > > > > > > > `.
> > > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >  Packet type parsing
> > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand
> > > > > > > > > > > > > > > =
> > > > > > > > > > > > > > >  Queue start/stop     =
> > > > > > > > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > > > > > > > +Shared Rx queue      =
> > > > > > > > > > > > > > >  Burst mode info      =
> > > > > > > > > > > > > > >  Power mgmt address monitor =
> > > > > > > > > > > > > > >  MTU update           =
> > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a
> > > > > > > > > > > > > > > software
> > > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > > >  .. [1] `Ethernet switch device driver
> > > > > > > > > > > > > > > model
> > > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation
> > > > > > > > > > > > > > > /net
> > > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > +- Memory usage of representors is huge
> > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > > +  because PMD always allocate mbuf for
> > > > > > > > > > > > > > > each
> > > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > > +  Polling the large number of ports
> > > > > > > > > > > > > > > brings
> > > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > > +  latency. Shared Rx queue can be used
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > > +  representors in same switch domain.
> > > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > > +  is present in Rx offloading capability
> > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > > +  offloading flag in device Rx mode or
> > > > > > > > > > > > > > > Rx
> > > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > > +  shared Rx queue. Polling any member
> > > > > > > > > > > > > > > port
> > > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > > +  packets of all ports in group, port ID
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > >  Basic SR-IOV
> > > > > > > > > > > > > > >  ------------
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_
> > > > > > > > > > > > > > > CKSU
> > > > > > > > > > > > > > > M),
> > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER
> > > > > > > > > > > > > > > _SPL
> > > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > > >  };
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct
> > > > > > > > > > > > > > > rte_eth_rxconf {
> > > > > > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop
> > > > > > > > > > > > > > > packets
> > > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > > >         uint8_t rx_deferred_start; /**<
> > > > > > > > > > > > > > > Do
> > > > > > > > > > > > > > > not start queue with rte_eth_dev_start().
> > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > >         uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > +       uint32_t shared_group; /**<
> > > > > > > > > > > > > > > Shared
> > > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > > >         /**
> > > > > > > > > > > > > > >          * Per-queue Rx offloads to be
> > > > > > > > > > > > > > > set
> > > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > > >          * Only offloads set on
> > > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@
> > > > > > > > > > > > > > > struct
> > > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > > + * Rx queue is shared among ports in
> > > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > > + * Real source port number saved in
> > > > > > > > > > > > > > > mbuf-
> > > > > > > > > > > > > > > > port field.
> > > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > > >                                  DEV_RX_O
> > > > > > > > > > > > > > > FFLO
> > > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > 
> 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-29  0:26                             ` Ananyev, Konstantin
  2021-09-29  8:40                               ` Xueming(Steven) Li
@ 2021-09-29  9:12                               ` Xueming(Steven) Li
  2021-09-29  9:52                                 ` Ananyev, Konstantin
  1 sibling, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-29  9:12 UTC (permalink / raw)
  To: jerinjacobk, konstantin.ananyev
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

On Wed, 2021-09-29 at 00:26 +0000, Ananyev, Konstantin wrote:
> > > > > > > > > > > > > > > In current DPDK framework, each RX queue
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > switch domain, the memory consumption
> > > > > > > > > > > > > > > became
> > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > important, polling all ports leads to
> > > > > > > > > > > > > > > high
> > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > This patch introduces shared RX queue.
> > > > > > > > > > > > > > > Ports
> > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > configuration in a switch domain could
> > > > > > > > > > > > > > > share
> > > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > > Polling any queue using same shared RX
> > > > > > > > > > > > > > > queue
> > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > all member ports. Source port is
> > > > > > > > > > > > > > > identified
> > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Port queue number in a shared group
> > > > > > > > > > > > > > > should be
> > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Share RX queue is supposed to be polled
> > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Is this offload specific to the
> > > > > > > > > > > > > > representor? If
> > > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Yes, PF and representor in switch domain
> > > > > > > > > > > > > could
> > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Not quite sure that I understood your
> > > > > > > > > > > > > question.
> > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > almost same as before, PF and representor
> > > > > > > > > > > > > port
> > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > implementation.
> > > > > > > > > > > > 
> > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > > ethdev receive queues land into
> > > > the same
> > > > > > > > > > > > receive queue, In that case, how the flow order
> > > > > > > > > > > > is
> > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > 
> > > > > > > > > > > I guess the question is testpmd forward stream?
> > > > > > > > > > > The
> > > > > > > > > > > forwarding logic has to be changed slightly in
> > > > > > > > > > > case
> > > > > > > > > > > of shared rxq.
> > > > > > > > > > > basically for each packet in rx_burst result,
> > > > > > > > > > > lookup
> > > > > > > > > > > source stream according to mbuf->port, forwarding
> > > > > > > > > > > to
> > > > > > > > > > > target fs.
> > > > > > > > > > > Packets from same source port could be grouped as
> > > > > > > > > > > a
> > > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > > performance if traffic
> > > > come from
> > > > > > > > > > > limited ports. I'll introduce some common api to
> > > > > > > > > > > do
> > > > > > > > > > > shard rxq forwarding, call it with packets
> > > > > > > > > > > handling
> > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > > (share queue), right? Does
> > > > > > > > > > this means only single core will poll only, what
> > > > > > > > > > will
> > > > > > > > > > happen if there are
> > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > 
> > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > application, I am not sure about
> > > > > > > > > > the solution, can't this work in a transparent way
> > > > > > > > > > to
> > > > > > > > > > the application?
> > > > > > > > > 
> > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8
> > > > > > > > > that
> > > > > > > > > aggregate ports
> > > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > > polling on the
> > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > 
> > > > > > > > The v3 still has testpmd changes in fastpath. Right?
> > > > > > > > IMO,
> > > > > > > > For this
> > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > > probably as
> > > > > > > > separate fwd_engine to show how to use this feature.
> > > > > > > 
> > > > > > > Good point to discuss :) There are two strategies to
> > > > > > > polling
> > > > > > > a shared
> > > > > > > Rxq:
> > > > > > > 1. polling each member port
> > > > > > >    All forwarding engines can be reused to work as
> > > > > > > before.
> > > > > > >    My testpmd patches are efforts towards this direction.
> > > > > > >    Does your PMD support this?
> > > > > > 
> > > > > > Not unfortunately. More than that, every application needs
> > > > > > to
> > > > > > change
> > > > > > to support this model.
> > > > > 
> > > > > Both strategies need user application to resolve port ID from
> > > > > mbuf and
> > > > > process accordingly.
> > > > > This one doesn't demand aggregated port, no polling schedule
> > > > > change.
> > > > 
> > > > I was thinking, mbuf will be updated from driver/aggregator
> > > > port as
> > > > when it
> > > > comes to application.
> > > > 
> > > > > 
> > > > > > 
> > > > > > > 2. polling aggregated port
> > > > > > >    Besides forwarding engine, need more work to to demo
> > > > > > > it.
> > > > > > >    This is an optional API, not supported by my PMD yet.
> > > > > > 
> > > > > > We are thinking of implementing this PMD when it comes to
> > > > > > it,
> > > > > > ie.
> > > > > > without application change in fastpath
> > > > > > logic.
> > > > > 
> > > > > Fastpath have to resolve port ID anyway and forwarding
> > > > > according
> > > > > to
> > > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > > Fortunately, in testpmd, this can be done with an abstract
> > > > > API.
> > > > > 
> > > > > Let's defer part 2 until some PMD really support it and
> > > > > tested,
> > > > > how do
> > > > > you think?
> > > > 
> > > > We are not planning to use this feature so either way it is OK
> > > > to
> > > > me.
> > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > 
> > > > I do have a strong opinion not changing the testpmd basic
> > > > forward
> > > > engines
> > > > for this feature.I would like to keep it simple as fastpath
> > > > optimized and would
> > > > like to add a separate Forwarding engine as means to verify
> > > > this
> > > > feature.
> > > 
> > > +1 to that.
> > > I don't think it a 'common' feature.
> > > So separate FWD mode seems like a best choice to me.
> > 
> > -1 :)
> > There was some internal requirement from test team, they need to
> > verify
> > all features like packet content, rss, vlan, checksum, rte_flow...
> > to
> > be working based on shared rx queue.
> 
> Then I suppose you'll need to write really comprehensive fwd-engine 
> to satisfy your test team :)
> Speaking seriously, I still don't understand why do you need all
> available fwd-engines to verify this feature.
> From what I understand, main purpose of your changes to test-pmd:
> allow to fwd packet though different fwd_stream (TX through different
> HW queue).
> In theory, if implemented in generic and extendable way - that
> might be a useful add-on to tespmd fwd functionality.
> But current implementation looks very case specific.
> And as I don't think it is a common case, I don't see much point to
> pollute
> basic fwd cases with it.
> 
> BTW, as a side note, the code below looks bogus to me:
> +void
> +forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
> +		   struct rte_mbuf **pkts_burst, packet_fwd_cb
> fwd)	
> +{
> +	uint16_t i, nb_fs_rx = 1, port;
> +
> +	/* Locate real source fs according to mbuf->port. */
> +	for (i = 0; i < nb_rx; ++i) {
> +		rte_prefetch0(pkts_burst[i + 1]);
> 
> you access pkt_burst[] beyond array boundaries,
> also you ask cpu to prefetch some unknown and possibly invalid
> address.

Sorry I forgot this topic. It's too late to prefetch current packet, so
perfetch next is better. Prefetch an invalid address at end of a look
doesn't hurt, it's common in DPDK.  

> 
> > Based on the patch, I believe the
> > impact has been minimized.
> > 
> > > 
> > > > 
> > > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > > represontors? If so can't we
> > > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > > scope can reduce the
> > > > > > > > > > complexity it brings?
> > > > > > > > > > 
> > > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > > representor the case by changing its
> > > > name and
> > > > > > > > > > > > scope.
> > > > > > > > > > > 
> > > > > > > > > > > It works for both PF and representors in same
> > > > > > > > > > > switch
> > > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > > apply.
> > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > >  doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > 11 +++++++++++
> > > > > > > > > > > > > > >  doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > >  doc/guides/prog_guide/switch_representat
> > > > > > > > > > > > > > > ion.
> > > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > >  7 +++++++
> > > > > > > > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner
> > > > > > > > > > > > > > > packet L4
> > > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > > >    ``tx_offload_capa,tx_queue_offload_cap
> > > > > > > > > > > > > > > a:DE
> > > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +Supports shared Rx queue for ports in
> > > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ`
> > > > > > > > > > > > > > > `.
> > > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >  Packet type parsing
> > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand
> > > > > > > > > > > > > > > =
> > > > > > > > > > > > > > >  Queue start/stop     =
> > > > > > > > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > > > > > > > +Shared Rx queue      =
> > > > > > > > > > > > > > >  Burst mode info      =
> > > > > > > > > > > > > > >  Power mgmt address monitor =
> > > > > > > > > > > > > > >  MTU update           =
> > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a
> > > > > > > > > > > > > > > software
> > > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > > >  .. [1] `Ethernet switch device driver
> > > > > > > > > > > > > > > model
> > > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation
> > > > > > > > > > > > > > > /net
> > > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > +- Memory usage of representors is huge
> > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > > +  because PMD always allocate mbuf for
> > > > > > > > > > > > > > > each
> > > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > > +  Polling the large number of ports
> > > > > > > > > > > > > > > brings
> > > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > > +  latency. Shared Rx queue can be used
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > > +  representors in same switch domain.
> > > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > > +  is present in Rx offloading capability
> > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > > +  offloading flag in device Rx mode or
> > > > > > > > > > > > > > > Rx
> > > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > > +  shared Rx queue. Polling any member
> > > > > > > > > > > > > > > port
> > > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > > +  packets of all ports in group, port ID
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > >  Basic SR-IOV
> > > > > > > > > > > > > > >  ------------
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_
> > > > > > > > > > > > > > > CKSU
> > > > > > > > > > > > > > > M),
> > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER
> > > > > > > > > > > > > > > _SPL
> > > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > > >  };
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct
> > > > > > > > > > > > > > > rte_eth_rxconf {
> > > > > > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop
> > > > > > > > > > > > > > > packets
> > > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > > >         uint8_t rx_deferred_start; /**<
> > > > > > > > > > > > > > > Do
> > > > > > > > > > > > > > > not start queue with rte_eth_dev_start().
> > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > >         uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > +       uint32_t shared_group; /**<
> > > > > > > > > > > > > > > Shared
> > > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > > >         /**
> > > > > > > > > > > > > > >          * Per-queue Rx offloads to be
> > > > > > > > > > > > > > > set
> > > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > > >          * Only offloads set on
> > > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@
> > > > > > > > > > > > > > > struct
> > > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > > + * Rx queue is shared among ports in
> > > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > > + * Real source port number saved in
> > > > > > > > > > > > > > > mbuf-
> > > > > > > > > > > > > > > > port field.
> > > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > > >                                  DEV_RX_O
> > > > > > > > > > > > > > > FFLO
> > > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > 
> 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-29  9:12                               ` Xueming(Steven) Li
@ 2021-09-29  9:52                                 ` Ananyev, Konstantin
  2021-09-29 11:07                                   ` Bruce Richardson
  2021-09-29 12:08                                   ` Xueming(Steven) Li
  0 siblings, 2 replies; 266+ messages in thread
From: Ananyev, Konstantin @ 2021-09-29  9:52 UTC (permalink / raw)
  To: Xueming(Steven) Li, jerinjacobk
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, Yigit, Ferruh



> -----Original Message-----
> From: Xueming(Steven) Li <xuemingl@nvidia.com>
> Sent: Wednesday, September 29, 2021 10:13 AM
> To: jerinjacobk@gmail.com; Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Cc: NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; andrew.rybchenko@oktetlabs.ru; dev@dpdk.org; Yigit, Ferruh
> <ferruh.yigit@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> 
> On Wed, 2021-09-29 at 00:26 +0000, Ananyev, Konstantin wrote:
> > > > > > > > > > > > > > > > In current DPDK framework, each RX queue
> > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > > switch domain, the memory consumption
> > > > > > > > > > > > > > > > became
> > > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > > important, polling all ports leads to
> > > > > > > > > > > > > > > > high
> > > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > This patch introduces shared RX queue.
> > > > > > > > > > > > > > > > Ports
> > > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > > configuration in a switch domain could
> > > > > > > > > > > > > > > > share
> > > > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > > > Polling any queue using same shared RX
> > > > > > > > > > > > > > > > queue
> > > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > > all member ports. Source port is
> > > > > > > > > > > > > > > > identified
> > > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Port queue number in a shared group
> > > > > > > > > > > > > > > > should be
> > > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Share RX queue is supposed to be polled
> > > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Is this offload specific to the
> > > > > > > > > > > > > > > representor? If
> > > > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Yes, PF and representor in switch domain
> > > > > > > > > > > > > > could
> > > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Not quite sure that I understood your
> > > > > > > > > > > > > > question.
> > > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > > almost same as before, PF and representor
> > > > > > > > > > > > > > port
> > > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > > implementation.
> > > > > > > > > > > > >
> > > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > > > ethdev receive queues land into
> > > > > the same
> > > > > > > > > > > > > receive queue, In that case, how the flow order
> > > > > > > > > > > > > is
> > > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > >
> > > > > > > > > > > > I guess the question is testpmd forward stream?
> > > > > > > > > > > > The
> > > > > > > > > > > > forwarding logic has to be changed slightly in
> > > > > > > > > > > > case
> > > > > > > > > > > > of shared rxq.
> > > > > > > > > > > > basically for each packet in rx_burst result,
> > > > > > > > > > > > lookup
> > > > > > > > > > > > source stream according to mbuf->port, forwarding
> > > > > > > > > > > > to
> > > > > > > > > > > > target fs.
> > > > > > > > > > > > Packets from same source port could be grouped as
> > > > > > > > > > > > a
> > > > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > > > performance if traffic
> > > > > come from
> > > > > > > > > > > > limited ports. I'll introduce some common api to
> > > > > > > > > > > > do
> > > > > > > > > > > > shard rxq forwarding, call it with packets
> > > > > > > > > > > > handling
> > > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > > > (share queue), right? Does
> > > > > > > > > > > this means only single core will poll only, what
> > > > > > > > > > > will
> > > > > > > > > > > happen if there are
> > > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > >
> > > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > > application, I am not sure about
> > > > > > > > > > > the solution, can't this work in a transparent way
> > > > > > > > > > > to
> > > > > > > > > > > the application?
> > > > > > > > > >
> > > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8
> > > > > > > > > > that
> > > > > > > > > > aggregate ports
> > > > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > > > polling on the
> > > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > >
> > > > > > > > > The v3 still has testpmd changes in fastpath. Right?
> > > > > > > > > IMO,
> > > > > > > > > For this
> > > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > > > probably as
> > > > > > > > > separate fwd_engine to show how to use this feature.
> > > > > > > >
> > > > > > > > Good point to discuss :) There are two strategies to
> > > > > > > > polling
> > > > > > > > a shared
> > > > > > > > Rxq:
> > > > > > > > 1. polling each member port
> > > > > > > >    All forwarding engines can be reused to work as
> > > > > > > > before.
> > > > > > > >    My testpmd patches are efforts towards this direction.
> > > > > > > >    Does your PMD support this?
> > > > > > >
> > > > > > > Not unfortunately. More than that, every application needs
> > > > > > > to
> > > > > > > change
> > > > > > > to support this model.
> > > > > >
> > > > > > Both strategies need user application to resolve port ID from
> > > > > > mbuf and
> > > > > > process accordingly.
> > > > > > This one doesn't demand aggregated port, no polling schedule
> > > > > > change.
> > > > >
> > > > > I was thinking, mbuf will be updated from driver/aggregator
> > > > > port as
> > > > > when it
> > > > > comes to application.
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > 2. polling aggregated port
> > > > > > > >    Besides forwarding engine, need more work to to demo
> > > > > > > > it.
> > > > > > > >    This is an optional API, not supported by my PMD yet.
> > > > > > >
> > > > > > > We are thinking of implementing this PMD when it comes to
> > > > > > > it,
> > > > > > > ie.
> > > > > > > without application change in fastpath
> > > > > > > logic.
> > > > > >
> > > > > > Fastpath have to resolve port ID anyway and forwarding
> > > > > > according
> > > > > > to
> > > > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > > > Fortunately, in testpmd, this can be done with an abstract
> > > > > > API.
> > > > > >
> > > > > > Let's defer part 2 until some PMD really support it and
> > > > > > tested,
> > > > > > how do
> > > > > > you think?
> > > > >
> > > > > We are not planning to use this feature so either way it is OK
> > > > > to
> > > > > me.
> > > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > >
> > > > > I do have a strong opinion not changing the testpmd basic
> > > > > forward
> > > > > engines
> > > > > for this feature.I would like to keep it simple as fastpath
> > > > > optimized and would
> > > > > like to add a separate Forwarding engine as means to verify
> > > > > this
> > > > > feature.
> > > >
> > > > +1 to that.
> > > > I don't think it a 'common' feature.
> > > > So separate FWD mode seems like a best choice to me.
> > >
> > > -1 :)
> > > There was some internal requirement from test team, they need to
> > > verify
> > > all features like packet content, rss, vlan, checksum, rte_flow...
> > > to
> > > be working based on shared rx queue.
> >
> > Then I suppose you'll need to write really comprehensive fwd-engine
> > to satisfy your test team :)
> > Speaking seriously, I still don't understand why do you need all
> > available fwd-engines to verify this feature.
> > From what I understand, main purpose of your changes to test-pmd:
> > allow to fwd packet though different fwd_stream (TX through different
> > HW queue).
> > In theory, if implemented in generic and extendable way - that
> > might be a useful add-on to tespmd fwd functionality.
> > But current implementation looks very case specific.
> > And as I don't think it is a common case, I don't see much point to
> > pollute
> > basic fwd cases with it.
> >
> > BTW, as a side note, the code below looks bogus to me:
> > +void
> > +forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
> > +		   struct rte_mbuf **pkts_burst, packet_fwd_cb
> > fwd)
> > +{
> > +	uint16_t i, nb_fs_rx = 1, port;
> > +
> > +	/* Locate real source fs according to mbuf->port. */
> > +	for (i = 0; i < nb_rx; ++i) {
> > +		rte_prefetch0(pkts_burst[i + 1]);
> >
> > you access pkt_burst[] beyond array boundaries,
> > also you ask cpu to prefetch some unknown and possibly invalid
> > address.
> 
> Sorry I forgot this topic. It's too late to prefetch current packet, so
> perfetch next is better. Prefetch an invalid address at end of a look
> doesn't hurt, it's common in DPDK.

First of all it is usually never 'OK' to access array beyond its bounds.
Second prefetching invalid address *does* hurt performance badly on many CPUs
(TLB misses, consumed memory bandwidth etc.).
As a reference:  https://lwn.net/Articles/444346/
If some existing DPDK code really does that - then I believe it is an issue and has to be addressed.
More important - it is really bad attitude to submit bogus code to DPDK community
and pretend that it is 'OK'.

> 
> >
> > > Based on the patch, I believe the
> > > impact has been minimized.
> > >
> > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > > > represontors? If so can't we
> > > > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > > > scope can reduce the
> > > > > > > > > > > complexity it brings?
> > > > > > > > > > >
> > > > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > > > representor the case by changing its
> > > > > name and
> > > > > > > > > > > > > scope.
> > > > > > > > > > > >
> > > > > > > > > > > > It works for both PF and representors in same
> > > > > > > > > > > > switch
> > > > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > > > apply.
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > >  doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > 11 +++++++++++
> > > > > > > > > > > > > > > >  doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > > >  doc/guides/prog_guide/switch_representat
> > > > > > > > > > > > > > > > ion.
> > > > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > >  7 +++++++
> > > > > > > > > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner
> > > > > > > > > > > > > > > > packet L4
> > > > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > > > >    ``tx_offload_capa,tx_queue_offload_cap
> > > > > > > > > > > > > > > > a:DE
> > > > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +Supports shared Rx queue for ports in
> > > > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ`
> > > > > > > > > > > > > > > > `.
> > > > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >  Packet type parsing
> > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand
> > > > > > > > > > > > > > > > =
> > > > > > > > > > > > > > > >  Queue start/stop     =
> > > > > > > > > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > > > > > > > > +Shared Rx queue      =
> > > > > > > > > > > > > > > >  Burst mode info      =
> > > > > > > > > > > > > > > >  Power mgmt address monitor =
> > > > > > > > > > > > > > > >  MTU update           =
> > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a
> > > > > > > > > > > > > > > > software
> > > > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > > > >  .. [1] `Ethernet switch device driver
> > > > > > > > > > > > > > > > model
> > > > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation
> > > > > > > > > > > > > > > > /net
> > > > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > +- Memory usage of representors is huge
> > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > > > +  because PMD always allocate mbuf for
> > > > > > > > > > > > > > > > each
> > > > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > > > +  Polling the large number of ports
> > > > > > > > > > > > > > > > brings
> > > > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > > > +  latency. Shared Rx queue can be used
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > > > +  representors in same switch domain.
> > > > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > > > +  is present in Rx offloading capability
> > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > > > +  offloading flag in device Rx mode or
> > > > > > > > > > > > > > > > Rx
> > > > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > > > +  shared Rx queue. Polling any member
> > > > > > > > > > > > > > > > port
> > > > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > > > +  packets of all ports in group, port ID
> > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > >  Basic SR-IOV
> > > > > > > > > > > > > > > >  ------------
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_
> > > > > > > > > > > > > > > > CKSU
> > > > > > > > > > > > > > > > M),
> > > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER
> > > > > > > > > > > > > > > > _SPL
> > > > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > > > >  };
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct
> > > > > > > > > > > > > > > > rte_eth_rxconf {
> > > > > > > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop
> > > > > > > > > > > > > > > > packets
> > > > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > > > >         uint8_t rx_deferred_start; /**<
> > > > > > > > > > > > > > > > Do
> > > > > > > > > > > > > > > > not start queue with rte_eth_dev_start().
> > > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > >         uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > > +       uint32_t shared_group; /**<
> > > > > > > > > > > > > > > > Shared
> > > > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > > > >         /**
> > > > > > > > > > > > > > > >          * Per-queue Rx offloads to be
> > > > > > > > > > > > > > > > set
> > > > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > > > >          * Only offloads set on
> > > > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@
> > > > > > > > > > > > > > > > struct
> > > > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > > > + * Rx queue is shared among ports in
> > > > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > > > + * Real source port number saved in
> > > > > > > > > > > > > > > > mbuf-
> > > > > > > > > > > > > > > > > port field.
> > > > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > > > >                                  DEV_RX_O
> > > > > > > > > > > > > > > > FFLO
> > > > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > >
> >


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-29  8:40                               ` Xueming(Steven) Li
@ 2021-09-29 10:20                                 ` Ananyev, Konstantin
  2021-09-29 13:25                                   ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Ananyev, Konstantin @ 2021-09-29 10:20 UTC (permalink / raw)
  To: Xueming(Steven) Li, jerinjacobk, Raslan Darawsheh
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, Yigit, Ferruh

> > > > > > > > > > > > > > > > In current DPDK framework, each RX queue
> > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > > switch domain, the memory consumption
> > > > > > > > > > > > > > > > became
> > > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > > important, polling all ports leads to
> > > > > > > > > > > > > > > > high
> > > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > This patch introduces shared RX queue.
> > > > > > > > > > > > > > > > Ports
> > > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > > configuration in a switch domain could
> > > > > > > > > > > > > > > > share
> > > > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > > > Polling any queue using same shared RX
> > > > > > > > > > > > > > > > queue
> > > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > > all member ports. Source port is
> > > > > > > > > > > > > > > > identified
> > > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Port queue number in a shared group
> > > > > > > > > > > > > > > > should be
> > > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Share RX queue is supposed to be polled
> > > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Is this offload specific to the
> > > > > > > > > > > > > > > representor? If
> > > > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Yes, PF and representor in switch domain
> > > > > > > > > > > > > > could
> > > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Not quite sure that I understood your
> > > > > > > > > > > > > > question.
> > > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > > almost same as before, PF and representor
> > > > > > > > > > > > > > port
> > > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > > implementation.
> > > > > > > > > > > > >
> > > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > > > ethdev receive queues land into
> > > > > the same
> > > > > > > > > > > > > receive queue, In that case, how the flow order
> > > > > > > > > > > > > is
> > > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > >
> > > > > > > > > > > > I guess the question is testpmd forward stream?
> > > > > > > > > > > > The
> > > > > > > > > > > > forwarding logic has to be changed slightly in
> > > > > > > > > > > > case
> > > > > > > > > > > > of shared rxq.
> > > > > > > > > > > > basically for each packet in rx_burst result,
> > > > > > > > > > > > lookup
> > > > > > > > > > > > source stream according to mbuf->port, forwarding
> > > > > > > > > > > > to
> > > > > > > > > > > > target fs.
> > > > > > > > > > > > Packets from same source port could be grouped as
> > > > > > > > > > > > a
> > > > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > > > performance if traffic
> > > > > come from
> > > > > > > > > > > > limited ports. I'll introduce some common api to
> > > > > > > > > > > > do
> > > > > > > > > > > > shard rxq forwarding, call it with packets
> > > > > > > > > > > > handling
> > > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > > > (share queue), right? Does
> > > > > > > > > > > this means only single core will poll only, what
> > > > > > > > > > > will
> > > > > > > > > > > happen if there are
> > > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > >
> > > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > > application, I am not sure about
> > > > > > > > > > > the solution, can't this work in a transparent way
> > > > > > > > > > > to
> > > > > > > > > > > the application?
> > > > > > > > > >
> > > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8
> > > > > > > > > > that
> > > > > > > > > > aggregate ports
> > > > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > > > polling on the
> > > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > >
> > > > > > > > > The v3 still has testpmd changes in fastpath. Right?
> > > > > > > > > IMO,
> > > > > > > > > For this
> > > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > > > probably as
> > > > > > > > > separate fwd_engine to show how to use this feature.
> > > > > > > >
> > > > > > > > Good point to discuss :) There are two strategies to
> > > > > > > > polling
> > > > > > > > a shared
> > > > > > > > Rxq:
> > > > > > > > 1. polling each member port
> > > > > > > >    All forwarding engines can be reused to work as
> > > > > > > > before.
> > > > > > > >    My testpmd patches are efforts towards this direction.
> > > > > > > >    Does your PMD support this?
> > > > > > >
> > > > > > > Not unfortunately. More than that, every application needs
> > > > > > > to
> > > > > > > change
> > > > > > > to support this model.
> > > > > >
> > > > > > Both strategies need user application to resolve port ID from
> > > > > > mbuf and
> > > > > > process accordingly.
> > > > > > This one doesn't demand aggregated port, no polling schedule
> > > > > > change.
> > > > >
> > > > > I was thinking, mbuf will be updated from driver/aggregator
> > > > > port as
> > > > > when it
> > > > > comes to application.
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > 2. polling aggregated port
> > > > > > > >    Besides forwarding engine, need more work to to demo
> > > > > > > > it.
> > > > > > > >    This is an optional API, not supported by my PMD yet.
> > > > > > >
> > > > > > > We are thinking of implementing this PMD when it comes to
> > > > > > > it,
> > > > > > > ie.
> > > > > > > without application change in fastpath
> > > > > > > logic.
> > > > > >
> > > > > > Fastpath have to resolve port ID anyway and forwarding
> > > > > > according
> > > > > > to
> > > > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > > > Fortunately, in testpmd, this can be done with an abstract
> > > > > > API.
> > > > > >
> > > > > > Let's defer part 2 until some PMD really support it and
> > > > > > tested,
> > > > > > how do
> > > > > > you think?
> > > > >
> > > > > We are not planning to use this feature so either way it is OK
> > > > > to
> > > > > me.
> > > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > >
> > > > > I do have a strong opinion not changing the testpmd basic
> > > > > forward
> > > > > engines
> > > > > for this feature.I would like to keep it simple as fastpath
> > > > > optimized and would
> > > > > like to add a separate Forwarding engine as means to verify
> > > > > this
> > > > > feature.
> > > >
> > > > +1 to that.
> > > > I don't think it a 'common' feature.
> > > > So separate FWD mode seems like a best choice to me.
> > >
> > > -1 :)
> > > There was some internal requirement from test team, they need to
> > > verify
> > > all features like packet content, rss, vlan, checksum, rte_flow...
> > > to
> > > be working based on shared rx queue.
> >
> > Then I suppose you'll need to write really comprehensive fwd-engine
> > to satisfy your test team :)
> > Speaking seriously, I still don't understand why do you need all
> > available fwd-engines to verify this feature.
> 
> The shared Rxq is low level feature, need to make sure driver higher
> level features working properly. fwd-engines like csum checks input
> packet and enable L3/L4 checksum and tunnel offloads accordingly,
> other engines do their own feature verification. All test automation
> could be reused with these engines supported seamlessly.
> 
> > From what I understand, main purpose of your changes to test-pmd:
> > allow to fwd packet though different fwd_stream (TX through different
> > HW queue).
> 
> Yes, each mbuf in burst come from differnt port, testpmd current fwd-
> engines relies heavily on source forwarding stream, that's why the
> patch devide burst result mbufs into sub-burst and use orginal fwd-
> engine callback to handle. How to handle is not changed.
> 
> > In theory, if implemented in generic and extendable way - that
> > might be a useful add-on to tespmd fwd functionality.
> > But current implementation looks very case specific.
> > And as I don't think it is a common case, I don't see much point to
> > pollute
> > basic fwd cases with it.
> 
> Shared Rxq is a ethdev feature that impacts how packets get handled.
> It's natural to update forwarding engines to avoid broken.

Why is that?
All it affects the way you RX the packets.
So why *all* FWD engines have to be updated?
Let say what specific you are going to test with macswap vs macfwd mode for that feature?
I still think one specific FWD engine is enough to cover majority of test cases.

> The new macro is introduced to minimize performance impact, I'm also
> wondering is there an elegant solution :)

I think Jerin suggested a good alternative with eventdev.
As another approach - might be consider to add an RX callback that
will return packets only for one particular port (while keeping packets
for other ports cached internally).
As a 'wild' thought - change testpmd fwd logic to allow multiple TX queues
per fwd_stream and add a function to do TX switching logic.
But that's probably quite a big change that needs a lot of work. 

> Current performance penalty
> is one "if unlikely" per burst.

It is not only about performance impact.
It is about keeping test-pmd code simple and maintainable.

> 
> Think in reverse direction, if we don't update fwd-engines here, all
> malfunction when shared rxq enabled, users can't verify driver
> features, are you expecting this?

I expect developers not to rewrite whole test-pmd fwd code for each new ethdev feature.
Specially for the feature that is not widely used.

> 
> >
> > BTW, as a side note, the code below looks bogus to me:
> > +void
> > +forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
> > +		   struct rte_mbuf **pkts_burst, packet_fwd_cb
> > fwd)
> > +{
> > +	uint16_t i, nb_fs_rx = 1, port;
> > +
> > +	/* Locate real source fs according to mbuf->port. */
> > +	for (i = 0; i < nb_rx; ++i) {
> > +		rte_prefetch0(pkts_burst[i + 1]);
> >
> > you access pkt_burst[] beyond array boundaries,
> > also you ask cpu to prefetch some unknown and possibly invalid
> > address.
> >
> > > Based on the patch, I believe the
> > > impact has been minimized.
> > >
> > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > > > represontors? If so can't we
> > > > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > > > scope can reduce the
> > > > > > > > > > > complexity it brings?
> > > > > > > > > > >
> > > > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > > > representor the case by changing its
> > > > > name and
> > > > > > > > > > > > > scope.
> > > > > > > > > > > >
> > > > > > > > > > > > It works for both PF and representors in same
> > > > > > > > > > > > switch
> > > > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > > > apply.
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > >  doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > 11 +++++++++++
> > > > > > > > > > > > > > > >  doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > > >  doc/guides/prog_guide/switch_representat
> > > > > > > > > > > > > > > > ion.
> > > > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > >  7 +++++++
> > > > > > > > > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner
> > > > > > > > > > > > > > > > packet L4
> > > > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > > > >    ``tx_offload_capa,tx_queue_offload_cap
> > > > > > > > > > > > > > > > a:DE
> > > > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +Supports shared Rx queue for ports in
> > > > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ`
> > > > > > > > > > > > > > > > `.
> > > > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >  Packet type parsing
> > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand
> > > > > > > > > > > > > > > > =
> > > > > > > > > > > > > > > >  Queue start/stop     =
> > > > > > > > > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > > > > > > > > +Shared Rx queue      =
> > > > > > > > > > > > > > > >  Burst mode info      =
> > > > > > > > > > > > > > > >  Power mgmt address monitor =
> > > > > > > > > > > > > > > >  MTU update           =
> > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a
> > > > > > > > > > > > > > > > software
> > > > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > > > >  .. [1] `Ethernet switch device driver
> > > > > > > > > > > > > > > > model
> > > > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation
> > > > > > > > > > > > > > > > /net
> > > > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > +- Memory usage of representors is huge
> > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > > > +  because PMD always allocate mbuf for
> > > > > > > > > > > > > > > > each
> > > > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > > > +  Polling the large number of ports
> > > > > > > > > > > > > > > > brings
> > > > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > > > +  latency. Shared Rx queue can be used
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > > > +  representors in same switch domain.
> > > > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > > > +  is present in Rx offloading capability
> > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > > > +  offloading flag in device Rx mode or
> > > > > > > > > > > > > > > > Rx
> > > > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > > > +  shared Rx queue. Polling any member
> > > > > > > > > > > > > > > > port
> > > > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > > > +  packets of all ports in group, port ID
> > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > >  Basic SR-IOV
> > > > > > > > > > > > > > > >  ------------
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_
> > > > > > > > > > > > > > > > CKSU
> > > > > > > > > > > > > > > > M),
> > > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER
> > > > > > > > > > > > > > > > _SPL
> > > > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > > > >  };
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct
> > > > > > > > > > > > > > > > rte_eth_rxconf {
> > > > > > > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop
> > > > > > > > > > > > > > > > packets
> > > > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > > > >         uint8_t rx_deferred_start; /**<
> > > > > > > > > > > > > > > > Do
> > > > > > > > > > > > > > > > not start queue with rte_eth_dev_start().
> > > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > >         uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > > +       uint32_t shared_group; /**<
> > > > > > > > > > > > > > > > Shared
> > > > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > > > >         /**
> > > > > > > > > > > > > > > >          * Per-queue Rx offloads to be
> > > > > > > > > > > > > > > > set
> > > > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > > > >          * Only offloads set on
> > > > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@
> > > > > > > > > > > > > > > > struct
> > > > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > > > + * Rx queue is shared among ports in
> > > > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > > > + * Real source port number saved in
> > > > > > > > > > > > > > > > mbuf-
> > > > > > > > > > > > > > > > > port field.
> > > > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > > > >                                  DEV_RX_O
> > > > > > > > > > > > > > > > FFLO
> > > > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > >
> >


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-29  9:52                                 ` Ananyev, Konstantin
@ 2021-09-29 11:07                                   ` Bruce Richardson
  2021-09-29 11:46                                     ` Ananyev, Konstantin
  2021-09-29 12:08                                   ` Xueming(Steven) Li
  1 sibling, 1 reply; 266+ messages in thread
From: Bruce Richardson @ 2021-09-29 11:07 UTC (permalink / raw)
  To: Ananyev, Konstantin
  Cc: Xueming(Steven) Li, jerinjacobk, NBU-Contact-Thomas Monjalon,
	andrew.rybchenko, dev, Yigit, Ferruh

On Wed, Sep 29, 2021 at 09:52:20AM +0000, Ananyev, Konstantin wrote:
> 
> 
> > -----Original Message-----
> > From: Xueming(Steven) Li <xuemingl@nvidia.com>
> > Sent: Wednesday, September 29, 2021 10:13 AM
<snip>
> > > +	/* Locate real source fs according to mbuf->port. */
> > > +	for (i = 0; i < nb_rx; ++i) {
> > > +		rte_prefetch0(pkts_burst[i + 1]);
> > >
> > > you access pkt_burst[] beyond array boundaries,
> > > also you ask cpu to prefetch some unknown and possibly invalid
> > > address.
> > 
> > Sorry I forgot this topic. It's too late to prefetch current packet, so
> > perfetch next is better. Prefetch an invalid address at end of a look
> > doesn't hurt, it's common in DPDK.
> 
> First of all it is usually never 'OK' to access array beyond its bounds.
> Second prefetching invalid address *does* hurt performance badly on many CPUs
> (TLB misses, consumed memory bandwidth etc.).
> As a reference:  https://lwn.net/Articles/444346/
> If some existing DPDK code really does that - then I believe it is an issue and has to be addressed.
> More important - it is really bad attitude to submit bogus code to DPDK community
> and pretend that it is 'OK'.
>
 
The main point we need to take from all this is that when
prefetching you need to measure perf impact of it.

In terms of the specific case of prefetching one past the end of the array,
I would take the view that this is harmless in almost all cases. Unlike any
prefetch of "NULL" as in the referenced mail, reading one past the end (or
other small number of elements past the end) is far less likely to cause a
TLB miss - and it's basically just reproducing behaviour we would expect
off a HW prefetcher (though those my explicitly never cross page
boundaries). However, if you feel it's just cleaner to put in an
additional condition to remove the prefetch for the end case, that's ok
also - again so long as it doesn't affect performance. [Since prefetch is a
hint, I'm not sure if compilers or CPUs may be legally allowed to skip the
branch and blindly prefetch in all cases?]

/Bruce

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-29 11:07                                   ` Bruce Richardson
@ 2021-09-29 11:46                                     ` Ananyev, Konstantin
  2021-09-29 12:17                                       ` Bruce Richardson
  0 siblings, 1 reply; 266+ messages in thread
From: Ananyev, Konstantin @ 2021-09-29 11:46 UTC (permalink / raw)
  To: Richardson, Bruce
  Cc: Xueming(Steven) Li, jerinjacobk, NBU-Contact-Thomas Monjalon,
	andrew.rybchenko, dev, Yigit, Ferruh



> -----Original Message-----
> From: Richardson, Bruce <bruce.richardson@intel.com>
> Sent: Wednesday, September 29, 2021 12:08 PM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Cc: Xueming(Steven) Li <xuemingl@nvidia.com>; jerinjacobk@gmail.com; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> andrew.rybchenko@oktetlabs.ru; dev@dpdk.org; Yigit, Ferruh <ferruh.yigit@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> 
> On Wed, Sep 29, 2021 at 09:52:20AM +0000, Ananyev, Konstantin wrote:
> >
> >
> > > -----Original Message-----
> > > From: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Sent: Wednesday, September 29, 2021 10:13 AM
> <snip>
> > > > +	/* Locate real source fs according to mbuf->port. */
> > > > +	for (i = 0; i < nb_rx; ++i) {
> > > > +		rte_prefetch0(pkts_burst[i + 1]);
> > > >
> > > > you access pkt_burst[] beyond array boundaries,
> > > > also you ask cpu to prefetch some unknown and possibly invalid
> > > > address.
> > >
> > > Sorry I forgot this topic. It's too late to prefetch current packet, so
> > > perfetch next is better. Prefetch an invalid address at end of a look
> > > doesn't hurt, it's common in DPDK.
> >
> > First of all it is usually never 'OK' to access array beyond its bounds.
> > Second prefetching invalid address *does* hurt performance badly on many CPUs
> > (TLB misses, consumed memory bandwidth etc.).
> > As a reference:  https://lwn.net/Articles/444346/
> > If some existing DPDK code really does that - then I believe it is an issue and has to be addressed.
> > More important - it is really bad attitude to submit bogus code to DPDK community
> > and pretend that it is 'OK'.
> >
> 
> The main point we need to take from all this is that when
> prefetching you need to measure perf impact of it.
> 
> In terms of the specific case of prefetching one past the end of the array,
> I would take the view that this is harmless in almost all cases. Unlike any
> prefetch of "NULL" as in the referenced mail, reading one past the end (or
> other small number of elements past the end) is far less likely to cause a
> TLB miss - and it's basically just reproducing behaviour we would expect
> off a HW prefetcher (though those my explicitly never cross page
> boundaries). However, if you feel it's just cleaner to put in an
> additional condition to remove the prefetch for the end case, that's ok
> also - again so long as it doesn't affect performance. [Since prefetch is a
> hint, I'm not sure if compilers or CPUs may be legally allowed to skip the
> branch and blindly prefetch in all cases?]

Please look at the code.
It doesn't prefetch next element beyond array boundaries.
It first reads address from the element that is beyond array boundaries (which is a bug by itself).
Then it prefetches that bogus address.
We simply don't know is this address is valid and where it points to.

In other words, it doesn't do:
rte_prefetch0(&pkts_burst[i + 1]);
 
It does:
rte_prefetch0(pkts_burst[i + 1]);


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-29  9:52                                 ` Ananyev, Konstantin
  2021-09-29 11:07                                   ` Bruce Richardson
@ 2021-09-29 12:08                                   ` Xueming(Steven) Li
  2021-09-29 12:35                                     ` Ananyev, Konstantin
  1 sibling, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-29 12:08 UTC (permalink / raw)
  To: jerinjacobk, konstantin.ananyev
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

On Wed, 2021-09-29 at 09:52 +0000, Ananyev, Konstantin wrote:
> 
> > -----Original Message-----
> > From: Xueming(Steven) Li <xuemingl@nvidia.com>
> > Sent: Wednesday, September 29, 2021 10:13 AM
> > To: jerinjacobk@gmail.com; Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > Cc: NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; andrew.rybchenko@oktetlabs.ru; dev@dpdk.org; Yigit, Ferruh
> > <ferruh.yigit@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > 
> > On Wed, 2021-09-29 at 00:26 +0000, Ananyev, Konstantin wrote:
> > > > > > > > > > > > > > > > > In current DPDK framework, each RX queue
> > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > > > switch domain, the memory consumption
> > > > > > > > > > > > > > > > > became
> > > > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > > > important, polling all ports leads to
> > > > > > > > > > > > > > > > > high
> > > > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > This patch introduces shared RX queue.
> > > > > > > > > > > > > > > > > Ports
> > > > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > > > configuration in a switch domain could
> > > > > > > > > > > > > > > > > share
> > > > > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > > > > Polling any queue using same shared RX
> > > > > > > > > > > > > > > > > queue
> > > > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > > > all member ports. Source port is
> > > > > > > > > > > > > > > > > identified
> > > > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Port queue number in a shared group
> > > > > > > > > > > > > > > > > should be
> > > > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Share RX queue is supposed to be polled
> > > > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Is this offload specific to the
> > > > > > > > > > > > > > > > representor? If
> > > > > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Yes, PF and representor in switch domain
> > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Not quite sure that I understood your
> > > > > > > > > > > > > > > question.
> > > > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > > > almost same as before, PF and representor
> > > > > > > > > > > > > > > port
> > > > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > > > implementation.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > > > > ethdev receive queues land into
> > > > > > the same
> > > > > > > > > > > > > > receive queue, In that case, how the flow order
> > > > > > > > > > > > > > is
> > > > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I guess the question is testpmd forward stream?
> > > > > > > > > > > > > The
> > > > > > > > > > > > > forwarding logic has to be changed slightly in
> > > > > > > > > > > > > case
> > > > > > > > > > > > > of shared rxq.
> > > > > > > > > > > > > basically for each packet in rx_burst result,
> > > > > > > > > > > > > lookup
> > > > > > > > > > > > > source stream according to mbuf->port, forwarding
> > > > > > > > > > > > > to
> > > > > > > > > > > > > target fs.
> > > > > > > > > > > > > Packets from same source port could be grouped as
> > > > > > > > > > > > > a
> > > > > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > > > > performance if traffic
> > > > > > come from
> > > > > > > > > > > > > limited ports. I'll introduce some common api to
> > > > > > > > > > > > > do
> > > > > > > > > > > > > shard rxq forwarding, call it with packets
> > > > > > > > > > > > > handling
> > > > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > > > > (share queue), right? Does
> > > > > > > > > > > > this means only single core will poll only, what
> > > > > > > > > > > > will
> > > > > > > > > > > > happen if there are
> > > > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > > > 
> > > > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > > > application, I am not sure about
> > > > > > > > > > > > the solution, can't this work in a transparent way
> > > > > > > > > > > > to
> > > > > > > > > > > > the application?
> > > > > > > > > > > 
> > > > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8
> > > > > > > > > > > that
> > > > > > > > > > > aggregate ports
> > > > > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > > > > polling on the
> > > > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > > > 
> > > > > > > > > > The v3 still has testpmd changes in fastpath. Right?
> > > > > > > > > > IMO,
> > > > > > > > > > For this
> > > > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > > > > probably as
> > > > > > > > > > separate fwd_engine to show how to use this feature.
> > > > > > > > > 
> > > > > > > > > Good point to discuss :) There are two strategies to
> > > > > > > > > polling
> > > > > > > > > a shared
> > > > > > > > > Rxq:
> > > > > > > > > 1. polling each member port
> > > > > > > > >    All forwarding engines can be reused to work as
> > > > > > > > > before.
> > > > > > > > >    My testpmd patches are efforts towards this direction.
> > > > > > > > >    Does your PMD support this?
> > > > > > > > 
> > > > > > > > Not unfortunately. More than that, every application needs
> > > > > > > > to
> > > > > > > > change
> > > > > > > > to support this model.
> > > > > > > 
> > > > > > > Both strategies need user application to resolve port ID from
> > > > > > > mbuf and
> > > > > > > process accordingly.
> > > > > > > This one doesn't demand aggregated port, no polling schedule
> > > > > > > change.
> > > > > > 
> > > > > > I was thinking, mbuf will be updated from driver/aggregator
> > > > > > port as
> > > > > > when it
> > > > > > comes to application.
> > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > 2. polling aggregated port
> > > > > > > > >    Besides forwarding engine, need more work to to demo
> > > > > > > > > it.
> > > > > > > > >    This is an optional API, not supported by my PMD yet.
> > > > > > > > 
> > > > > > > > We are thinking of implementing this PMD when it comes to
> > > > > > > > it,
> > > > > > > > ie.
> > > > > > > > without application change in fastpath
> > > > > > > > logic.
> > > > > > > 
> > > > > > > Fastpath have to resolve port ID anyway and forwarding
> > > > > > > according
> > > > > > > to
> > > > > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > > > > Fortunately, in testpmd, this can be done with an abstract
> > > > > > > API.
> > > > > > > 
> > > > > > > Let's defer part 2 until some PMD really support it and
> > > > > > > tested,
> > > > > > > how do
> > > > > > > you think?
> > > > > > 
> > > > > > We are not planning to use this feature so either way it is OK
> > > > > > to
> > > > > > me.
> > > > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > > > 
> > > > > > I do have a strong opinion not changing the testpmd basic
> > > > > > forward
> > > > > > engines
> > > > > > for this feature.I would like to keep it simple as fastpath
> > > > > > optimized and would
> > > > > > like to add a separate Forwarding engine as means to verify
> > > > > > this
> > > > > > feature.
> > > > > 
> > > > > +1 to that.
> > > > > I don't think it a 'common' feature.
> > > > > So separate FWD mode seems like a best choice to me.
> > > > 
> > > > -1 :)
> > > > There was some internal requirement from test team, they need to
> > > > verify
> > > > all features like packet content, rss, vlan, checksum, rte_flow...
> > > > to
> > > > be working based on shared rx queue.
> > > 
> > > Then I suppose you'll need to write really comprehensive fwd-engine
> > > to satisfy your test team :)
> > > Speaking seriously, I still don't understand why do you need all
> > > available fwd-engines to verify this feature.
> > > From what I understand, main purpose of your changes to test-pmd:
> > > allow to fwd packet though different fwd_stream (TX through different
> > > HW queue).
> > > In theory, if implemented in generic and extendable way - that
> > > might be a useful add-on to tespmd fwd functionality.
> > > But current implementation looks very case specific.
> > > And as I don't think it is a common case, I don't see much point to
> > > pollute
> > > basic fwd cases with it.
> > > 
> > > BTW, as a side note, the code below looks bogus to me:
> > > +void
> > > +forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
> > > +		   struct rte_mbuf **pkts_burst, packet_fwd_cb
> > > fwd)
> > > +{
> > > +	uint16_t i, nb_fs_rx = 1, port;
> > > +
> > > +	/* Locate real source fs according to mbuf->port. */
> > > +	for (i = 0; i < nb_rx; ++i) {
> > > +		rte_prefetch0(pkts_burst[i + 1]);
> > > 
> > > you access pkt_burst[] beyond array boundaries,
> > > also you ask cpu to prefetch some unknown and possibly invalid
> > > address.
> > 
> > Sorry I forgot this topic. It's too late to prefetch current packet, so
> > perfetch next is better. Prefetch an invalid address at end of a look
> > doesn't hurt, it's common in DPDK.
> 
> First of all it is usually never 'OK' to access array beyond its bounds.
> Second prefetching invalid address *does* hurt performance badly on many CPUs
> (TLB misses, consumed memory bandwidth etc.).
> As a reference:  https://lwn.net/Articles/444346/
> If some existing DPDK code really does that - then I believe it is an issue and has to be addressed.
> More important - it is really bad attitude to submit bogus code to DPDK community
> and pretend that it is 'OK'.

Thanks for the link!
From instruction spec, "The PREFETCHh instruction is merely a hint and
does not affect program behavior."
There are 3 choices here:
1: no prefetch. D$ miss will happen on each packet, time cost depends
on where data sits(close or far) and burst size.
2: prefetch with loop end check to avoid random address. Pro is free of
TLB miss per burst, Cons is "if" instruction per packet. Cost depends
on burst size.
3: brute force prefetch, cost is TLB miss, but no addtional
instructions per packet. Not sure how random the last address could be
in testpmd and how many TLB miss could happen. 

Based on my expericen of performance optimization, IIRC, option 3 has
the best performance. But for this case, result depends on how many
sub-burst inside and how sub-burst get processed, maybe callback will
flush prefetch data completely or not. So it's hard to get a
conclusion, what I said is that the code in PMD driver should have a
reason.

On the other hand, the latency and throughput saving of this featue on
multiple ports is huge, I perfer to down play this prefetch discussion
if you agree.


> 
> > 
> > > 
> > > > Based on the patch, I believe the
> > > > impact has been minimized.
> > > > 
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > > > > represontors? If so can't we
> > > > > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > > > > scope can reduce the
> > > > > > > > > > > > complexity it brings?
> > > > > > > > > > > > 
> > > > > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > > > > representor the case by changing its
> > > > > > name and
> > > > > > > > > > > > > > scope.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > It works for both PF and representors in same
> > > > > > > > > > > > > switch
> > > > > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > > > > apply.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > >  doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > > 11 +++++++++++
> > > > > > > > > > > > > > > > >  doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > > > >  doc/guides/prog_guide/switch_representat
> > > > > > > > > > > > > > > > > ion.
> > > > > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > >  7 +++++++
> > > > > > > > > > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner
> > > > > > > > > > > > > > > > > packet L4
> > > > > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > > > > >    ``tx_offload_capa,tx_queue_offload_cap
> > > > > > > > > > > > > > > > > a:DE
> > > > > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > +Supports shared Rx queue for ports in
> > > > > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ`
> > > > > > > > > > > > > > > > > `.
> > > > > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > >  Packet type parsing
> > > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand
> > > > > > > > > > > > > > > > > =
> > > > > > > > > > > > > > > > >  Queue start/stop     =
> > > > > > > > > > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > > > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > > > > > > > > > +Shared Rx queue      =
> > > > > > > > > > > > > > > > >  Burst mode info      =
> > > > > > > > > > > > > > > > >  Power mgmt address monitor =
> > > > > > > > > > > > > > > > >  MTU update           =
> > > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a
> > > > > > > > > > > > > > > > > software
> > > > > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > > > > >  .. [1] `Ethernet switch device driver
> > > > > > > > > > > > > > > > > model
> > > > > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation
> > > > > > > > > > > > > > > > > /net
> > > > > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > +- Memory usage of representors is huge
> > > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > > > > +  because PMD always allocate mbuf for
> > > > > > > > > > > > > > > > > each
> > > > > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > > > > +  Polling the large number of ports
> > > > > > > > > > > > > > > > > brings
> > > > > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > > > > +  latency. Shared Rx queue can be used
> > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > > > > +  representors in same switch domain.
> > > > > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > > > > +  is present in Rx offloading capability
> > > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > > > > +  offloading flag in device Rx mode or
> > > > > > > > > > > > > > > > > Rx
> > > > > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > > > > +  shared Rx queue. Polling any member
> > > > > > > > > > > > > > > > > port
> > > > > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > > > > +  packets of all ports in group, port ID
> > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > >  Basic SR-IOV
> > > > > > > > > > > > > > > > >  ------------
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_
> > > > > > > > > > > > > > > > > CKSU
> > > > > > > > > > > > > > > > > M),
> > > > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER
> > > > > > > > > > > > > > > > > _SPL
> > > > > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > > > > >  };
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct
> > > > > > > > > > > > > > > > > rte_eth_rxconf {
> > > > > > > > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop
> > > > > > > > > > > > > > > > > packets
> > > > > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > > > > >         uint8_t rx_deferred_start; /**<
> > > > > > > > > > > > > > > > > Do
> > > > > > > > > > > > > > > > > not start queue with rte_eth_dev_start().
> > > > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > > >         uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > > > +       uint32_t shared_group; /**<
> > > > > > > > > > > > > > > > > Shared
> > > > > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > > > > >         /**
> > > > > > > > > > > > > > > > >          * Per-queue Rx offloads to be
> > > > > > > > > > > > > > > > > set
> > > > > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > > > > >          * Only offloads set on
> > > > > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@
> > > > > > > > > > > > > > > > > struct
> > > > > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > > > > + * Rx queue is shared among ports in
> > > > > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > > > > + * Real source port number saved in
> > > > > > > > > > > > > > > > > mbuf-
> > > > > > > > > > > > > > > > > > port field.
> > > > > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > > > > >                                  DEV_RX_O
> > > > > > > > > > > > > > > > > FFLO
> > > > > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > 
> > > 
> 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-29 11:46                                     ` Ananyev, Konstantin
@ 2021-09-29 12:17                                       ` Bruce Richardson
  0 siblings, 0 replies; 266+ messages in thread
From: Bruce Richardson @ 2021-09-29 12:17 UTC (permalink / raw)
  To: Ananyev, Konstantin
  Cc: Xueming(Steven) Li, jerinjacobk, NBU-Contact-Thomas Monjalon,
	andrew.rybchenko, dev, Yigit, Ferruh

On Wed, Sep 29, 2021 at 12:46:51PM +0100, Ananyev, Konstantin wrote:
> 
> 
> > -----Original Message-----
> > From: Richardson, Bruce <bruce.richardson@intel.com>
> > Sent: Wednesday, September 29, 2021 12:08 PM
> > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > Cc: Xueming(Steven) Li <xuemingl@nvidia.com>; jerinjacobk@gmail.com; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > andrew.rybchenko@oktetlabs.ru; dev@dpdk.org; Yigit, Ferruh <ferruh.yigit@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> >
> > On Wed, Sep 29, 2021 at 09:52:20AM +0000, Ananyev, Konstantin wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > Sent: Wednesday, September 29, 2021 10:13 AM
> > <snip>
> > > > > +       /* Locate real source fs according to mbuf->port. */
> > > > > +       for (i = 0; i < nb_rx; ++i) {
> > > > > +               rte_prefetch0(pkts_burst[i + 1]);
> > > > >
> > > > > you access pkt_burst[] beyond array boundaries,
> > > > > also you ask cpu to prefetch some unknown and possibly invalid
> > > > > address.
> > > >
> > > > Sorry I forgot this topic. It's too late to prefetch current packet, so
> > > > perfetch next is better. Prefetch an invalid address at end of a look
> > > > doesn't hurt, it's common in DPDK.
> > >
> > > First of all it is usually never 'OK' to access array beyond its bounds.
> > > Second prefetching invalid address *does* hurt performance badly on many CPUs
> > > (TLB misses, consumed memory bandwidth etc.).
> > > As a reference:  https://lwn.net/Articles/444346/
> > > If some existing DPDK code really does that - then I believe it is an issue and has to be addressed.
> > > More important - it is really bad attitude to submit bogus code to DPDK community
> > > and pretend that it is 'OK'.
> > >
> >
> > The main point we need to take from all this is that when
> > prefetching you need to measure perf impact of it.
> >
> > In terms of the specific case of prefetching one past the end of the array,
> > I would take the view that this is harmless in almost all cases. Unlike any
> > prefetch of "NULL" as in the referenced mail, reading one past the end (or
> > other small number of elements past the end) is far less likely to cause a
> > TLB miss - and it's basically just reproducing behaviour we would expect
> > off a HW prefetcher (though those my explicitly never cross page
> > boundaries). However, if you feel it's just cleaner to put in an
> > additional condition to remove the prefetch for the end case, that's ok
> > also - again so long as it doesn't affect performance. [Since prefetch is a
> > hint, I'm not sure if compilers or CPUs may be legally allowed to skip the
> > branch and blindly prefetch in all cases?]
> 
> Please look at the code.
> It doesn't prefetch next element beyond array boundaries.
> It first reads address from the element that is beyond array boundaries (which is a bug by itself).
> Then it prefetches that bogus address.
> We simply don't know is this address is valid and where it points to.
> 
> In other words, it doesn't do:
> rte_prefetch0(&pkts_burst[i + 1]);
> 
> It does:
> rte_prefetch0(pkts_burst[i + 1]);
>
Apologies, yes, you are right, and that is a bug.

/Bruce

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-29 12:08                                   ` Xueming(Steven) Li
@ 2021-09-29 12:35                                     ` Ananyev, Konstantin
  2021-09-29 14:54                                       ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Ananyev, Konstantin @ 2021-09-29 12:35 UTC (permalink / raw)
  To: Xueming(Steven) Li, jerinjacobk
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, Yigit, Ferruh



> -----Original Message-----
> From: Xueming(Steven) Li <xuemingl@nvidia.com>
> Sent: Wednesday, September 29, 2021 1:09 PM
> To: jerinjacobk@gmail.com; Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Cc: NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; andrew.rybchenko@oktetlabs.ru; dev@dpdk.org; Yigit, Ferruh
> <ferruh.yigit@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue> 
> On Wed, 2021-09-29 at 09:52 +0000, Ananyev, Konstantin wrote:
> >
> > > -----Original Message-----
> > > From: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Sent: Wednesday, September 29, 2021 10:13 AM
> > > To: jerinjacobk@gmail.com; Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > > Cc: NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; andrew.rybchenko@oktetlabs.ru; dev@dpdk.org; Yigit, Ferruh
> > > <ferruh.yigit@intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > >
> > > On Wed, 2021-09-29 at 00:26 +0000, Ananyev, Konstantin wrote:
> > > > > > > > > > > > > > > > > > In current DPDK framework, each RX queue
> > > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > > > > switch domain, the memory consumption
> > > > > > > > > > > > > > > > > > became
> > > > > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > > > > important, polling all ports leads to
> > > > > > > > > > > > > > > > > > high
> > > > > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > This patch introduces shared RX queue.
> > > > > > > > > > > > > > > > > > Ports
> > > > > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > > > > configuration in a switch domain could
> > > > > > > > > > > > > > > > > > share
> > > > > > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > > > > > Polling any queue using same shared RX
> > > > > > > > > > > > > > > > > > queue
> > > > > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > > > > all member ports. Source port is
> > > > > > > > > > > > > > > > > > identified
> > > > > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Port queue number in a shared group
> > > > > > > > > > > > > > > > > > should be
> > > > > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Share RX queue is supposed to be polled
> > > > > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Is this offload specific to the
> > > > > > > > > > > > > > > > > representor? If
> > > > > > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Yes, PF and representor in switch domain
> > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Not quite sure that I understood your
> > > > > > > > > > > > > > > > question.
> > > > > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > > > > almost same as before, PF and representor
> > > > > > > > > > > > > > > > port
> > > > > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > > > > implementation.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > > > > > ethdev receive queues land into
> > > > > > > the same
> > > > > > > > > > > > > > > receive queue, In that case, how the flow order
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I guess the question is testpmd forward stream?
> > > > > > > > > > > > > > The
> > > > > > > > > > > > > > forwarding logic has to be changed slightly in
> > > > > > > > > > > > > > case
> > > > > > > > > > > > > > of shared rxq.
> > > > > > > > > > > > > > basically for each packet in rx_burst result,
> > > > > > > > > > > > > > lookup
> > > > > > > > > > > > > > source stream according to mbuf->port, forwarding
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > target fs.
> > > > > > > > > > > > > > Packets from same source port could be grouped as
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > > > > > performance if traffic
> > > > > > > come from
> > > > > > > > > > > > > > limited ports. I'll introduce some common api to
> > > > > > > > > > > > > > do
> > > > > > > > > > > > > > shard rxq forwarding, call it with packets
> > > > > > > > > > > > > > handling
> > > > > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > > > > > (share queue), right? Does
> > > > > > > > > > > > > this means only single core will poll only, what
> > > > > > > > > > > > > will
> > > > > > > > > > > > > happen if there are
> > > > > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > > > >
> > > > > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > > > > application, I am not sure about
> > > > > > > > > > > > > the solution, can't this work in a transparent way
> > > > > > > > > > > > > to
> > > > > > > > > > > > > the application?
> > > > > > > > > > > >
> > > > > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8
> > > > > > > > > > > > that
> > > > > > > > > > > > aggregate ports
> > > > > > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > > > > > polling on the
> > > > > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > > > >
> > > > > > > > > > > The v3 still has testpmd changes in fastpath. Right?
> > > > > > > > > > > IMO,
> > > > > > > > > > > For this
> > > > > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > > > > > probably as
> > > > > > > > > > > separate fwd_engine to show how to use this feature.
> > > > > > > > > >
> > > > > > > > > > Good point to discuss :) There are two strategies to
> > > > > > > > > > polling
> > > > > > > > > > a shared
> > > > > > > > > > Rxq:
> > > > > > > > > > 1. polling each member port
> > > > > > > > > >    All forwarding engines can be reused to work as
> > > > > > > > > > before.
> > > > > > > > > >    My testpmd patches are efforts towards this direction.
> > > > > > > > > >    Does your PMD support this?
> > > > > > > > >
> > > > > > > > > Not unfortunately. More than that, every application needs
> > > > > > > > > to
> > > > > > > > > change
> > > > > > > > > to support this model.
> > > > > > > >
> > > > > > > > Both strategies need user application to resolve port ID from
> > > > > > > > mbuf and
> > > > > > > > process accordingly.
> > > > > > > > This one doesn't demand aggregated port, no polling schedule
> > > > > > > > change.
> > > > > > >
> > > > > > > I was thinking, mbuf will be updated from driver/aggregator
> > > > > > > port as
> > > > > > > when it
> > > > > > > comes to application.
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > 2. polling aggregated port
> > > > > > > > > >    Besides forwarding engine, need more work to to demo
> > > > > > > > > > it.
> > > > > > > > > >    This is an optional API, not supported by my PMD yet.
> > > > > > > > >
> > > > > > > > > We are thinking of implementing this PMD when it comes to
> > > > > > > > > it,
> > > > > > > > > ie.
> > > > > > > > > without application change in fastpath
> > > > > > > > > logic.
> > > > > > > >
> > > > > > > > Fastpath have to resolve port ID anyway and forwarding
> > > > > > > > according
> > > > > > > > to
> > > > > > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > > > > > Fortunately, in testpmd, this can be done with an abstract
> > > > > > > > API.
> > > > > > > >
> > > > > > > > Let's defer part 2 until some PMD really support it and
> > > > > > > > tested,
> > > > > > > > how do
> > > > > > > > you think?
> > > > > > >
> > > > > > > We are not planning to use this feature so either way it is OK
> > > > > > > to
> > > > > > > me.
> > > > > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > > > >
> > > > > > > I do have a strong opinion not changing the testpmd basic
> > > > > > > forward
> > > > > > > engines
> > > > > > > for this feature.I would like to keep it simple as fastpath
> > > > > > > optimized and would
> > > > > > > like to add a separate Forwarding engine as means to verify
> > > > > > > this
> > > > > > > feature.
> > > > > >
> > > > > > +1 to that.
> > > > > > I don't think it a 'common' feature.
> > > > > > So separate FWD mode seems like a best choice to me.
> > > > >
> > > > > -1 :)
> > > > > There was some internal requirement from test team, they need to
> > > > > verify
> > > > > all features like packet content, rss, vlan, checksum, rte_flow...
> > > > > to
> > > > > be working based on shared rx queue.
> > > >
> > > > Then I suppose you'll need to write really comprehensive fwd-engine
> > > > to satisfy your test team :)
> > > > Speaking seriously, I still don't understand why do you need all
> > > > available fwd-engines to verify this feature.
> > > > From what I understand, main purpose of your changes to test-pmd:
> > > > allow to fwd packet though different fwd_stream (TX through different
> > > > HW queue).
> > > > In theory, if implemented in generic and extendable way - that
> > > > might be a useful add-on to tespmd fwd functionality.
> > > > But current implementation looks very case specific.
> > > > And as I don't think it is a common case, I don't see much point to
> > > > pollute
> > > > basic fwd cases with it.
> > > >
> > > > BTW, as a side note, the code below looks bogus to me:
> > > > +void
> > > > +forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
> > > > +		   struct rte_mbuf **pkts_burst, packet_fwd_cb
> > > > fwd)
> > > > +{
> > > > +	uint16_t i, nb_fs_rx = 1, port;
> > > > +
> > > > +	/* Locate real source fs according to mbuf->port. */
> > > > +	for (i = 0; i < nb_rx; ++i) {
> > > > +		rte_prefetch0(pkts_burst[i + 1]);
> > > >
> > > > you access pkt_burst[] beyond array boundaries,
> > > > also you ask cpu to prefetch some unknown and possibly invalid
> > > > address.
> > >
> > > Sorry I forgot this topic. It's too late to prefetch current packet, so
> > > perfetch next is better. Prefetch an invalid address at end of a look
> > > doesn't hurt, it's common in DPDK.
> >
> > First of all it is usually never 'OK' to access array beyond its bounds.
> > Second prefetching invalid address *does* hurt performance badly on many CPUs
> > (TLB misses, consumed memory bandwidth etc.).
> > As a reference:  https://lwn.net/Articles/444346/
> > If some existing DPDK code really does that - then I believe it is an issue and has to be addressed.
> > More important - it is really bad attitude to submit bogus code to DPDK community
> > and pretend that it is 'OK'.
> 
> Thanks for the link!
> From instruction spec, "The PREFETCHh instruction is merely a hint and
> does not affect program behavior."
> There are 3 choices here:
> 1: no prefetch. D$ miss will happen on each packet, time cost depends
> on where data sits(close or far) and burst size.
> 2: prefetch with loop end check to avoid random address. Pro is free of
> TLB miss per burst, Cons is "if" instruction per packet. Cost depends
> on burst size.
> 3: brute force prefetch, cost is TLB miss, but no addtional
> instructions per packet. Not sure how random the last address could be
> in testpmd and how many TLB miss could happen.

There are plenty of standard techniques to avoid that issue while keeping
prefetch() in place.
Probably the easiest one:

for (i=0; i < nb_rx - 1; i++) {
    prefetch(pkt[i+1];
    /* do your stuff with pkt[i[ here */
}

/* do your stuff with pkt[nb_rx - 1]; */
 
> Based on my expericen of performance optimization, IIRC, option 3 has
> the best performance. But for this case, result depends on how many
> sub-burst inside and how sub-burst get processed, maybe callback will
> flush prefetch data completely or not. So it's hard to get a
> conclusion, what I said is that the code in PMD driver should have a
> reason.
> 
> On the other hand, the latency and throughput saving of this featue on
> multiple ports is huge, I perfer to down play this prefetch discussion
> if you agree.
> 

Honestly, I don't know how else to explain to you that there is a bug in that piece of code.
From my perspective it is a trivial bug, with a trivial fix.
But you simply keep ignoring the arguments.
Till it get fixed and other comments addressed - my vote is NACK for these series.
I don't think we need bogus code in testpmd.



^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-29 10:20                                 ` Ananyev, Konstantin
@ 2021-09-29 13:25                                   ` Xueming(Steven) Li
  2021-09-30  9:59                                     ` Ananyev, Konstantin
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-29 13:25 UTC (permalink / raw)
  To: jerinjacobk, Raslan Darawsheh, konstantin.ananyev
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

On Wed, 2021-09-29 at 10:20 +0000, Ananyev, Konstantin wrote:
> > > > > > > > > > > > > > > > > In current DPDK framework, each RX
> > > > > > > > > > > > > > > > > queue
> > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > > > switch domain, the memory consumption
> > > > > > > > > > > > > > > > > became
> > > > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > > > important, polling all ports leads to
> > > > > > > > > > > > > > > > > high
> > > > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > This patch introduces shared RX
> > > > > > > > > > > > > > > > > queue.
> > > > > > > > > > > > > > > > > Ports
> > > > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > > > configuration in a switch domain
> > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > share
> > > > > > > > > > > > > > > > > RX queue set by specifying sharing
> > > > > > > > > > > > > > > > > group.
> > > > > > > > > > > > > > > > > Polling any queue using same shared
> > > > > > > > > > > > > > > > > RX
> > > > > > > > > > > > > > > > > queue
> > > > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > > > all member ports. Source port is
> > > > > > > > > > > > > > > > > identified
> > > > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Port queue number in a shared group
> > > > > > > > > > > > > > > > > should be
> > > > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Share RX queue is supposed to be
> > > > > > > > > > > > > > > > > polled
> > > > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Multiple groups is supported by group
> > > > > > > > > > > > > > > > > ID.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Is this offload specific to the
> > > > > > > > > > > > > > > > representor? If
> > > > > > > > > > > > > > > > so can this name be changed
> > > > > > > > > > > > > > > > specifically to
> > > > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Yes, PF and representor in switch domain
> > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > If it is for a generic case, how the
> > > > > > > > > > > > > > > > flow
> > > > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Not quite sure that I understood your
> > > > > > > > > > > > > > > question.
> > > > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > > > almost same as before, PF and representor
> > > > > > > > > > > > > > > port
> > > > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > > > implementation.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload,
> > > > > > > > > > > > > > multiple
> > > > > > > > > > > > > > ethdev receive queues land into
> > > > > > the same
> > > > > > > > > > > > > > receive queue, In that case, how the flow
> > > > > > > > > > > > > > order
> > > > > > > > > > > > > > is
> > > > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I guess the question is testpmd forward
> > > > > > > > > > > > > stream?
> > > > > > > > > > > > > The
> > > > > > > > > > > > > forwarding logic has to be changed slightly
> > > > > > > > > > > > > in
> > > > > > > > > > > > > case
> > > > > > > > > > > > > of shared rxq.
> > > > > > > > > > > > > basically for each packet in rx_burst result,
> > > > > > > > > > > > > lookup
> > > > > > > > > > > > > source stream according to mbuf->port,
> > > > > > > > > > > > > forwarding
> > > > > > > > > > > > > to
> > > > > > > > > > > > > target fs.
> > > > > > > > > > > > > Packets from same source port could be
> > > > > > > > > > > > > grouped as
> > > > > > > > > > > > > a
> > > > > > > > > > > > > small burst to process, this will accelerates
> > > > > > > > > > > > > the
> > > > > > > > > > > > > performance if traffic
> > > > > > come from
> > > > > > > > > > > > > limited ports. I'll introduce some common api
> > > > > > > > > > > > > to
> > > > > > > > > > > > > do
> > > > > > > > > > > > > shard rxq forwarding, call it with packets
> > > > > > > > > > > > > handling
> > > > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > > > all forwarding engine. Will sent patches
> > > > > > > > > > > > > soon.
> > > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > All ports will put the packets in to the same
> > > > > > > > > > > > queue
> > > > > > > > > > > > (share queue), right? Does
> > > > > > > > > > > > this means only single core will poll only,
> > > > > > > > > > > > what
> > > > > > > > > > > > will
> > > > > > > > > > > > happen if there are
> > > > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > > > 
> > > > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > > > application, I am not sure about
> > > > > > > > > > > > the solution, can't this work in a transparent
> > > > > > > > > > > > way
> > > > > > > > > > > > to
> > > > > > > > > > > > the application?
> > > > > > > > > > > 
> > > > > > > > > > > Discussed with Jerin, new API introduced in v3
> > > > > > > > > > > 2/8
> > > > > > > > > > > that
> > > > > > > > > > > aggregate ports
> > > > > > > > > > > in same group into one new port. Users could
> > > > > > > > > > > schedule
> > > > > > > > > > > polling on the
> > > > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > > > 
> > > > > > > > > > The v3 still has testpmd changes in fastpath.
> > > > > > > > > > Right?
> > > > > > > > > > IMO,
> > > > > > > > > > For this
> > > > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > > > application. Instead, testpmd can use aggregated
> > > > > > > > > > ports
> > > > > > > > > > probably as
> > > > > > > > > > separate fwd_engine to show how to use this
> > > > > > > > > > feature.
> > > > > > > > > 
> > > > > > > > > Good point to discuss :) There are two strategies to
> > > > > > > > > polling
> > > > > > > > > a shared
> > > > > > > > > Rxq:
> > > > > > > > > 1. polling each member port
> > > > > > > > >    All forwarding engines can be reused to work as
> > > > > > > > > before.
> > > > > > > > >    My testpmd patches are efforts towards this
> > > > > > > > > direction.
> > > > > > > > >    Does your PMD support this?
> > > > > > > > 
> > > > > > > > Not unfortunately. More than that, every application
> > > > > > > > needs
> > > > > > > > to
> > > > > > > > change
> > > > > > > > to support this model.
> > > > > > > 
> > > > > > > Both strategies need user application to resolve port ID
> > > > > > > from
> > > > > > > mbuf and
> > > > > > > process accordingly.
> > > > > > > This one doesn't demand aggregated port, no polling
> > > > > > > schedule
> > > > > > > change.
> > > > > > 
> > > > > > I was thinking, mbuf will be updated from driver/aggregator
> > > > > > port as
> > > > > > when it
> > > > > > comes to application.
> > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > 2. polling aggregated port
> > > > > > > > >    Besides forwarding engine, need more work to to
> > > > > > > > > demo
> > > > > > > > > it.
> > > > > > > > >    This is an optional API, not supported by my PMD
> > > > > > > > > yet.
> > > > > > > > 
> > > > > > > > We are thinking of implementing this PMD when it comes
> > > > > > > > to
> > > > > > > > it,
> > > > > > > > ie.
> > > > > > > > without application change in fastpath
> > > > > > > > logic.
> > > > > > > 
> > > > > > > Fastpath have to resolve port ID anyway and forwarding
> > > > > > > according
> > > > > > > to
> > > > > > > logic. Forwarding engine need to adapt to support shard
> > > > > > > Rxq.
> > > > > > > Fortunately, in testpmd, this can be done with an
> > > > > > > abstract
> > > > > > > API.
> > > > > > > 
> > > > > > > Let's defer part 2 until some PMD really support it and
> > > > > > > tested,
> > > > > > > how do
> > > > > > > you think?
> > > > > > 
> > > > > > We are not planning to use this feature so either way it is
> > > > > > OK
> > > > > > to
> > > > > > me.
> > > > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > > > 
> > > > > > I do have a strong opinion not changing the testpmd basic
> > > > > > forward
> > > > > > engines
> > > > > > for this feature.I would like to keep it simple as fastpath
> > > > > > optimized and would
> > > > > > like to add a separate Forwarding engine as means to verify
> > > > > > this
> > > > > > feature.
> > > > > 
> > > > > +1 to that.
> > > > > I don't think it a 'common' feature.
> > > > > So separate FWD mode seems like a best choice to me.
> > > > 
> > > > -1 :)
> > > > There was some internal requirement from test team, they need
> > > > to
> > > > verify
> > > > all features like packet content, rss, vlan, checksum,
> > > > rte_flow...
> > > > to
> > > > be working based on shared rx queue.
> > > 
> > > Then I suppose you'll need to write really comprehensive fwd-
> > > engine
> > > to satisfy your test team :)
> > > Speaking seriously, I still don't understand why do you need all
> > > available fwd-engines to verify this feature.
> > 
> > The shared Rxq is low level feature, need to make sure driver
> > higher
> > level features working properly. fwd-engines like csum checks input
> > packet and enable L3/L4 checksum and tunnel offloads accordingly,
> > other engines do their own feature verification. All test
> > automation
> > could be reused with these engines supported seamlessly.
> > 
> > > From what I understand, main purpose of your changes to test-pmd:
> > > allow to fwd packet though different fwd_stream (TX through
> > > different
> > > HW queue).
> > 
> > Yes, each mbuf in burst come from differnt port, testpmd current
> > fwd-
> > engines relies heavily on source forwarding stream, that's why the
> > patch devide burst result mbufs into sub-burst and use orginal fwd-
> > engine callback to handle. How to handle is not changed.
> > 
> > > In theory, if implemented in generic and extendable way - that
> > > might be a useful add-on to tespmd fwd functionality.
> > > But current implementation looks very case specific.
> > > And as I don't think it is a common case, I don't see much point
> > > to
> > > pollute
> > > basic fwd cases with it.
> > 
> > Shared Rxq is a ethdev feature that impacts how packets get
> > handled.
> > It's natural to update forwarding engines to avoid broken.
> 
> Why is that?
> All it affects the way you RX the packets.
> So why *all* FWD engines have to be updated?

People will ask why some FWD engine can't work? 

> Let say what specific you are going to test with macswap vs macfwd
> mode for that feature?

If people want to test NIC with real switch, or make sure L2 layer not
get corrupted.

> I still think one specific FWD engine is enough to cover majority of
> test cases.

Yes, rxonly should be sufficient to verify the fundametal, but to
verify csum, timing, need others. Some back2back test system need io
forwarding, real switch depolyment need macswap...


> 
> > The new macro is introduced to minimize performance impact, I'm
> > also
> > wondering is there an elegant solution :)
> 
> I think Jerin suggested a good alternative with eventdev.
> As another approach - might be consider to add an RX callback that
> will return packets only for one particular port (while keeping
> packets
> for other ports cached internally).

This and the aggreate port API could be options in ethdev layer later.
It can't be the fundamental due performance loss and potential cache
miss.

> As a 'wild' thought - change testpmd fwd logic to allow multiple TX
> queues
> per fwd_stream and add a function to do TX switching logic.
> But that's probably quite a big change that needs a lot of work. 
> 
> > Current performance penalty
> > is one "if unlikely" per burst.
> 
> It is not only about performance impact.
> It is about keeping test-pmd code simple and maintainable.
> > 
> > Think in reverse direction, if we don't update fwd-engines here,
> > all
> > malfunction when shared rxq enabled, users can't verify driver
> > features, are you expecting this?
> 
> I expect developers not to rewrite whole test-pmd fwd code for each
> new ethdev feature.

Here just abstract duplicated code from fwd-engines, an improvement,
keep test-pmd code simple and mantainable.

> Specially for the feature that is not widely used.

Based on the huge memory saving, performance and latency gains, it will
be popular to users.

But the test-pmd is not critical to this feature, I'm ok to drop the
fwd-engine support if you agree.

> 
> > 
> > > 
> > > BTW, as a side note, the code below looks bogus to me:
> > > +void
> > > +forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
> > > +		   struct rte_mbuf **pkts_burst, packet_fwd_cb
> > > fwd)
> > > +{
> > > +	uint16_t i, nb_fs_rx = 1, port;
> > > +
> > > +	/* Locate real source fs according to mbuf->port. */
> > > +	for (i = 0; i < nb_rx; ++i) {
> > > +		rte_prefetch0(pkts_burst[i + 1]);
> > > 
> > > you access pkt_burst[] beyond array boundaries,
> > > also you ask cpu to prefetch some unknown and possibly invalid
> > > address.
> > > 
> > > > Based on the patch, I believe the
> > > > impact has been minimized.
> > > > 
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > Overall, is this for optimizing memory for the
> > > > > > > > > > > > port
> > > > > > > > > > > > represontors? If so can't we
> > > > > > > > > > > > have a port representor specific solution,
> > > > > > > > > > > > reducing
> > > > > > > > > > > > scope can reduce the
> > > > > > > > > > > > complexity it brings?
> > > > > > > > > > > > 
> > > > > > > > > > > > > > If this offload is only useful for
> > > > > > > > > > > > > > representor
> > > > > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > > > > representor the case by changing its
> > > > > > name and
> > > > > > > > > > > > > > scope.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > It works for both PF and representors in same
> > > > > > > > > > > > > switch
> > > > > > > > > > > > > domain, for application like OVS, few changes
> > > > > > > > > > > > > to
> > > > > > > > > > > > > apply.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > >  doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > > 11 +++++++++++
> > > > > > > > > > > > > > > > >  doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > > > >  doc/guides/prog_guide/switch_represe
> > > > > > > > > > > > > > > > > ntat
> > > > > > > > > > > > > > > > > ion.
> > > > > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > >  7 +++++++
> > > > > > > > > > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > > a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner
> > > > > > > > > > > > > > > > > packet L4
> > > > > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > > > > >    ``tx_offload_capa,tx_queue_offload
> > > > > > > > > > > > > > > > > _cap
> > > > > > > > > > > > > > > > > a:DE
> > > > > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > +Supports shared Rx queue for ports
> > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_
> > > > > > > > > > > > > > > > > RXQ`
> > > > > > > > > > > > > > > > > `.
> > > > > > > > > > > > > > > > > +* **[provides] mbuf**:
> > > > > > > > > > > > > > > > > ``mbuf.port``.
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > >  ..
> > > > > > > > > > > > > > > > > _nic_features_packet_type_parsing:
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > >  Packet type parsing
> > > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > > a/doc/guides/nics/features/default.in
> > > > > > > > > > > > > > > > > i
> > > > > > > > > > > > > > > > > b/doc/guides/nics/features/default.in
> > > > > > > > > > > > > > > > > i
> > > > > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > a/doc/guides/nics/features/default.in
> > > > > > > > > > > > > > > > > i
> > > > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > > > b/doc/guides/nics/features/default.in
> > > > > > > > > > > > > > > > > i
> > > > > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on
> > > > > > > > > > > > > > > > > demand
> > > > > > > > > > > > > > > > > =
> > > > > > > > > > > > > > > > >  Queue start/stop     =
> > > > > > > > > > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > > > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > > > > > > > > > +Shared Rx queue      =
> > > > > > > > > > > > > > > > >  Burst mode info      =
> > > > > > > > > > > > > > > > >  Power mgmt address monitor =
> > > > > > > > > > > > > > > > >  MTU update           =
> > > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_repres
> > > > > > > > > > > > > > > > > enta
> > > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_repres
> > > > > > > > > > > > > > > > > enta
> > > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_repres
> > > > > > > > > > > > > > > > > enta
> > > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_repres
> > > > > > > > > > > > > > > > > enta
> > > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a
> > > > > > > > > > > > > > > > > software
> > > > > > > > > > > > > > > > > "patch panel" front-end for
> > > > > > > > > > > > > > > > > applications.
> > > > > > > > > > > > > > > > >  .. [1] `Ethernet switch device
> > > > > > > > > > > > > > > > > driver
> > > > > > > > > > > > > > > > > model
> > > > > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > <https://www.kernel.org/doc/Documenta
> > > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > > /net
> > > > > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > +- Memory usage of representors is
> > > > > > > > > > > > > > > > > huge
> > > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > > > > +  because PMD always allocate mbuf
> > > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > each
> > > > > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > > > > +  Polling the large number of ports
> > > > > > > > > > > > > > > > > brings
> > > > > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > > > > +  latency. Shared Rx queue can be
> > > > > > > > > > > > > > > > > used
> > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > > > > +  representors in same switch
> > > > > > > > > > > > > > > > > domain.
> > > > > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > > > > +  is present in Rx offloading
> > > > > > > > > > > > > > > > > capability
> > > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > > > > +  offloading flag in device Rx mode
> > > > > > > > > > > > > > > > > or
> > > > > > > > > > > > > > > > > Rx
> > > > > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > > > > +  shared Rx queue. Polling any
> > > > > > > > > > > > > > > > > member
> > > > > > > > > > > > > > > > > port
> > > > > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > > > > +  packets of all ports in group,
> > > > > > > > > > > > > > > > > port ID
> > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > >  Basic SR-IOV
> > > > > > > > > > > > > > > > >  ------------
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const
> > > > > > > > > > > > > > > > > struct {
> > > > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_
> > > > > > > > > > > > > > > > > UDP_
> > > > > > > > > > > > > > > > > CKSU
> > > > > > > > > > > > > > > > > M),
> > > > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HA
> > > > > > > > > > > > > > > > > SH),
> > > > > > > > > > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BU
> > > > > > > > > > > > > > > > > FFER
> > > > > > > > > > > > > > > > > _SPL
> > > > > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ
> > > > > > > > > > > > > > > > > ),
> > > > > > > > > > > > > > > > >  };
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct
> > > > > > > > > > > > > > > > > rte_eth_rxconf {
> > > > > > > > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop
> > > > > > > > > > > > > > > > > packets
> > > > > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > > > > >         uint8_t rx_deferred_start;
> > > > > > > > > > > > > > > > > /**<
> > > > > > > > > > > > > > > > > Do
> > > > > > > > > > > > > > > > > not start queue with
> > > > > > > > > > > > > > > > > rte_eth_dev_start().
> > > > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > > >         uint16_t rx_nseg; /**< Number
> > > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > > > +       uint32_t shared_group; /**<
> > > > > > > > > > > > > > > > > Shared
> > > > > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > > > > >         /**
> > > > > > > > > > > > > > > > >          * Per-queue Rx offloads to
> > > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > set
> > > > > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > > > > >          * Only offloads set on
> > > > > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12
> > > > > > > > > > > > > > > > > @@
> > > > > > > > > > > > > > > > > struct
> > > > > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > > > > #define
> > > > > > > > > > > > > > > > > DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > > > > >  #define
> > > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > > > > + * Rx queue is shared among ports in
> > > > > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > > > > + * avoid polling each port. Any port
> > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > > > > + * Real source port number saved in
> > > > > > > > > > > > > > > > > mbuf-
> > > > > > > > > > > > > > > > > > port field.
> > > > > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > > > > +#define
> > > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > > > > >                                  DEV_
> > > > > > > > > > > > > > > > > RX_O
> > > > > > > > > > > > > > > > > FFLO
> > > > > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > 
> > > 
> 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-29 12:35                                     ` Ananyev, Konstantin
@ 2021-09-29 14:54                                       ` Xueming(Steven) Li
  0 siblings, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-09-29 14:54 UTC (permalink / raw)
  To: jerinjacobk, konstantin.ananyev
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

On Wed, 2021-09-29 at 12:35 +0000, Ananyev, Konstantin wrote:
> 
> > -----Original Message-----
> > From: Xueming(Steven) Li <xuemingl@nvidia.com>
> > Sent: Wednesday, September 29, 2021 1:09 PM
> > To: jerinjacobk@gmail.com; Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > Cc: NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; andrew.rybchenko@oktetlabs.ru; dev@dpdk.org; Yigit, Ferruh
> > <ferruh.yigit@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue> 
> > On Wed, 2021-09-29 at 09:52 +0000, Ananyev, Konstantin wrote:
> > > 
> > > > -----Original Message-----
> > > > From: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > Sent: Wednesday, September 29, 2021 10:13 AM
> > > > To: jerinjacobk@gmail.com; Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > > > Cc: NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; andrew.rybchenko@oktetlabs.ru; dev@dpdk.org; Yigit, Ferruh
> > > > <ferruh.yigit@intel.com>
> > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > 
> > > > On Wed, 2021-09-29 at 00:26 +0000, Ananyev, Konstantin wrote:
> > > > > > > > > > > > > > > > > > > In current DPDK framework, each RX queue
> > > > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > > > > > switch domain, the memory consumption
> > > > > > > > > > > > > > > > > > > became
> > > > > > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > > > > > important, polling all ports leads to
> > > > > > > > > > > > > > > > > > > high
> > > > > > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > This patch introduces shared RX queue.
> > > > > > > > > > > > > > > > > > > Ports
> > > > > > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > > > > > configuration in a switch domain could
> > > > > > > > > > > > > > > > > > > share
> > > > > > > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > > > > > > Polling any queue using same shared RX
> > > > > > > > > > > > > > > > > > > queue
> > > > > > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > > > > > all member ports. Source port is
> > > > > > > > > > > > > > > > > > > identified
> > > > > > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > Port queue number in a shared group
> > > > > > > > > > > > > > > > > > > should be
> > > > > > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > Share RX queue is supposed to be polled
> > > > > > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > Is this offload specific to the
> > > > > > > > > > > > > > > > > > representor? If
> > > > > > > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Yes, PF and representor in switch domain
> > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Not quite sure that I understood your
> > > > > > > > > > > > > > > > > question.
> > > > > > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > > > > > almost same as before, PF and representor
> > > > > > > > > > > > > > > > > port
> > > > > > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > > > > > implementation.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > > > > > > ethdev receive queues land into
> > > > > > > > the same
> > > > > > > > > > > > > > > > receive queue, In that case, how the flow order
> > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > I guess the question is testpmd forward stream?
> > > > > > > > > > > > > > > The
> > > > > > > > > > > > > > > forwarding logic has to be changed slightly in
> > > > > > > > > > > > > > > case
> > > > > > > > > > > > > > > of shared rxq.
> > > > > > > > > > > > > > > basically for each packet in rx_burst result,
> > > > > > > > > > > > > > > lookup
> > > > > > > > > > > > > > > source stream according to mbuf->port, forwarding
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > target fs.
> > > > > > > > > > > > > > > Packets from same source port could be grouped as
> > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > > > > > > performance if traffic
> > > > > > > > come from
> > > > > > > > > > > > > > > limited ports. I'll introduce some common api to
> > > > > > > > > > > > > > > do
> > > > > > > > > > > > > > > shard rxq forwarding, call it with packets
> > > > > > > > > > > > > > > handling
> > > > > > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > > > > > > (share queue), right? Does
> > > > > > > > > > > > > > this means only single core will poll only, what
> > > > > > > > > > > > > > will
> > > > > > > > > > > > > > happen if there are
> > > > > > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > > > > > application, I am not sure about
> > > > > > > > > > > > > > the solution, can't this work in a transparent way
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > the application?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8
> > > > > > > > > > > > > that
> > > > > > > > > > > > > aggregate ports
> > > > > > > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > > > > > > polling on the
> > > > > > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > > > > > 
> > > > > > > > > > > > The v3 still has testpmd changes in fastpath. Right?
> > > > > > > > > > > > IMO,
> > > > > > > > > > > > For this
> > > > > > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > > > > > > probably as
> > > > > > > > > > > > separate fwd_engine to show how to use this feature.
> > > > > > > > > > > 
> > > > > > > > > > > Good point to discuss :) There are two strategies to
> > > > > > > > > > > polling
> > > > > > > > > > > a shared
> > > > > > > > > > > Rxq:
> > > > > > > > > > > 1. polling each member port
> > > > > > > > > > >    All forwarding engines can be reused to work as
> > > > > > > > > > > before.
> > > > > > > > > > >    My testpmd patches are efforts towards this direction.
> > > > > > > > > > >    Does your PMD support this?
> > > > > > > > > > 
> > > > > > > > > > Not unfortunately. More than that, every application needs
> > > > > > > > > > to
> > > > > > > > > > change
> > > > > > > > > > to support this model.
> > > > > > > > > 
> > > > > > > > > Both strategies need user application to resolve port ID from
> > > > > > > > > mbuf and
> > > > > > > > > process accordingly.
> > > > > > > > > This one doesn't demand aggregated port, no polling schedule
> > > > > > > > > change.
> > > > > > > > 
> > > > > > > > I was thinking, mbuf will be updated from driver/aggregator
> > > > > > > > port as
> > > > > > > > when it
> > > > > > > > comes to application.
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > 2. polling aggregated port
> > > > > > > > > > >    Besides forwarding engine, need more work to to demo
> > > > > > > > > > > it.
> > > > > > > > > > >    This is an optional API, not supported by my PMD yet.
> > > > > > > > > > 
> > > > > > > > > > We are thinking of implementing this PMD when it comes to
> > > > > > > > > > it,
> > > > > > > > > > ie.
> > > > > > > > > > without application change in fastpath
> > > > > > > > > > logic.
> > > > > > > > > 
> > > > > > > > > Fastpath have to resolve port ID anyway and forwarding
> > > > > > > > > according
> > > > > > > > > to
> > > > > > > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > > > > > > Fortunately, in testpmd, this can be done with an abstract
> > > > > > > > > API.
> > > > > > > > > 
> > > > > > > > > Let's defer part 2 until some PMD really support it and
> > > > > > > > > tested,
> > > > > > > > > how do
> > > > > > > > > you think?
> > > > > > > > 
> > > > > > > > We are not planning to use this feature so either way it is OK
> > > > > > > > to
> > > > > > > > me.
> > > > > > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > > > > > 
> > > > > > > > I do have a strong opinion not changing the testpmd basic
> > > > > > > > forward
> > > > > > > > engines
> > > > > > > > for this feature.I would like to keep it simple as fastpath
> > > > > > > > optimized and would
> > > > > > > > like to add a separate Forwarding engine as means to verify
> > > > > > > > this
> > > > > > > > feature.
> > > > > > > 
> > > > > > > +1 to that.
> > > > > > > I don't think it a 'common' feature.
> > > > > > > So separate FWD mode seems like a best choice to me.
> > > > > > 
> > > > > > -1 :)
> > > > > > There was some internal requirement from test team, they need to
> > > > > > verify
> > > > > > all features like packet content, rss, vlan, checksum, rte_flow...
> > > > > > to
> > > > > > be working based on shared rx queue.
> > > > > 
> > > > > Then I suppose you'll need to write really comprehensive fwd-engine
> > > > > to satisfy your test team :)
> > > > > Speaking seriously, I still don't understand why do you need all
> > > > > available fwd-engines to verify this feature.
> > > > > From what I understand, main purpose of your changes to test-pmd:
> > > > > allow to fwd packet though different fwd_stream (TX through different
> > > > > HW queue).
> > > > > In theory, if implemented in generic and extendable way - that
> > > > > might be a useful add-on to tespmd fwd functionality.
> > > > > But current implementation looks very case specific.
> > > > > And as I don't think it is a common case, I don't see much point to
> > > > > pollute
> > > > > basic fwd cases with it.
> > > > > 
> > > > > BTW, as a side note, the code below looks bogus to me:
> > > > > +void
> > > > > +forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
> > > > > +		   struct rte_mbuf **pkts_burst, packet_fwd_cb
> > > > > fwd)
> > > > > +{
> > > > > +	uint16_t i, nb_fs_rx = 1, port;
> > > > > +
> > > > > +	/* Locate real source fs according to mbuf->port. */
> > > > > +	for (i = 0; i < nb_rx; ++i) {
> > > > > +		rte_prefetch0(pkts_burst[i + 1]);
> > > > > 
> > > > > you access pkt_burst[] beyond array boundaries,
> > > > > also you ask cpu to prefetch some unknown and possibly invalid
> > > > > address.
> > > > 
> > > > Sorry I forgot this topic. It's too late to prefetch current packet, so
> > > > perfetch next is better. Prefetch an invalid address at end of a look
> > > > doesn't hurt, it's common in DPDK.
> > > 
> > > First of all it is usually never 'OK' to access array beyond its bounds.
> > > Second prefetching invalid address *does* hurt performance badly on many CPUs
> > > (TLB misses, consumed memory bandwidth etc.).
> > > As a reference:  https://lwn.net/Articles/444346/
> > > If some existing DPDK code really does that - then I believe it is an issue and has to be addressed.
> > > More important - it is really bad attitude to submit bogus code to DPDK community
> > > and pretend that it is 'OK'.
> > 
> > Thanks for the link!
> > From instruction spec, "The PREFETCHh instruction is merely a hint and
> > does not affect program behavior."
> > There are 3 choices here:
> > 1: no prefetch. D$ miss will happen on each packet, time cost depends
> > on where data sits(close or far) and burst size.
> > 2: prefetch with loop end check to avoid random address. Pro is free of
> > TLB miss per burst, Cons is "if" instruction per packet. Cost depends
> > on burst size.
> > 3: brute force prefetch, cost is TLB miss, but no addtional
> > instructions per packet. Not sure how random the last address could be
> > in testpmd and how many TLB miss could happen.
> 
> There are plenty of standard techniques to avoid that issue while keeping
> prefetch() in place.
> Probably the easiest one:
> 
> for (i=0; i < nb_rx - 1; i++) {
>     prefetch(pkt[i+1];
>     /* do your stuff with pkt[i[ here */
> }
> 
> /* do your stuff with pkt[nb_rx - 1]; */

Thanks, will update in next version.

>  
> > Based on my expericen of performance optimization, IIRC, option 3 has
> > the best performance. But for this case, result depends on how many
> > sub-burst inside and how sub-burst get processed, maybe callback will
> > flush prefetch data completely or not. So it's hard to get a
> > conclusion, what I said is that the code in PMD driver should have a
> > reason.
> > 
> > On the other hand, the latency and throughput saving of this featue on
> > multiple ports is huge, I perfer to down play this prefetch discussion
> > if you agree.
> > 
> 
> Honestly, I don't know how else to explain to you that there is a bug in that piece of code.
> From my perspective it is a trivial bug, with a trivial fix.
> But you simply keep ignoring the arguments.
> Till it get fixed and other comments addressed - my vote is NACK for these series.
> I don't think we need bogus code in testpmd.
> 
> 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-29 13:25                                   ` Xueming(Steven) Li
@ 2021-09-30  9:59                                     ` Ananyev, Konstantin
  2021-10-06  7:54                                       ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Ananyev, Konstantin @ 2021-09-30  9:59 UTC (permalink / raw)
  To: Xueming(Steven) Li, jerinjacobk, Raslan Darawsheh
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, Yigit, Ferruh



> On Wed, 2021-09-29 at 10:20 +0000, Ananyev, Konstantin wrote:
> > > > > > > > > > > > > > > > > > In current DPDK framework, each RX
> > > > > > > > > > > > > > > > > > queue
> > > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > > > > switch domain, the memory consumption
> > > > > > > > > > > > > > > > > > became
> > > > > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > > > > important, polling all ports leads to
> > > > > > > > > > > > > > > > > > high
> > > > > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > This patch introduces shared RX
> > > > > > > > > > > > > > > > > > queue.
> > > > > > > > > > > > > > > > > > Ports
> > > > > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > > > > configuration in a switch domain
> > > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > > share
> > > > > > > > > > > > > > > > > > RX queue set by specifying sharing
> > > > > > > > > > > > > > > > > > group.
> > > > > > > > > > > > > > > > > > Polling any queue using same shared
> > > > > > > > > > > > > > > > > > RX
> > > > > > > > > > > > > > > > > > queue
> > > > > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > > > > all member ports. Source port is
> > > > > > > > > > > > > > > > > > identified
> > > > > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Port queue number in a shared group
> > > > > > > > > > > > > > > > > > should be
> > > > > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Share RX queue is supposed to be
> > > > > > > > > > > > > > > > > > polled
> > > > > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Multiple groups is supported by group
> > > > > > > > > > > > > > > > > > ID.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Is this offload specific to the
> > > > > > > > > > > > > > > > > representor? If
> > > > > > > > > > > > > > > > > so can this name be changed
> > > > > > > > > > > > > > > > > specifically to
> > > > > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Yes, PF and representor in switch domain
> > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > If it is for a generic case, how the
> > > > > > > > > > > > > > > > > flow
> > > > > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Not quite sure that I understood your
> > > > > > > > > > > > > > > > question.
> > > > > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > > > > almost same as before, PF and representor
> > > > > > > > > > > > > > > > port
> > > > > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > > > > implementation.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload,
> > > > > > > > > > > > > > > multiple
> > > > > > > > > > > > > > > ethdev receive queues land into
> > > > > > > the same
> > > > > > > > > > > > > > > receive queue, In that case, how the flow
> > > > > > > > > > > > > > > order
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I guess the question is testpmd forward
> > > > > > > > > > > > > > stream?
> > > > > > > > > > > > > > The
> > > > > > > > > > > > > > forwarding logic has to be changed slightly
> > > > > > > > > > > > > > in
> > > > > > > > > > > > > > case
> > > > > > > > > > > > > > of shared rxq.
> > > > > > > > > > > > > > basically for each packet in rx_burst result,
> > > > > > > > > > > > > > lookup
> > > > > > > > > > > > > > source stream according to mbuf->port,
> > > > > > > > > > > > > > forwarding
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > target fs.
> > > > > > > > > > > > > > Packets from same source port could be
> > > > > > > > > > > > > > grouped as
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > small burst to process, this will accelerates
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > performance if traffic
> > > > > > > come from
> > > > > > > > > > > > > > limited ports. I'll introduce some common api
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > do
> > > > > > > > > > > > > > shard rxq forwarding, call it with packets
> > > > > > > > > > > > > > handling
> > > > > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > > > > all forwarding engine. Will sent patches
> > > > > > > > > > > > > > soon.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > All ports will put the packets in to the same
> > > > > > > > > > > > > queue
> > > > > > > > > > > > > (share queue), right? Does
> > > > > > > > > > > > > this means only single core will poll only,
> > > > > > > > > > > > > what
> > > > > > > > > > > > > will
> > > > > > > > > > > > > happen if there are
> > > > > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > > > >
> > > > > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > > > > application, I am not sure about
> > > > > > > > > > > > > the solution, can't this work in a transparent
> > > > > > > > > > > > > way
> > > > > > > > > > > > > to
> > > > > > > > > > > > > the application?
> > > > > > > > > > > >
> > > > > > > > > > > > Discussed with Jerin, new API introduced in v3
> > > > > > > > > > > > 2/8
> > > > > > > > > > > > that
> > > > > > > > > > > > aggregate ports
> > > > > > > > > > > > in same group into one new port. Users could
> > > > > > > > > > > > schedule
> > > > > > > > > > > > polling on the
> > > > > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > > > >
> > > > > > > > > > > The v3 still has testpmd changes in fastpath.
> > > > > > > > > > > Right?
> > > > > > > > > > > IMO,
> > > > > > > > > > > For this
> > > > > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > > > > application. Instead, testpmd can use aggregated
> > > > > > > > > > > ports
> > > > > > > > > > > probably as
> > > > > > > > > > > separate fwd_engine to show how to use this
> > > > > > > > > > > feature.
> > > > > > > > > >
> > > > > > > > > > Good point to discuss :) There are two strategies to
> > > > > > > > > > polling
> > > > > > > > > > a shared
> > > > > > > > > > Rxq:
> > > > > > > > > > 1. polling each member port
> > > > > > > > > >    All forwarding engines can be reused to work as
> > > > > > > > > > before.
> > > > > > > > > >    My testpmd patches are efforts towards this
> > > > > > > > > > direction.
> > > > > > > > > >    Does your PMD support this?
> > > > > > > > >
> > > > > > > > > Not unfortunately. More than that, every application
> > > > > > > > > needs
> > > > > > > > > to
> > > > > > > > > change
> > > > > > > > > to support this model.
> > > > > > > >
> > > > > > > > Both strategies need user application to resolve port ID
> > > > > > > > from
> > > > > > > > mbuf and
> > > > > > > > process accordingly.
> > > > > > > > This one doesn't demand aggregated port, no polling
> > > > > > > > schedule
> > > > > > > > change.
> > > > > > >
> > > > > > > I was thinking, mbuf will be updated from driver/aggregator
> > > > > > > port as
> > > > > > > when it
> > > > > > > comes to application.
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > 2. polling aggregated port
> > > > > > > > > >    Besides forwarding engine, need more work to to
> > > > > > > > > > demo
> > > > > > > > > > it.
> > > > > > > > > >    This is an optional API, not supported by my PMD
> > > > > > > > > > yet.
> > > > > > > > >
> > > > > > > > > We are thinking of implementing this PMD when it comes
> > > > > > > > > to
> > > > > > > > > it,
> > > > > > > > > ie.
> > > > > > > > > without application change in fastpath
> > > > > > > > > logic.
> > > > > > > >
> > > > > > > > Fastpath have to resolve port ID anyway and forwarding
> > > > > > > > according
> > > > > > > > to
> > > > > > > > logic. Forwarding engine need to adapt to support shard
> > > > > > > > Rxq.
> > > > > > > > Fortunately, in testpmd, this can be done with an
> > > > > > > > abstract
> > > > > > > > API.
> > > > > > > >
> > > > > > > > Let's defer part 2 until some PMD really support it and
> > > > > > > > tested,
> > > > > > > > how do
> > > > > > > > you think?
> > > > > > >
> > > > > > > We are not planning to use this feature so either way it is
> > > > > > > OK
> > > > > > > to
> > > > > > > me.
> > > > > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > > > >
> > > > > > > I do have a strong opinion not changing the testpmd basic
> > > > > > > forward
> > > > > > > engines
> > > > > > > for this feature.I would like to keep it simple as fastpath
> > > > > > > optimized and would
> > > > > > > like to add a separate Forwarding engine as means to verify
> > > > > > > this
> > > > > > > feature.
> > > > > >
> > > > > > +1 to that.
> > > > > > I don't think it a 'common' feature.
> > > > > > So separate FWD mode seems like a best choice to me.
> > > > >
> > > > > -1 :)
> > > > > There was some internal requirement from test team, they need
> > > > > to
> > > > > verify
> > > > > all features like packet content, rss, vlan, checksum,
> > > > > rte_flow...
> > > > > to
> > > > > be working based on shared rx queue.
> > > >
> > > > Then I suppose you'll need to write really comprehensive fwd-
> > > > engine
> > > > to satisfy your test team :)
> > > > Speaking seriously, I still don't understand why do you need all
> > > > available fwd-engines to verify this feature.
> > >
> > > The shared Rxq is low level feature, need to make sure driver
> > > higher
> > > level features working properly. fwd-engines like csum checks input
> > > packet and enable L3/L4 checksum and tunnel offloads accordingly,
> > > other engines do their own feature verification. All test
> > > automation
> > > could be reused with these engines supported seamlessly.
> > >
> > > > From what I understand, main purpose of your changes to test-pmd:
> > > > allow to fwd packet though different fwd_stream (TX through
> > > > different
> > > > HW queue).
> > >
> > > Yes, each mbuf in burst come from differnt port, testpmd current
> > > fwd-
> > > engines relies heavily on source forwarding stream, that's why the
> > > patch devide burst result mbufs into sub-burst and use orginal fwd-
> > > engine callback to handle. How to handle is not changed.
> > >
> > > > In theory, if implemented in generic and extendable way - that
> > > > might be a useful add-on to tespmd fwd functionality.
> > > > But current implementation looks very case specific.
> > > > And as I don't think it is a common case, I don't see much point
> > > > to
> > > > pollute
> > > > basic fwd cases with it.
> > >
> > > Shared Rxq is a ethdev feature that impacts how packets get
> > > handled.
> > > It's natural to update forwarding engines to avoid broken.
> >
> > Why is that?
> > All it affects the way you RX the packets.
> > So why *all* FWD engines have to be updated?
> 
> People will ask why some FWD engine can't work?

It can be documented: which fwd engine supposed to work properly
with this feature, which not.
BTW, as I understand, as long as RX queues are properly assigned to lcores,
any fwd engine will continue to work.
Just for engines that are not aware about this feature packets can be send
out via wrong TX queue. 
 
> > Let say what specific you are going to test with macswap vs macfwd
> > mode for that feature?
> 
> If people want to test NIC with real switch, or make sure L2 layer not
> get corrupted.

I understand that, what I am saying that you probably don't need both to test this feature.

> > I still think one specific FWD engine is enough to cover majority of
> > test cases.
> 
> Yes, rxonly should be sufficient to verify the fundametal, but to
> verify csum, timing, need others. Some back2back test system need io
> forwarding, real switch depolyment need macswap...

Ok, but nothing stops you to pickup for your purposes a FWD engine with most comprehensive functionality:
macswap,  5tswap, even 5tswap+csum update ....

> 
> 
> >
> > > The new macro is introduced to minimize performance impact, I'm
> > > also
> > > wondering is there an elegant solution :)
> >
> > I think Jerin suggested a good alternative with eventdev.
> > As another approach - might be consider to add an RX callback that
> > will return packets only for one particular port (while keeping
> > packets
> > for other ports cached internally).
> 
> This and the aggreate port API could be options in ethdev layer later.
> It can't be the fundamental due performance loss and potential cache
> miss.
> 
> > As a 'wild' thought - change testpmd fwd logic to allow multiple TX
> > queues
> > per fwd_stream and add a function to do TX switching logic.
> > But that's probably quite a big change that needs a lot of work.
> >
> > > Current performance penalty
> > > is one "if unlikely" per burst.
> >
> > It is not only about performance impact.
> > It is about keeping test-pmd code simple and maintainable.
> > >
> > > Think in reverse direction, if we don't update fwd-engines here,
> > > all
> > > malfunction when shared rxq enabled, users can't verify driver
> > > features, are you expecting this?
> >
> > I expect developers not to rewrite whole test-pmd fwd code for each
> > new ethdev feature.
> 
> Here just abstract duplicated code from fwd-engines, an improvement,
> keep test-pmd code simple and mantainable.
> 
> > Specially for the feature that is not widely used.
> 
> Based on the huge memory saving, performance and latency gains, it will
> be popular to users.
> 
> But the test-pmd is not critical to this feature, I'm ok to drop the
> fwd-engine support if you agree.

Not sure fully understand you here...
If you saying that you decided to demonstrate this feature via some other app/example
and prefer to abandon these changes in testpmd -
then, yes I don't see any problems with that.

^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v4 0/6] ethdev: introduce shared Rx queue
  2021-07-27  3:42 [dpdk-dev] [RFC] ethdev: introduce shared Rx queue Xueming Li
                   ` (3 preceding siblings ...)
  2021-09-17  8:01 ` [dpdk-dev] [PATCH v3 0/8] " Xueming Li
@ 2021-09-30 14:55 ` Xueming Li
  2021-09-30 14:55   ` [dpdk-dev] [PATCH v4 1/6] " Xueming Li
                     ` (6 more replies)
  2021-10-11 12:37 ` [dpdk-dev] [PATCH v5 0/5] " Xueming Li
                   ` (11 subsequent siblings)
  16 siblings, 7 replies; 266+ messages in thread
From: Xueming Li @ 2021-09-30 14:55 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin

In current DPDK framework, all RX queues is pre-loaded with mbufs for
incoming packets. When number of representors scale out in a switch
domain, the memory consumption became significant. Further more,
polling all ports leads to high cache miss, high latency and low
throughputs.

This patch introduces shared RX queue. PF and representors with same
configuration in same switch domain could share RX queue set by
specifying shared Rx queue offloading flag and sharing group.

All ports that Shared Rx queue actually shares One Rx queue and only
pre-load mbufs to one Rx queue, memory is saved.

Polling any queue using same shared RX queue receives packets from all
member ports. Source port is identified by mbuf->port.

Multiple groups is supported by group ID. Port queue number in a shared
group should be identical. Queue index is 1:1 mapped in shared group.
An example of polling two share groups:
  core	group	queue
  0	0	0
  1	0	1
  2	0	2
  3	0	3
  4	1	0
  5	1	1
  6	1	2
  7	1	3

Shared RX queue must be polled on single thread or core. If both PF0 and
representor0 joined same share group, can't poll pf0rxq0 on core1 and
rep0rxq0 on core2. Actually, polling one port within share group is
sufficient since polling any port in group will return packets for any
port in group.

v1:
  - initial version
v2:
  - add testpmd patches
v3:
  - change common forwarding api to macro for performance, thanks Jerin.
  - save global variable accessed in forwarding to flowstream to minimize
    cache miss
  - combined patches for each forwarding engine
  - support multiple groups in testpmd "--share-rxq" parameter
  - new api to aggerate shared rxq group
v4:
  - spelling fixes
  - remove shared-rxq support for all forwarding engines
  - add dedicate shared-rxq forwarding engine

Xueming Li (6):
  ethdev: introduce shared Rx queue
  ethdev: new API to aggregate shared Rx queue group
  app/testpmd: new parameter to enable shared Rx queue
  app/testpmd: dump port info for shared Rx queue
  app/testpmd: force shared Rx queue polled on same core
  app/testpmd: add forwarding engine for shared Rx queue

 app/test-pmd/config.c                         | 102 +++++++++++-
 app/test-pmd/meson.build                      |   1 +
 app/test-pmd/parameters.c                     |  13 ++
 app/test-pmd/shared_rxq_fwd.c                 | 148 ++++++++++++++++++
 app/test-pmd/testpmd.c                        |  23 ++-
 app/test-pmd/testpmd.h                        |   5 +
 app/test-pmd/util.c                           |   3 +
 doc/guides/nics/features.rst                  |  11 ++
 doc/guides/nics/features/default.ini          |   1 +
 .../prog_guide/switch_representation.rst      |  10 ++
 doc/guides/testpmd_app_ug/run_app.rst         |   8 +
 doc/guides/testpmd_app_ug/testpmd_funcs.rst   |   5 +-
 lib/ethdev/ethdev_driver.h                    |  23 ++-
 lib/ethdev/rte_ethdev.c                       |  23 +++
 lib/ethdev/rte_ethdev.h                       |  23 +++
 lib/ethdev/version.map                        |   3 +
 16 files changed, 398 insertions(+), 4 deletions(-)
 create mode 100644 app/test-pmd/shared_rxq_fwd.c

-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v4 1/6] ethdev: introduce shared Rx queue
  2021-09-30 14:55 ` [dpdk-dev] [PATCH v4 0/6] ethdev: introduce shared Rx queue Xueming Li
@ 2021-09-30 14:55   ` Xueming Li
  2021-10-11 10:47     ` Andrew Rybchenko
  2021-09-30 14:55   ` [dpdk-dev] [PATCH v4 2/6] ethdev: new API to aggregate shared Rx queue group Xueming Li
                     ` (5 subsequent siblings)
  6 siblings, 1 reply; 266+ messages in thread
From: Xueming Li @ 2021-09-30 14:55 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin

In current DPDK framework, each RX queue is pre-loaded with mbufs for
incoming packets. When number of representors scale out in a switch
domain, the memory consumption became significant. Most important,
polling all ports leads to high cache miss, high latency and low
throughput.

This patch introduces shared RX queue. Ports with same configuration in
a switch domain could share RX queue set by specifying sharing group.
Polling any queue using same shared RX queue receives packets from all
member ports. Source port is identified by mbuf->port.

Port queue number in a shared group should be identical. Queue index is
1:1 mapped in shared group.

Share RX queue must be polled on single thread or core.

Multiple groups is supported by group ID.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Cc: Jerin Jacob <jerinjacobk@gmail.com>
---
Rx queue object could be used as shared Rx queue object, it's important
to clear all queue control callback api that using queue object:
  https://mails.dpdk.org/archives/dev/2021-July/215574.html
---
 doc/guides/nics/features.rst                    | 11 +++++++++++
 doc/guides/nics/features/default.ini            |  1 +
 doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
 lib/ethdev/rte_ethdev.c                         |  1 +
 lib/ethdev/rte_ethdev.h                         |  7 +++++++
 5 files changed, 30 insertions(+)

diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index 4fce8cd1c97..69bc1d5719c 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -626,6 +626,17 @@ Supports inner packet L4 checksum.
   ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
 
 
+.. _nic_features_shared_rx_queue:
+
+Shared Rx queue
+---------------
+
+Supports shared Rx queue for ports in same switch domain.
+
+* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
+* **[provides] mbuf**: ``mbuf.port``.
+
+
 .. _nic_features_packet_type_parsing:
 
 Packet type parsing
diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
index 754184ddd4d..ebeb4c18512 100644
--- a/doc/guides/nics/features/default.ini
+++ b/doc/guides/nics/features/default.ini
@@ -19,6 +19,7 @@ Free Tx mbuf on demand =
 Queue start/stop     =
 Runtime Rx queue setup =
 Runtime Tx queue setup =
+Shared Rx queue      =
 Burst mode info      =
 Power mgmt address monitor =
 MTU update           =
diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
index ff6aa91c806..bc7ce65fa3d 100644
--- a/doc/guides/prog_guide/switch_representation.rst
+++ b/doc/guides/prog_guide/switch_representation.rst
@@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
 .. [1] `Ethernet switch device driver model (switchdev)
        <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
 
+- Memory usage of representors is huge when number of representor grows,
+  because PMD always allocate mbuf for each descriptor of Rx queue.
+  Polling the large number of ports brings more CPU load, cache miss and
+  latency. Shared Rx queue can be used to share Rx queue between PF and
+  representors in same switch. ``RTE_ETH_RX_OFFLOAD_SHARED_RXQ`` is
+  present in Rx offloading capability of device info. Setting the
+  offloading flag in device Rx mode or Rx queue configuration to enable
+  shared Rx queue. Polling any member port of the shared Rx queue can return
+  packets of all ports in the group, port ID is saved in ``mbuf.port``.
+
 Basic SR-IOV
 ------------
 
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 61aa49efec6..73270c10492 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -127,6 +127,7 @@ static const struct {
 	RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
 	RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
 	RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
+	RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
 };
 
 #undef RTE_RX_OFFLOAD_BIT2STR
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index afdc53b674c..d7ac625ee74 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1077,6 +1077,7 @@ struct rte_eth_rxconf {
 	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
 	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
 	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
+	uint32_t shared_group; /**< Shared port group index in switch domain. */
 	/**
 	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
 	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
@@ -1403,6 +1404,12 @@ struct rte_eth_conf {
 #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
 #define DEV_RX_OFFLOAD_RSS_HASH		0x00080000
 #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
+/**
+ * Rx queue is shared among ports in same switch domain to save memory,
+ * avoid polling each port. Any port in the group can be used to receive
+ * packets. Real source port number saved in mbuf->port field.
+ */
+#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
 
 #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
 				 DEV_RX_OFFLOAD_UDP_CKSUM | \
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v4 2/6] ethdev: new API to aggregate shared Rx queue group
  2021-09-30 14:55 ` [dpdk-dev] [PATCH v4 0/6] ethdev: introduce shared Rx queue Xueming Li
  2021-09-30 14:55   ` [dpdk-dev] [PATCH v4 1/6] " Xueming Li
@ 2021-09-30 14:55   ` Xueming Li
  2021-09-30 14:55   ` [dpdk-dev] [PATCH v4 3/6] app/testpmd: new parameter to enable shared Rx queue Xueming Li
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-09-30 14:55 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ray Kinsella

This patch introduces new api to aggreate ports among same shared Rx
queue group. Only queues with specified share group are aggregated.
Rx burst and device close are expected to be supported by new device.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 lib/ethdev/ethdev_driver.h | 23 ++++++++++++++++++++++-
 lib/ethdev/rte_ethdev.c    | 22 ++++++++++++++++++++++
 lib/ethdev/rte_ethdev.h    | 16 ++++++++++++++++
 lib/ethdev/version.map     |  3 +++
 4 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index 2f0fd3516d8..88c7bd3f698 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -782,10 +782,28 @@ typedef int (*eth_get_monitor_addr_t)(void *rxq,
  * @return
  *   Negative errno value on error, number of info entries otherwise.
  */
-
 typedef int (*eth_representor_info_get_t)(struct rte_eth_dev *dev,
 	struct rte_eth_representor_info *info);
 
+/**
+ * @internal
+ * Aggregate shared Rx queue.
+ *
+ * Create a new port used for shared Rx queue polling.
+ *
+ * Only queues with specified share group are aggregated.
+ * At least Rx burst and device close should be supported.
+ *
+ * @param dev
+ *   Ethdev handle of port.
+ * @param group
+ *   Shared Rx queue group to aggregate.
+ * @return
+ *   UINT16_MAX if failed, otherwise aggregated port number.
+ */
+typedef uint16_t (*eth_shared_rxq_aggregate_t)(struct rte_eth_dev *dev,
+					       uint32_t group);
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -946,6 +964,9 @@ struct eth_dev_ops {
 
 	eth_representor_info_get_t representor_info_get;
 	/**< Get representor info. */
+
+	eth_shared_rxq_aggregate_t shared_rxq_aggregate;
+	/**< Aggregate shared Rx queue. */
 };
 
 /**
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 73270c10492..d78b50e1fa7 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -6297,6 +6297,28 @@ rte_eth_representor_info_get(uint16_t port_id,
 	return eth_err(port_id, (*dev->dev_ops->representor_info_get)(dev, info));
 }
 
+uint16_t
+rte_eth_shared_rxq_aggregate(uint16_t port_id, uint32_t group)
+{
+	struct rte_eth_dev *dev;
+	uint64_t offloads;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+	dev = &rte_eth_devices[port_id];
+
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->shared_rxq_aggregate,
+				UINT16_MAX);
+
+	offloads = dev->data->dev_conf.rxmode.offloads;
+	if ((offloads & RTE_ETH_RX_OFFLOAD_SHARED_RXQ) == 0) {
+		RTE_ETHDEV_LOG(ERR, "port_id=%u doesn't support Rx offload\n",
+			       port_id);
+		return UINT16_MAX;
+	}
+
+	return (*dev->dev_ops->shared_rxq_aggregate)(dev, group);
+}
+
 RTE_LOG_REGISTER_DEFAULT(rte_eth_dev_logtype, INFO);
 
 RTE_INIT(ethdev_init_telemetry)
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index d7ac625ee74..b94f7ba5a3f 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -4909,6 +4909,22 @@ __rte_experimental
 int rte_eth_representor_info_get(uint16_t port_id,
 				 struct rte_eth_representor_info *info);
 
+/**
+ * Aggregate shared Rx queue ports to one port for polling.
+ *
+ * Only queues with specified share group are aggregated.
+ * Any operation besides Rx burst and device close is unexpected.
+ *
+ * @param port_id
+ *   The port identifier of the device from shared Rx queue group.
+ * @param group
+ *   Shared Rx queue group to aggregate.
+ * @return
+ *   UINT16_MAX if failed, otherwise aggregated port number.
+ */
+__rte_experimental
+uint16_t rte_eth_shared_rxq_aggregate(uint16_t port_id, uint32_t group);
+
 #include <rte_ethdev_core.h>
 
 /**
diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
index 904bce6ea14..6f261bb923a 100644
--- a/lib/ethdev/version.map
+++ b/lib/ethdev/version.map
@@ -247,6 +247,9 @@ EXPERIMENTAL {
 	rte_mtr_meter_policy_delete;
 	rte_mtr_meter_policy_update;
 	rte_mtr_meter_policy_validate;
+
+	# added in 21.11
+	rte_eth_shared_rxq_aggregate;
 };
 
 INTERNAL {
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v4 3/6] app/testpmd: new parameter to enable shared Rx queue
  2021-09-30 14:55 ` [dpdk-dev] [PATCH v4 0/6] ethdev: introduce shared Rx queue Xueming Li
  2021-09-30 14:55   ` [dpdk-dev] [PATCH v4 1/6] " Xueming Li
  2021-09-30 14:55   ` [dpdk-dev] [PATCH v4 2/6] ethdev: new API to aggregate shared Rx queue group Xueming Li
@ 2021-09-30 14:55   ` Xueming Li
  2021-09-30 14:56   ` [dpdk-dev] [PATCH v4 4/6] app/testpmd: dump port info for " Xueming Li
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-09-30 14:55 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Xiaoyun Li

Adds "--rxq-share" parameter to enable shared rxq for each rxq.

Default shared rxq group 0 is used, RX queues in same switch domain
shares same rxq according to queue index.

Shared Rx queue is enabled only if device support offloading flag
RTE_ETH_RX_OFFLOAD_SHARED_RXQ.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c                 |  6 +++++-
 app/test-pmd/parameters.c             | 13 +++++++++++++
 app/test-pmd/testpmd.c                | 18 ++++++++++++++++++
 app/test-pmd/testpmd.h                |  2 ++
 doc/guides/testpmd_app_ug/run_app.rst |  7 +++++++
 5 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 9c66329e96e..6c7f9dee065 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -2709,7 +2709,11 @@ rxtx_config_display(void)
 			printf("      RX threshold registers: pthresh=%d hthresh=%d "
 				" wthresh=%d\n",
 				pthresh_tmp, hthresh_tmp, wthresh_tmp);
-			printf("      RX Offloads=0x%"PRIx64"\n", offloads_tmp);
+			printf("      RX Offloads=0x%"PRIx64, offloads_tmp);
+			if (rx_conf->offloads & RTE_ETH_RX_OFFLOAD_SHARED_RXQ)
+				printf(" share group=%u",
+				       rx_conf->shared_group);
+			printf("\n");
 		}
 
 		/* per tx queue config only for first queue to be less verbose */
diff --git a/app/test-pmd/parameters.c b/app/test-pmd/parameters.c
index 3f94a82e321..30dae326310 100644
--- a/app/test-pmd/parameters.c
+++ b/app/test-pmd/parameters.c
@@ -167,6 +167,7 @@ usage(char* progname)
 	printf("  --tx-ip=src,dst: IP addresses in Tx-only mode\n");
 	printf("  --tx-udp=src[,dst]: UDP ports in Tx-only mode\n");
 	printf("  --eth-link-speed: force link speed.\n");
+	printf("  --rxq-share: number of ports per shared rxq groups, defaults to MAX(1 group)\n");
 	printf("  --disable-link-check: disable check on link status when "
 	       "starting/stopping ports.\n");
 	printf("  --disable-device-start: do not automatically start port\n");
@@ -607,6 +608,7 @@ launch_args_parse(int argc, char** argv)
 		{ "rxpkts",			1, 0, 0 },
 		{ "txpkts",			1, 0, 0 },
 		{ "txonly-multi-flow",		0, 0, 0 },
+		{ "rxq-share",			2, 0, 0 },
 		{ "eth-link-speed",		1, 0, 0 },
 		{ "disable-link-check",		0, 0, 0 },
 		{ "disable-device-start",	0, 0, 0 },
@@ -1271,6 +1273,17 @@ launch_args_parse(int argc, char** argv)
 			}
 			if (!strcmp(lgopts[opt_idx].name, "txonly-multi-flow"))
 				txonly_multi_flow = 1;
+			if (!strcmp(lgopts[opt_idx].name, "rxq-share")) {
+				if (optarg == NULL) {
+					rxq_share = UINT32_MAX;
+				} else {
+					n = atoi(optarg);
+					if (n >= 0)
+						rxq_share = (uint32_t)n;
+					else
+						rte_exit(EXIT_FAILURE, "rxq-share must be >= 0\n");
+				}
+			}
 			if (!strcmp(lgopts[opt_idx].name, "no-flush-rx"))
 				no_flush_rx = 1;
 			if (!strcmp(lgopts[opt_idx].name, "eth-link-speed")) {
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 97ae52e17ec..417e92ade11 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -498,6 +498,11 @@ uint8_t record_core_cycles;
  */
 uint8_t record_burst_stats;
 
+/*
+ * Number of ports per shared Rx queue group, 0 disable.
+ */
+uint32_t rxq_share;
+
 unsigned int num_sockets = 0;
 unsigned int socket_ids[RTE_MAX_NUMA_NODES];
 
@@ -1506,6 +1511,11 @@ init_config_port_offloads(portid_t pid, uint32_t socket_id)
 		port->dev_conf.txmode.offloads &=
 			~DEV_TX_OFFLOAD_MBUF_FAST_FREE;
 
+	if (rxq_share > 0 &&
+	    (port->dev_info.rx_offload_capa & RTE_ETH_RX_OFFLOAD_SHARED_RXQ))
+		port->dev_conf.rxmode.offloads |=
+				RTE_ETH_RX_OFFLOAD_SHARED_RXQ;
+
 	/* Apply Rx offloads configuration */
 	for (i = 0; i < port->dev_info.max_rx_queues; i++)
 		port->rx_conf[i].offloads = port->dev_conf.rxmode.offloads;
@@ -3401,6 +3411,14 @@ rxtx_port_config(struct rte_port *port)
 	for (qid = 0; qid < nb_rxq; qid++) {
 		offloads = port->rx_conf[qid].offloads;
 		port->rx_conf[qid] = port->dev_info.default_rxconf;
+
+		if (rxq_share > 0 &&
+		    (port->dev_info.rx_offload_capa &
+		     RTE_ETH_RX_OFFLOAD_SHARED_RXQ)) {
+			offloads |= RTE_ETH_RX_OFFLOAD_SHARED_RXQ;
+			port->rx_conf[qid].shared_group = nb_ports / rxq_share;
+		}
+
 		if (offloads != 0)
 			port->rx_conf[qid].offloads = offloads;
 
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 5863b2f43f3..3dfaaad94c0 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -477,6 +477,8 @@ extern enum tx_pkt_split tx_pkt_split;
 
 extern uint8_t txonly_multi_flow;
 
+extern uint32_t rxq_share;
+
 extern uint16_t nb_pkt_per_burst;
 extern uint16_t nb_pkt_flowgen_clones;
 extern int nb_flows_flowgen;
diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
index 640eadeff73..43c85959e0b 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -389,6 +389,13 @@ The command line options are:
 
     Generate multiple flows in txonly mode.
 
+*   ``--rxq-share=[X]``
+
+    Create queues in shared RX queue mode if device supports.
+    Group number grows per X ports, default to group 0 if X not specified.
+    Only "shared-rxq" forwarding engine suppose to resolve source stream
+    correctly.
+
 *   ``--eth-link-speed``
 
     Set a forced link speed to the ethernet port::
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v4 4/6] app/testpmd: dump port info for shared Rx queue
  2021-09-30 14:55 ` [dpdk-dev] [PATCH v4 0/6] ethdev: introduce shared Rx queue Xueming Li
                     ` (2 preceding siblings ...)
  2021-09-30 14:55   ` [dpdk-dev] [PATCH v4 3/6] app/testpmd: new parameter to enable shared Rx queue Xueming Li
@ 2021-09-30 14:56   ` Xueming Li
  2021-09-30 14:56   ` [dpdk-dev] [PATCH v4 5/6] app/testpmd: force shared Rx queue polled on same core Xueming Li
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-09-30 14:56 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Xiaoyun Li

In case of shared Rx queue, polling any member port returns mbufs for
all members. This patch dumps mbuf->port for each packet.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/util.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/app/test-pmd/util.c b/app/test-pmd/util.c
index 14a9a251fb9..4c07907c441 100644
--- a/app/test-pmd/util.c
+++ b/app/test-pmd/util.c
@@ -100,6 +100,9 @@ dump_pkt_burst(uint16_t port_id, uint16_t queue, struct rte_mbuf *pkts[],
 		struct rte_flow_restore_info info = { 0, };
 
 		mb = pkts[i];
+		if (rxq_share > 0)
+			MKDUMPSTR(print_buf, buf_size, cur_len, "port %u, ",
+				  mb->port);
 		eth_hdr = rte_pktmbuf_read(mb, 0, sizeof(_eth_hdr), &_eth_hdr);
 		eth_type = RTE_BE_TO_CPU_16(eth_hdr->ether_type);
 		packet_type = mb->packet_type;
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v4 5/6] app/testpmd: force shared Rx queue polled on same core
  2021-09-30 14:55 ` [dpdk-dev] [PATCH v4 0/6] ethdev: introduce shared Rx queue Xueming Li
                     ` (3 preceding siblings ...)
  2021-09-30 14:56   ` [dpdk-dev] [PATCH v4 4/6] app/testpmd: dump port info for " Xueming Li
@ 2021-09-30 14:56   ` Xueming Li
  2021-09-30 14:56   ` [dpdk-dev] [PATCH v4 6/6] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
  2021-10-11 11:49   ` [dpdk-dev] [PATCH v4 0/6] ethdev: introduce " Andrew Rybchenko
  6 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-09-30 14:56 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Xiaoyun Li

Shared rxqs shares one set rx queue of groups zero. Shared Rx queue must
must be polled from one core.

Checks and stops forwarding if shared rxq being scheduled on multiple
cores.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c  | 96 ++++++++++++++++++++++++++++++++++++++++++
 app/test-pmd/testpmd.c |  4 +-
 app/test-pmd/testpmd.h |  2 +
 3 files changed, 101 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 6c7f9dee065..8bfa26570ba 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -2885,6 +2885,102 @@ port_rss_hash_key_update(portid_t port_id, char rss_type[], uint8_t *hash_key,
 	}
 }
 
+/*
+ * Check whether a shared rxq scheduled on other lcores.
+ */
+static bool
+fwd_stream_on_other_lcores(uint16_t domain_id, portid_t src_port,
+			   queueid_t src_rxq, lcoreid_t src_lc,
+			   uint32_t shared_group)
+{
+	streamid_t sm_id;
+	streamid_t nb_fs_per_lcore;
+	lcoreid_t  nb_fc;
+	lcoreid_t  lc_id;
+	struct fwd_stream *fs;
+	struct rte_port *port;
+	struct rte_eth_rxconf *rxq_conf;
+
+	nb_fc = cur_fwd_config.nb_fwd_lcores;
+	for (lc_id = src_lc + 1; lc_id < nb_fc; lc_id++) {
+		sm_id = fwd_lcores[lc_id]->stream_idx;
+		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
+		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
+		     sm_id++) {
+			fs = fwd_streams[sm_id];
+			port = &ports[fs->rx_port];
+			rxq_conf = &port->rx_conf[fs->rx_queue];
+			if ((rxq_conf->offloads & RTE_ETH_RX_OFFLOAD_SHARED_RXQ)
+			    == 0)
+				/* Not shared rxq. */
+				continue;
+			if (domain_id != port->dev_info.switch_info.domain_id)
+				continue;
+			if (fs->rx_queue != src_rxq)
+				continue;
+			if (rxq_conf->shared_group != shared_group)
+				continue;
+			printf("Shared RX queue group %u can't be scheduled on different cores:\n",
+			       shared_group);
+			printf("  lcore %hhu Port %hu queue %hu\n",
+			       src_lc, src_port, src_rxq);
+			printf("  lcore %hhu Port %hu queue %hu\n",
+			       lc_id, fs->rx_port, fs->rx_queue);
+			printf("  please use --nb-cores=%hu to limit forwarding cores\n",
+			       nb_rxq);
+			return true;
+		}
+	}
+	return false;
+}
+
+/*
+ * Check shared rxq configuration.
+ *
+ * Shared group must not being scheduled on different core.
+ */
+bool
+pkt_fwd_shared_rxq_check(void)
+{
+	streamid_t sm_id;
+	streamid_t nb_fs_per_lcore;
+	lcoreid_t  nb_fc;
+	lcoreid_t  lc_id;
+	struct fwd_stream *fs;
+	uint16_t domain_id;
+	struct rte_port *port;
+	struct rte_eth_rxconf *rxq_conf;
+
+	nb_fc = cur_fwd_config.nb_fwd_lcores;
+	/*
+	 * Check streams on each core, make sure the same switch domain +
+	 * group + queue doesn't get scheduled on other cores.
+	 */
+	for (lc_id = 0; lc_id < nb_fc; lc_id++) {
+		sm_id = fwd_lcores[lc_id]->stream_idx;
+		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
+		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
+		     sm_id++) {
+			fs = fwd_streams[sm_id];
+			/* Update lcore info stream being scheduled. */
+			fs->lcore = fwd_lcores[lc_id];
+			port = &ports[fs->rx_port];
+			rxq_conf = &port->rx_conf[fs->rx_queue];
+			if ((rxq_conf->offloads & RTE_ETH_RX_OFFLOAD_SHARED_RXQ)
+			    == 0)
+				/* Not shared rxq. */
+				continue;
+			/* Check shared rxq not scheduled on remaining cores. */
+			domain_id = port->dev_info.switch_info.domain_id;
+			if (fwd_stream_on_other_lcores(domain_id, fs->rx_port,
+						       fs->rx_queue, lc_id,
+						       rxq_conf->shared_group))
+				return false;
+		}
+	}
+	return true;
+}
+
 /*
  * Setup forwarding configuration for each logical core.
  */
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 417e92ade11..cab4b36b046 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -2241,10 +2241,12 @@ start_packet_forwarding(int with_tx_first)
 
 	fwd_config_setup();
 
+	pkt_fwd_config_display(&cur_fwd_config);
+	if (!pkt_fwd_shared_rxq_check())
+		return;
 	if(!no_flush_rx)
 		flush_fwd_rx_queues();
 
-	pkt_fwd_config_display(&cur_fwd_config);
 	rxtx_config_display();
 
 	fwd_stats_reset();
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 3dfaaad94c0..f121a2da90c 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -144,6 +144,7 @@ struct fwd_stream {
 	uint64_t     core_cycles; /**< used for RX and TX processing */
 	struct pkt_burst_stats rx_burst_stats;
 	struct pkt_burst_stats tx_burst_stats;
+	struct fwd_lcore *lcore; /**< Lcore being scheduled. */
 };
 
 /**
@@ -795,6 +796,7 @@ void port_summary_header_display(void);
 void rx_queue_infos_display(portid_t port_idi, uint16_t queue_id);
 void tx_queue_infos_display(portid_t port_idi, uint16_t queue_id);
 void fwd_lcores_config_display(void);
+bool pkt_fwd_shared_rxq_check(void);
 void pkt_fwd_config_display(struct fwd_config *cfg);
 void rxtx_config_display(void);
 void fwd_config_setup(void);
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v4 6/6] app/testpmd: add forwarding engine for shared Rx queue
  2021-09-30 14:55 ` [dpdk-dev] [PATCH v4 0/6] ethdev: introduce shared Rx queue Xueming Li
                     ` (4 preceding siblings ...)
  2021-09-30 14:56   ` [dpdk-dev] [PATCH v4 5/6] app/testpmd: force shared Rx queue polled on same core Xueming Li
@ 2021-09-30 14:56   ` Xueming Li
  2021-10-11 11:49   ` [dpdk-dev] [PATCH v4 0/6] ethdev: introduce " Andrew Rybchenko
  6 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-09-30 14:56 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Xiaoyun Li

To support shared Rx queue, this patch introduces dedicate forwarding
engine. The engine groups received packets by mbuf->port into sub-group,
updates stream statistics and simply frees packets.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/meson.build                    |   1 +
 app/test-pmd/shared_rxq_fwd.c               | 148 ++++++++++++++++++++
 app/test-pmd/testpmd.c                      |   1 +
 app/test-pmd/testpmd.h                      |   1 +
 doc/guides/testpmd_app_ug/run_app.rst       |   1 +
 doc/guides/testpmd_app_ug/testpmd_funcs.rst |   5 +-
 6 files changed, 156 insertions(+), 1 deletion(-)
 create mode 100644 app/test-pmd/shared_rxq_fwd.c

diff --git a/app/test-pmd/meson.build b/app/test-pmd/meson.build
index 98f3289bdfa..07042e45b12 100644
--- a/app/test-pmd/meson.build
+++ b/app/test-pmd/meson.build
@@ -21,6 +21,7 @@ sources = files(
         'noisy_vnf.c',
         'parameters.c',
         'rxonly.c',
+        'shared_rxq_fwd.c',
         'testpmd.c',
         'txonly.c',
         'util.c',
diff --git a/app/test-pmd/shared_rxq_fwd.c b/app/test-pmd/shared_rxq_fwd.c
new file mode 100644
index 00000000000..4e262b99bc7
--- /dev/null
+++ b/app/test-pmd/shared_rxq_fwd.c
@@ -0,0 +1,148 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2021 NVIDIA Corporation & Affiliates
+ */
+
+#include <stdarg.h>
+#include <string.h>
+#include <stdio.h>
+#include <errno.h>
+#include <stdint.h>
+#include <unistd.h>
+#include <inttypes.h>
+
+#include <sys/queue.h>
+#include <sys/stat.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_debug.h>
+#include <rte_cycles.h>
+#include <rte_memory.h>
+#include <rte_memcpy.h>
+#include <rte_launch.h>
+#include <rte_eal.h>
+#include <rte_per_lcore.h>
+#include <rte_lcore.h>
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_mempool.h>
+#include <rte_mbuf.h>
+#include <rte_pci.h>
+#include <rte_ether.h>
+#include <rte_ethdev.h>
+#include <rte_string_fns.h>
+#include <rte_ip.h>
+#include <rte_udp.h>
+#include <rte_net.h>
+#include <rte_flow.h>
+
+#include "testpmd.h"
+
+/*
+ * Rx only sub-burst forwarding.
+ */
+static void
+forward_rx_only(uint16_t nb_rx, struct rte_mbuf **pkts_burst)
+{
+	rte_pktmbuf_free_bulk(pkts_burst, nb_rx);
+}
+
+/**
+ * Get packet source stream by source port and queue.
+ * All streams of same shared Rx queue locates on same core.
+ */
+static struct fwd_stream *
+forward_stream_get(struct fwd_stream *fs, uint16_t port)
+{
+	streamid_t sm_id;
+	struct fwd_lcore *fc;
+	struct fwd_stream **fsm;
+	streamid_t nb_fs;
+
+	fc = fs->lcore;
+	fsm = &fwd_streams[fc->stream_idx];
+	nb_fs = fc->stream_nb;
+	for (sm_id = 0; sm_id < nb_fs; sm_id++) {
+		if (fsm[sm_id]->rx_port == port &&
+		    fsm[sm_id]->rx_queue == fs->rx_queue)
+			return fsm[sm_id];
+	}
+	return NULL;
+}
+
+/**
+ * Forward packet by source port and queue.
+ */
+static void
+forward_sub_burst(struct fwd_stream *src_fs, uint16_t port, uint16_t nb_rx,
+		  struct rte_mbuf **pkts)
+{
+	struct fwd_stream *fs = forward_stream_get(src_fs, port);
+
+	if (fs != NULL) {
+		fs->rx_packets += nb_rx;
+		forward_rx_only(nb_rx, pkts);
+	} else {
+		/* Source stream not found, drop all packets. */
+		src_fs->fwd_dropped += nb_rx;
+		while (nb_rx > 0)
+			rte_pktmbuf_free(pkts[--nb_rx]);
+	}
+}
+
+/**
+ * Forward packets from shared Rx queue.
+ *
+ * Source port of packets are identified by mbuf->port.
+ */
+static void
+forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
+		   struct rte_mbuf **pkts_burst)
+{
+	uint16_t i, nb_sub_burst, port, last_port;
+
+	nb_sub_burst = 0;
+	last_port = pkts_burst[0]->port;
+	/* Locate sub-burst according to mbuf->port. */
+	for (i = 0; i < nb_rx - 1; ++i) {
+		rte_prefetch0(pkts_burst[i + 1]);
+		port = pkts_burst[i]->port;
+		if (i > 0 && last_port != port) {
+			/* Forward packets with same source port. */
+			forward_sub_burst(fs, last_port, nb_sub_burst,
+					  &pkts_burst[i - nb_sub_burst]);
+			nb_sub_burst = 0;
+			last_port = port;
+		}
+		nb_sub_burst++;
+	}
+	/* Last sub-burst. */
+	nb_sub_burst++;
+	forward_sub_burst(fs, last_port, nb_sub_burst,
+			  &pkts_burst[nb_rx - nb_sub_burst]);
+}
+
+static void
+shared_rxq_fwd(struct fwd_stream *fs)
+{
+	struct rte_mbuf *pkts_burst[nb_pkt_per_burst];
+	uint16_t nb_rx;
+	uint64_t start_tsc = 0;
+
+	get_start_cycles(&start_tsc);
+	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
+				 nb_pkt_per_burst);
+	inc_rx_burst_stats(fs, nb_rx);
+	if (unlikely(nb_rx == 0))
+		return;
+	forward_shared_rxq(fs, nb_rx, pkts_burst);
+	get_end_cycles(fs, start_tsc);
+}
+
+struct fwd_engine shared_rxq_engine = {
+	.fwd_mode_name  = "shared_rxq",
+	.port_fwd_begin = NULL,
+	.port_fwd_end   = NULL,
+	.packet_fwd     = shared_rxq_fwd,
+};
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index cab4b36b046..681ea591871 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -188,6 +188,7 @@ struct fwd_engine * fwd_engines[] = {
 #ifdef RTE_LIBRTE_IEEE1588
 	&ieee1588_fwd_engine,
 #endif
+	&shared_rxq_engine,
 	NULL,
 };
 
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index f121a2da90c..f1fd607e365 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -299,6 +299,7 @@ extern struct fwd_engine five_tuple_swap_fwd_engine;
 #ifdef RTE_LIBRTE_IEEE1588
 extern struct fwd_engine ieee1588_fwd_engine;
 #endif
+extern struct fwd_engine shared_rxq_engine;
 
 extern struct fwd_engine * fwd_engines[]; /**< NULL terminated array. */
 extern cmdline_parse_inst_t cmd_set_raw;
diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
index 43c85959e0b..7b83d4b3944 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -252,6 +252,7 @@ The command line options are:
        tm
        noisy
        5tswap
+       shared-rxq
 
 *   ``--rss-ip``
 
diff --git a/doc/guides/testpmd_app_ug/testpmd_funcs.rst b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
index bbef7063741..a30e2f4dfae 100644
--- a/doc/guides/testpmd_app_ug/testpmd_funcs.rst
+++ b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
@@ -314,7 +314,7 @@ set fwd
 Set the packet forwarding mode::
 
    testpmd> set fwd (io|mac|macswap|flowgen| \
-                     rxonly|txonly|csum|icmpecho|noisy|5tswap) (""|retry)
+                     rxonly|txonly|csum|icmpecho|noisy|5tswap|shared-rxq) (""|retry)
 
 ``retry`` can be specified for forwarding engines except ``rx_only``.
 
@@ -357,6 +357,9 @@ The available information categories are:
 
   L4 swaps the source port and destination port of transport layer (TCP and UDP).
 
+* ``shared-rxq``: Receive only for shared Rx queue.
+  Resolve packet source port from mbuf and update stream statistics accrodingly.
+
 Example::
 
    testpmd> set fwd rxonly
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-30  9:59                                     ` Ananyev, Konstantin
@ 2021-10-06  7:54                                       ` Xueming(Steven) Li
  0 siblings, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-10-06  7:54 UTC (permalink / raw)
  To: jerinjacobk, Raslan Darawsheh, konstantin.ananyev
  Cc: NBU-Contact-Thomas Monjalon, andrew.rybchenko, dev, ferruh.yigit

On Thu, 2021-09-30 at 09:59 +0000, Ananyev, Konstantin wrote:


On Wed, 2021-09-29 at 10:20 +0000, Ananyev, Konstantin wrote:

In current DPDK framework, each RX

queue

is

pre-loaded with mbufs

for incoming packets. When number of

representors scale out in a

switch domain, the memory consumption

became

significant. Most

important, polling all ports leads to

high

cache miss, high

latency and low throughput.


This patch introduces shared RX

queue.

Ports

with same

configuration in a switch domain

could

share

RX queue set by specifying sharing

group.

Polling any queue using same shared

RX

queue

receives packets from

all member ports. Source port is

identified

by mbuf->port.


Port queue number in a shared group

should be

identical. Queue

index is

1:1 mapped in shared group.


Share RX queue is supposed to be

polled

on

same thread.


Multiple groups is supported by group

ID.


Is this offload specific to the

representor? If

so can this name be changed

specifically to

representor?


Yes, PF and representor in switch domain

could

take advantage.


If it is for a generic case, how the

flow

ordering will be maintained?


Not quite sure that I understood your

question.

The control path of is

almost same as before, PF and representor

port

still needed, rte flows not impacted.

Queues still needed for each member port,

descriptors(mbuf) will be

supplied from shared Rx queue in my PMD

implementation.


My question was if create a generic

RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload,

multiple

ethdev receive queues land into

the same

receive queue, In that case, how the flow

order

is

maintained for respective receive queues.


I guess the question is testpmd forward

stream?

The

forwarding logic has to be changed slightly

in

case

of shared rxq.

basically for each packet in rx_burst result,

lookup

source stream according to mbuf->port,

forwarding

to

target fs.

Packets from same source port could be

grouped as

a

small burst to process, this will accelerates

the

performance if traffic

come from

limited ports. I'll introduce some common api

to

do

shard rxq forwarding, call it with packets

handling

callback, so it suites for

all forwarding engine. Will sent patches

soon.



All ports will put the packets in to the same

queue

(share queue), right? Does

this means only single core will poll only,

what

will

happen if there are

multiple cores polling, won't it cause problem?


And if this requires specific changes in the

application, I am not sure about

the solution, can't this work in a transparent

way

to

the application?


Discussed with Jerin, new API introduced in v3

2/8

that

aggregate ports

in same group into one new port. Users could

schedule

polling on the

aggregated port instead of all member ports.


The v3 still has testpmd changes in fastpath.

Right?

IMO,

For this

feature, we should not change fastpath of testpmd

application. Instead, testpmd can use aggregated

ports

probably as

separate fwd_engine to show how to use this

feature.


Good point to discuss :) There are two strategies to

polling

a shared

Rxq:

1. polling each member port

   All forwarding engines can be reused to work as

before.

   My testpmd patches are efforts towards this

direction.

   Does your PMD support this?


Not unfortunately. More than that, every application

needs

to

change

to support this model.


Both strategies need user application to resolve port ID

from

mbuf and

process accordingly.

This one doesn't demand aggregated port, no polling

schedule

change.


I was thinking, mbuf will be updated from driver/aggregator

port as

when it

comes to application.




2. polling aggregated port

   Besides forwarding engine, need more work to to

demo

it.

   This is an optional API, not supported by my PMD

yet.


We are thinking of implementing this PMD when it comes

to

it,

ie.

without application change in fastpath

logic.


Fastpath have to resolve port ID anyway and forwarding

according

to

logic. Forwarding engine need to adapt to support shard

Rxq.

Fortunately, in testpmd, this can be done with an

abstract

API.


Let's defer part 2 until some PMD really support it and

tested,

how do

you think?


We are not planning to use this feature so either way it is

OK

to

me.

I leave to ethdev maintainers decide between 1 vs 2.


I do have a strong opinion not changing the testpmd basic

forward

engines

for this feature.I would like to keep it simple as fastpath

optimized and would

like to add a separate Forwarding engine as means to verify

this

feature.


+1 to that.

I don't think it a 'common' feature.

So separate FWD mode seems like a best choice to me.


-1 :)

There was some internal requirement from test team, they need

to

verify

all features like packet content, rss, vlan, checksum,

rte_flow...

to

be working based on shared rx queue.


Then I suppose you'll need to write really comprehensive fwd-

engine

to satisfy your test team :)

Speaking seriously, I still don't understand why do you need all

available fwd-engines to verify this feature.


The shared Rxq is low level feature, need to make sure driver

higher

level features working properly. fwd-engines like csum checks input

packet and enable L3/L4 checksum and tunnel offloads accordingly,

other engines do their own feature verification. All test

automation

could be reused with these engines supported seamlessly.


From what I understand, main purpose of your changes to test-pmd:

allow to fwd packet though different fwd_stream (TX through

different

HW queue).


Yes, each mbuf in burst come from differnt port, testpmd current

fwd-

engines relies heavily on source forwarding stream, that's why the

patch devide burst result mbufs into sub-burst and use orginal fwd-

engine callback to handle. How to handle is not changed.


In theory, if implemented in generic and extendable way - that

might be a useful add-on to tespmd fwd functionality.

But current implementation looks very case specific.

And as I don't think it is a common case, I don't see much point

to

pollute

basic fwd cases with it.


Shared Rxq is a ethdev feature that impacts how packets get

handled.

It's natural to update forwarding engines to avoid broken.


Why is that?

All it affects the way you RX the packets.

So why *all* FWD engines have to be updated?


People will ask why some FWD engine can't work?


It can be documented: which fwd engine supposed to work properly

with this feature, which not.

BTW, as I understand, as long as RX queues are properly assigned to lcores,

any fwd engine will continue to work.

Just for engines that are not aware about this feature packets can be send

out via wrong TX queue.



Let say what specific you are going to test with macswap vs macfwd

mode for that feature?


If people want to test NIC with real switch, or make sure L2 layer not

get corrupted.


I understand that, what I am saying that you probably don't need both to test this feature.


I still think one specific FWD engine is enough to cover majority of

test cases.


Yes, rxonly should be sufficient to verify the fundametal, but to

verify csum, timing, need others. Some back2back test system need io

forwarding, real switch depolyment need macswap...


Ok, but nothing stops you to pickup for your purposes a FWD engine with most comprehensive functionality:

macswap,  5tswap, even 5tswap+csum update ....





The new macro is introduced to minimize performance impact, I'm

also

wondering is there an elegant solution :)


I think Jerin suggested a good alternative with eventdev.

As another approach - might be consider to add an RX callback that

will return packets only for one particular port (while keeping

packets

for other ports cached internally).


This and the aggreate port API could be options in ethdev layer later.

It can't be the fundamental due performance loss and potential cache

miss.


As a 'wild' thought - change testpmd fwd logic to allow multiple TX

queues

per fwd_stream and add a function to do TX switching logic.

But that's probably quite a big change that needs a lot of work.


Current performance penalty

is one "if unlikely" per burst.


It is not only about performance impact.

It is about keeping test-pmd code simple and maintainable.


Think in reverse direction, if we don't update fwd-engines here,

all

malfunction when shared rxq enabled, users can't verify driver

features, are you expecting this?


I expect developers not to rewrite whole test-pmd fwd code for each

new ethdev feature.


Here just abstract duplicated code from fwd-engines, an improvement,

keep test-pmd code simple and mantainable.


Specially for the feature that is not widely used.


Based on the huge memory saving, performance and latency gains, it will

be popular to users.


But the test-pmd is not critical to this feature, I'm ok to drop the

fwd-engine support if you agree.


Not sure fully understand you here...

If you saying that you decided to demonstrate this feature via some other app/example

and prefer to abandon these changes in testpmd -

then, yes I don't see any problems with that.

Hi Ananyev & Jerin,

New v4 posted with a dedicate fwd engine to demonstrate this feature, do you have time to check?

Thanks,
Xueming

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-09-29  8:05                                 ` Jerin Jacob
@ 2021-10-08  8:26                                   ` Xueming(Steven) Li
  2021-10-10  9:46                                     ` Jerin Jacob
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-10-08  8:26 UTC (permalink / raw)
  To: jerinjacobk
  Cc: NBU-Contact-Thomas Monjalon, Raslan Darawsheh, andrew.rybchenko,
	konstantin.ananyev, dev, ferruh.yigit

On Wed, 2021-09-29 at 13:35 +0530, Jerin Jacob wrote:
> On Wed, Sep 29, 2021 at 1:11 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > 
> > On Tue, 2021-09-28 at 20:29 +0530, Jerin Jacob wrote:
> > > On Tue, Sep 28, 2021 at 8:10 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > 
> > > > On Tue, 2021-09-28 at 13:59 +0000, Ananyev, Konstantin wrote:
> > > > > > 
> > > > > > On Tue, Sep 28, 2021 at 6:55 PM Xueming(Steven) Li
> > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > 
> > > > > > > On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> > > > > > > > On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li
> > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > 
> > > > > > > > > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > > > > > > > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li
> > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > 
> > > > > > > > > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > > > > > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas
> > > > > > > > > > > > > > Monjalon
> > > > > > <thomas@monjalon.net>;
> > > > > > > > > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li
> > > > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > > > > > <ferruh.yigit@intel.com>;
> > > > > > > > > > > > > > > > NBU-Contact-Thomas Monjalon
> > > > > > > > > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li
> > > > > > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > In current DPDK framework, each RX queue is
> > > > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > > > switch domain, the memory consumption became
> > > > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > > > important, polling all ports leads to high
> > > > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > This patch introduces shared RX queue. Ports
> > > > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > > > configuration in a switch domain could share
> > > > > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > > > > Polling any queue using same shared RX queue
> > > > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > > > all member ports. Source port is identified
> > > > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Port queue number in a shared group should be
> > > > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Share RX queue is supposed to be polled on
> > > > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Is this offload specific to the representor? If
> > > > > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Yes, PF and representor in switch domain could
> > > > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Not quite sure that I understood your question.
> > > > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > > > almost same as before, PF and representor port
> > > > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > > > implementation.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > > > > ethdev receive queues land into
> > > > > > the same
> > > > > > > > > > > > > > receive queue, In that case, how the flow order is
> > > > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I guess the question is testpmd forward stream? The
> > > > > > > > > > > > > forwarding logic has to be changed slightly in case
> > > > > > > > > > > > > of shared rxq.
> > > > > > > > > > > > > basically for each packet in rx_burst result, lookup
> > > > > > > > > > > > > source stream according to mbuf->port, forwarding to
> > > > > > > > > > > > > target fs.
> > > > > > > > > > > > > Packets from same source port could be grouped as a
> > > > > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > > > > performance if traffic
> > > > > > come from
> > > > > > > > > > > > > limited ports. I'll introduce some common api to do
> > > > > > > > > > > > > shard rxq forwarding, call it with packets handling
> > > > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > > > > (share queue), right? Does
> > > > > > > > > > > > this means only single core will poll only, what will
> > > > > > > > > > > > happen if there are
> > > > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > > > 
> > > > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > > > application, I am not sure about
> > > > > > > > > > > > the solution, can't this work in a transparent way to
> > > > > > > > > > > > the application?
> > > > > > > > > > > 
> > > > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8 that
> > > > > > > > > > > aggregate ports
> > > > > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > > > > polling on the
> > > > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > > > 
> > > > > > > > > > The v3 still has testpmd changes in fastpath. Right? IMO,
> > > > > > > > > > For this
> > > > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > > > > probably as
> > > > > > > > > > separate fwd_engine to show how to use this feature.
> > > > > > > > > 
> > > > > > > > > Good point to discuss :) There are two strategies to polling
> > > > > > > > > a shared
> > > > > > > > > Rxq:
> > > > > > > > > 1. polling each member port
> > > > > > > > >    All forwarding engines can be reused to work as before.
> > > > > > > > >    My testpmd patches are efforts towards this direction.
> > > > > > > > >    Does your PMD support this?
> > > > > > > > 
> > > > > > > > Not unfortunately. More than that, every application needs to
> > > > > > > > change
> > > > > > > > to support this model.
> > > > > > > 
> > > > > > > Both strategies need user application to resolve port ID from
> > > > > > > mbuf and
> > > > > > > process accordingly.
> > > > > > > This one doesn't demand aggregated port, no polling schedule
> > > > > > > change.
> > > > > > 
> > > > > > I was thinking, mbuf will be updated from driver/aggregator port as
> > > > > > when it
> > > > > > comes to application.
> > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > 2. polling aggregated port
> > > > > > > > >    Besides forwarding engine, need more work to to demo it.
> > > > > > > > >    This is an optional API, not supported by my PMD yet.
> > > > > > > > 
> > > > > > > > We are thinking of implementing this PMD when it comes to it,
> > > > > > > > ie.
> > > > > > > > without application change in fastpath
> > > > > > > > logic.
> > > > > > > 
> > > > > > > Fastpath have to resolve port ID anyway and forwarding according
> > > > > > > to
> > > > > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > > > > Fortunately, in testpmd, this can be done with an abstract API.
> > > > > > > 
> > > > > > > Let's defer part 2 until some PMD really support it and tested,
> > > > > > > how do
> > > > > > > you think?
> > > > > > 
> > > > > > We are not planning to use this feature so either way it is OK to
> > > > > > me.
> > > > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > > > 
> > > > > > I do have a strong opinion not changing the testpmd basic forward
> > > > > > engines
> > > > > > for this feature.I would like to keep it simple as fastpath
> > > > > > optimized and would
> > > > > > like to add a separate Forwarding engine as means to verify this
> > > > > > feature.
> > > > > 
> > > > > +1 to that.
> > > > > I don't think it a 'common' feature.
> > > > > So separate FWD mode seems like a best choice to me.
> > > > 
> > > > -1 :)
> > > > There was some internal requirement from test team, they need to verify
> > > 
> 
> 
> > > Internal QA requirements may not be the driving factor :-)
> > 
> > It will be a test requirement for any driver to face, not internal. The
> > performance difference almost zero in v3, only an "unlikely if" test on
> > each burst. Shared Rxq is a low level feature, reusing all current FWD
> > engines to verify driver high level features is important IMHO.
> 
> In addition to additional if check, The real concern is polluting the
> common forward engine for the not common feature.

Okay, removed changes to common forward engines in v4, please check.

> 
> If you really want to reuse the existing application without any
> application change,
> I think, you need to hook this to eventdev
> http://code.dpdk.org/dpdk/latest/source/lib/eventdev/rte_eventdev.h#L34
> 
> Where eventdev drivers does this thing in addition to other features, Ie.
> t has ports (which is kind of aggregator),
> it can receive the packets from any queue with mbuf->port as actually
> received port.
> That is in terms of mapping:
> - event queue will be dummy it will be as same as Rx queue
> - Rx adapter will be also a dummy
> - event ports aggregate multiple queues and connect to core via event port
> - On Rxing the packet, mbuf->port will be the actual Port which is received.
> app/test-eventdev written to use this model.

Is this the optional aggregator api we discussed? already there, patch
2/6.
I was trying to make common forwarding engines perfect to support any
case, but since you all have concerns, removed in v4.


> 
> 
> 
> > 
> > > 
> > > > all features like packet content, rss, vlan, checksum, rte_flow... to
> > > > be working based on shared rx queue. Based on the patch, I believe the
> > > > impact has been minimized.
> > > 
> > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > > > > represontors? If so can't we
> > > > > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > > > > scope can reduce the
> > > > > > > > > > > > complexity it brings?
> > > > > > > > > > > > 
> > > > > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > > > > representor the case by changing its
> > > > > > name and
> > > > > > > > > > > > > > scope.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > It works for both PF and representors in same switch
> > > > > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > > > > apply.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > >  doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > > 11 +++++++++++
> > > > > > > > > > > > > > > > >  doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > > > >  doc/guides/prog_guide/switch_representation.
> > > > > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > >  7 +++++++
> > > > > > > > > > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4
> > > > > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DE
> > > > > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > +Supports shared Rx queue for ports in same
> > > > > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > >  Packet type parsing
> > > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > > > > > > > >  Queue start/stop     =
> > > > > > > > > > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > > > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > > > > > > > > > +Shared Rx queue      =
> > > > > > > > > > > > > > > > >  Burst mode info      =
> > > > > > > > > > > > > > > > >  Power mgmt address monitor =
> > > > > > > > > > > > > > > > >  MTU update           =
> > > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software
> > > > > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > > > > >  .. [1] `Ethernet switch device driver model
> > > > > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation/net
> > > > > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > +- Memory usage of representors is huge when
> > > > > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > > > > +  because PMD always allocate mbuf for each
> > > > > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > > > > +  Polling the large number of ports brings
> > > > > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > > > > +  latency. Shared Rx queue can be used to
> > > > > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > > > > +  representors in same switch domain.
> > > > > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > > > > +  is present in Rx offloading capability of
> > > > > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > > > > +  offloading flag in device Rx mode or Rx
> > > > > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > > > > +  shared Rx queue. Polling any member port
> > > > > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > > > > +  packets of all ports in group, port ID is
> > > > > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > >  Basic SR-IOV
> > > > > > > > > > > > > > > > >  ------------
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSU
> > > > > > > > > > > > > > > > > M),
> > > > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPL
> > > > > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > > > > >  };
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets
> > > > > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > > > > >         uint8_t rx_deferred_start; /**< Do
> > > > > > > > > > > > > > > > > not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > > > > > > >         uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > > > +       uint32_t shared_group; /**< Shared
> > > > > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > > > > >         /**
> > > > > > > > > > > > > > > > >          * Per-queue Rx offloads to be set
> > > > > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > > > > >          * Only offloads set on
> > > > > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct
> > > > > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > > > > + * Rx queue is shared among ports in same
> > > > > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > > > > + * Real source port number saved in mbuf-
> > > > > > > > > > > > > > > > > > port field.
> > > > > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > > > > >                                  DEV_RX_OFFLO
> > > > > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > 
> > > > 
> > 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-10-08  8:26                                   ` Xueming(Steven) Li
@ 2021-10-10  9:46                                     ` Jerin Jacob
  2021-10-10 13:40                                       ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Jerin Jacob @ 2021-10-10  9:46 UTC (permalink / raw)
  To: Xueming(Steven) Li
  Cc: NBU-Contact-Thomas Monjalon, Raslan Darawsheh, andrew.rybchenko,
	konstantin.ananyev, dev, ferruh.yigit

On Fri, Oct 8, 2021 at 1:56 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
> On Wed, 2021-09-29 at 13:35 +0530, Jerin Jacob wrote:
> > On Wed, Sep 29, 2021 at 1:11 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >
> > > On Tue, 2021-09-28 at 20:29 +0530, Jerin Jacob wrote:
> > > > On Tue, Sep 28, 2021 at 8:10 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > >
> > > > > On Tue, 2021-09-28 at 13:59 +0000, Ananyev, Konstantin wrote:
> > > > > > >
> > > > > > > On Tue, Sep 28, 2021 at 6:55 PM Xueming(Steven) Li
> > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > >
> > > > > > > > On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> > > > > > > > > On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li
> > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > > > > > > > > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li
> > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > > > > > > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas
> > > > > > > > > > > > > > > Monjalon
> > > > > > > <thomas@monjalon.net>;
> > > > > > > > > > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li
> > > > > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > > > > > > <ferruh.yigit@intel.com>;
> > > > > > > > > > > > > > > > > NBU-Contact-Thomas Monjalon
> > > > > > > > > > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li
> > > > > > > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > In current DPDK framework, each RX queue is
> > > > > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > > > > switch domain, the memory consumption became
> > > > > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > > > > important, polling all ports leads to high
> > > > > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > This patch introduces shared RX queue. Ports
> > > > > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > > > > configuration in a switch domain could share
> > > > > > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > > > > > Polling any queue using same shared RX queue
> > > > > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > > > > all member ports. Source port is identified
> > > > > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Port queue number in a shared group should be
> > > > > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Share RX queue is supposed to be polled on
> > > > > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Is this offload specific to the representor? If
> > > > > > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Yes, PF and representor in switch domain could
> > > > > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Not quite sure that I understood your question.
> > > > > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > > > > almost same as before, PF and representor port
> > > > > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > > > > implementation.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > > > > > ethdev receive queues land into
> > > > > > > the same
> > > > > > > > > > > > > > > receive queue, In that case, how the flow order is
> > > > > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I guess the question is testpmd forward stream? The
> > > > > > > > > > > > > > forwarding logic has to be changed slightly in case
> > > > > > > > > > > > > > of shared rxq.
> > > > > > > > > > > > > > basically for each packet in rx_burst result, lookup
> > > > > > > > > > > > > > source stream according to mbuf->port, forwarding to
> > > > > > > > > > > > > > target fs.
> > > > > > > > > > > > > > Packets from same source port could be grouped as a
> > > > > > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > > > > > performance if traffic
> > > > > > > come from
> > > > > > > > > > > > > > limited ports. I'll introduce some common api to do
> > > > > > > > > > > > > > shard rxq forwarding, call it with packets handling
> > > > > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > > > > > (share queue), right? Does
> > > > > > > > > > > > > this means only single core will poll only, what will
> > > > > > > > > > > > > happen if there are
> > > > > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > > > >
> > > > > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > > > > application, I am not sure about
> > > > > > > > > > > > > the solution, can't this work in a transparent way to
> > > > > > > > > > > > > the application?
> > > > > > > > > > > >
> > > > > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8 that
> > > > > > > > > > > > aggregate ports
> > > > > > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > > > > > polling on the
> > > > > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > > > >
> > > > > > > > > > > The v3 still has testpmd changes in fastpath. Right? IMO,
> > > > > > > > > > > For this
> > > > > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > > > > > probably as
> > > > > > > > > > > separate fwd_engine to show how to use this feature.
> > > > > > > > > >
> > > > > > > > > > Good point to discuss :) There are two strategies to polling
> > > > > > > > > > a shared
> > > > > > > > > > Rxq:
> > > > > > > > > > 1. polling each member port
> > > > > > > > > >    All forwarding engines can be reused to work as before.
> > > > > > > > > >    My testpmd patches are efforts towards this direction.
> > > > > > > > > >    Does your PMD support this?
> > > > > > > > >
> > > > > > > > > Not unfortunately. More than that, every application needs to
> > > > > > > > > change
> > > > > > > > > to support this model.
> > > > > > > >
> > > > > > > > Both strategies need user application to resolve port ID from
> > > > > > > > mbuf and
> > > > > > > > process accordingly.
> > > > > > > > This one doesn't demand aggregated port, no polling schedule
> > > > > > > > change.
> > > > > > >
> > > > > > > I was thinking, mbuf will be updated from driver/aggregator port as
> > > > > > > when it
> > > > > > > comes to application.
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > 2. polling aggregated port
> > > > > > > > > >    Besides forwarding engine, need more work to to demo it.
> > > > > > > > > >    This is an optional API, not supported by my PMD yet.
> > > > > > > > >
> > > > > > > > > We are thinking of implementing this PMD when it comes to it,
> > > > > > > > > ie.
> > > > > > > > > without application change in fastpath
> > > > > > > > > logic.
> > > > > > > >
> > > > > > > > Fastpath have to resolve port ID anyway and forwarding according
> > > > > > > > to
> > > > > > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > > > > > Fortunately, in testpmd, this can be done with an abstract API.
> > > > > > > >
> > > > > > > > Let's defer part 2 until some PMD really support it and tested,
> > > > > > > > how do
> > > > > > > > you think?
> > > > > > >
> > > > > > > We are not planning to use this feature so either way it is OK to
> > > > > > > me.
> > > > > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > > > >
> > > > > > > I do have a strong opinion not changing the testpmd basic forward
> > > > > > > engines
> > > > > > > for this feature.I would like to keep it simple as fastpath
> > > > > > > optimized and would
> > > > > > > like to add a separate Forwarding engine as means to verify this
> > > > > > > feature.
> > > > > >
> > > > > > +1 to that.
> > > > > > I don't think it a 'common' feature.
> > > > > > So separate FWD mode seems like a best choice to me.
> > > > >
> > > > > -1 :)
> > > > > There was some internal requirement from test team, they need to verify
> > > >
> >
> >
> > > > Internal QA requirements may not be the driving factor :-)
> > >
> > > It will be a test requirement for any driver to face, not internal. The
> > > performance difference almost zero in v3, only an "unlikely if" test on
> > > each burst. Shared Rxq is a low level feature, reusing all current FWD
> > > engines to verify driver high level features is important IMHO.
> >
> > In addition to additional if check, The real concern is polluting the
> > common forward engine for the not common feature.
>
> Okay, removed changes to common forward engines in v4, please check.

Thanks.

>
> >
> > If you really want to reuse the existing application without any
> > application change,
> > I think, you need to hook this to eventdev
> > http://code.dpdk.org/dpdk/latest/source/lib/eventdev/rte_eventdev.h#L34
> >
> > Where eventdev drivers does this thing in addition to other features, Ie.
> > t has ports (which is kind of aggregator),
> > it can receive the packets from any queue with mbuf->port as actually
> > received port.
> > That is in terms of mapping:
> > - event queue will be dummy it will be as same as Rx queue
> > - Rx adapter will be also a dummy
> > - event ports aggregate multiple queues and connect to core via event port
> > - On Rxing the packet, mbuf->port will be the actual Port which is received.
> > app/test-eventdev written to use this model.
>
> Is this the optional aggregator api we discussed? already there, patch
> 2/6.
> I was trying to make common forwarding engines perfect to support any
> case, but since you all have concerns, removed in v4.

The point was, If we take eventdev Rx adapter path, This all thing can
be implemented
without adding any new APIs in ethdev as similar functionality is
supported ethdeev-eventdev
Rx adapter. Now two things,

1) Aggregator API is not required, We will be taking the eventdev Rx
adapter route this implement it
2) Another mode it is possible to implement it with  eventdev Rx
adapter. So I leave to ethdev
maintainers to decide if this path is required or not. No strong
opinion on this.



>
>
> >
> >
> >
> > >
> > > >
> > > > > all features like packet content, rss, vlan, checksum, rte_flow... to
> > > > > be working based on shared rx queue. Based on the patch, I believe the
> > > > > impact has been minimized.
> > > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > > > > > represontors? If so can't we
> > > > > > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > > > > > scope can reduce the
> > > > > > > > > > > > > complexity it brings?
> > > > > > > > > > > > >
> > > > > > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > > > > > representor the case by changing its
> > > > > > > name and
> > > > > > > > > > > > > > > scope.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > It works for both PF and representors in same switch
> > > > > > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > > > > > apply.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > >  doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > > > 11 +++++++++++
> > > > > > > > > > > > > > > > > >  doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > > > > >  doc/guides/prog_guide/switch_representation.
> > > > > > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > >  7 +++++++
> > > > > > > > > > > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4
> > > > > > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DE
> > > > > > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > +Supports shared Rx queue for ports in same
> > > > > > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >  Packet type parsing
> > > > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > > > > > > > > >  Queue start/stop     =
> > > > > > > > > > > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > > > > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > > > > > > > > > > +Shared Rx queue      =
> > > > > > > > > > > > > > > > > >  Burst mode info      =
> > > > > > > > > > > > > > > > > >  Power mgmt address monitor =
> > > > > > > > > > > > > > > > > >  MTU update           =
> > > > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software
> > > > > > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > > > > > >  .. [1] `Ethernet switch device driver model
> > > > > > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation/net
> > > > > > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > +- Memory usage of representors is huge when
> > > > > > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > > > > > +  because PMD always allocate mbuf for each
> > > > > > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > > > > > +  Polling the large number of ports brings
> > > > > > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > > > > > +  latency. Shared Rx queue can be used to
> > > > > > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > > > > > +  representors in same switch domain.
> > > > > > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > > > > > +  is present in Rx offloading capability of
> > > > > > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > > > > > +  offloading flag in device Rx mode or Rx
> > > > > > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > > > > > +  shared Rx queue. Polling any member port
> > > > > > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > > > > > +  packets of all ports in group, port ID is
> > > > > > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > >  Basic SR-IOV
> > > > > > > > > > > > > > > > > >  ------------
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSU
> > > > > > > > > > > > > > > > > > M),
> > > > > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPL
> > > > > > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > > > > > >  };
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets
> > > > > > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > > > > > >         uint8_t rx_deferred_start; /**< Do
> > > > > > > > > > > > > > > > > > not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > > > > > > > >         uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > > > > +       uint32_t shared_group; /**< Shared
> > > > > > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > > > > > >         /**
> > > > > > > > > > > > > > > > > >          * Per-queue Rx offloads to be set
> > > > > > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > > > > > >          * Only offloads set on
> > > > > > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct
> > > > > > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > > > > > + * Rx queue is shared among ports in same
> > > > > > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > > > > > + * Real source port number saved in mbuf-
> > > > > > > > > > > > > > > > > > > port field.
> > > > > > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > > > > > >                                  DEV_RX_OFFLO
> > > > > > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > >
> > >
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-10-10  9:46                                     ` Jerin Jacob
@ 2021-10-10 13:40                                       ` Xueming(Steven) Li
  2021-10-11  4:10                                         ` Jerin Jacob
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-10-10 13:40 UTC (permalink / raw)
  To: jerinjacobk
  Cc: NBU-Contact-Thomas Monjalon, Raslan Darawsheh, andrew.rybchenko,
	konstantin.ananyev, dev, ferruh.yigit

On Sun, 2021-10-10 at 15:16 +0530, Jerin Jacob wrote:
> On Fri, Oct 8, 2021 at 1:56 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > 
> > On Wed, 2021-09-29 at 13:35 +0530, Jerin Jacob wrote:
> > > On Wed, Sep 29, 2021 at 1:11 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > 
> > > > On Tue, 2021-09-28 at 20:29 +0530, Jerin Jacob wrote:
> > > > > On Tue, Sep 28, 2021 at 8:10 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > 
> > > > > > On Tue, 2021-09-28 at 13:59 +0000, Ananyev, Konstantin wrote:
> > > > > > > > 
> > > > > > > > On Tue, Sep 28, 2021 at 6:55 PM Xueming(Steven) Li
> > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > 
> > > > > > > > > On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> > > > > > > > > > On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li
> > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > 
> > > > > > > > > > > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > > > > > > > > > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li
> > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > > > > > > > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas
> > > > > > > > > > > > > > > > Monjalon
> > > > > > > > <thomas@monjalon.net>;
> > > > > > > > > > > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li
> > > > > > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > > > > > > > <ferruh.yigit@intel.com>;
> > > > > > > > > > > > > > > > > > NBU-Contact-Thomas Monjalon
> > > > > > > > > > > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li
> > > > > > > > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > In current DPDK framework, each RX queue is
> > > > > > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > > > > > switch domain, the memory consumption became
> > > > > > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > > > > > important, polling all ports leads to high
> > > > > > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > This patch introduces shared RX queue. Ports
> > > > > > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > > > > > configuration in a switch domain could share
> > > > > > > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > > > > > > Polling any queue using same shared RX queue
> > > > > > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > > > > > all member ports. Source port is identified
> > > > > > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > Port queue number in a shared group should be
> > > > > > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > Share RX queue is supposed to be polled on
> > > > > > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > Is this offload specific to the representor? If
> > > > > > > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Yes, PF and representor in switch domain could
> > > > > > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Not quite sure that I understood your question.
> > > > > > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > > > > > almost same as before, PF and representor port
> > > > > > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > > > > > implementation.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > > > > > > ethdev receive queues land into
> > > > > > > > the same
> > > > > > > > > > > > > > > > receive queue, In that case, how the flow order is
> > > > > > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > I guess the question is testpmd forward stream? The
> > > > > > > > > > > > > > > forwarding logic has to be changed slightly in case
> > > > > > > > > > > > > > > of shared rxq.
> > > > > > > > > > > > > > > basically for each packet in rx_burst result, lookup
> > > > > > > > > > > > > > > source stream according to mbuf->port, forwarding to
> > > > > > > > > > > > > > > target fs.
> > > > > > > > > > > > > > > Packets from same source port could be grouped as a
> > > > > > > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > > > > > > performance if traffic
> > > > > > > > come from
> > > > > > > > > > > > > > > limited ports. I'll introduce some common api to do
> > > > > > > > > > > > > > > shard rxq forwarding, call it with packets handling
> > > > > > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > > > > > > (share queue), right? Does
> > > > > > > > > > > > > > this means only single core will poll only, what will
> > > > > > > > > > > > > > happen if there are
> > > > > > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > > > > > application, I am not sure about
> > > > > > > > > > > > > > the solution, can't this work in a transparent way to
> > > > > > > > > > > > > > the application?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8 that
> > > > > > > > > > > > > aggregate ports
> > > > > > > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > > > > > > polling on the
> > > > > > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > > > > > 
> > > > > > > > > > > > The v3 still has testpmd changes in fastpath. Right? IMO,
> > > > > > > > > > > > For this
> > > > > > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > > > > > > probably as
> > > > > > > > > > > > separate fwd_engine to show how to use this feature.
> > > > > > > > > > > 
> > > > > > > > > > > Good point to discuss :) There are two strategies to polling
> > > > > > > > > > > a shared
> > > > > > > > > > > Rxq:
> > > > > > > > > > > 1. polling each member port
> > > > > > > > > > >    All forwarding engines can be reused to work as before.
> > > > > > > > > > >    My testpmd patches are efforts towards this direction.
> > > > > > > > > > >    Does your PMD support this?
> > > > > > > > > > 
> > > > > > > > > > Not unfortunately. More than that, every application needs to
> > > > > > > > > > change
> > > > > > > > > > to support this model.
> > > > > > > > > 
> > > > > > > > > Both strategies need user application to resolve port ID from
> > > > > > > > > mbuf and
> > > > > > > > > process accordingly.
> > > > > > > > > This one doesn't demand aggregated port, no polling schedule
> > > > > > > > > change.
> > > > > > > > 
> > > > > > > > I was thinking, mbuf will be updated from driver/aggregator port as
> > > > > > > > when it
> > > > > > > > comes to application.
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > 2. polling aggregated port
> > > > > > > > > > >    Besides forwarding engine, need more work to to demo it.
> > > > > > > > > > >    This is an optional API, not supported by my PMD yet.
> > > > > > > > > > 
> > > > > > > > > > We are thinking of implementing this PMD when it comes to it,
> > > > > > > > > > ie.
> > > > > > > > > > without application change in fastpath
> > > > > > > > > > logic.
> > > > > > > > > 
> > > > > > > > > Fastpath have to resolve port ID anyway and forwarding according
> > > > > > > > > to
> > > > > > > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > > > > > > Fortunately, in testpmd, this can be done with an abstract API.
> > > > > > > > > 
> > > > > > > > > Let's defer part 2 until some PMD really support it and tested,
> > > > > > > > > how do
> > > > > > > > > you think?
> > > > > > > > 
> > > > > > > > We are not planning to use this feature so either way it is OK to
> > > > > > > > me.
> > > > > > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > > > > > 
> > > > > > > > I do have a strong opinion not changing the testpmd basic forward
> > > > > > > > engines
> > > > > > > > for this feature.I would like to keep it simple as fastpath
> > > > > > > > optimized and would
> > > > > > > > like to add a separate Forwarding engine as means to verify this
> > > > > > > > feature.
> > > > > > > 
> > > > > > > +1 to that.
> > > > > > > I don't think it a 'common' feature.
> > > > > > > So separate FWD mode seems like a best choice to me.
> > > > > > 
> > > > > > -1 :)
> > > > > > There was some internal requirement from test team, they need to verify
> > > > > 
> > > 
> > > 
> > > > > Internal QA requirements may not be the driving factor :-)
> > > > 
> > > > It will be a test requirement for any driver to face, not internal. The
> > > > performance difference almost zero in v3, only an "unlikely if" test on
> > > > each burst. Shared Rxq is a low level feature, reusing all current FWD
> > > > engines to verify driver high level features is important IMHO.
> > > 
> > > In addition to additional if check, The real concern is polluting the
> > > common forward engine for the not common feature.
> > 
> > Okay, removed changes to common forward engines in v4, please check.
> 
> Thanks.
> 
> > 
> > > 
> > > If you really want to reuse the existing application without any
> > > application change,
> > > I think, you need to hook this to eventdev
> > > http://code.dpdk.org/dpdk/latest/source/lib/eventdev/rte_eventdev.h#L34
> > > 
> > > Where eventdev drivers does this thing in addition to other features, Ie.
> > > t has ports (which is kind of aggregator),
> > > it can receive the packets from any queue with mbuf->port as actually
> > > received port.
> > > That is in terms of mapping:
> > > - event queue will be dummy it will be as same as Rx queue
> > > - Rx adapter will be also a dummy
> > > - event ports aggregate multiple queues and connect to core via event port
> > > - On Rxing the packet, mbuf->port will be the actual Port which is received.
> > > app/test-eventdev written to use this model.
> > 
> > Is this the optional aggregator api we discussed? already there, patch
> > 2/6.
> > I was trying to make common forwarding engines perfect to support any
> > case, but since you all have concerns, removed in v4.
> 
> The point was, If we take eventdev Rx adapter path, This all thing can
> be implemented
> without adding any new APIs in ethdev as similar functionality is
> supported ethdeev-eventdev
> Rx adapter. Now two things,
> 
> 1) Aggregator API is not required, We will be taking the eventdev Rx
> adapter route this implement it
> 2) Another mode it is possible to implement it with  eventdev Rx
> adapter. So I leave to ethdev
> maintainers to decide if this path is required or not. No strong
> opinion on this.

Seems you are expert of event, is this the Rx burst api?
rte_event_dequeue_burst(dev_id, port_id, ev[], nb_events, timeout)

Two concerns from user perspective:
1. By using ethdev-eventdev wrapper, it impacts performance.
2. For user application like OVS, using event api just when shared rxq
enable looks strange.

Maybe I missed something?

There should be more feedkback and idea on how to aggregate ports after
the fundamental(offload bit and group) start to work, agree to remove
the aggregator api for now.

> 
> 
> 
> > 
> > 
> > > 
> > > 
> > > 
> > > > 
> > > > > 
> > > > > > all features like packet content, rss, vlan, checksum, rte_flow... to
> > > > > > be working based on shared rx queue. Based on the patch, I believe the
> > > > > > impact has been minimized.
> > > > > 
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > > > > > > represontors? If so can't we
> > > > > > > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > > > > > > scope can reduce the
> > > > > > > > > > > > > > complexity it brings?
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > > > > > > representor the case by changing its
> > > > > > > > name and
> > > > > > > > > > > > > > > > scope.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > It works for both PF and representors in same switch
> > > > > > > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > > > > > > apply.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > > >  doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > > > > 11 +++++++++++
> > > > > > > > > > > > > > > > > > >  doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > > > > > >  doc/guides/prog_guide/switch_representation.
> > > > > > > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > > >  7 +++++++
> > > > > > > > > > > > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4
> > > > > > > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DE
> > > > > > > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > +Supports shared Rx queue for ports in same
> > > > > > > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > >  Packet type parsing
> > > > > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > > > > > > > > > >  Queue start/stop     =
> > > > > > > > > > > > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > > > > > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > > > > > > > > > > > +Shared Rx queue      =
> > > > > > > > > > > > > > > > > > >  Burst mode info      =
> > > > > > > > > > > > > > > > > > >  Power mgmt address monitor =
> > > > > > > > > > > > > > > > > > >  MTU update           =
> > > > > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software
> > > > > > > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > > > > > > >  .. [1] `Ethernet switch device driver model
> > > > > > > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation/net
> > > > > > > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > +- Memory usage of representors is huge when
> > > > > > > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > > > > > > +  because PMD always allocate mbuf for each
> > > > > > > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > > > > > > +  Polling the large number of ports brings
> > > > > > > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > > > > > > +  latency. Shared Rx queue can be used to
> > > > > > > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > > > > > > +  representors in same switch domain.
> > > > > > > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > > > > > > +  is present in Rx offloading capability of
> > > > > > > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > > > > > > +  offloading flag in device Rx mode or Rx
> > > > > > > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > > > > > > +  shared Rx queue. Polling any member port
> > > > > > > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > > > > > > +  packets of all ports in group, port ID is
> > > > > > > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > >  Basic SR-IOV
> > > > > > > > > > > > > > > > > > >  ------------
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSU
> > > > > > > > > > > > > > > > > > > M),
> > > > > > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPL
> > > > > > > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > > > > > > >  };
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets
> > > > > > > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > > > > > > >         uint8_t rx_deferred_start; /**< Do
> > > > > > > > > > > > > > > > > > > not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > > > > > > > > >         uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > > > > > +       uint32_t shared_group; /**< Shared
> > > > > > > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > > > > > > >         /**
> > > > > > > > > > > > > > > > > > >          * Per-queue Rx offloads to be set
> > > > > > > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > > > > > > >          * Only offloads set on
> > > > > > > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct
> > > > > > > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > > > > > > + * Rx queue is shared among ports in same
> > > > > > > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > > > > > > + * Real source port number saved in mbuf-
> > > > > > > > > > > > > > > > > > > > port field.
> > > > > > > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > > > > > > >                                  DEV_RX_OFFLO
> > > > > > > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > 
> > > > > > 
> > > > 
> > 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
  2021-10-10 13:40                                       ` Xueming(Steven) Li
@ 2021-10-11  4:10                                         ` Jerin Jacob
  0 siblings, 0 replies; 266+ messages in thread
From: Jerin Jacob @ 2021-10-11  4:10 UTC (permalink / raw)
  To: Xueming(Steven) Li
  Cc: NBU-Contact-Thomas Monjalon, Raslan Darawsheh, andrew.rybchenko,
	konstantin.ananyev, dev, ferruh.yigit

On Sun, Oct 10, 2021 at 7:10 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
> On Sun, 2021-10-10 at 15:16 +0530, Jerin Jacob wrote:
> > On Fri, Oct 8, 2021 at 1:56 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >
> > > On Wed, 2021-09-29 at 13:35 +0530, Jerin Jacob wrote:
> > > > On Wed, Sep 29, 2021 at 1:11 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > >
> > > > > On Tue, 2021-09-28 at 20:29 +0530, Jerin Jacob wrote:
> > > > > > On Tue, Sep 28, 2021 at 8:10 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > >
> > > > > > > On Tue, 2021-09-28 at 13:59 +0000, Ananyev, Konstantin wrote:
> > > > > > > > >
> > > > > > > > > On Tue, Sep 28, 2021 at 6:55 PM Xueming(Steven) Li
> > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> > > > > > > > > > > On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li
> > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > > > > > > > > > > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li
> > > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > > > > > > > > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas
> > > > > > > > > > > > > > > > > Monjalon
> > > > > > > > > <thomas@monjalon.net>;
> > > > > > > > > > > > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li
> > > > > > > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > > > > > > > > <ferruh.yigit@intel.com>;
> > > > > > > > > > > > > > > > > > > NBU-Contact-Thomas Monjalon
> > > > > > > > > > > > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li
> > > > > > > > > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > In current DPDK framework, each RX queue is
> > > > > > > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > > > > > > switch domain, the memory consumption became
> > > > > > > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > > > > > > important, polling all ports leads to high
> > > > > > > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > This patch introduces shared RX queue. Ports
> > > > > > > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > > > > > > configuration in a switch domain could share
> > > > > > > > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > > > > > > > Polling any queue using same shared RX queue
> > > > > > > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > > > > > > all member ports. Source port is identified
> > > > > > > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Port queue number in a shared group should be
> > > > > > > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Share RX queue is supposed to be polled on
> > > > > > > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Is this offload specific to the representor? If
> > > > > > > > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Yes, PF and representor in switch domain could
> > > > > > > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Not quite sure that I understood your question.
> > > > > > > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > > > > > > almost same as before, PF and representor port
> > > > > > > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > > > > > > implementation.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > > > > > > > ethdev receive queues land into
> > > > > > > > > the same
> > > > > > > > > > > > > > > > > receive queue, In that case, how the flow order is
> > > > > > > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I guess the question is testpmd forward stream? The
> > > > > > > > > > > > > > > > forwarding logic has to be changed slightly in case
> > > > > > > > > > > > > > > > of shared rxq.
> > > > > > > > > > > > > > > > basically for each packet in rx_burst result, lookup
> > > > > > > > > > > > > > > > source stream according to mbuf->port, forwarding to
> > > > > > > > > > > > > > > > target fs.
> > > > > > > > > > > > > > > > Packets from same source port could be grouped as a
> > > > > > > > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > > > > > > > performance if traffic
> > > > > > > > > come from
> > > > > > > > > > > > > > > > limited ports. I'll introduce some common api to do
> > > > > > > > > > > > > > > > shard rxq forwarding, call it with packets handling
> > > > > > > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > > > > > > > (share queue), right? Does
> > > > > > > > > > > > > > > this means only single core will poll only, what will
> > > > > > > > > > > > > > > happen if there are
> > > > > > > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > > > > > > application, I am not sure about
> > > > > > > > > > > > > > > the solution, can't this work in a transparent way to
> > > > > > > > > > > > > > > the application?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8 that
> > > > > > > > > > > > > > aggregate ports
> > > > > > > > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > > > > > > > polling on the
> > > > > > > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The v3 still has testpmd changes in fastpath. Right? IMO,
> > > > > > > > > > > > > For this
> > > > > > > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > > > > > > > probably as
> > > > > > > > > > > > > separate fwd_engine to show how to use this feature.
> > > > > > > > > > > >
> > > > > > > > > > > > Good point to discuss :) There are two strategies to polling
> > > > > > > > > > > > a shared
> > > > > > > > > > > > Rxq:
> > > > > > > > > > > > 1. polling each member port
> > > > > > > > > > > >    All forwarding engines can be reused to work as before.
> > > > > > > > > > > >    My testpmd patches are efforts towards this direction.
> > > > > > > > > > > >    Does your PMD support this?
> > > > > > > > > > >
> > > > > > > > > > > Not unfortunately. More than that, every application needs to
> > > > > > > > > > > change
> > > > > > > > > > > to support this model.
> > > > > > > > > >
> > > > > > > > > > Both strategies need user application to resolve port ID from
> > > > > > > > > > mbuf and
> > > > > > > > > > process accordingly.
> > > > > > > > > > This one doesn't demand aggregated port, no polling schedule
> > > > > > > > > > change.
> > > > > > > > >
> > > > > > > > > I was thinking, mbuf will be updated from driver/aggregator port as
> > > > > > > > > when it
> > > > > > > > > comes to application.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > 2. polling aggregated port
> > > > > > > > > > > >    Besides forwarding engine, need more work to to demo it.
> > > > > > > > > > > >    This is an optional API, not supported by my PMD yet.
> > > > > > > > > > >
> > > > > > > > > > > We are thinking of implementing this PMD when it comes to it,
> > > > > > > > > > > ie.
> > > > > > > > > > > without application change in fastpath
> > > > > > > > > > > logic.
> > > > > > > > > >
> > > > > > > > > > Fastpath have to resolve port ID anyway and forwarding according
> > > > > > > > > > to
> > > > > > > > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > > > > > > > Fortunately, in testpmd, this can be done with an abstract API.
> > > > > > > > > >
> > > > > > > > > > Let's defer part 2 until some PMD really support it and tested,
> > > > > > > > > > how do
> > > > > > > > > > you think?
> > > > > > > > >
> > > > > > > > > We are not planning to use this feature so either way it is OK to
> > > > > > > > > me.
> > > > > > > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > > > > > >
> > > > > > > > > I do have a strong opinion not changing the testpmd basic forward
> > > > > > > > > engines
> > > > > > > > > for this feature.I would like to keep it simple as fastpath
> > > > > > > > > optimized and would
> > > > > > > > > like to add a separate Forwarding engine as means to verify this
> > > > > > > > > feature.
> > > > > > > >
> > > > > > > > +1 to that.
> > > > > > > > I don't think it a 'common' feature.
> > > > > > > > So separate FWD mode seems like a best choice to me.
> > > > > > >
> > > > > > > -1 :)
> > > > > > > There was some internal requirement from test team, they need to verify
> > > > > >
> > > >
> > > >
> > > > > > Internal QA requirements may not be the driving factor :-)
> > > > >
> > > > > It will be a test requirement for any driver to face, not internal. The
> > > > > performance difference almost zero in v3, only an "unlikely if" test on
> > > > > each burst. Shared Rxq is a low level feature, reusing all current FWD
> > > > > engines to verify driver high level features is important IMHO.
> > > >
> > > > In addition to additional if check, The real concern is polluting the
> > > > common forward engine for the not common feature.
> > >
> > > Okay, removed changes to common forward engines in v4, please check.
> >
> > Thanks.
> >
> > >
> > > >
> > > > If you really want to reuse the existing application without any
> > > > application change,
> > > > I think, you need to hook this to eventdev
> > > > http://code.dpdk.org/dpdk/latest/source/lib/eventdev/rte_eventdev.h#L34
> > > >
> > > > Where eventdev drivers does this thing in addition to other features, Ie.
> > > > t has ports (which is kind of aggregator),
> > > > it can receive the packets from any queue with mbuf->port as actually
> > > > received port.
> > > > That is in terms of mapping:
> > > > - event queue will be dummy it will be as same as Rx queue
> > > > - Rx adapter will be also a dummy
> > > > - event ports aggregate multiple queues and connect to core via event port
> > > > - On Rxing the packet, mbuf->port will be the actual Port which is received.
> > > > app/test-eventdev written to use this model.
> > >
> > > Is this the optional aggregator api we discussed? already there, patch
> > > 2/6.
> > > I was trying to make common forwarding engines perfect to support any
> > > case, but since you all have concerns, removed in v4.
> >
> > The point was, If we take eventdev Rx adapter path, This all thing can
> > be implemented
> > without adding any new APIs in ethdev as similar functionality is
> > supported ethdeev-eventdev
> > Rx adapter. Now two things,
> >
> > 1) Aggregator API is not required, We will be taking the eventdev Rx
> > adapter route this implement it
> > 2) Another mode it is possible to implement it with  eventdev Rx
> > adapter. So I leave to ethdev
> > maintainers to decide if this path is required or not. No strong
> > opinion on this.
>
> Seems you are expert of event, is this the Rx burst api?
> rte_event_dequeue_burst(dev_id, port_id, ev[], nb_events, timeout)

Yes.

>
> Two concerns from user perspective:
> 1. By using ethdev-eventdev wrapper, it impacts performance.

It is not a wrapper. If HW doing the work then there will not be any regression
with the Rx adapter.
Like tx_burst, packet/event comes through
rte_event_dequeue_burst() aka single callback function pointer overhead.

> 2. For user application like OVS, using event api just when shared rxq
> enable looks strange.
>
> Maybe I missed something?
>
> There should be more feedkback and idea on how to aggregate ports after
> the fundamental(offload bit and group) start to work, agree to remove
> the aggregator api for now.

OK.

>
> >
> >
> >
> > >
> > >
> > > >
> > > >
> > > >
> > > > >
> > > > > >
> > > > > > > all features like packet content, rss, vlan, checksum, rte_flow... to
> > > > > > > be working based on shared rx queue. Based on the patch, I believe the
> > > > > > > impact has been minimized.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > > > > > > > represontors? If so can't we
> > > > > > > > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > > > > > > > scope can reduce the
> > > > > > > > > > > > > > > complexity it brings?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > > > > > > > representor the case by changing its
> > > > > > > > > name and
> > > > > > > > > > > > > > > > > scope.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > It works for both PF and representors in same switch
> > > > > > > > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > > > > > > > apply.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > > > >  doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > > > > > 11 +++++++++++
> > > > > > > > > > > > > > > > > > > >  doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > > > > > > >  doc/guides/prog_guide/switch_representation.
> > > > > > > > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > > > >  1 +
> > > > > > > > > > > > > > > > > > > >  lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > > > >  7 +++++++
> > > > > > > > > > > > > > > > > > > >  5 files changed, 30 insertions(+)
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4
> > > > > > > > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > > > > > > > >    ``tx_offload_capa,tx_queue_offload_capa:DE
> > > > > > > > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > > +Supports shared Rx queue for ports in same
> > > > > > > > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > >  .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >  Packet type parsing
> > > > > > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > > > > > > > > > > >  Queue start/stop     =
> > > > > > > > > > > > > > > > > > > >  Runtime Rx queue setup =
> > > > > > > > > > > > > > > > > > > >  Runtime Tx queue setup =
> > > > > > > > > > > > > > > > > > > > +Shared Rx queue      =
> > > > > > > > > > > > > > > > > > > >  Burst mode info      =
> > > > > > > > > > > > > > > > > > > >  Power mgmt address monitor =
> > > > > > > > > > > > > > > > > > > >  MTU update           =
> > > > > > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software
> > > > > > > > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > > > > > > > >  .. [1] `Ethernet switch device driver model
> > > > > > > > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation/net
> > > > > > > > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > +- Memory usage of representors is huge when
> > > > > > > > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > > > > > > > +  because PMD always allocate mbuf for each
> > > > > > > > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > > > > > > > +  Polling the large number of ports brings
> > > > > > > > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > > > > > > > +  latency. Shared Rx queue can be used to
> > > > > > > > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > > > > > > > +  representors in same switch domain.
> > > > > > > > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > > > > > > > +  is present in Rx offloading capability of
> > > > > > > > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > > > > > > > +  offloading flag in device Rx mode or Rx
> > > > > > > > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > > > > > > > +  shared Rx queue. Polling any member port
> > > > > > > > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > > > > > > > +  packets of all ports in group, port ID is
> > > > > > > > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > >  Basic SR-IOV
> > > > > > > > > > > > > > > > > > > >  ------------
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSU
> > > > > > > > > > > > > > > > > > > > M),
> > > > > > > > > > > > > > > > > > > >         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > > > > > > > >         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPL
> > > > > > > > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > > > > > > > >  };
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets
> > > > > > > > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > > > > > > > >         uint8_t rx_deferred_start; /**< Do
> > > > > > > > > > > > > > > > > > > > not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > > > > > > > > > >         uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > > > > > > +       uint32_t shared_group; /**< Shared
> > > > > > > > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > > > > > > > >         /**
> > > > > > > > > > > > > > > > > > > >          * Per-queue Rx offloads to be set
> > > > > > > > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > > > > > > > >          * Only offloads set on
> > > > > > > > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct
> > > > > > > > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > > > > > > > + * Rx queue is shared among ports in same
> > > > > > > > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > > > > > > > + * Real source port number saved in mbuf-
> > > > > > > > > > > > > > > > > > > > > port field.
> > > > > > > > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > > > > > > > >                                  DEV_RX_OFFLO
> > > > > > > > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > >
> > > > >
> > >
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/6] ethdev: introduce shared Rx queue
  2021-09-30 14:55   ` [dpdk-dev] [PATCH v4 1/6] " Xueming Li
@ 2021-10-11 10:47     ` Andrew Rybchenko
  2021-10-11 13:12       ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Andrew Rybchenko @ 2021-10-11 10:47 UTC (permalink / raw)
  To: Xueming Li, dev
  Cc: Jerin Jacob, Ferruh Yigit, Viacheslav Ovsiienko, Thomas Monjalon,
	Lior Margalit, Ananyev Konstantin

On 9/30/21 5:55 PM, Xueming Li wrote:
> In current DPDK framework, each RX queue is pre-loaded with mbufs for

RX -> Rx

> incoming packets. When number of representors scale out in a switch
> domain, the memory consumption became significant. Most important,
> polling all ports leads to high cache miss, high latency and low
> throughput.

It should be highlighted that it is a problem of some PMDs.
Not all.

> 
> This patch introduces shared RX queue. Ports with same configuration in

"This patch introduces" -> "Introduce"

RX -> Rx

> a switch domain could share RX queue set by specifying sharing group.

RX -> Rx

> Polling any queue using same shared RX queue receives packets from all

RX -> Rx

> member ports. Source port is identified by mbuf->port.
> 
> Port queue number in a shared group should be identical. Queue index is
> 1:1 mapped in shared group.
> 
> Share RX queue must be polled on single thread or core.

RX -> Rx

> 
> Multiple groups is supported by group ID.

is -> are

> 
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> Cc: Jerin Jacob <jerinjacobk@gmail.com>

The patch should update release notes.

> ---
> Rx queue object could be used as shared Rx queue object, it's important
> to clear all queue control callback api that using queue object:
>   https://mails.dpdk.org/archives/dev/2021-July/215574.html
> ---
>  doc/guides/nics/features.rst                    | 11 +++++++++++
>  doc/guides/nics/features/default.ini            |  1 +
>  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
>  lib/ethdev/rte_ethdev.c                         |  1 +
>  lib/ethdev/rte_ethdev.h                         |  7 +++++++
>  5 files changed, 30 insertions(+)
> 
> diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> index 4fce8cd1c97..69bc1d5719c 100644
> --- a/doc/guides/nics/features.rst
> +++ b/doc/guides/nics/features.rst
> @@ -626,6 +626,17 @@ Supports inner packet L4 checksum.
>    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
>  
>  
> +.. _nic_features_shared_rx_queue:
> +
> +Shared Rx queue
> +---------------
> +
> +Supports shared Rx queue for ports in same switch domain.
> +
> +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> +* **[provides] mbuf**: ``mbuf.port``.
> +
> +
>  .. _nic_features_packet_type_parsing:
>  
>  Packet type parsing
> diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
> index 754184ddd4d..ebeb4c18512 100644
> --- a/doc/guides/nics/features/default.ini
> +++ b/doc/guides/nics/features/default.ini
> @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
>  Queue start/stop     =
>  Runtime Rx queue setup =
>  Runtime Tx queue setup =
> +Shared Rx queue      =
>  Burst mode info      =
>  Power mgmt address monitor =
>  MTU update           =
> diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
> index ff6aa91c806..bc7ce65fa3d 100644
> --- a/doc/guides/prog_guide/switch_representation.rst
> +++ b/doc/guides/prog_guide/switch_representation.rst
> @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
>  .. [1] `Ethernet switch device driver model (switchdev)
>         <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
>  
> +- Memory usage of representors is huge when number of representor grows,
> +  because PMD always allocate mbuf for each descriptor of Rx queue.

It is a problem of some PMDs only. So, it must be rewritten to
highlight it.

> +  Polling the large number of ports brings more CPU load, cache miss and
> +  latency. Shared Rx queue can be used to share Rx queue between PF and
> +  representors in same switch. ``RTE_ETH_RX_OFFLOAD_SHARED_RXQ`` is
> +  present in Rx offloading capability of device info. Setting the
> +  offloading flag in device Rx mode or Rx queue configuration to enable
> +  shared Rx queue. Polling any member port of the shared Rx queue can return
> +  packets of all ports in the group, port ID is saved in ``mbuf.port``.
> +
>  Basic SR-IOV
>  ------------
>  
> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index 61aa49efec6..73270c10492 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -127,6 +127,7 @@ static const struct {
>  	RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
>  	RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
>  	RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> +	RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
>  };
>  
>  #undef RTE_RX_OFFLOAD_BIT2STR
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index afdc53b674c..d7ac625ee74 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1077,6 +1077,7 @@ struct rte_eth_rxconf {
>  	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
>  	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
>  	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> +	uint32_t shared_group; /**< Shared port group index in switch domain. */
>  	/**
>  	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
>  	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
> @@ -1403,6 +1404,12 @@ struct rte_eth_conf {
>  #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
>  #define DEV_RX_OFFLOAD_RSS_HASH		0x00080000
>  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> +/**
> + * Rx queue is shared among ports in same switch domain to save memory,
> + * avoid polling each port. Any port in the group can be used to receive
> + * packets. Real source port number saved in mbuf->port field.
> + */
> +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
>  
>  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
>  				 DEV_RX_OFFLOAD_UDP_CKSUM | \
> 

IMHO it should be squashed with the second patch to make it
easier to review. Otherwise it is hard to understand what is
shared_group and the offlaod which are dead in the patch.

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v4 0/6] ethdev: introduce shared Rx queue
  2021-09-30 14:55 ` [dpdk-dev] [PATCH v4 0/6] ethdev: introduce shared Rx queue Xueming Li
                     ` (5 preceding siblings ...)
  2021-09-30 14:56   ` [dpdk-dev] [PATCH v4 6/6] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
@ 2021-10-11 11:49   ` Andrew Rybchenko
  2021-10-11 15:11     ` Xueming(Steven) Li
  6 siblings, 1 reply; 266+ messages in thread
From: Andrew Rybchenko @ 2021-10-11 11:49 UTC (permalink / raw)
  To: Xueming Li, dev
  Cc: Jerin Jacob, Ferruh Yigit, Viacheslav Ovsiienko, Thomas Monjalon,
	Lior Margalit, Ananyev Konstantin

Hi Xueming,

On 9/30/21 5:55 PM, Xueming Li wrote:
> In current DPDK framework, all RX queues is pre-loaded with mbufs for
> incoming packets. When number of representors scale out in a switch
> domain, the memory consumption became significant. Further more,
> polling all ports leads to high cache miss, high latency and low
> throughputs.
> 
> This patch introduces shared RX queue. PF and representors with same
> configuration in same switch domain could share RX queue set by
> specifying shared Rx queue offloading flag and sharing group.
> 
> All ports that Shared Rx queue actually shares One Rx queue and only
> pre-load mbufs to one Rx queue, memory is saved.
> 
> Polling any queue using same shared RX queue receives packets from all
> member ports. Source port is identified by mbuf->port.
> 
> Multiple groups is supported by group ID. Port queue number in a shared
> group should be identical. Queue index is 1:1 mapped in shared group.
> An example of polling two share groups:
>   core	group	queue
>   0	0	0
>   1	0	1
>   2	0	2
>   3	0	3
>   4	1	0
>   5	1	1
>   6	1	2
>   7	1	3
> 
> Shared RX queue must be polled on single thread or core. If both PF0 and
> representor0 joined same share group, can't poll pf0rxq0 on core1 and
> rep0rxq0 on core2. Actually, polling one port within share group is
> sufficient since polling any port in group will return packets for any
> port in group.

I apologize that I jump in into the review process that late.

Frankly speaking I doubt that it is the best design to solve
the problem. Yes, I confirm that the problem exists, but I
think there is better and simpler way to solve it.

The problem of the suggested solution is that it puts all
the headache about consistency to application and PMDs
without any help from ethdev layer to guarantee the
consistency. As the result I believe it will be either
missing/lost consistency checks or huge duplication in
each PMD which supports the feature. Shared RxQs must be
equally configured including number of queues, offloads
(taking device level Rx offloads into account), RSS
settings etc. So, applications must care about it and
PMDs (or ethdev layer) must check it.

The advantage of the solution is that any device may
create group and subsequent devices join. Absence of
primary device is nice. But do we really need it?
Will the design work if some representors are configured
to use shared RxQ, but some do not? Theoretically it
is possible, but could require extra non-trivial code
on fast path.

Also looking at the first two patch I don't understand
how application will find out which devices may share
RxQs. E.g. if we have two difference NICs which support
sharing, we can try to setup only one group 0, but
finally will have two devices (not one) which must be
polled.

1. We need extra flag in dev_info->dev_capa
   RTE_ETH_DEV_CAPA_RX_SHARE to advertise that
   the device supports Rx sharing.

2. I think we need "rx_domain" in device info
   (which should be treated in boundaries of the
   switch_domain) if and only if
   RTE_ETH_DEV_CAPA_RX_SHARE is advertised.
   Otherwise rx_domain value does not make sense.

(1) and (2) will allow application to find out which
devices can share Rx.

3. Primary device (representors backing device) should
   advertise shared RxQ offload. Enabling of the offload
   tells the device to provide packets to all device in
   the Rx domain with mbuf->port filled in appropriately.
   Also it allows app to identify primary device in the
   Rx domain. When application enables the offload, it
   must ensure that it does not treat used port_id as an
   input port_id, but always check mbuf->port for each
   packet.

4. A new Rx mode should be introduced for secondary
   devices. It should not allow to configure RSS, specify
   any Rx offloads etc. ethdev must ensure it.
   It is an open question right now if it should require
   to provide primary port_id. In theory representors
   have it. However, may be it is nice for consistency
   to ensure that application knows that it does.
   If shared Rx mode is specified for device, application
   does not need to setup RxQs and attempts to do it
   should be discarded in ethdev.
   For consistency it is better to ensure that number of
   queues match.
   It is an interesting question what should happen if
   primary device is reconfigured and shared Rx is
   disabled on reconfiguration.

5. If so, in theory implementation of the Rx burst
   in the secondary could simply call Rx burst on
   primary device.

Andrew.

^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v5 0/5] ethdev: introduce shared Rx queue
  2021-07-27  3:42 [dpdk-dev] [RFC] ethdev: introduce shared Rx queue Xueming Li
                   ` (4 preceding siblings ...)
  2021-09-30 14:55 ` [dpdk-dev] [PATCH v4 0/6] ethdev: introduce shared Rx queue Xueming Li
@ 2021-10-11 12:37 ` Xueming Li
  2021-10-11 12:37   ` [dpdk-dev] [PATCH v5 1/5] " Xueming Li
                     ` (4 more replies)
  2021-10-12 14:39 ` [dpdk-dev] [PATCH v6 0/5] ethdev: introduce " Xueming Li
                   ` (10 subsequent siblings)
  16 siblings, 5 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-11 12:37 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin

In current DPDK framework, all Rx queues is pre-loaded with mbufs for
incoming packets. When number of representors scale out in a switch
domain, the memory consumption became significant. Further more,
polling all ports leads to high cache miss, high latency and low
throughputs.

This patch introduces shared Rx queue. PF and representors with same
configuration in same switch domain could share Rx queue set by
specifying shared Rx queue offloading flag and sharing group.

All ports that Shared Rx queue actually shares One Rx queue and only
pre-load mbufs to one Rx queue, memory is saved.

Polling any queue using same shared Rx queue receives packets from all
member ports. Source port is identified by mbuf->port.

Multiple groups is supported by group ID. Port queue number in a shared
group should be identical. Queue index is 1:1 mapped in shared group.
An example of two share groups:
 Group0, 4 shared Rx queues per member port: PF, repr0, repr1
 Group1, 2 shared Rx queues per member port: repr2, repr3, ... repr127
 Poll first port for each group:
  core	port	queue
  0	0	0
  1	0	1
  2	0	2
  3	0	3
  4	2	0
  5	2	1

Shared Rx queue must be polled on single thread or core. If both PF0 and
representor0 joined same share group, can't poll pf0rxq0 on core1 and
rep0rxq0 on core2. Actually, polling one port within share group is
sufficient since polling any port in group will return packets for any
port in group.

There was some discussion to aggregate member ports in same group into a
dummy port, several ways to achieve it. Since it optional, need to collect
more feedback and requirement from user, make better decision later.

v1:
  - initial version
v2:
  - add testpmd patches
v3:
  - change common forwarding api to macro for performance, thanks Jerin.
  - save global variable accessed in forwarding to flowstream to minimize
    cache miss
  - combined patches for each forwarding engine
  - support multiple groups in testpmd "--share-rxq" parameter
  - new api to aggregate shared rxq group
v4:
  - spelling fixes
  - remove shared-rxq support for all forwarding engines
  - add dedicate shared-rxq forwarding engine
v5:
 - fix grammars
 - remove aggregate api and leave it for later discussion
 - add release notes
 - add deployment example

Xueming Li (5):
  ethdev: introduce shared Rx queue
  app/testpmd: new parameter to enable shared Rx queue
  app/testpmd: dump port info for shared Rx queue
  app/testpmd: force shared Rx queue polled on same core
  app/testpmd: add forwarding engine for shared Rx queue

 app/test-pmd/config.c                         | 102 +++++++++++-
 app/test-pmd/meson.build                      |   1 +
 app/test-pmd/parameters.c                     |  13 ++
 app/test-pmd/shared_rxq_fwd.c                 | 148 ++++++++++++++++++
 app/test-pmd/testpmd.c                        |  23 ++-
 app/test-pmd/testpmd.h                        |   5 +
 app/test-pmd/util.c                           |   3 +
 doc/guides/nics/features.rst                  |  11 ++
 doc/guides/nics/features/default.ini          |   1 +
 .../prog_guide/switch_representation.rst      |  10 ++
 doc/guides/rel_notes/release_21_11.rst        |   4 +
 doc/guides/testpmd_app_ug/run_app.rst         |   8 +
 doc/guides/testpmd_app_ug/testpmd_funcs.rst   |   5 +-
 lib/ethdev/rte_ethdev.c                       |   1 +
 lib/ethdev/rte_ethdev.h                       |   7 +
 15 files changed, 339 insertions(+), 3 deletions(-)
 create mode 100644 app/test-pmd/shared_rxq_fwd.c

-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v5 1/5] ethdev: introduce shared Rx queue
  2021-10-11 12:37 ` [dpdk-dev] [PATCH v5 0/5] " Xueming Li
@ 2021-10-11 12:37   ` Xueming Li
  2021-10-11 12:37   ` [dpdk-dev] [PATCH v5 2/5] app/testpmd: new parameter to enable " Xueming Li
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-11 12:37 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin

In current DPDK framework, each Rx queue is pre-loaded with mbufs for
incoming packets. For some PMDs, when number of representors scale out
in a switch domain, the memory consumption became significant. Polling
all ports also leads to high cache miss, high latency and low
throughput.

This patch introduce shared Rx queue. Ports with same configuration in
a switch domain could share Rx queue set by specifying sharing group.
Polling any queue using same shared Rx queue receives packets from all
member ports. Source port is identified by mbuf->port.

Port queue number in a shared group should be identical. Queue index is
1:1 mapped in shared group.

Share Rx queue must be polled on single thread or core.

Multiple groups are supported by group ID.

Example grouping and polling model to reflect service priority:
 Group0, 2 shared Rx queues per port: PF, rep0, rep1
 Group1, 1 shared Rx queue per port: rep2, rep3, ... rep127
 Core0: poll PF queue0
 Core1: poll PF queue1
 Core2: poll rep2 queue0

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Cc: Jerin Jacob <jerinjacobk@gmail.com>
---
Rx queue object could be used as shared Rx queue object, it's important
to clear all queue control callback api that using queue object:
  https://mails.dpdk.org/archives/dev/2021-July/215574.html
---
 doc/guides/nics/features.rst                    | 11 +++++++++++
 doc/guides/nics/features/default.ini            |  1 +
 doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
 doc/guides/rel_notes/release_21_11.rst          |  4 ++++
 lib/ethdev/rte_ethdev.c                         |  1 +
 lib/ethdev/rte_ethdev.h                         |  7 +++++++
 6 files changed, 34 insertions(+)

diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index 4fce8cd1c97..69bc1d5719c 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -626,6 +626,17 @@ Supports inner packet L4 checksum.
   ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
 
 
+.. _nic_features_shared_rx_queue:
+
+Shared Rx queue
+---------------
+
+Supports shared Rx queue for ports in same switch domain.
+
+* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
+* **[provides] mbuf**: ``mbuf.port``.
+
+
 .. _nic_features_packet_type_parsing:
 
 Packet type parsing
diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
index 754184ddd4d..ebeb4c18512 100644
--- a/doc/guides/nics/features/default.ini
+++ b/doc/guides/nics/features/default.ini
@@ -19,6 +19,7 @@ Free Tx mbuf on demand =
 Queue start/stop     =
 Runtime Rx queue setup =
 Runtime Tx queue setup =
+Shared Rx queue      =
 Burst mode info      =
 Power mgmt address monitor =
 MTU update           =
diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
index ff6aa91c806..47205f5f1cc 100644
--- a/doc/guides/prog_guide/switch_representation.rst
+++ b/doc/guides/prog_guide/switch_representation.rst
@@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
 .. [1] `Ethernet switch device driver model (switchdev)
        <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
 
+- For some PMDs, memory usage of representors is huge when number of
+  representor grows, mbufs are allocated for each descriptor of Rx queue.
+  Polling large number of ports brings more CPU load, cache miss and
+  latency. Shared Rx queue can be used to share Rx queue between PF and
+  representors among same switch. ``RTE_ETH_RX_OFFLOAD_SHARED_RXQ`` is
+  present in Rx offloading capability of device info. Setting the
+  offloading flag in device Rx mode or Rx queue configuration to enable
+  shared Rx queue. Polling any member port of the shared Rx queue can return
+  packets of all ports in the group, port ID is saved in ``mbuf.port``.
+
 Basic SR-IOV
 ------------
 
diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
index c0a7f755189..aa6b5e2e9c5 100644
--- a/doc/guides/rel_notes/release_21_11.rst
+++ b/doc/guides/rel_notes/release_21_11.rst
@@ -134,6 +134,10 @@ New Features
   * Added tests to validate packets hard expiry.
   * Added tests to verify tunnel header verification in IPsec inbound.
 
+* **Added ethdev shared Rx queue support. **
+
+  * Added new Rx queue offloading capability flag.
+  * Added share group to Rx queue configuration.
 
 Removed Items
 -------------
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index fb69f6ea8d1..d78b50e1fa7 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -127,6 +127,7 @@ static const struct {
 	RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
 	RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
 	RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
+	RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
 };
 
 #undef RTE_RX_OFFLOAD_BIT2STR
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 39d2cd612cb..b94f7ba5a3f 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1077,6 +1077,7 @@ struct rte_eth_rxconf {
 	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
 	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
 	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
+	uint32_t shared_group; /**< Shared port group index in switch domain. */
 	/**
 	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
 	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
@@ -1403,6 +1404,12 @@ struct rte_eth_conf {
 #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
 #define DEV_RX_OFFLOAD_RSS_HASH		0x00080000
 #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
+/**
+ * Rx queue is shared among ports in same switch domain to save memory,
+ * avoid polling each port. Any port in the group can be used to receive
+ * packets. Real source port number saved in mbuf->port field.
+ */
+#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
 
 #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
 				 DEV_RX_OFFLOAD_UDP_CKSUM | \
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v5 2/5] app/testpmd: new parameter to enable shared Rx queue
  2021-10-11 12:37 ` [dpdk-dev] [PATCH v5 0/5] " Xueming Li
  2021-10-11 12:37   ` [dpdk-dev] [PATCH v5 1/5] " Xueming Li
@ 2021-10-11 12:37   ` Xueming Li
  2021-10-11 12:37   ` [dpdk-dev] [PATCH v5 3/5] app/testpmd: dump port info for " Xueming Li
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-11 12:37 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Xiaoyun Li

Adds "--rxq-share" parameter to enable shared rxq for each rxq.

Default shared rxq group 0 is used, Rx queues in same switch domain
shares same rxq according to queue index.

Shared Rx queue is enabled only if device support offloading flag
RTE_ETH_RX_OFFLOAD_SHARED_RXQ.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c                 |  6 +++++-
 app/test-pmd/parameters.c             | 13 +++++++++++++
 app/test-pmd/testpmd.c                | 18 ++++++++++++++++++
 app/test-pmd/testpmd.h                |  2 ++
 doc/guides/testpmd_app_ug/run_app.rst |  7 +++++++
 5 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 9c66329e96e..6c7f9dee065 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -2709,7 +2709,11 @@ rxtx_config_display(void)
 			printf("      RX threshold registers: pthresh=%d hthresh=%d "
 				" wthresh=%d\n",
 				pthresh_tmp, hthresh_tmp, wthresh_tmp);
-			printf("      RX Offloads=0x%"PRIx64"\n", offloads_tmp);
+			printf("      RX Offloads=0x%"PRIx64, offloads_tmp);
+			if (rx_conf->offloads & RTE_ETH_RX_OFFLOAD_SHARED_RXQ)
+				printf(" share group=%u",
+				       rx_conf->shared_group);
+			printf("\n");
 		}
 
 		/* per tx queue config only for first queue to be less verbose */
diff --git a/app/test-pmd/parameters.c b/app/test-pmd/parameters.c
index 3f94a82e321..30dae326310 100644
--- a/app/test-pmd/parameters.c
+++ b/app/test-pmd/parameters.c
@@ -167,6 +167,7 @@ usage(char* progname)
 	printf("  --tx-ip=src,dst: IP addresses in Tx-only mode\n");
 	printf("  --tx-udp=src[,dst]: UDP ports in Tx-only mode\n");
 	printf("  --eth-link-speed: force link speed.\n");
+	printf("  --rxq-share: number of ports per shared rxq groups, defaults to MAX(1 group)\n");
 	printf("  --disable-link-check: disable check on link status when "
 	       "starting/stopping ports.\n");
 	printf("  --disable-device-start: do not automatically start port\n");
@@ -607,6 +608,7 @@ launch_args_parse(int argc, char** argv)
 		{ "rxpkts",			1, 0, 0 },
 		{ "txpkts",			1, 0, 0 },
 		{ "txonly-multi-flow",		0, 0, 0 },
+		{ "rxq-share",			2, 0, 0 },
 		{ "eth-link-speed",		1, 0, 0 },
 		{ "disable-link-check",		0, 0, 0 },
 		{ "disable-device-start",	0, 0, 0 },
@@ -1271,6 +1273,17 @@ launch_args_parse(int argc, char** argv)
 			}
 			if (!strcmp(lgopts[opt_idx].name, "txonly-multi-flow"))
 				txonly_multi_flow = 1;
+			if (!strcmp(lgopts[opt_idx].name, "rxq-share")) {
+				if (optarg == NULL) {
+					rxq_share = UINT32_MAX;
+				} else {
+					n = atoi(optarg);
+					if (n >= 0)
+						rxq_share = (uint32_t)n;
+					else
+						rte_exit(EXIT_FAILURE, "rxq-share must be >= 0\n");
+				}
+			}
 			if (!strcmp(lgopts[opt_idx].name, "no-flush-rx"))
 				no_flush_rx = 1;
 			if (!strcmp(lgopts[opt_idx].name, "eth-link-speed")) {
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 97ae52e17ec..417e92ade11 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -498,6 +498,11 @@ uint8_t record_core_cycles;
  */
 uint8_t record_burst_stats;
 
+/*
+ * Number of ports per shared Rx queue group, 0 disable.
+ */
+uint32_t rxq_share;
+
 unsigned int num_sockets = 0;
 unsigned int socket_ids[RTE_MAX_NUMA_NODES];
 
@@ -1506,6 +1511,11 @@ init_config_port_offloads(portid_t pid, uint32_t socket_id)
 		port->dev_conf.txmode.offloads &=
 			~DEV_TX_OFFLOAD_MBUF_FAST_FREE;
 
+	if (rxq_share > 0 &&
+	    (port->dev_info.rx_offload_capa & RTE_ETH_RX_OFFLOAD_SHARED_RXQ))
+		port->dev_conf.rxmode.offloads |=
+				RTE_ETH_RX_OFFLOAD_SHARED_RXQ;
+
 	/* Apply Rx offloads configuration */
 	for (i = 0; i < port->dev_info.max_rx_queues; i++)
 		port->rx_conf[i].offloads = port->dev_conf.rxmode.offloads;
@@ -3401,6 +3411,14 @@ rxtx_port_config(struct rte_port *port)
 	for (qid = 0; qid < nb_rxq; qid++) {
 		offloads = port->rx_conf[qid].offloads;
 		port->rx_conf[qid] = port->dev_info.default_rxconf;
+
+		if (rxq_share > 0 &&
+		    (port->dev_info.rx_offload_capa &
+		     RTE_ETH_RX_OFFLOAD_SHARED_RXQ)) {
+			offloads |= RTE_ETH_RX_OFFLOAD_SHARED_RXQ;
+			port->rx_conf[qid].shared_group = nb_ports / rxq_share;
+		}
+
 		if (offloads != 0)
 			port->rx_conf[qid].offloads = offloads;
 
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 5863b2f43f3..3dfaaad94c0 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -477,6 +477,8 @@ extern enum tx_pkt_split tx_pkt_split;
 
 extern uint8_t txonly_multi_flow;
 
+extern uint32_t rxq_share;
+
 extern uint16_t nb_pkt_per_burst;
 extern uint16_t nb_pkt_flowgen_clones;
 extern int nb_flows_flowgen;
diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
index 640eadeff73..6cfe88951f7 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -389,6 +389,13 @@ The command line options are:
 
     Generate multiple flows in txonly mode.
 
+*   ``--rxq-share=[X]``
+
+    Create queues in shared Rx queue mode if device supports.
+    Group number grows per X ports, default to group 0 if X not specified.
+    Only "shared-rxq" forwarding engine suppose to resolve source stream
+    correctly.
+
 *   ``--eth-link-speed``
 
     Set a forced link speed to the ethernet port::
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v5 3/5] app/testpmd: dump port info for shared Rx queue
  2021-10-11 12:37 ` [dpdk-dev] [PATCH v5 0/5] " Xueming Li
  2021-10-11 12:37   ` [dpdk-dev] [PATCH v5 1/5] " Xueming Li
  2021-10-11 12:37   ` [dpdk-dev] [PATCH v5 2/5] app/testpmd: new parameter to enable " Xueming Li
@ 2021-10-11 12:37   ` Xueming Li
  2021-10-11 12:37   ` [dpdk-dev] [PATCH v5 4/5] app/testpmd: force shared Rx queue polled on same core Xueming Li
  2021-10-11 12:37   ` [dpdk-dev] [PATCH v5 5/5] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
  4 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-11 12:37 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Xiaoyun Li

In case of shared Rx queue, polling any member port returns mbufs for
all members. This patch dumps mbuf->port for each packet.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/util.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/app/test-pmd/util.c b/app/test-pmd/util.c
index 51506e49404..e98f136d5ed 100644
--- a/app/test-pmd/util.c
+++ b/app/test-pmd/util.c
@@ -100,6 +100,9 @@ dump_pkt_burst(uint16_t port_id, uint16_t queue, struct rte_mbuf *pkts[],
 		struct rte_flow_restore_info info = { 0, };
 
 		mb = pkts[i];
+		if (rxq_share > 0)
+			MKDUMPSTR(print_buf, buf_size, cur_len, "port %u, ",
+				  mb->port);
 		eth_hdr = rte_pktmbuf_read(mb, 0, sizeof(_eth_hdr), &_eth_hdr);
 		eth_type = RTE_BE_TO_CPU_16(eth_hdr->ether_type);
 		packet_type = mb->packet_type;
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v5 4/5] app/testpmd: force shared Rx queue polled on same core
  2021-10-11 12:37 ` [dpdk-dev] [PATCH v5 0/5] " Xueming Li
                     ` (2 preceding siblings ...)
  2021-10-11 12:37   ` [dpdk-dev] [PATCH v5 3/5] app/testpmd: dump port info for " Xueming Li
@ 2021-10-11 12:37   ` Xueming Li
  2021-10-11 12:37   ` [dpdk-dev] [PATCH v5 5/5] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
  4 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-11 12:37 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Xiaoyun Li

Shared rxqs shares one set rx queue of groups zero. Shared Rx queue must
must be polled from one core.

Checks and stops forwarding if shared rxq being scheduled on multiple
cores.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c  | 96 ++++++++++++++++++++++++++++++++++++++++++
 app/test-pmd/testpmd.c |  4 +-
 app/test-pmd/testpmd.h |  2 +
 3 files changed, 101 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 6c7f9dee065..4ab500569ee 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -2885,6 +2885,102 @@ port_rss_hash_key_update(portid_t port_id, char rss_type[], uint8_t *hash_key,
 	}
 }
 
+/*
+ * Check whether a shared rxq scheduled on other lcores.
+ */
+static bool
+fwd_stream_on_other_lcores(uint16_t domain_id, portid_t src_port,
+			   queueid_t src_rxq, lcoreid_t src_lc,
+			   uint32_t shared_group)
+{
+	streamid_t sm_id;
+	streamid_t nb_fs_per_lcore;
+	lcoreid_t  nb_fc;
+	lcoreid_t  lc_id;
+	struct fwd_stream *fs;
+	struct rte_port *port;
+	struct rte_eth_rxconf *rxq_conf;
+
+	nb_fc = cur_fwd_config.nb_fwd_lcores;
+	for (lc_id = src_lc + 1; lc_id < nb_fc; lc_id++) {
+		sm_id = fwd_lcores[lc_id]->stream_idx;
+		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
+		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
+		     sm_id++) {
+			fs = fwd_streams[sm_id];
+			port = &ports[fs->rx_port];
+			rxq_conf = &port->rx_conf[fs->rx_queue];
+			if ((rxq_conf->offloads & RTE_ETH_RX_OFFLOAD_SHARED_RXQ)
+			    == 0)
+				/* Not shared rxq. */
+				continue;
+			if (domain_id != port->dev_info.switch_info.domain_id)
+				continue;
+			if (fs->rx_queue != src_rxq)
+				continue;
+			if (rxq_conf->shared_group != shared_group)
+				continue;
+			printf("Shared Rx queue group %u can't be scheduled on different cores:\n",
+			       shared_group);
+			printf("  lcore %hhu Port %hu queue %hu\n",
+			       src_lc, src_port, src_rxq);
+			printf("  lcore %hhu Port %hu queue %hu\n",
+			       lc_id, fs->rx_port, fs->rx_queue);
+			printf("  please use --nb-cores=%hu to limit forwarding cores\n",
+			       nb_rxq);
+			return true;
+		}
+	}
+	return false;
+}
+
+/*
+ * Check shared rxq configuration.
+ *
+ * Shared group must not being scheduled on different core.
+ */
+bool
+pkt_fwd_shared_rxq_check(void)
+{
+	streamid_t sm_id;
+	streamid_t nb_fs_per_lcore;
+	lcoreid_t  nb_fc;
+	lcoreid_t  lc_id;
+	struct fwd_stream *fs;
+	uint16_t domain_id;
+	struct rte_port *port;
+	struct rte_eth_rxconf *rxq_conf;
+
+	nb_fc = cur_fwd_config.nb_fwd_lcores;
+	/*
+	 * Check streams on each core, make sure the same switch domain +
+	 * group + queue doesn't get scheduled on other cores.
+	 */
+	for (lc_id = 0; lc_id < nb_fc; lc_id++) {
+		sm_id = fwd_lcores[lc_id]->stream_idx;
+		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
+		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
+		     sm_id++) {
+			fs = fwd_streams[sm_id];
+			/* Update lcore info stream being scheduled. */
+			fs->lcore = fwd_lcores[lc_id];
+			port = &ports[fs->rx_port];
+			rxq_conf = &port->rx_conf[fs->rx_queue];
+			if ((rxq_conf->offloads & RTE_ETH_RX_OFFLOAD_SHARED_RXQ)
+			    == 0)
+				/* Not shared rxq. */
+				continue;
+			/* Check shared rxq not scheduled on remaining cores. */
+			domain_id = port->dev_info.switch_info.domain_id;
+			if (fwd_stream_on_other_lcores(domain_id, fs->rx_port,
+						       fs->rx_queue, lc_id,
+						       rxq_conf->shared_group))
+				return false;
+		}
+	}
+	return true;
+}
+
 /*
  * Setup forwarding configuration for each logical core.
  */
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 417e92ade11..cab4b36b046 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -2241,10 +2241,12 @@ start_packet_forwarding(int with_tx_first)
 
 	fwd_config_setup();
 
+	pkt_fwd_config_display(&cur_fwd_config);
+	if (!pkt_fwd_shared_rxq_check())
+		return;
 	if(!no_flush_rx)
 		flush_fwd_rx_queues();
 
-	pkt_fwd_config_display(&cur_fwd_config);
 	rxtx_config_display();
 
 	fwd_stats_reset();
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 3dfaaad94c0..f121a2da90c 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -144,6 +144,7 @@ struct fwd_stream {
 	uint64_t     core_cycles; /**< used for RX and TX processing */
 	struct pkt_burst_stats rx_burst_stats;
 	struct pkt_burst_stats tx_burst_stats;
+	struct fwd_lcore *lcore; /**< Lcore being scheduled. */
 };
 
 /**
@@ -795,6 +796,7 @@ void port_summary_header_display(void);
 void rx_queue_infos_display(portid_t port_idi, uint16_t queue_id);
 void tx_queue_infos_display(portid_t port_idi, uint16_t queue_id);
 void fwd_lcores_config_display(void);
+bool pkt_fwd_shared_rxq_check(void);
 void pkt_fwd_config_display(struct fwd_config *cfg);
 void rxtx_config_display(void);
 void fwd_config_setup(void);
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v5 5/5] app/testpmd: add forwarding engine for shared Rx queue
  2021-10-11 12:37 ` [dpdk-dev] [PATCH v5 0/5] " Xueming Li
                     ` (3 preceding siblings ...)
  2021-10-11 12:37   ` [dpdk-dev] [PATCH v5 4/5] app/testpmd: force shared Rx queue polled on same core Xueming Li
@ 2021-10-11 12:37   ` Xueming Li
  4 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-11 12:37 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Xiaoyun Li

To support shared Rx queue, this patch introduces dedicate forwarding
engine. The engine groups received packets by mbuf->port into sub-group,
updates stream statistics and simply frees packets.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/meson.build                    |   1 +
 app/test-pmd/shared_rxq_fwd.c               | 148 ++++++++++++++++++++
 app/test-pmd/testpmd.c                      |   1 +
 app/test-pmd/testpmd.h                      |   1 +
 doc/guides/testpmd_app_ug/run_app.rst       |   1 +
 doc/guides/testpmd_app_ug/testpmd_funcs.rst |   5 +-
 6 files changed, 156 insertions(+), 1 deletion(-)
 create mode 100644 app/test-pmd/shared_rxq_fwd.c

diff --git a/app/test-pmd/meson.build b/app/test-pmd/meson.build
index 98f3289bdfa..07042e45b12 100644
--- a/app/test-pmd/meson.build
+++ b/app/test-pmd/meson.build
@@ -21,6 +21,7 @@ sources = files(
         'noisy_vnf.c',
         'parameters.c',
         'rxonly.c',
+        'shared_rxq_fwd.c',
         'testpmd.c',
         'txonly.c',
         'util.c',
diff --git a/app/test-pmd/shared_rxq_fwd.c b/app/test-pmd/shared_rxq_fwd.c
new file mode 100644
index 00000000000..4e262b99bc7
--- /dev/null
+++ b/app/test-pmd/shared_rxq_fwd.c
@@ -0,0 +1,148 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2021 NVIDIA Corporation & Affiliates
+ */
+
+#include <stdarg.h>
+#include <string.h>
+#include <stdio.h>
+#include <errno.h>
+#include <stdint.h>
+#include <unistd.h>
+#include <inttypes.h>
+
+#include <sys/queue.h>
+#include <sys/stat.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_debug.h>
+#include <rte_cycles.h>
+#include <rte_memory.h>
+#include <rte_memcpy.h>
+#include <rte_launch.h>
+#include <rte_eal.h>
+#include <rte_per_lcore.h>
+#include <rte_lcore.h>
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_mempool.h>
+#include <rte_mbuf.h>
+#include <rte_pci.h>
+#include <rte_ether.h>
+#include <rte_ethdev.h>
+#include <rte_string_fns.h>
+#include <rte_ip.h>
+#include <rte_udp.h>
+#include <rte_net.h>
+#include <rte_flow.h>
+
+#include "testpmd.h"
+
+/*
+ * Rx only sub-burst forwarding.
+ */
+static void
+forward_rx_only(uint16_t nb_rx, struct rte_mbuf **pkts_burst)
+{
+	rte_pktmbuf_free_bulk(pkts_burst, nb_rx);
+}
+
+/**
+ * Get packet source stream by source port and queue.
+ * All streams of same shared Rx queue locates on same core.
+ */
+static struct fwd_stream *
+forward_stream_get(struct fwd_stream *fs, uint16_t port)
+{
+	streamid_t sm_id;
+	struct fwd_lcore *fc;
+	struct fwd_stream **fsm;
+	streamid_t nb_fs;
+
+	fc = fs->lcore;
+	fsm = &fwd_streams[fc->stream_idx];
+	nb_fs = fc->stream_nb;
+	for (sm_id = 0; sm_id < nb_fs; sm_id++) {
+		if (fsm[sm_id]->rx_port == port &&
+		    fsm[sm_id]->rx_queue == fs->rx_queue)
+			return fsm[sm_id];
+	}
+	return NULL;
+}
+
+/**
+ * Forward packet by source port and queue.
+ */
+static void
+forward_sub_burst(struct fwd_stream *src_fs, uint16_t port, uint16_t nb_rx,
+		  struct rte_mbuf **pkts)
+{
+	struct fwd_stream *fs = forward_stream_get(src_fs, port);
+
+	if (fs != NULL) {
+		fs->rx_packets += nb_rx;
+		forward_rx_only(nb_rx, pkts);
+	} else {
+		/* Source stream not found, drop all packets. */
+		src_fs->fwd_dropped += nb_rx;
+		while (nb_rx > 0)
+			rte_pktmbuf_free(pkts[--nb_rx]);
+	}
+}
+
+/**
+ * Forward packets from shared Rx queue.
+ *
+ * Source port of packets are identified by mbuf->port.
+ */
+static void
+forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
+		   struct rte_mbuf **pkts_burst)
+{
+	uint16_t i, nb_sub_burst, port, last_port;
+
+	nb_sub_burst = 0;
+	last_port = pkts_burst[0]->port;
+	/* Locate sub-burst according to mbuf->port. */
+	for (i = 0; i < nb_rx - 1; ++i) {
+		rte_prefetch0(pkts_burst[i + 1]);
+		port = pkts_burst[i]->port;
+		if (i > 0 && last_port != port) {
+			/* Forward packets with same source port. */
+			forward_sub_burst(fs, last_port, nb_sub_burst,
+					  &pkts_burst[i - nb_sub_burst]);
+			nb_sub_burst = 0;
+			last_port = port;
+		}
+		nb_sub_burst++;
+	}
+	/* Last sub-burst. */
+	nb_sub_burst++;
+	forward_sub_burst(fs, last_port, nb_sub_burst,
+			  &pkts_burst[nb_rx - nb_sub_burst]);
+}
+
+static void
+shared_rxq_fwd(struct fwd_stream *fs)
+{
+	struct rte_mbuf *pkts_burst[nb_pkt_per_burst];
+	uint16_t nb_rx;
+	uint64_t start_tsc = 0;
+
+	get_start_cycles(&start_tsc);
+	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
+				 nb_pkt_per_burst);
+	inc_rx_burst_stats(fs, nb_rx);
+	if (unlikely(nb_rx == 0))
+		return;
+	forward_shared_rxq(fs, nb_rx, pkts_burst);
+	get_end_cycles(fs, start_tsc);
+}
+
+struct fwd_engine shared_rxq_engine = {
+	.fwd_mode_name  = "shared_rxq",
+	.port_fwd_begin = NULL,
+	.port_fwd_end   = NULL,
+	.packet_fwd     = shared_rxq_fwd,
+};
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index cab4b36b046..681ea591871 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -188,6 +188,7 @@ struct fwd_engine * fwd_engines[] = {
 #ifdef RTE_LIBRTE_IEEE1588
 	&ieee1588_fwd_engine,
 #endif
+	&shared_rxq_engine,
 	NULL,
 };
 
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index f121a2da90c..f1fd607e365 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -299,6 +299,7 @@ extern struct fwd_engine five_tuple_swap_fwd_engine;
 #ifdef RTE_LIBRTE_IEEE1588
 extern struct fwd_engine ieee1588_fwd_engine;
 #endif
+extern struct fwd_engine shared_rxq_engine;
 
 extern struct fwd_engine * fwd_engines[]; /**< NULL terminated array. */
 extern cmdline_parse_inst_t cmd_set_raw;
diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
index 6cfe88951f7..277ef79ed9d 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -252,6 +252,7 @@ The command line options are:
        tm
        noisy
        5tswap
+       shared-rxq
 
 *   ``--rss-ip``
 
diff --git a/doc/guides/testpmd_app_ug/testpmd_funcs.rst b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
index bbef7063741..a30e2f4dfae 100644
--- a/doc/guides/testpmd_app_ug/testpmd_funcs.rst
+++ b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
@@ -314,7 +314,7 @@ set fwd
 Set the packet forwarding mode::
 
    testpmd> set fwd (io|mac|macswap|flowgen| \
-                     rxonly|txonly|csum|icmpecho|noisy|5tswap) (""|retry)
+                     rxonly|txonly|csum|icmpecho|noisy|5tswap|shared-rxq) (""|retry)
 
 ``retry`` can be specified for forwarding engines except ``rx_only``.
 
@@ -357,6 +357,9 @@ The available information categories are:
 
   L4 swaps the source port and destination port of transport layer (TCP and UDP).
 
+* ``shared-rxq``: Receive only for shared Rx queue.
+  Resolve packet source port from mbuf and update stream statistics accrodingly.
+
 Example::
 
    testpmd> set fwd rxonly
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/6] ethdev: introduce shared Rx queue
  2021-10-11 10:47     ` Andrew Rybchenko
@ 2021-10-11 13:12       ` Xueming(Steven) Li
  0 siblings, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-10-11 13:12 UTC (permalink / raw)
  To: andrew.rybchenko, dev
  Cc: jerinjacobk, NBU-Contact-Thomas Monjalon, Lior Margalit,
	Slava Ovsiienko, konstantin.ananyev, ferruh.yigit

On Mon, 2021-10-11 at 13:47 +0300, Andrew Rybchenko wrote:
> On 9/30/21 5:55 PM, Xueming Li wrote:
> > In current DPDK framework, each RX queue is pre-loaded with mbufs for
> 
> RX -> Rx
> 
> > incoming packets. When number of representors scale out in a switch
> > domain, the memory consumption became significant. Most important,
> > polling all ports leads to high cache miss, high latency and low
> > throughput.
> 
> It should be highlighted that it is a problem of some PMDs.
> Not all.
> 
> > 
> > This patch introduces shared RX queue. Ports with same configuration in
> 
> "This patch introduces" -> "Introduce"
> 
> RX -> Rx
> 
> > a switch domain could share RX queue set by specifying sharing group.
> 
> RX -> Rx
> 
> > Polling any queue using same shared RX queue receives packets from all
> 
> RX -> Rx
> 
> > member ports. Source port is identified by mbuf->port.
> > 
> > Port queue number in a shared group should be identical. Queue index is
> > 1:1 mapped in shared group.
> > 
> > Share RX queue must be polled on single thread or core.
> 
> RX -> Rx
> 
> > 
> > Multiple groups is supported by group ID.
> 
> is -> are
> 
> > 
> > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> 
> The patch should update release notes.
> 
> > ---
> > Rx queue object could be used as shared Rx queue object, it's important
> > to clear all queue control callback api that using queue object:
> >   https://mails.dpdk.org/archives/dev/2021-July/215574.html
> > ---
> >  doc/guides/nics/features.rst                    | 11 +++++++++++
> >  doc/guides/nics/features/default.ini            |  1 +
> >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> >  lib/ethdev/rte_ethdev.c                         |  1 +
> >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> >  5 files changed, 30 insertions(+)
> > 
> > diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> > index 4fce8cd1c97..69bc1d5719c 100644
> > --- a/doc/guides/nics/features.rst
> > +++ b/doc/guides/nics/features.rst
> > @@ -626,6 +626,17 @@ Supports inner packet L4 checksum.
> >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> >  
> >  
> > +.. _nic_features_shared_rx_queue:
> > +
> > +Shared Rx queue
> > +---------------
> > +
> > +Supports shared Rx queue for ports in same switch domain.
> > +
> > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > +* **[provides] mbuf**: ``mbuf.port``.
> > +
> > +
> >  .. _nic_features_packet_type_parsing:
> >  
> >  Packet type parsing
> > diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
> > index 754184ddd4d..ebeb4c18512 100644
> > --- a/doc/guides/nics/features/default.ini
> > +++ b/doc/guides/nics/features/default.ini
> > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> >  Queue start/stop     =
> >  Runtime Rx queue setup =
> >  Runtime Tx queue setup =
> > +Shared Rx queue      =
> >  Burst mode info      =
> >  Power mgmt address monitor =
> >  MTU update           =
> > diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
> > index ff6aa91c806..bc7ce65fa3d 100644
> > --- a/doc/guides/prog_guide/switch_representation.rst
> > +++ b/doc/guides/prog_guide/switch_representation.rst
> > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> >  .. [1] `Ethernet switch device driver model (switchdev)
> >         <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
> >  
> > +- Memory usage of representors is huge when number of representor grows,
> > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> 
> It is a problem of some PMDs only. So, it must be rewritten to
> highlight it.
> 
> > +  Polling the large number of ports brings more CPU load, cache miss and
> > +  latency. Shared Rx queue can be used to share Rx queue between PF and
> > +  representors in same switch. ``RTE_ETH_RX_OFFLOAD_SHARED_RXQ`` is
> > +  present in Rx offloading capability of device info. Setting the
> > +  offloading flag in device Rx mode or Rx queue configuration to enable
> > +  shared Rx queue. Polling any member port of the shared Rx queue can return
> > +  packets of all ports in the group, port ID is saved in ``mbuf.port``.
> > +
> >  Basic SR-IOV
> >  ------------
> >  
> > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > index 61aa49efec6..73270c10492 100644
> > --- a/lib/ethdev/rte_ethdev.c
> > +++ b/lib/ethdev/rte_ethdev.c
> > @@ -127,6 +127,7 @@ static const struct {
> >  	RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> >  	RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> >  	RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > +	RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> >  };
> >  
> >  #undef RTE_RX_OFFLOAD_BIT2STR
> > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > index afdc53b674c..d7ac625ee74 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -1077,6 +1077,7 @@ struct rte_eth_rxconf {
> >  	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> >  	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> >  	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> > +	uint32_t shared_group; /**< Shared port group index in switch domain. */
> >  	/**
> >  	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> >  	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
> > @@ -1403,6 +1404,12 @@ struct rte_eth_conf {
> >  #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> >  #define DEV_RX_OFFLOAD_RSS_HASH		0x00080000
> >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > +/**
> > + * Rx queue is shared among ports in same switch domain to save memory,
> > + * avoid polling each port. Any port in the group can be used to receive
> > + * packets. Real source port number saved in mbuf->port field.
> > + */
> > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> >  
> >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> >  				 DEV_RX_OFFLOAD_UDP_CKSUM | \
> > 
> 
> IMHO it should be squashed with the second patch to make it
> easier to review. Otherwise it is hard to understand what is
> shared_group and the offlaod which are dead in the patch.

Hi Andrew,

Thanks for the review! With discussion with Jerin, we want to drop
second patch and decide how to aggregate ports later by collecting more
feedback and idea. To make the offload and group clear, I'll add an
example in commit message. v5 sent w/o seeing your another review on
0/6, please ignore for now.

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v4 0/6] ethdev: introduce shared Rx queue
  2021-10-11 11:49   ` [dpdk-dev] [PATCH v4 0/6] ethdev: introduce " Andrew Rybchenko
@ 2021-10-11 15:11     ` Xueming(Steven) Li
  2021-10-12  6:37       ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-10-11 15:11 UTC (permalink / raw)
  To: andrew.rybchenko, dev
  Cc: jerinjacobk, NBU-Contact-Thomas Monjalon, Lior Margalit,
	Slava Ovsiienko, konstantin.ananyev, ferruh.yigit

On Mon, 2021-10-11 at 14:49 +0300, Andrew Rybchenko wrote:
> Hi Xueming,
> 
> On 9/30/21 5:55 PM, Xueming Li wrote:
> > In current DPDK framework, all RX queues is pre-loaded with mbufs for
> > incoming packets. When number of representors scale out in a switch
> > domain, the memory consumption became significant. Further more,
> > polling all ports leads to high cache miss, high latency and low
> > throughputs.
> > 
> > This patch introduces shared RX queue. PF and representors with same
> > configuration in same switch domain could share RX queue set by
> > specifying shared Rx queue offloading flag and sharing group.
> > 
> > All ports that Shared Rx queue actually shares One Rx queue and only
> > pre-load mbufs to one Rx queue, memory is saved.
> > 
> > Polling any queue using same shared RX queue receives packets from all
> > member ports. Source port is identified by mbuf->port.
> > 
> > Multiple groups is supported by group ID. Port queue number in a shared
> > group should be identical. Queue index is 1:1 mapped in shared group.
> > An example of polling two share groups:
> >   core	group	queue
> >   0	0	0
> >   1	0	1
> >   2	0	2
> >   3	0	3
> >   4	1	0
> >   5	1	1
> >   6	1	2
> >   7	1	3
> > 
> > Shared RX queue must be polled on single thread or core. If both PF0 and
> > representor0 joined same share group, can't poll pf0rxq0 on core1 and
> > rep0rxq0 on core2. Actually, polling one port within share group is
> > sufficient since polling any port in group will return packets for any
> > port in group.
> 
> I apologize that I jump in into the review process that late.

Appreciate the bold suggestion, never too late :)

> 
> Frankly speaking I doubt that it is the best design to solve
> the problem. Yes, I confirm that the problem exists, but I
> think there is better and simpler way to solve it.
> 
> The problem of the suggested solution is that it puts all
> the headache about consistency to application and PMDs
> without any help from ethdev layer to guarantee the
> consistency. As the result I believe it will be either
> missing/lost consistency checks or huge duplication in
> each PMD which supports the feature. Shared RxQs must be
> equally configured including number of queues, offloads
> (taking device level Rx offloads into account), RSS
> settings etc. So, applications must care about it and
> PMDs (or ethdev layer) must check it.

The name might be confusing, here is my understanding:
1. NIC  shares the buffer supply HW Q between shared RxQs - for memory
2. PMD polls one shared RxQ - for latency and performance
3. Most per queue features like offloads and RSS not impacted. That's
why this not mentioned. Some offloading might not being supported due
to PMD or hw limitation, need to add check in PMD case by case.
4. Multiple group is defined for service level flexibility. For
example, PF and VIP customer's load distributed via queues and dedicate
cores. Low priority customers share one core with one shared queue.
multiple groups enables more combination.
5. One port could assign queues to different group for polling
flexibility. For example first 4 queues in group 0 and next 4 queues in
group1, each group have other member ports with 4 queues, so the port
with 8 queues could be polled with 8 cores w/o non-shared rxq penalty,
in other words, each core only poll one shared RxQ.

> 
> The advantage of the solution is that any device may
> create group and subsequent devices join. Absence of
> primary device is nice. But do we really need it?
> Will the design work if some representors are configured
> to use shared RxQ, but some do not? Theoretically it
> is possible, but could require extra non-trivial code
> on fast path.

If multiple groups, any device could be hot-unplugged.

Mixed configuration is supported, the only difference is how to set
mbuf->port. Since group is per queue, mixed is better to be supported,
didn't see any difficulty here.

PDM could select to support only group 0, same settings for each rxq,
that fits most scenario.

> 
> Also looking at the first two patch I don't understand
> how application will find out which devices may share
> RxQs. E.g. if we have two difference NICs which support
> sharing, we can try to setup only one group 0, but
> finally will have two devices (not one) which must be
> polled.
> 
> 1. We need extra flag in dev_info->dev_capa
>    RTE_ETH_DEV_CAPA_RX_SHARE to advertise that
>    the device supports Rx sharing.

dev_info->rx_queue_offload_capa could be used here, no?

> 
> 2. I think we need "rx_domain" in device info
>    (which should be treated in boundaries of the
>    switch_domain) if and only if
>    RTE_ETH_DEV_CAPA_RX_SHARE is advertised.
>    Otherwise rx_domain value does not make sense.

I see, this will give flexibility of different hw, will add it.

> 
> (1) and (2) will allow application to find out which
> devices can share Rx.
> 
> 3. Primary device (representors backing device) should
>    advertise shared RxQ offload. Enabling of the offload
>    tells the device to provide packets to all device in
>    the Rx domain with mbuf->port filled in appropriately.
>    Also it allows app to identify primary device in the
>    Rx domain. When application enables the offload, it
>    must ensure that it does not treat used port_id as an
>    input port_id, but always check mbuf->port for each
>    packet.
> 
> 4. A new Rx mode should be introduced for secondary
>    devices. It should not allow to configure RSS, specify
>    any Rx offloads etc. ethdev must ensure it.
>    It is an open question right now if it should require
>    to provide primary port_id. In theory representors
>    have it. However, may be it is nice for consistency
>    to ensure that application knows that it does.
>    If shared Rx mode is specified for device, application
>    does not need to setup RxQs and attempts to do it
>    should be discarded in ethdev.
>    For consistency it is better to ensure that number of
>    queues match.

RSS and Rx offloads should be supported as individual, PMD needs to
check if not supported.

>    It is an interesting question what should happen if
>    primary device is reconfigured and shared Rx is
>    disabled on reconfiguration.

I feel better no primary port/queue assumption in configuration, all
members are equally treated, each queue can join or quit share group,
that's important to support multiple groups.

> 
> 5. If so, in theory implementation of the Rx burst
>    in the secondary could simply call Rx burst on
>    primary device.
> 
> Andrew.


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v4 0/6] ethdev: introduce shared Rx queue
  2021-10-11 15:11     ` Xueming(Steven) Li
@ 2021-10-12  6:37       ` Xueming(Steven) Li
  2021-10-12  8:48         ` Andrew Rybchenko
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-10-12  6:37 UTC (permalink / raw)
  To: andrew.rybchenko, dev
  Cc: jerinjacobk, NBU-Contact-Thomas Monjalon, Lior Margalit,
	Slava Ovsiienko, konstantin.ananyev, ferruh.yigit

On Mon, 2021-10-11 at 23:11 +0800, Xueming Li wrote:
> On Mon, 2021-10-11 at 14:49 +0300, Andrew Rybchenko wrote:
> > Hi Xueming,
> > 
> > On 9/30/21 5:55 PM, Xueming Li wrote:
> > > In current DPDK framework, all RX queues is pre-loaded with mbufs for
> > > incoming packets. When number of representors scale out in a switch
> > > domain, the memory consumption became significant. Further more,
> > > polling all ports leads to high cache miss, high latency and low
> > > throughputs.
> > >  
> > > This patch introduces shared RX queue. PF and representors with same
> > > configuration in same switch domain could share RX queue set by
> > > specifying shared Rx queue offloading flag and sharing group.
> > > 
> > > All ports that Shared Rx queue actually shares One Rx queue and only
> > > pre-load mbufs to one Rx queue, memory is saved.
> > > 
> > > Polling any queue using same shared RX queue receives packets from all
> > > member ports. Source port is identified by mbuf->port.
> > > 
> > > Multiple groups is supported by group ID. Port queue number in a shared
> > > group should be identical. Queue index is 1:1 mapped in shared group.
> > > An example of polling two share groups:
> > >   core	group	queue
> > >   0	0	0
> > >   1	0	1
> > >   2	0	2
> > >   3	0	3
> > >   4	1	0
> > >   5	1	1
> > >   6	1	2
> > >   7	1	3
> > > 
> > > Shared RX queue must be polled on single thread or core. If both PF0 and
> > > representor0 joined same share group, can't poll pf0rxq0 on core1 and
> > > rep0rxq0 on core2. Actually, polling one port within share group is
> > > sufficient since polling any port in group will return packets for any
> > > port in group.
> > 
> > I apologize that I jump in into the review process that late.
> 
> Appreciate the bold suggestion, never too late :)
> 
> > 
> > Frankly speaking I doubt that it is the best design to solve
> > the problem. Yes, I confirm that the problem exists, but I
> > think there is better and simpler way to solve it.
> > 
> > The problem of the suggested solution is that it puts all
> > the headache about consistency to application and PMDs
> > without any help from ethdev layer to guarantee the
> > consistency. As the result I believe it will be either
> > missing/lost consistency checks or huge duplication in
> > each PMD which supports the feature. Shared RxQs must be
> > equally configured including number of queues, offloads
> > (taking device level Rx offloads into account), RSS
> > settings etc. So, applications must care about it and
> > PMDs (or ethdev layer) must check it.
> 
> The name might be confusing, here is my understanding:
> 1. NIC  shares the buffer supply HW Q between shared RxQs - for memory
> 2. PMD polls one shared RxQ - for latency and performance
> 3. Most per queue features like offloads and RSS not impacted. That's
> why this not mentioned. Some offloading might not being supported due
> to PMD or hw limitation, need to add check in PMD case by case.
> 4. Multiple group is defined for service level flexibility. For
> example, PF and VIP customer's load distributed via queues and dedicate
> cores. Low priority customers share one core with one shared queue.
> multiple groups enables more combination.
> 5. One port could assign queues to different group for polling
> flexibility. For example first 4 queues in group 0 and next 4 queues in
> group1, each group have other member ports with 4 queues, so the port
> with 8 queues could be polled with 8 cores w/o non-shared rxq penalty,
> in other words, each core only poll one shared RxQ.
> 
> > 
> > The advantage of the solution is that any device may
> > create group and subsequent devices join. Absence of
> > primary device is nice. But do we really need it?
> > Will the design work if some representors are configured
> > to use shared RxQ, but some do not? Theoretically it
> > is possible, but could require extra non-trivial code
> > on fast path.
> 
> If multiple groups, any device could be hot-unplugged.
> 
> Mixed configuration is supported, the only difference is how to set
> mbuf->port. Since group is per queue, mixed is better to be supported,
> didn't see any difficulty here.
> 
> PDM could select to support only group 0, same settings for each rxq,
> that fits most scenario.
> 
> > 
> > Also looking at the first two patch I don't understand
> > how application will find out which devices may share
> > RxQs. E.g. if we have two difference NICs which support
> > sharing, we can try to setup only one group 0, but
> > finally will have two devices (not one) which must be
> > polled.
> > 
> > 1. We need extra flag in dev_info->dev_capa
> >    RTE_ETH_DEV_CAPA_RX_SHARE to advertise that
> >    the device supports Rx sharing.
> 
> dev_info->rx_queue_offload_capa could be used here, no?
> 
> > 
> > 2. I think we need "rx_domain" in device info
> >    (which should be treated in boundaries of the
> >    switch_domain) if and only if
> >    RTE_ETH_DEV_CAPA_RX_SHARE is advertised.
> >    Otherwise rx_domain value does not make sense.
> 
> I see, this will give flexibility of different hw, will add it.
> 
> > 
> > (1) and (2) will allow application to find out which
> > devices can share Rx.
> > 
> > 3. Primary device (representors backing device) should
> >    advertise shared RxQ offload. Enabling of the offload
> >    tells the device to provide packets to all device in
> >    the Rx domain with mbuf->port filled in appropriately.
> >    Also it allows app to identify primary device in the
> >    Rx domain. When application enables the offload, it
> >    must ensure that it does not treat used port_id as an
> >    input port_id, but always check mbuf->port for each
> >    packet.
> > 
> > 4. A new Rx mode should be introduced for secondary
> >    devices. It should not allow to configure RSS, specify
> >    any Rx offloads etc. ethdev must ensure it.
> >    It is an open question right now if it should require
> >    to provide primary port_id. In theory representors
> >    have it. However, may be it is nice for consistency
> >    to ensure that application knows that it does.
> >    If shared Rx mode is specified for device, application
> >    does not need to setup RxQs and attempts to do it
> >    should be discarded in ethdev.
> >    For consistency it is better to ensure that number of
> >    queues match.
> 
> RSS and Rx offloads should be supported as individual, PMD needs to
> check if not supported.
> 
> >    It is an interesting question what should happen if
> >    primary device is reconfigured and shared Rx is
> >    disabled on reconfiguration.
> 
> I feel better no primary port/queue assumption in configuration, all
> members are equally treated, each queue can join or quit share group,
> that's important to support multiple groups.
> 
> > 
> > 5. If so, in theory implementation of the Rx burst
> >    in the secondary could simply call Rx burst on
> >    primary device.
> > 
> > Andrew.
> 

Hi Andrew,

I realized that we are talking different things, this feature
introduced 2 RxQ share:
1. Share mempool to save memory
2. Share polling to save latency

What you suggested is reuse all RxQ configuration IIUC, maybe we should
break the flag into 3, so application could learn PMD capability and
configure accordingly, how do you think?
RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POOL
RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POLL
RTE_ETH_RX_OFFLOAD_RXQ_SHARE_CFG //implies POOL and POLL

Regards,
Xueming


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v4 0/6] ethdev: introduce shared Rx queue
  2021-10-12  6:37       ` Xueming(Steven) Li
@ 2021-10-12  8:48         ` Andrew Rybchenko
  2021-10-12 10:55           ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Andrew Rybchenko @ 2021-10-12  8:48 UTC (permalink / raw)
  To: Xueming(Steven) Li, dev
  Cc: jerinjacobk, NBU-Contact-Thomas Monjalon, Lior Margalit,
	Slava Ovsiienko, konstantin.ananyev, ferruh.yigit

On 10/12/21 9:37 AM, Xueming(Steven) Li wrote:
> On Mon, 2021-10-11 at 23:11 +0800, Xueming Li wrote:
>> On Mon, 2021-10-11 at 14:49 +0300, Andrew Rybchenko wrote:
>>> Hi Xueming,
>>>
>>> On 9/30/21 5:55 PM, Xueming Li wrote:
>>>> In current DPDK framework, all RX queues is pre-loaded with mbufs for
>>>> incoming packets. When number of representors scale out in a switch
>>>> domain, the memory consumption became significant. Further more,
>>>> polling all ports leads to high cache miss, high latency and low
>>>> throughputs.
>>>>  
>>>> This patch introduces shared RX queue. PF and representors with same
>>>> configuration in same switch domain could share RX queue set by
>>>> specifying shared Rx queue offloading flag and sharing group.
>>>>
>>>> All ports that Shared Rx queue actually shares One Rx queue and only
>>>> pre-load mbufs to one Rx queue, memory is saved.
>>>>
>>>> Polling any queue using same shared RX queue receives packets from all
>>>> member ports. Source port is identified by mbuf->port.
>>>>
>>>> Multiple groups is supported by group ID. Port queue number in a shared
>>>> group should be identical. Queue index is 1:1 mapped in shared group.
>>>> An example of polling two share groups:
>>>>   core	group	queue
>>>>   0	0	0
>>>>   1	0	1
>>>>   2	0	2
>>>>   3	0	3
>>>>   4	1	0
>>>>   5	1	1
>>>>   6	1	2
>>>>   7	1	3
>>>>
>>>> Shared RX queue must be polled on single thread or core. If both PF0 and
>>>> representor0 joined same share group, can't poll pf0rxq0 on core1 and
>>>> rep0rxq0 on core2. Actually, polling one port within share group is
>>>> sufficient since polling any port in group will return packets for any
>>>> port in group.
>>>
>>> I apologize that I jump in into the review process that late.
>>
>> Appreciate the bold suggestion, never too late :)
>>
>>>
>>> Frankly speaking I doubt that it is the best design to solve
>>> the problem. Yes, I confirm that the problem exists, but I
>>> think there is better and simpler way to solve it.
>>>
>>> The problem of the suggested solution is that it puts all
>>> the headache about consistency to application and PMDs
>>> without any help from ethdev layer to guarantee the
>>> consistency. As the result I believe it will be either
>>> missing/lost consistency checks or huge duplication in
>>> each PMD which supports the feature. Shared RxQs must be
>>> equally configured including number of queues, offloads
>>> (taking device level Rx offloads into account), RSS
>>> settings etc. So, applications must care about it and
>>> PMDs (or ethdev layer) must check it.
>>
>> The name might be confusing, here is my understanding:
>> 1. NIC  shares the buffer supply HW Q between shared RxQs - for memory
>> 2. PMD polls one shared RxQ - for latency and performance
>> 3. Most per queue features like offloads and RSS not impacted. That's
>> why this not mentioned. Some offloading might not being supported due
>> to PMD or hw limitation, need to add check in PMD case by case.
>> 4. Multiple group is defined for service level flexibility. For
>> example, PF and VIP customer's load distributed via queues and dedicate
>> cores. Low priority customers share one core with one shared queue.
>> multiple groups enables more combination.
>> 5. One port could assign queues to different group for polling
>> flexibility. For example first 4 queues in group 0 and next 4 queues in
>> group1, each group have other member ports with 4 queues, so the port
>> with 8 queues could be polled with 8 cores w/o non-shared rxq penalty,
>> in other words, each core only poll one shared RxQ.
>>
>>>
>>> The advantage of the solution is that any device may
>>> create group and subsequent devices join. Absence of
>>> primary device is nice. But do we really need it?
>>> Will the design work if some representors are configured
>>> to use shared RxQ, but some do not? Theoretically it
>>> is possible, but could require extra non-trivial code
>>> on fast path.
>>
>> If multiple groups, any device could be hot-unplugged.
>>
>> Mixed configuration is supported, the only difference is how to set
>> mbuf->port. Since group is per queue, mixed is better to be supported,
>> didn't see any difficulty here.
>>
>> PDM could select to support only group 0, same settings for each rxq,
>> that fits most scenario.
>>
>>>
>>> Also looking at the first two patch I don't understand
>>> how application will find out which devices may share
>>> RxQs. E.g. if we have two difference NICs which support
>>> sharing, we can try to setup only one group 0, but
>>> finally will have two devices (not one) which must be
>>> polled.
>>>
>>> 1. We need extra flag in dev_info->dev_capa
>>>    RTE_ETH_DEV_CAPA_RX_SHARE to advertise that
>>>    the device supports Rx sharing.
>>
>> dev_info->rx_queue_offload_capa could be used here, no?

It depends. But we definitely need a flag which
says that below rx_domain makes sense. It could be
either RTE_ETH_DEV_CAPA_RX_SHARE or an Rx offload
capability.

The question is if it is really an offload. The offload is
when something could be done by HW/FW and result is provided
to SW. May be it is just a nit picking...

May be we don't need an offload at all. Just have
RTE_ETH_DEV_CAPA_RXQ_SHARE and use non-zero group ID
as a flag that an RxQ should be shared (zero - default,
no sharing). ethdev layer may check consistency on
its layer to ensure that the device capability is
reported if non-zero group is specified on queue setup.

>>
>>>
>>> 2. I think we need "rx_domain" in device info
>>>    (which should be treated in boundaries of the
>>>    switch_domain) if and only if
>>>    RTE_ETH_DEV_CAPA_RX_SHARE is advertised.
>>>    Otherwise rx_domain value does not make sense.
>>
>> I see, this will give flexibility of different hw, will add it.
>>
>>>
>>> (1) and (2) will allow application to find out which
>>> devices can share Rx.
>>>
>>> 3. Primary device (representors backing device) should
>>>    advertise shared RxQ offload. Enabling of the offload
>>>    tells the device to provide packets to all device in
>>>    the Rx domain with mbuf->port filled in appropriately.
>>>    Also it allows app to identify primary device in the
>>>    Rx domain. When application enables the offload, it
>>>    must ensure that it does not treat used port_id as an
>>>    input port_id, but always check mbuf->port for each
>>>    packet.
>>>
>>> 4. A new Rx mode should be introduced for secondary
>>>    devices. It should not allow to configure RSS, specify
>>>    any Rx offloads etc. ethdev must ensure it.
>>>    It is an open question right now if it should require
>>>    to provide primary port_id. In theory representors
>>>    have it. However, may be it is nice for consistency
>>>    to ensure that application knows that it does.
>>>    If shared Rx mode is specified for device, application
>>>    does not need to setup RxQs and attempts to do it
>>>    should be discarded in ethdev.
>>>    For consistency it is better to ensure that number of
>>>    queues match.
>>
>> RSS and Rx offloads should be supported as individual, PMD needs to
>> check if not supported.

Thinking a bit more about it I agree that RSS settings could
be individual. Offload could be individual as well, but I'm
not sure about all offloads. E.g. Rx scatter which is related
to Rx buffer size (which is shared since Rx mempool is shared)
vs MTU. May be it is acceptable. We just must define rules
what should happen if offloads contradict to each other.
It should be highlighted in the description including
driver callback to ensure that PMD maintainers are responsible
for consistency checks.

>>
>>>    It is an interesting question what should happen if
>>>    primary device is reconfigured and shared Rx is
>>>    disabled on reconfiguration.
>>
>> I feel better no primary port/queue assumption in configuration, all
>> members are equally treated, each queue can join or quit share group,
>> that's important to support multiple groups.

I agree. The problem of many flexible solutions is
complexity to support. We'll see how it goes.

>>
>>>
>>> 5. If so, in theory implementation of the Rx burst
>>>    in the secondary could simply call Rx burst on
>>>    primary device.
>>>
>>> Andrew.
>>
> 
> Hi Andrew,
> 
> I realized that we are talking different things, this feature
> introduced 2 RxQ share:
> 1. Share mempool to save memory
> 2. Share polling to save latency
> 
> What you suggested is reuse all RxQ configuration IIUC, maybe we should
> break the flag into 3, so application could learn PMD capability and
> configure accordingly, how do you think?
> RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POOL

Not sure that I understand. Just specify the same mempool
on Rx queue setup. Isn't it sufficient?

> RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POLL

It implies pool sharing if I'm not mistaken. Of course,
we can pool many different HW queues in one poll, but it
hardly makes sense to care specially about it.
IMHO RxQ sharing is a sharing of the underlying HW Rx queue.

> RTE_ETH_RX_OFFLOAD_RXQ_SHARE_CFG //implies POOL and POLL

It is hardly a feature. Rather a possible limitation.

Andrew.

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v4 0/6] ethdev: introduce shared Rx queue
  2021-10-12  8:48         ` Andrew Rybchenko
@ 2021-10-12 10:55           ` Xueming(Steven) Li
  2021-10-12 11:28             ` Andrew Rybchenko
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-10-12 10:55 UTC (permalink / raw)
  To: andrew.rybchenko, dev
  Cc: jerinjacobk, NBU-Contact-Thomas Monjalon, Lior Margalit,
	Slava Ovsiienko, konstantin.ananyev, ferruh.yigit

On Tue, 2021-10-12 at 11:48 +0300, Andrew Rybchenko wrote:
> On 10/12/21 9:37 AM, Xueming(Steven) Li wrote:
> > On Mon, 2021-10-11 at 23:11 +0800, Xueming Li wrote:
> > > On Mon, 2021-10-11 at 14:49 +0300, Andrew Rybchenko wrote:
> > > > Hi Xueming,
> > > > 
> > > > On 9/30/21 5:55 PM, Xueming Li wrote:
> > > > > In current DPDK framework, all RX queues is pre-loaded with mbufs for
> > > > > incoming packets. When number of representors scale out in a switch
> > > > > domain, the memory consumption became significant. Further more,
> > > > > polling all ports leads to high cache miss, high latency and low
> > > > > throughputs.
> > > > >  
> > > > > This patch introduces shared RX queue. PF and representors with same
> > > > > configuration in same switch domain could share RX queue set by
> > > > > specifying shared Rx queue offloading flag and sharing group.
> > > > > 
> > > > > All ports that Shared Rx queue actually shares One Rx queue and only
> > > > > pre-load mbufs to one Rx queue, memory is saved.
> > > > > 
> > > > > Polling any queue using same shared RX queue receives packets from all
> > > > > member ports. Source port is identified by mbuf->port.
> > > > > 
> > > > > Multiple groups is supported by group ID. Port queue number in a shared
> > > > > group should be identical. Queue index is 1:1 mapped in shared group.
> > > > > An example of polling two share groups:
> > > > >   core	group	queue
> > > > >   0	0	0
> > > > >   1	0	1
> > > > >   2	0	2
> > > > >   3	0	3
> > > > >   4	1	0
> > > > >   5	1	1
> > > > >   6	1	2
> > > > >   7	1	3
> > > > > 
> > > > > Shared RX queue must be polled on single thread or core. If both PF0 and
> > > > > representor0 joined same share group, can't poll pf0rxq0 on core1 and
> > > > > rep0rxq0 on core2. Actually, polling one port within share group is
> > > > > sufficient since polling any port in group will return packets for any
> > > > > port in group.
> > > > 
> > > > I apologize that I jump in into the review process that late.
> > > 
> > > Appreciate the bold suggestion, never too late :)
> > > 
> > > > 
> > > > Frankly speaking I doubt that it is the best design to solve
> > > > the problem. Yes, I confirm that the problem exists, but I
> > > > think there is better and simpler way to solve it.
> > > > 
> > > > The problem of the suggested solution is that it puts all
> > > > the headache about consistency to application and PMDs
> > > > without any help from ethdev layer to guarantee the
> > > > consistency. As the result I believe it will be either
> > > > missing/lost consistency checks or huge duplication in
> > > > each PMD which supports the feature. Shared RxQs must be
> > > > equally configured including number of queues, offloads
> > > > (taking device level Rx offloads into account), RSS
> > > > settings etc. So, applications must care about it and
> > > > PMDs (or ethdev layer) must check it.
> > > 
> > > The name might be confusing, here is my understanding:
> > > 1. NIC  shares the buffer supply HW Q between shared RxQs - for memory
> > > 2. PMD polls one shared RxQ - for latency and performance
> > > 3. Most per queue features like offloads and RSS not impacted. That's
> > > why this not mentioned. Some offloading might not being supported due
> > > to PMD or hw limitation, need to add check in PMD case by case.
> > > 4. Multiple group is defined for service level flexibility. For
> > > example, PF and VIP customer's load distributed via queues and dedicate
> > > cores. Low priority customers share one core with one shared queue.
> > > multiple groups enables more combination.
> > > 5. One port could assign queues to different group for polling
> > > flexibility. For example first 4 queues in group 0 and next 4 queues in
> > > group1, each group have other member ports with 4 queues, so the port
> > > with 8 queues could be polled with 8 cores w/o non-shared rxq penalty,
> > > in other words, each core only poll one shared RxQ.
> > > 
> > > > 
> > > > The advantage of the solution is that any device may
> > > > create group and subsequent devices join. Absence of
> > > > primary device is nice. But do we really need it?
> > > > Will the design work if some representors are configured
> > > > to use shared RxQ, but some do not? Theoretically it
> > > > is possible, but could require extra non-trivial code
> > > > on fast path.
> > > 
> > > If multiple groups, any device could be hot-unplugged.
> > > 
> > > Mixed configuration is supported, the only difference is how to set
> > > mbuf->port. Since group is per queue, mixed is better to be supported,
> > > didn't see any difficulty here.
> > > 
> > > PDM could select to support only group 0, same settings for each rxq,
> > > that fits most scenario.
> > > 
> > > > 
> > > > Also looking at the first two patch I don't understand
> > > > how application will find out which devices may share
> > > > RxQs. E.g. if we have two difference NICs which support
> > > > sharing, we can try to setup only one group 0, but
> > > > finally will have two devices (not one) which must be
> > > > polled.
> > > > 
> > > > 1. We need extra flag in dev_info->dev_capa
> > > >    RTE_ETH_DEV_CAPA_RX_SHARE to advertise that
> > > >    the device supports Rx sharing.
> > > 
> > > dev_info->rx_queue_offload_capa could be used here, no?
> 
> It depends. But we definitely need a flag which
> says that below rx_domain makes sense. It could be
> either RTE_ETH_DEV_CAPA_RX_SHARE or an Rx offload
> capability.
> 
> The question is if it is really an offload. The offload is
> when something could be done by HW/FW and result is provided
> to SW. May be it is just a nit picking...
> 
> May be we don't need an offload at all. Just have
> RTE_ETH_DEV_CAPA_RXQ_SHARE and use non-zero group ID
> as a flag that an RxQ should be shared (zero - default,
> no sharing). ethdev layer may check consistency on
> its layer to ensure that the device capability is
> reported if non-zero group is specified on queue setup.
> 
> > > 
> > > > 
> > > > 2. I think we need "rx_domain" in device info
> > > >    (which should be treated in boundaries of the
> > > >    switch_domain) if and only if
> > > >    RTE_ETH_DEV_CAPA_RX_SHARE is advertised.
> > > >    Otherwise rx_domain value does not make sense.
> > > 
> > > I see, this will give flexibility of different hw, will add it.
> > > 
> > > > 
> > > > (1) and (2) will allow application to find out which
> > > > devices can share Rx.
> > > > 
> > > > 3. Primary device (representors backing device) should
> > > >    advertise shared RxQ offload. Enabling of the offload
> > > >    tells the device to provide packets to all device in
> > > >    the Rx domain with mbuf->port filled in appropriately.
> > > >    Also it allows app to identify primary device in the
> > > >    Rx domain. When application enables the offload, it
> > > >    must ensure that it does not treat used port_id as an
> > > >    input port_id, but always check mbuf->port for each
> > > >    packet.
> > > > 
> > > > 4. A new Rx mode should be introduced for secondary
> > > >    devices. It should not allow to configure RSS, specify
> > > >    any Rx offloads etc. ethdev must ensure it.
> > > >    It is an open question right now if it should require
> > > >    to provide primary port_id. In theory representors
> > > >    have it. However, may be it is nice for consistency
> > > >    to ensure that application knows that it does.
> > > >    If shared Rx mode is specified for device, application
> > > >    does not need to setup RxQs and attempts to do it
> > > >    should be discarded in ethdev.
> > > >    For consistency it is better to ensure that number of
> > > >    queues match.
> > > 
> > > RSS and Rx offloads should be supported as individual, PMD needs to
> > > check if not supported.
> 
> Thinking a bit more about it I agree that RSS settings could
> be individual. Offload could be individual as well, but I'm
> not sure about all offloads. E.g. Rx scatter which is related
> to Rx buffer size (which is shared since Rx mempool is shared)
> vs MTU. May be it is acceptable. We just must define rules
> what should happen if offloads contradict to each other.
> It should be highlighted in the description including
> driver callback to ensure that PMD maintainers are responsible
> for consistency checks.
> 
> > > 
> > > >    It is an interesting question what should happen if
> > > >    primary device is reconfigured and shared Rx is
> > > >    disabled on reconfiguration.
> > > 
> > > I feel better no primary port/queue assumption in configuration, all
> > > members are equally treated, each queue can join or quit share group,
> > > that's important to support multiple groups.
> 
> I agree. The problem of many flexible solutions is
> complexity to support. We'll see how it goes.
> 
> > > 
> > > > 
> > > > 5. If so, in theory implementation of the Rx burst
> > > >    in the secondary could simply call Rx burst on
> > > >    primary device.
> > > > 
> > > > Andrew.
> > > 
> > 
> > Hi Andrew,
> > 
> > I realized that we are talking different things, this feature
> > introduced 2 RxQ share:
> > 1. Share mempool to save memory
> > 2. Share polling to save latency
> > 
> > What you suggested is reuse all RxQ configuration IIUC, maybe we should
> > break the flag into 3, so application could learn PMD capability and
> > configure accordingly, how do you think?
> > RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POOL
> 
> Not sure that I understand. Just specify the same mempool
> on Rx queue setup. Isn't it sufficient?
> 
> > RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POLL
> 
> It implies pool sharing if I'm not mistaken. Of course,
> we can pool many different HW queues in one poll, but it
> hardly makes sense to care specially about it.
> IMHO RxQ sharing is a sharing of the underlying HW Rx queue.
> 
> > RTE_ETH_RX_OFFLOAD_RXQ_SHARE_CFG //implies POOL and POLL
> 
> It is hardly a feature. Rather a possible limitation.

Thanks, then I'd drop this suggestion then.

Here is the TODO list, let me know if anything missing:
1. change offload flag to RTE_ETH_DEV_CAPA_RX_SHARE
2. RxQ share group check in ethdev
3. add rx_domain into device info

> 
> Andrew.


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v4 0/6] ethdev: introduce shared Rx queue
  2021-10-12 10:55           ` Xueming(Steven) Li
@ 2021-10-12 11:28             ` Andrew Rybchenko
  2021-10-12 11:33               ` Xueming(Steven) Li
  2021-10-13  7:53               ` Xueming(Steven) Li
  0 siblings, 2 replies; 266+ messages in thread
From: Andrew Rybchenko @ 2021-10-12 11:28 UTC (permalink / raw)
  To: Xueming(Steven) Li, dev
  Cc: jerinjacobk, NBU-Contact-Thomas Monjalon, Lior Margalit,
	Slava Ovsiienko, konstantin.ananyev, ferruh.yigit

On 10/12/21 1:55 PM, Xueming(Steven) Li wrote:
> On Tue, 2021-10-12 at 11:48 +0300, Andrew Rybchenko wrote:
>> On 10/12/21 9:37 AM, Xueming(Steven) Li wrote:
>>> On Mon, 2021-10-11 at 23:11 +0800, Xueming Li wrote:
>>>> On Mon, 2021-10-11 at 14:49 +0300, Andrew Rybchenko wrote:
>>>>> Hi Xueming,
>>>>>
>>>>> On 9/30/21 5:55 PM, Xueming Li wrote:
>>>>>> In current DPDK framework, all RX queues is pre-loaded with mbufs for
>>>>>> incoming packets. When number of representors scale out in a switch
>>>>>> domain, the memory consumption became significant. Further more,
>>>>>> polling all ports leads to high cache miss, high latency and low
>>>>>> throughputs.
>>>>>>  
>>>>>> This patch introduces shared RX queue. PF and representors with same
>>>>>> configuration in same switch domain could share RX queue set by
>>>>>> specifying shared Rx queue offloading flag and sharing group.
>>>>>>
>>>>>> All ports that Shared Rx queue actually shares One Rx queue and only
>>>>>> pre-load mbufs to one Rx queue, memory is saved.
>>>>>>
>>>>>> Polling any queue using same shared RX queue receives packets from all
>>>>>> member ports. Source port is identified by mbuf->port.
>>>>>>
>>>>>> Multiple groups is supported by group ID. Port queue number in a shared
>>>>>> group should be identical. Queue index is 1:1 mapped in shared group.
>>>>>> An example of polling two share groups:
>>>>>>   core	group	queue
>>>>>>   0	0	0
>>>>>>   1	0	1
>>>>>>   2	0	2
>>>>>>   3	0	3
>>>>>>   4	1	0
>>>>>>   5	1	1
>>>>>>   6	1	2
>>>>>>   7	1	3
>>>>>>
>>>>>> Shared RX queue must be polled on single thread or core. If both PF0 and
>>>>>> representor0 joined same share group, can't poll pf0rxq0 on core1 and
>>>>>> rep0rxq0 on core2. Actually, polling one port within share group is
>>>>>> sufficient since polling any port in group will return packets for any
>>>>>> port in group.
>>>>>
>>>>> I apologize that I jump in into the review process that late.
>>>>
>>>> Appreciate the bold suggestion, never too late :)
>>>>
>>>>>
>>>>> Frankly speaking I doubt that it is the best design to solve
>>>>> the problem. Yes, I confirm that the problem exists, but I
>>>>> think there is better and simpler way to solve it.
>>>>>
>>>>> The problem of the suggested solution is that it puts all
>>>>> the headache about consistency to application and PMDs
>>>>> without any help from ethdev layer to guarantee the
>>>>> consistency. As the result I believe it will be either
>>>>> missing/lost consistency checks or huge duplication in
>>>>> each PMD which supports the feature. Shared RxQs must be
>>>>> equally configured including number of queues, offloads
>>>>> (taking device level Rx offloads into account), RSS
>>>>> settings etc. So, applications must care about it and
>>>>> PMDs (or ethdev layer) must check it.
>>>>
>>>> The name might be confusing, here is my understanding:
>>>> 1. NIC  shares the buffer supply HW Q between shared RxQs - for memory
>>>> 2. PMD polls one shared RxQ - for latency and performance
>>>> 3. Most per queue features like offloads and RSS not impacted. That's
>>>> why this not mentioned. Some offloading might not being supported due
>>>> to PMD or hw limitation, need to add check in PMD case by case.
>>>> 4. Multiple group is defined for service level flexibility. For
>>>> example, PF and VIP customer's load distributed via queues and dedicate
>>>> cores. Low priority customers share one core with one shared queue.
>>>> multiple groups enables more combination.
>>>> 5. One port could assign queues to different group for polling
>>>> flexibility. For example first 4 queues in group 0 and next 4 queues in
>>>> group1, each group have other member ports with 4 queues, so the port
>>>> with 8 queues could be polled with 8 cores w/o non-shared rxq penalty,
>>>> in other words, each core only poll one shared RxQ.
>>>>
>>>>>
>>>>> The advantage of the solution is that any device may
>>>>> create group and subsequent devices join. Absence of
>>>>> primary device is nice. But do we really need it?
>>>>> Will the design work if some representors are configured
>>>>> to use shared RxQ, but some do not? Theoretically it
>>>>> is possible, but could require extra non-trivial code
>>>>> on fast path.
>>>>
>>>> If multiple groups, any device could be hot-unplugged.
>>>>
>>>> Mixed configuration is supported, the only difference is how to set
>>>> mbuf->port. Since group is per queue, mixed is better to be supported,
>>>> didn't see any difficulty here.
>>>>
>>>> PDM could select to support only group 0, same settings for each rxq,
>>>> that fits most scenario.
>>>>
>>>>>
>>>>> Also looking at the first two patch I don't understand
>>>>> how application will find out which devices may share
>>>>> RxQs. E.g. if we have two difference NICs which support
>>>>> sharing, we can try to setup only one group 0, but
>>>>> finally will have two devices (not one) which must be
>>>>> polled.
>>>>>
>>>>> 1. We need extra flag in dev_info->dev_capa
>>>>>    RTE_ETH_DEV_CAPA_RX_SHARE to advertise that
>>>>>    the device supports Rx sharing.
>>>>
>>>> dev_info->rx_queue_offload_capa could be used here, no?
>>
>> It depends. But we definitely need a flag which
>> says that below rx_domain makes sense. It could be
>> either RTE_ETH_DEV_CAPA_RX_SHARE or an Rx offload
>> capability.
>>
>> The question is if it is really an offload. The offload is
>> when something could be done by HW/FW and result is provided
>> to SW. May be it is just a nit picking...
>>
>> May be we don't need an offload at all. Just have
>> RTE_ETH_DEV_CAPA_RXQ_SHARE and use non-zero group ID
>> as a flag that an RxQ should be shared (zero - default,
>> no sharing). ethdev layer may check consistency on
>> its layer to ensure that the device capability is
>> reported if non-zero group is specified on queue setup.
>>
>>>>
>>>>>
>>>>> 2. I think we need "rx_domain" in device info
>>>>>    (which should be treated in boundaries of the
>>>>>    switch_domain) if and only if
>>>>>    RTE_ETH_DEV_CAPA_RX_SHARE is advertised.
>>>>>    Otherwise rx_domain value does not make sense.
>>>>
>>>> I see, this will give flexibility of different hw, will add it.
>>>>
>>>>>
>>>>> (1) and (2) will allow application to find out which
>>>>> devices can share Rx.
>>>>>
>>>>> 3. Primary device (representors backing device) should
>>>>>    advertise shared RxQ offload. Enabling of the offload
>>>>>    tells the device to provide packets to all device in
>>>>>    the Rx domain with mbuf->port filled in appropriately.
>>>>>    Also it allows app to identify primary device in the
>>>>>    Rx domain. When application enables the offload, it
>>>>>    must ensure that it does not treat used port_id as an
>>>>>    input port_id, but always check mbuf->port for each
>>>>>    packet.
>>>>>
>>>>> 4. A new Rx mode should be introduced for secondary
>>>>>    devices. It should not allow to configure RSS, specify
>>>>>    any Rx offloads etc. ethdev must ensure it.
>>>>>    It is an open question right now if it should require
>>>>>    to provide primary port_id. In theory representors
>>>>>    have it. However, may be it is nice for consistency
>>>>>    to ensure that application knows that it does.
>>>>>    If shared Rx mode is specified for device, application
>>>>>    does not need to setup RxQs and attempts to do it
>>>>>    should be discarded in ethdev.
>>>>>    For consistency it is better to ensure that number of
>>>>>    queues match.
>>>>
>>>> RSS and Rx offloads should be supported as individual, PMD needs to
>>>> check if not supported.
>>
>> Thinking a bit more about it I agree that RSS settings could
>> be individual. Offload could be individual as well, but I'm
>> not sure about all offloads. E.g. Rx scatter which is related
>> to Rx buffer size (which is shared since Rx mempool is shared)
>> vs MTU. May be it is acceptable. We just must define rules
>> what should happen if offloads contradict to each other.
>> It should be highlighted in the description including
>> driver callback to ensure that PMD maintainers are responsible
>> for consistency checks.
>>
>>>>
>>>>>    It is an interesting question what should happen if
>>>>>    primary device is reconfigured and shared Rx is
>>>>>    disabled on reconfiguration.
>>>>
>>>> I feel better no primary port/queue assumption in configuration, all
>>>> members are equally treated, each queue can join or quit share group,
>>>> that's important to support multiple groups.
>>
>> I agree. The problem of many flexible solutions is
>> complexity to support. We'll see how it goes.
>>
>>>>
>>>>>
>>>>> 5. If so, in theory implementation of the Rx burst
>>>>>    in the secondary could simply call Rx burst on
>>>>>    primary device.
>>>>>
>>>>> Andrew.
>>>>
>>>
>>> Hi Andrew,
>>>
>>> I realized that we are talking different things, this feature
>>> introduced 2 RxQ share:
>>> 1. Share mempool to save memory
>>> 2. Share polling to save latency
>>>
>>> What you suggested is reuse all RxQ configuration IIUC, maybe we should
>>> break the flag into 3, so application could learn PMD capability and
>>> configure accordingly, how do you think?
>>> RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POOL
>>
>> Not sure that I understand. Just specify the same mempool
>> on Rx queue setup. Isn't it sufficient?
>>
>>> RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POLL
>>
>> It implies pool sharing if I'm not mistaken. Of course,
>> we can pool many different HW queues in one poll, but it
>> hardly makes sense to care specially about it.
>> IMHO RxQ sharing is a sharing of the underlying HW Rx queue.
>>
>>> RTE_ETH_RX_OFFLOAD_RXQ_SHARE_CFG //implies POOL and POLL
>>
>> It is hardly a feature. Rather a possible limitation.
> 
> Thanks, then I'd drop this suggestion then.
> 
> Here is the TODO list, let me know if anything missing:
> 1. change offload flag to RTE_ETH_DEV_CAPA_RX_SHARE

RTE_ETH_DEV_CAPA_RXQ_SHARE since it is not sharing of
entire Rx, but just some queues.

> 2. RxQ share group check in ethdev
> 3. add rx_domain into device info
> 
>>
>> Andrew.
> 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v4 0/6] ethdev: introduce shared Rx queue
  2021-10-12 11:28             ` Andrew Rybchenko
@ 2021-10-12 11:33               ` Xueming(Steven) Li
  2021-10-13  7:53               ` Xueming(Steven) Li
  1 sibling, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-10-12 11:33 UTC (permalink / raw)
  To: andrew.rybchenko, dev
  Cc: jerinjacobk, NBU-Contact-Thomas Monjalon, Lior Margalit,
	Slava Ovsiienko, konstantin.ananyev, ferruh.yigit

On Tue, 2021-10-12 at 14:28 +0300, Andrew Rybchenko wrote:
> On 10/12/21 1:55 PM, Xueming(Steven) Li wrote:
> > On Tue, 2021-10-12 at 11:48 +0300, Andrew Rybchenko wrote:
> > > On 10/12/21 9:37 AM, Xueming(Steven) Li wrote:
> > > > On Mon, 2021-10-11 at 23:11 +0800, Xueming Li wrote:
> > > > > On Mon, 2021-10-11 at 14:49 +0300, Andrew Rybchenko wrote:
> > > > > > Hi Xueming,
> > > > > > 
> > > > > > On 9/30/21 5:55 PM, Xueming Li wrote:
> > > > > > > In current DPDK framework, all RX queues is pre-loaded with mbufs for
> > > > > > > incoming packets. When number of representors scale out in a switch
> > > > > > > domain, the memory consumption became significant. Further more,
> > > > > > > polling all ports leads to high cache miss, high latency and low
> > > > > > > throughputs.
> > > > > > >  
> > > > > > > This patch introduces shared RX queue. PF and representors with same
> > > > > > > configuration in same switch domain could share RX queue set by
> > > > > > > specifying shared Rx queue offloading flag and sharing group.
> > > > > > > 
> > > > > > > All ports that Shared Rx queue actually shares One Rx queue and only
> > > > > > > pre-load mbufs to one Rx queue, memory is saved.
> > > > > > > 
> > > > > > > Polling any queue using same shared RX queue receives packets from all
> > > > > > > member ports. Source port is identified by mbuf->port.
> > > > > > > 
> > > > > > > Multiple groups is supported by group ID. Port queue number in a shared
> > > > > > > group should be identical. Queue index is 1:1 mapped in shared group.
> > > > > > > An example of polling two share groups:
> > > > > > >   core	group	queue
> > > > > > >   0	0	0
> > > > > > >   1	0	1
> > > > > > >   2	0	2
> > > > > > >   3	0	3
> > > > > > >   4	1	0
> > > > > > >   5	1	1
> > > > > > >   6	1	2
> > > > > > >   7	1	3
> > > > > > > 
> > > > > > > Shared RX queue must be polled on single thread or core. If both PF0 and
> > > > > > > representor0 joined same share group, can't poll pf0rxq0 on core1 and
> > > > > > > rep0rxq0 on core2. Actually, polling one port within share group is
> > > > > > > sufficient since polling any port in group will return packets for any
> > > > > > > port in group.
> > > > > > 
> > > > > > I apologize that I jump in into the review process that late.
> > > > > 
> > > > > Appreciate the bold suggestion, never too late :)
> > > > > 
> > > > > > 
> > > > > > Frankly speaking I doubt that it is the best design to solve
> > > > > > the problem. Yes, I confirm that the problem exists, but I
> > > > > > think there is better and simpler way to solve it.
> > > > > > 
> > > > > > The problem of the suggested solution is that it puts all
> > > > > > the headache about consistency to application and PMDs
> > > > > > without any help from ethdev layer to guarantee the
> > > > > > consistency. As the result I believe it will be either
> > > > > > missing/lost consistency checks or huge duplication in
> > > > > > each PMD which supports the feature. Shared RxQs must be
> > > > > > equally configured including number of queues, offloads
> > > > > > (taking device level Rx offloads into account), RSS
> > > > > > settings etc. So, applications must care about it and
> > > > > > PMDs (or ethdev layer) must check it.
> > > > > 
> > > > > The name might be confusing, here is my understanding:
> > > > > 1. NIC  shares the buffer supply HW Q between shared RxQs - for memory
> > > > > 2. PMD polls one shared RxQ - for latency and performance
> > > > > 3. Most per queue features like offloads and RSS not impacted. That's
> > > > > why this not mentioned. Some offloading might not being supported due
> > > > > to PMD or hw limitation, need to add check in PMD case by case.
> > > > > 4. Multiple group is defined for service level flexibility. For
> > > > > example, PF and VIP customer's load distributed via queues and dedicate
> > > > > cores. Low priority customers share one core with one shared queue.
> > > > > multiple groups enables more combination.
> > > > > 5. One port could assign queues to different group for polling
> > > > > flexibility. For example first 4 queues in group 0 and next 4 queues in
> > > > > group1, each group have other member ports with 4 queues, so the port
> > > > > with 8 queues could be polled with 8 cores w/o non-shared rxq penalty,
> > > > > in other words, each core only poll one shared RxQ.
> > > > > 
> > > > > > 
> > > > > > The advantage of the solution is that any device may
> > > > > > create group and subsequent devices join. Absence of
> > > > > > primary device is nice. But do we really need it?
> > > > > > Will the design work if some representors are configured
> > > > > > to use shared RxQ, but some do not? Theoretically it
> > > > > > is possible, but could require extra non-trivial code
> > > > > > on fast path.
> > > > > 
> > > > > If multiple groups, any device could be hot-unplugged.
> > > > > 
> > > > > Mixed configuration is supported, the only difference is how to set
> > > > > mbuf->port. Since group is per queue, mixed is better to be supported,
> > > > > didn't see any difficulty here.
> > > > > 
> > > > > PDM could select to support only group 0, same settings for each rxq,
> > > > > that fits most scenario.
> > > > > 
> > > > > > 
> > > > > > Also looking at the first two patch I don't understand
> > > > > > how application will find out which devices may share
> > > > > > RxQs. E.g. if we have two difference NICs which support
> > > > > > sharing, we can try to setup only one group 0, but
> > > > > > finally will have two devices (not one) which must be
> > > > > > polled.
> > > > > > 
> > > > > > 1. We need extra flag in dev_info->dev_capa
> > > > > >    RTE_ETH_DEV_CAPA_RX_SHARE to advertise that
> > > > > >    the device supports Rx sharing.
> > > > > 
> > > > > dev_info->rx_queue_offload_capa could be used here, no?
> > > 
> > > It depends. But we definitely need a flag which
> > > says that below rx_domain makes sense. It could be
> > > either RTE_ETH_DEV_CAPA_RX_SHARE or an Rx offload
> > > capability.
> > > 
> > > The question is if it is really an offload. The offload is
> > > when something could be done by HW/FW and result is provided
> > > to SW. May be it is just a nit picking...
> > > 
> > > May be we don't need an offload at all. Just have
> > > RTE_ETH_DEV_CAPA_RXQ_SHARE and use non-zero group ID
> > > as a flag that an RxQ should be shared (zero - default,
> > > no sharing). ethdev layer may check consistency on
> > > its layer to ensure that the device capability is
> > > reported if non-zero group is specified on queue setup.
> > > 
> > > > > 
> > > > > > 
> > > > > > 2. I think we need "rx_domain" in device info
> > > > > >    (which should be treated in boundaries of the
> > > > > >    switch_domain) if and only if
> > > > > >    RTE_ETH_DEV_CAPA_RX_SHARE is advertised.
> > > > > >    Otherwise rx_domain value does not make sense.
> > > > > 
> > > > > I see, this will give flexibility of different hw, will add it.
> > > > > 
> > > > > > 
> > > > > > (1) and (2) will allow application to find out which
> > > > > > devices can share Rx.
> > > > > > 
> > > > > > 3. Primary device (representors backing device) should
> > > > > >    advertise shared RxQ offload. Enabling of the offload
> > > > > >    tells the device to provide packets to all device in
> > > > > >    the Rx domain with mbuf->port filled in appropriately.
> > > > > >    Also it allows app to identify primary device in the
> > > > > >    Rx domain. When application enables the offload, it
> > > > > >    must ensure that it does not treat used port_id as an
> > > > > >    input port_id, but always check mbuf->port for each
> > > > > >    packet.
> > > > > > 
> > > > > > 4. A new Rx mode should be introduced for secondary
> > > > > >    devices. It should not allow to configure RSS, specify
> > > > > >    any Rx offloads etc. ethdev must ensure it.
> > > > > >    It is an open question right now if it should require
> > > > > >    to provide primary port_id. In theory representors
> > > > > >    have it. However, may be it is nice for consistency
> > > > > >    to ensure that application knows that it does.
> > > > > >    If shared Rx mode is specified for device, application
> > > > > >    does not need to setup RxQs and attempts to do it
> > > > > >    should be discarded in ethdev.
> > > > > >    For consistency it is better to ensure that number of
> > > > > >    queues match.
> > > > > 
> > > > > RSS and Rx offloads should be supported as individual, PMD needs to
> > > > > check if not supported.
> > > 
> > > Thinking a bit more about it I agree that RSS settings could
> > > be individual. Offload could be individual as well, but I'm
> > > not sure about all offloads. E.g. Rx scatter which is related
> > > to Rx buffer size (which is shared since Rx mempool is shared)
> > > vs MTU. May be it is acceptable. We just must define rules
> > > what should happen if offloads contradict to each other.
> > > It should be highlighted in the description including
> > > driver callback to ensure that PMD maintainers are responsible
> > > for consistency checks.
> > > 
> > > > > 
> > > > > >    It is an interesting question what should happen if
> > > > > >    primary device is reconfigured and shared Rx is
> > > > > >    disabled on reconfiguration.
> > > > > 
> > > > > I feel better no primary port/queue assumption in configuration, all
> > > > > members are equally treated, each queue can join or quit share group,
> > > > > that's important to support multiple groups.
> > > 
> > > I agree. The problem of many flexible solutions is
> > > complexity to support. We'll see how it goes.
> > > 
> > > > > 
> > > > > > 
> > > > > > 5. If so, in theory implementation of the Rx burst
> > > > > >    in the secondary could simply call Rx burst on
> > > > > >    primary device.
> > > > > > 
> > > > > > Andrew.
> > > > > 
> > > > 
> > > > Hi Andrew,
> > > > 
> > > > I realized that we are talking different things, this feature
> > > > introduced 2 RxQ share:
> > > > 1. Share mempool to save memory
> > > > 2. Share polling to save latency
> > > > 
> > > > What you suggested is reuse all RxQ configuration IIUC, maybe we should
> > > > break the flag into 3, so application could learn PMD capability and
> > > > configure accordingly, how do you think?
> > > > RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POOL
> > > 
> > > Not sure that I understand. Just specify the same mempool
> > > on Rx queue setup. Isn't it sufficient?
> > > 
> > > > RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POLL
> > > 
> > > It implies pool sharing if I'm not mistaken. Of course,
> > > we can pool many different HW queues in one poll, but it
> > > hardly makes sense to care specially about it.
> > > IMHO RxQ sharing is a sharing of the underlying HW Rx queue.
> > > 
> > > > RTE_ETH_RX_OFFLOAD_RXQ_SHARE_CFG //implies POOL and POLL
> > > 
> > > It is hardly a feature. Rather a possible limitation.
> > 
> > Thanks, then I'd drop this suggestion then.
> > 
> > Here is the TODO list, let me know if anything missing:
> > 1. change offload flag to RTE_ETH_DEV_CAPA_RX_SHARE
> 
> RTE_ETH_DEV_CAPA_RXQ_SHARE since it is not sharing of
> entire Rx, but just some queues.

OK

> 
> > 2. RxQ share group check in ethdev
> > 3. add rx_domain into device info

Seems rte_eth_switch_info is better palce for rx_domain.

> > 
> > > 
> > > Andrew.
> > 
> 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v6 0/5] ethdev: introduce shared Rx queue
  2021-07-27  3:42 [dpdk-dev] [RFC] ethdev: introduce shared Rx queue Xueming Li
                   ` (5 preceding siblings ...)
  2021-10-11 12:37 ` [dpdk-dev] [PATCH v5 0/5] " Xueming Li
@ 2021-10-12 14:39 ` Xueming Li
  2021-10-12 14:39   ` [dpdk-dev] [PATCH v6 1/5] " Xueming Li
                     ` (4 more replies)
  2021-10-16  8:42 ` [dpdk-dev] [PATCH v7 0/5] ethdev: introduce " Xueming Li
                   ` (9 subsequent siblings)
  16 siblings, 5 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-12 14:39 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin

In current DPDK framework, all Rx queues is pre-loaded with mbufs for
incoming packets. When number of representors scale out in a switch
domain, the memory consumption became significant. Further more,
polling all ports leads to high cache miss, high latency and low
throughputs.

This patch introduces shared Rx queue. PF and representors in same
Rx domain and switch domain could share Rx queue set by specifying
non-zero share group value in Rx queue configuration.

All ports that share Rx queue actually shares hardware descriptor
queue and feed all Rx queues with one descriptor supply, memory is saved.

Polling any queue using same shared Rx queue receives packets from all
member ports. Source port is identified by mbuf->port.

Multiple groups is supported by group ID. Port queue number in a shared
group should be identical. Queue index is 1:1 mapped in shared group.
An example of two share groups:
 Group1, 4 shared Rx queues per member port: PF, repr0, repr1
 Group2, 2 shared Rx queues per member port: repr2, repr3, ... repr127
 Poll first port for each group:
  core	port	queue
  0	0	0
  1	0	1
  2	0	2
  3	0	3
  4	2	0
  5	2	1

Shared Rx queue must be polled on single thread or core. If both PF0 and
representor0 joined same share group, can't poll pf0rxq0 on core1 and
rep0rxq0 on core2. Actually, polling one port within share group is
sufficient since polling any port in group will return packets for any
port in group.

There was some discussion to aggregate member ports in same group into a
dummy port, several ways to achieve it. Since it optional, need to collect
more feedback and requirement from user, make better decision later.

v1:
  - initial version
v2:
  - add testpmd patches
v3:
  - change common forwarding api to macro for performance, thanks Jerin.
  - save global variable accessed in forwarding to flowstream to minimize
    cache miss
  - combined patches for each forwarding engine
  - support multiple groups in testpmd "--share-rxq" parameter
  - new api to aggregate shared rxq group
v4:
  - spelling fixes
  - remove shared-rxq support for all forwarding engines
  - add dedicate shared-rxq forwarding engine
v5:
 - fix grammars
 - remove aggregate api and leave it for later discussion
 - add release notes
 - add deployment example
v6:
 - replace RxQ offload flag with device offload capability flag
 - add Rx domain
 - RxQ is shared when share group > 0
 - update testpmd accordingly

Xueming Li (5):
  ethdev: introduce shared Rx queue
  app/testpmd: new parameter to enable shared Rx queue
  app/testpmd: dump port info for shared Rx queue
  app/testpmd: force shared Rx queue polled on same core
  app/testpmd: add forwarding engine for shared Rx queue

 app/test-pmd/config.c                         | 106 ++++++++++++-
 app/test-pmd/meson.build                      |   1 +
 app/test-pmd/parameters.c                     |  13 ++
 app/test-pmd/shared_rxq_fwd.c                 | 148 ++++++++++++++++++
 app/test-pmd/testpmd.c                        |  17 +-
 app/test-pmd/testpmd.h                        |   5 +
 app/test-pmd/util.c                           |   3 +
 doc/guides/nics/features.rst                  |  13 ++
 doc/guides/nics/features/default.ini          |   1 +
 .../prog_guide/switch_representation.rst      |  10 ++
 doc/guides/rel_notes/release_21_11.rst        |   5 +
 doc/guides/testpmd_app_ug/run_app.rst         |   8 +
 doc/guides/testpmd_app_ug/testpmd_funcs.rst   |   5 +-
 lib/ethdev/rte_ethdev.c                       |   9 ++
 lib/ethdev/rte_ethdev.h                       |  21 +++
 15 files changed, 362 insertions(+), 3 deletions(-)
 create mode 100644 app/test-pmd/shared_rxq_fwd.c

-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v6 1/5] ethdev: introduce shared Rx queue
  2021-10-12 14:39 ` [dpdk-dev] [PATCH v6 0/5] ethdev: introduce " Xueming Li
@ 2021-10-12 14:39   ` Xueming Li
  2021-10-15  9:28     ` Andrew Rybchenko
  2021-10-15 17:20     ` Ferruh Yigit
  2021-10-12 14:39   ` [dpdk-dev] [PATCH v6 2/5] app/testpmd: new parameter to enable " Xueming Li
                     ` (3 subsequent siblings)
  4 siblings, 2 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-12 14:39 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin

In current DPDK framework, each Rx queue is pre-loaded with mbufs to
save incoming packets. For some PMDs, when number of representors scale
out in a switch domain, the memory consumption became significant.
Polling all ports also leads to high cache miss, high latency and low
throughput.

This patch introduce shared Rx queue. Ports in same Rx domain and
switch domain could share Rx queue set by specifying non-zero sharing
group in Rx queue configuration.

No special API is defined to receive packets from shared Rx queue.
Polling any member port of a shared Rx queue receives packets of that
queue for all member ports, source port is identified by mbuf->port.

Shared Rx queue must be polled in same thread or core, polling a queue
ID of any member port is essentially same.

Multiple share groups are supported by non-zero share group ID. Device
should support mixed configuration by allowing multiple share
groups and non-shared Rx queue.

Even Rx queue shared, queue configuration like offloads and RSS should
not be impacted.

Example grouping and polling model to reflect service priority:
 Group1, 2 shared Rx queues per port: PF, rep0, rep1
 Group2, 1 shared Rx queue per port: rep2, rep3, ... rep127
 Core0: poll PF queue0
 Core1: poll PF queue1
 Core2: poll rep2 queue0

PMD driver advertise shared Rx queue capability via
RTE_ETH_DEV_CAPA_RXQ_SHARE.

PMD driver is responsible for shared Rx queue consistency checks to
avoid member port's configuration contradict to each other.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 doc/guides/nics/features.rst                  | 13 ++++++++++++
 doc/guides/nics/features/default.ini          |  1 +
 .../prog_guide/switch_representation.rst      | 10 +++++++++
 doc/guides/rel_notes/release_21_11.rst        |  5 +++++
 lib/ethdev/rte_ethdev.c                       |  9 ++++++++
 lib/ethdev/rte_ethdev.h                       | 21 +++++++++++++++++++
 6 files changed, 59 insertions(+)

diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index e346018e4b8..b64433b8ea5 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -615,6 +615,19 @@ Supports inner packet L4 checksum.
   ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
 
 
+.. _nic_features_shared_rx_queue:
+
+Shared Rx queue
+---------------
+
+Supports shared Rx queue for ports in same Rx domain of a switch domain.
+
+* **[uses]     rte_eth_dev_info**: ``dev_capa:RTE_ETH_DEV_CAPA_RXQ_SHARE``.
+* **[uses]     rte_eth_dev_info,rte_eth_switch_info**: ``rx_domain``, ``domain_id``.
+* **[uses]     rte_eth_rxconf**: ``share_group``.
+* **[provides] mbuf**: ``mbuf.port``.
+
+
 .. _nic_features_packet_type_parsing:
 
 Packet type parsing
diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
index d473b94091a..93f5d1b46f4 100644
--- a/doc/guides/nics/features/default.ini
+++ b/doc/guides/nics/features/default.ini
@@ -19,6 +19,7 @@ Free Tx mbuf on demand =
 Queue start/stop     =
 Runtime Rx queue setup =
 Runtime Tx queue setup =
+Shared Rx queue      =
 Burst mode info      =
 Power mgmt address monitor =
 MTU update           =
diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
index ff6aa91c806..de41db8385d 100644
--- a/doc/guides/prog_guide/switch_representation.rst
+++ b/doc/guides/prog_guide/switch_representation.rst
@@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
 .. [1] `Ethernet switch device driver model (switchdev)
        <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
 
+- For some PMDs, memory usage of representors is huge when number of
+  representor grows, mbufs are allocated for each descriptor of Rx queue.
+  Polling large number of ports brings more CPU load, cache miss and
+  latency. Shared Rx queue can be used to share Rx queue between PF and
+  representors among same Rx domain. ``RTE_ETH_DEV_CAPA_RXQ_SHARE`` is
+  present in device capability of device info. Setting non-zero share group
+  in Rx queue configuration to enable share. Polling any member port can
+  receive packets of all member ports in the group, port ID is saved in
+  ``mbuf.port``.
+
 Basic SR-IOV
 ------------
 
diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
index 5036641842c..d72fc97f4fb 100644
--- a/doc/guides/rel_notes/release_21_11.rst
+++ b/doc/guides/rel_notes/release_21_11.rst
@@ -141,6 +141,11 @@ New Features
   * Added tests to validate packets hard expiry.
   * Added tests to verify tunnel header verification in IPsec inbound.
 
+* **Added ethdev shared Rx queue support.**
+
+  * Added new device capability flag and rx domain field to switch info.
+  * Added share group to Rx queue configuration.
+  * Added testpmd support and dedicate forwarding engine.
 
 Removed Items
 -------------
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 028907bc4b9..9b1b66370a7 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -2159,6 +2159,15 @@ rte_eth_rx_queue_setup(uint16_t port_id, uint16_t rx_queue_id,
 		return -EINVAL;
 	}
 
+	if (local_conf.share_group > 0 &&
+	    (dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE) == 0) {
+		RTE_ETHDEV_LOG(ERR,
+			"Ethdev port_id=%d rx_queue_id=%d, enabled share_group=%u while device doesn't support Rx queue share in %s()\n",
+			port_id, rx_queue_id, local_conf.share_group,
+			__func__);
+		return -EINVAL;
+	}
+
 	/*
 	 * If LRO is enabled, check that the maximum aggregated packet
 	 * size is supported by the configured device.
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 6d80514ba7a..041da6ee52f 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1044,6 +1044,13 @@ struct rte_eth_rxconf {
 	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
 	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
 	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
+	/**
+	 * Share group index in Rx domain and switch domain.
+	 * Non-zero value to enable Rx queue share, zero value disable share.
+	 * PMD driver is responsible for Rx queue consistency checks to avoid
+	 * member port's configuration contradict to each other.
+	 */
+	uint32_t share_group;
 	/**
 	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
 	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
@@ -1445,6 +1452,14 @@ struct rte_eth_conf {
 #define RTE_ETH_DEV_CAPA_RUNTIME_RX_QUEUE_SETUP 0x00000001
 /** Device supports Tx queue setup after device started. */
 #define RTE_ETH_DEV_CAPA_RUNTIME_TX_QUEUE_SETUP 0x00000002
+/**
+ * Device supports shared Rx queue among ports within Rx domain and
+ * switch domain. Mbufs are consumed by shared Rx queue instead of
+ * every port. Multiple groups is supported by share_group of Rx
+ * queue configuration. Polling any port in the group receive packets
+ * of all member ports, source port identified by mbuf->port field.
+ */
+#define RTE_ETH_DEV_CAPA_RXQ_SHARE              0x00000004
 /**@}*/
 
 /*
@@ -1488,6 +1503,12 @@ struct rte_eth_switch_info {
 	 * but each driver should explicitly define the mapping of switch
 	 * port identifier to that physical interconnect/switch
 	 */
+	uint16_t rx_domain;
+	/**<
+	 * Shared Rx queue sub-domain boundary. Only ports in same Rx domain
+	 * and switch domain can share Rx queue. Valid only if device advertised
+	 * RTE_ETH_DEV_CAPA_RXQ_SHARE capability.
+	 */
 };
 
 /**
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v6 2/5] app/testpmd: new parameter to enable shared Rx queue
  2021-10-12 14:39 ` [dpdk-dev] [PATCH v6 0/5] ethdev: introduce " Xueming Li
  2021-10-12 14:39   ` [dpdk-dev] [PATCH v6 1/5] " Xueming Li
@ 2021-10-12 14:39   ` Xueming Li
  2021-10-12 14:39   ` [dpdk-dev] [PATCH v6 3/5] app/testpmd: dump port info for " Xueming Li
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-12 14:39 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Xiaoyun Li

Adds "--rxq-share=X" parameter to enable shared RxQ, share if device
supports, otherwise fallback to standard RxQ.

Share group number grows per X ports. X defaults to MAX, implies all
ports join share group 1.

Forwarding engine "shared-rxq" should be used which Rx only and update
stream statistics correctly.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c                 |  6 +++++-
 app/test-pmd/parameters.c             | 13 +++++++++++++
 app/test-pmd/testpmd.c                | 12 ++++++++++++
 app/test-pmd/testpmd.h                |  2 ++
 doc/guides/testpmd_app_ug/run_app.rst |  7 +++++++
 5 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 9c66329e96e..96fc2ab888b 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -2709,7 +2709,11 @@ rxtx_config_display(void)
 			printf("      RX threshold registers: pthresh=%d hthresh=%d "
 				" wthresh=%d\n",
 				pthresh_tmp, hthresh_tmp, wthresh_tmp);
-			printf("      RX Offloads=0x%"PRIx64"\n", offloads_tmp);
+			printf("      RX Offloads=0x%"PRIx64, offloads_tmp);
+			if (rx_conf->share_group > 0)
+				printf(" share group=%u",
+				       rx_conf->share_group);
+			printf("\n");
 		}
 
 		/* per tx queue config only for first queue to be less verbose */
diff --git a/app/test-pmd/parameters.c b/app/test-pmd/parameters.c
index 3f94a82e321..30dae326310 100644
--- a/app/test-pmd/parameters.c
+++ b/app/test-pmd/parameters.c
@@ -167,6 +167,7 @@ usage(char* progname)
 	printf("  --tx-ip=src,dst: IP addresses in Tx-only mode\n");
 	printf("  --tx-udp=src[,dst]: UDP ports in Tx-only mode\n");
 	printf("  --eth-link-speed: force link speed.\n");
+	printf("  --rxq-share: number of ports per shared rxq groups, defaults to MAX(1 group)\n");
 	printf("  --disable-link-check: disable check on link status when "
 	       "starting/stopping ports.\n");
 	printf("  --disable-device-start: do not automatically start port\n");
@@ -607,6 +608,7 @@ launch_args_parse(int argc, char** argv)
 		{ "rxpkts",			1, 0, 0 },
 		{ "txpkts",			1, 0, 0 },
 		{ "txonly-multi-flow",		0, 0, 0 },
+		{ "rxq-share",			2, 0, 0 },
 		{ "eth-link-speed",		1, 0, 0 },
 		{ "disable-link-check",		0, 0, 0 },
 		{ "disable-device-start",	0, 0, 0 },
@@ -1271,6 +1273,17 @@ launch_args_parse(int argc, char** argv)
 			}
 			if (!strcmp(lgopts[opt_idx].name, "txonly-multi-flow"))
 				txonly_multi_flow = 1;
+			if (!strcmp(lgopts[opt_idx].name, "rxq-share")) {
+				if (optarg == NULL) {
+					rxq_share = UINT32_MAX;
+				} else {
+					n = atoi(optarg);
+					if (n >= 0)
+						rxq_share = (uint32_t)n;
+					else
+						rte_exit(EXIT_FAILURE, "rxq-share must be >= 0\n");
+				}
+			}
 			if (!strcmp(lgopts[opt_idx].name, "no-flush-rx"))
 				no_flush_rx = 1;
 			if (!strcmp(lgopts[opt_idx].name, "eth-link-speed")) {
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 97ae52e17ec..9c26301d397 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -498,6 +498,11 @@ uint8_t record_core_cycles;
  */
 uint8_t record_burst_stats;
 
+/*
+ * Number of ports per shared Rx queue group, 0 disable.
+ */
+uint32_t rxq_share;
+
 unsigned int num_sockets = 0;
 unsigned int socket_ids[RTE_MAX_NUMA_NODES];
 
@@ -3401,6 +3406,13 @@ rxtx_port_config(struct rte_port *port)
 	for (qid = 0; qid < nb_rxq; qid++) {
 		offloads = port->rx_conf[qid].offloads;
 		port->rx_conf[qid] = port->dev_info.default_rxconf;
+
+		if (rxq_share > 0 &&
+		    (port->dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE))
+			/* Non-zero share group to enable RxQ share. */
+			port->rx_conf[qid].share_group = nb_ports / rxq_share
+							 + 1;
+
 		if (offloads != 0)
 			port->rx_conf[qid].offloads = offloads;
 
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 5863b2f43f3..3dfaaad94c0 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -477,6 +477,8 @@ extern enum tx_pkt_split tx_pkt_split;
 
 extern uint8_t txonly_multi_flow;
 
+extern uint32_t rxq_share;
+
 extern uint16_t nb_pkt_per_burst;
 extern uint16_t nb_pkt_flowgen_clones;
 extern int nb_flows_flowgen;
diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
index 640eadeff73..ff5908dcd50 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -389,6 +389,13 @@ The command line options are:
 
     Generate multiple flows in txonly mode.
 
+*   ``--rxq-share=[X]``
+
+    Create queues in shared Rx queue mode if device supports.
+    Group number grows per X ports. X defaults to MAX, implies all ports
+    join share group 1. Forwarding engine "shared-rxq" should be used
+    which Rx only and update stream statistics correctly.
+
 *   ``--eth-link-speed``
 
     Set a forced link speed to the ethernet port::
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v6 3/5] app/testpmd: dump port info for shared Rx queue
  2021-10-12 14:39 ` [dpdk-dev] [PATCH v6 0/5] ethdev: introduce " Xueming Li
  2021-10-12 14:39   ` [dpdk-dev] [PATCH v6 1/5] " Xueming Li
  2021-10-12 14:39   ` [dpdk-dev] [PATCH v6 2/5] app/testpmd: new parameter to enable " Xueming Li
@ 2021-10-12 14:39   ` Xueming Li
  2021-10-12 14:39   ` [dpdk-dev] [PATCH v6 4/5] app/testpmd: force shared Rx queue polled on same core Xueming Li
  2021-10-12 14:39   ` [dpdk-dev] [PATCH v6 5/5] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
  4 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-12 14:39 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Xiaoyun Li

In case of shared Rx queue, polling any member port returns mbufs for
all members. This patch dumps mbuf->port for each packet.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/util.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/app/test-pmd/util.c b/app/test-pmd/util.c
index 51506e49404..e98f136d5ed 100644
--- a/app/test-pmd/util.c
+++ b/app/test-pmd/util.c
@@ -100,6 +100,9 @@ dump_pkt_burst(uint16_t port_id, uint16_t queue, struct rte_mbuf *pkts[],
 		struct rte_flow_restore_info info = { 0, };
 
 		mb = pkts[i];
+		if (rxq_share > 0)
+			MKDUMPSTR(print_buf, buf_size, cur_len, "port %u, ",
+				  mb->port);
 		eth_hdr = rte_pktmbuf_read(mb, 0, sizeof(_eth_hdr), &_eth_hdr);
 		eth_type = RTE_BE_TO_CPU_16(eth_hdr->ether_type);
 		packet_type = mb->packet_type;
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v6 4/5] app/testpmd: force shared Rx queue polled on same core
  2021-10-12 14:39 ` [dpdk-dev] [PATCH v6 0/5] ethdev: introduce " Xueming Li
                     ` (2 preceding siblings ...)
  2021-10-12 14:39   ` [dpdk-dev] [PATCH v6 3/5] app/testpmd: dump port info for " Xueming Li
@ 2021-10-12 14:39   ` Xueming Li
  2021-10-12 14:39   ` [dpdk-dev] [PATCH v6 5/5] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
  4 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-12 14:39 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Xiaoyun Li

Shared Rx queue must be polled on same core. This patch checks and stops
forwarding if shared RxQ being scheduled on multiple
cores.

It's suggested to use same number of Rx queues and polling cores.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c  | 100 +++++++++++++++++++++++++++++++++++++++++
 app/test-pmd/testpmd.c |   4 +-
 app/test-pmd/testpmd.h |   2 +
 3 files changed, 105 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 96fc2ab888b..9acd2705f18 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -2885,6 +2885,106 @@ port_rss_hash_key_update(portid_t port_id, char rss_type[], uint8_t *hash_key,
 	}
 }
 
+/*
+ * Check whether a shared rxq scheduled on other lcores.
+ */
+static bool
+fwd_stream_on_other_lcores(uint16_t domain_id, portid_t src_port,
+			   queueid_t src_rxq, lcoreid_t src_lc,
+			   uint32_t share_group)
+{
+	streamid_t sm_id;
+	streamid_t nb_fs_per_lcore;
+	lcoreid_t  nb_fc;
+	lcoreid_t  lc_id;
+	struct fwd_stream *fs;
+	struct rte_port *port;
+	struct rte_eth_dev_info *dev_info;
+	struct rte_eth_rxconf *rxq_conf;
+
+	nb_fc = cur_fwd_config.nb_fwd_lcores;
+	for (lc_id = src_lc + 1; lc_id < nb_fc; lc_id++) {
+		sm_id = fwd_lcores[lc_id]->stream_idx;
+		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
+		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
+		     sm_id++) {
+			fs = fwd_streams[sm_id];
+			port = &ports[fs->rx_port];
+			dev_info = &port->dev_info;
+			rxq_conf = &port->rx_conf[fs->rx_queue];
+			if ((dev_info->dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE)
+			    == 0)
+				/* Not shared rxq. */
+				continue;
+			if (domain_id != port->dev_info.switch_info.domain_id)
+				continue;
+			if (fs->rx_queue != src_rxq)
+				continue;
+			if (rxq_conf->share_group != share_group)
+				continue;
+			printf("Shared Rx queue group %u can't be scheduled on different cores:\n",
+			       share_group);
+			printf("  lcore %hhu Port %hu queue %hu\n",
+			       src_lc, src_port, src_rxq);
+			printf("  lcore %hhu Port %hu queue %hu\n",
+			       lc_id, fs->rx_port, fs->rx_queue);
+			printf("  please use --nb-cores=%hu to limit forwarding cores\n",
+			       nb_rxq);
+			return true;
+		}
+	}
+	return false;
+}
+
+/*
+ * Check shared rxq configuration.
+ *
+ * Shared group must not being scheduled on different core.
+ */
+bool
+pkt_fwd_shared_rxq_check(void)
+{
+	streamid_t sm_id;
+	streamid_t nb_fs_per_lcore;
+	lcoreid_t  nb_fc;
+	lcoreid_t  lc_id;
+	struct fwd_stream *fs;
+	uint16_t domain_id;
+	struct rte_port *port;
+	struct rte_eth_dev_info *dev_info;
+	struct rte_eth_rxconf *rxq_conf;
+
+	nb_fc = cur_fwd_config.nb_fwd_lcores;
+	/*
+	 * Check streams on each core, make sure the same switch domain +
+	 * group + queue doesn't get scheduled on other cores.
+	 */
+	for (lc_id = 0; lc_id < nb_fc; lc_id++) {
+		sm_id = fwd_lcores[lc_id]->stream_idx;
+		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
+		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
+		     sm_id++) {
+			fs = fwd_streams[sm_id];
+			/* Update lcore info stream being scheduled. */
+			fs->lcore = fwd_lcores[lc_id];
+			port = &ports[fs->rx_port];
+			dev_info = &port->dev_info;
+			rxq_conf = &port->rx_conf[fs->rx_queue];
+			if ((dev_info->dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE)
+			    == 0)
+				/* Not shared rxq. */
+				continue;
+			/* Check shared rxq not scheduled on remaining cores. */
+			domain_id = port->dev_info.switch_info.domain_id;
+			if (fwd_stream_on_other_lcores(domain_id, fs->rx_port,
+						       fs->rx_queue, lc_id,
+						       rxq_conf->share_group))
+				return false;
+		}
+	}
+	return true;
+}
+
 /*
  * Setup forwarding configuration for each logical core.
  */
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 9c26301d397..df301e2e683 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -2236,10 +2236,12 @@ start_packet_forwarding(int with_tx_first)
 
 	fwd_config_setup();
 
+	pkt_fwd_config_display(&cur_fwd_config);
+	if (!pkt_fwd_shared_rxq_check())
+		return;
 	if(!no_flush_rx)
 		flush_fwd_rx_queues();
 
-	pkt_fwd_config_display(&cur_fwd_config);
 	rxtx_config_display();
 
 	fwd_stats_reset();
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 3dfaaad94c0..f121a2da90c 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -144,6 +144,7 @@ struct fwd_stream {
 	uint64_t     core_cycles; /**< used for RX and TX processing */
 	struct pkt_burst_stats rx_burst_stats;
 	struct pkt_burst_stats tx_burst_stats;
+	struct fwd_lcore *lcore; /**< Lcore being scheduled. */
 };
 
 /**
@@ -795,6 +796,7 @@ void port_summary_header_display(void);
 void rx_queue_infos_display(portid_t port_idi, uint16_t queue_id);
 void tx_queue_infos_display(portid_t port_idi, uint16_t queue_id);
 void fwd_lcores_config_display(void);
+bool pkt_fwd_shared_rxq_check(void);
 void pkt_fwd_config_display(struct fwd_config *cfg);
 void rxtx_config_display(void);
 void fwd_config_setup(void);
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v6 5/5] app/testpmd: add forwarding engine for shared Rx queue
  2021-10-12 14:39 ` [dpdk-dev] [PATCH v6 0/5] ethdev: introduce " Xueming Li
                     ` (3 preceding siblings ...)
  2021-10-12 14:39   ` [dpdk-dev] [PATCH v6 4/5] app/testpmd: force shared Rx queue polled on same core Xueming Li
@ 2021-10-12 14:39   ` Xueming Li
  4 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-12 14:39 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Xiaoyun Li

To support shared Rx queue, this patch introduces dedicate forwarding
engine. The engine groups received packets by mbuf->port into sub-group,
updates stream statistics and simply frees packets.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/meson.build                    |   1 +
 app/test-pmd/shared_rxq_fwd.c               | 148 ++++++++++++++++++++
 app/test-pmd/testpmd.c                      |   1 +
 app/test-pmd/testpmd.h                      |   1 +
 doc/guides/testpmd_app_ug/run_app.rst       |   1 +
 doc/guides/testpmd_app_ug/testpmd_funcs.rst |   5 +-
 6 files changed, 156 insertions(+), 1 deletion(-)
 create mode 100644 app/test-pmd/shared_rxq_fwd.c

diff --git a/app/test-pmd/meson.build b/app/test-pmd/meson.build
index 98f3289bdfa..07042e45b12 100644
--- a/app/test-pmd/meson.build
+++ b/app/test-pmd/meson.build
@@ -21,6 +21,7 @@ sources = files(
         'noisy_vnf.c',
         'parameters.c',
         'rxonly.c',
+        'shared_rxq_fwd.c',
         'testpmd.c',
         'txonly.c',
         'util.c',
diff --git a/app/test-pmd/shared_rxq_fwd.c b/app/test-pmd/shared_rxq_fwd.c
new file mode 100644
index 00000000000..4e262b99bc7
--- /dev/null
+++ b/app/test-pmd/shared_rxq_fwd.c
@@ -0,0 +1,148 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2021 NVIDIA Corporation & Affiliates
+ */
+
+#include <stdarg.h>
+#include <string.h>
+#include <stdio.h>
+#include <errno.h>
+#include <stdint.h>
+#include <unistd.h>
+#include <inttypes.h>
+
+#include <sys/queue.h>
+#include <sys/stat.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_debug.h>
+#include <rte_cycles.h>
+#include <rte_memory.h>
+#include <rte_memcpy.h>
+#include <rte_launch.h>
+#include <rte_eal.h>
+#include <rte_per_lcore.h>
+#include <rte_lcore.h>
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_mempool.h>
+#include <rte_mbuf.h>
+#include <rte_pci.h>
+#include <rte_ether.h>
+#include <rte_ethdev.h>
+#include <rte_string_fns.h>
+#include <rte_ip.h>
+#include <rte_udp.h>
+#include <rte_net.h>
+#include <rte_flow.h>
+
+#include "testpmd.h"
+
+/*
+ * Rx only sub-burst forwarding.
+ */
+static void
+forward_rx_only(uint16_t nb_rx, struct rte_mbuf **pkts_burst)
+{
+	rte_pktmbuf_free_bulk(pkts_burst, nb_rx);
+}
+
+/**
+ * Get packet source stream by source port and queue.
+ * All streams of same shared Rx queue locates on same core.
+ */
+static struct fwd_stream *
+forward_stream_get(struct fwd_stream *fs, uint16_t port)
+{
+	streamid_t sm_id;
+	struct fwd_lcore *fc;
+	struct fwd_stream **fsm;
+	streamid_t nb_fs;
+
+	fc = fs->lcore;
+	fsm = &fwd_streams[fc->stream_idx];
+	nb_fs = fc->stream_nb;
+	for (sm_id = 0; sm_id < nb_fs; sm_id++) {
+		if (fsm[sm_id]->rx_port == port &&
+		    fsm[sm_id]->rx_queue == fs->rx_queue)
+			return fsm[sm_id];
+	}
+	return NULL;
+}
+
+/**
+ * Forward packet by source port and queue.
+ */
+static void
+forward_sub_burst(struct fwd_stream *src_fs, uint16_t port, uint16_t nb_rx,
+		  struct rte_mbuf **pkts)
+{
+	struct fwd_stream *fs = forward_stream_get(src_fs, port);
+
+	if (fs != NULL) {
+		fs->rx_packets += nb_rx;
+		forward_rx_only(nb_rx, pkts);
+	} else {
+		/* Source stream not found, drop all packets. */
+		src_fs->fwd_dropped += nb_rx;
+		while (nb_rx > 0)
+			rte_pktmbuf_free(pkts[--nb_rx]);
+	}
+}
+
+/**
+ * Forward packets from shared Rx queue.
+ *
+ * Source port of packets are identified by mbuf->port.
+ */
+static void
+forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
+		   struct rte_mbuf **pkts_burst)
+{
+	uint16_t i, nb_sub_burst, port, last_port;
+
+	nb_sub_burst = 0;
+	last_port = pkts_burst[0]->port;
+	/* Locate sub-burst according to mbuf->port. */
+	for (i = 0; i < nb_rx - 1; ++i) {
+		rte_prefetch0(pkts_burst[i + 1]);
+		port = pkts_burst[i]->port;
+		if (i > 0 && last_port != port) {
+			/* Forward packets with same source port. */
+			forward_sub_burst(fs, last_port, nb_sub_burst,
+					  &pkts_burst[i - nb_sub_burst]);
+			nb_sub_burst = 0;
+			last_port = port;
+		}
+		nb_sub_burst++;
+	}
+	/* Last sub-burst. */
+	nb_sub_burst++;
+	forward_sub_burst(fs, last_port, nb_sub_burst,
+			  &pkts_burst[nb_rx - nb_sub_burst]);
+}
+
+static void
+shared_rxq_fwd(struct fwd_stream *fs)
+{
+	struct rte_mbuf *pkts_burst[nb_pkt_per_burst];
+	uint16_t nb_rx;
+	uint64_t start_tsc = 0;
+
+	get_start_cycles(&start_tsc);
+	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
+				 nb_pkt_per_burst);
+	inc_rx_burst_stats(fs, nb_rx);
+	if (unlikely(nb_rx == 0))
+		return;
+	forward_shared_rxq(fs, nb_rx, pkts_burst);
+	get_end_cycles(fs, start_tsc);
+}
+
+struct fwd_engine shared_rxq_engine = {
+	.fwd_mode_name  = "shared_rxq",
+	.port_fwd_begin = NULL,
+	.port_fwd_end   = NULL,
+	.packet_fwd     = shared_rxq_fwd,
+};
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index df301e2e683..e29837b3b16 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -188,6 +188,7 @@ struct fwd_engine * fwd_engines[] = {
 #ifdef RTE_LIBRTE_IEEE1588
 	&ieee1588_fwd_engine,
 #endif
+	&shared_rxq_engine,
 	NULL,
 };
 
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index f121a2da90c..f1fd607e365 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -299,6 +299,7 @@ extern struct fwd_engine five_tuple_swap_fwd_engine;
 #ifdef RTE_LIBRTE_IEEE1588
 extern struct fwd_engine ieee1588_fwd_engine;
 #endif
+extern struct fwd_engine shared_rxq_engine;
 
 extern struct fwd_engine * fwd_engines[]; /**< NULL terminated array. */
 extern cmdline_parse_inst_t cmd_set_raw;
diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
index ff5908dcd50..e4b97844ced 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -252,6 +252,7 @@ The command line options are:
        tm
        noisy
        5tswap
+       shared-rxq
 
 *   ``--rss-ip``
 
diff --git a/doc/guides/testpmd_app_ug/testpmd_funcs.rst b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
index 8ead7a4a712..499874187f2 100644
--- a/doc/guides/testpmd_app_ug/testpmd_funcs.rst
+++ b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
@@ -314,7 +314,7 @@ set fwd
 Set the packet forwarding mode::
 
    testpmd> set fwd (io|mac|macswap|flowgen| \
-                     rxonly|txonly|csum|icmpecho|noisy|5tswap) (""|retry)
+                     rxonly|txonly|csum|icmpecho|noisy|5tswap|shared-rxq) (""|retry)
 
 ``retry`` can be specified for forwarding engines except ``rx_only``.
 
@@ -357,6 +357,9 @@ The available information categories are:
 
   L4 swaps the source port and destination port of transport layer (TCP and UDP).
 
+* ``shared-rxq``: Receive only for shared Rx queue.
+  Resolve packet source port from mbuf and update stream statistics accordingly.
+
 Example::
 
    testpmd> set fwd rxonly
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v4 0/6] ethdev: introduce shared Rx queue
  2021-10-12 11:28             ` Andrew Rybchenko
  2021-10-12 11:33               ` Xueming(Steven) Li
@ 2021-10-13  7:53               ` Xueming(Steven) Li
  1 sibling, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-10-13  7:53 UTC (permalink / raw)
  To: andrew.rybchenko, dev
  Cc: jerinjacobk, NBU-Contact-Thomas Monjalon, Lior Margalit,
	Slava Ovsiienko, konstantin.ananyev, ferruh.yigit

On Tue, 2021-10-12 at 14:28 +0300, Andrew Rybchenko wrote:
> On 10/12/21 1:55 PM, Xueming(Steven) Li wrote:
> > On Tue, 2021-10-12 at 11:48 +0300, Andrew Rybchenko wrote:
> > > On 10/12/21 9:37 AM, Xueming(Steven) Li wrote:
> > > > On Mon, 2021-10-11 at 23:11 +0800, Xueming Li wrote:
> > > > > On Mon, 2021-10-11 at 14:49 +0300, Andrew Rybchenko wrote:
> > > > > > Hi Xueming,
> > > > > > 
> > > > > > On 9/30/21 5:55 PM, Xueming Li wrote:
> > > > > > > In current DPDK framework, all RX queues is pre-loaded with mbufs for
> > > > > > > incoming packets. When number of representors scale out in a switch
> > > > > > > domain, the memory consumption became significant. Further more,
> > > > > > > polling all ports leads to high cache miss, high latency and low
> > > > > > > throughputs.
> > > > > > >  
> > > > > > > This patch introduces shared RX queue. PF and representors with same
> > > > > > > configuration in same switch domain could share RX queue set by
> > > > > > > specifying shared Rx queue offloading flag and sharing group.
> > > > > > > 
> > > > > > > All ports that Shared Rx queue actually shares One Rx queue and only
> > > > > > > pre-load mbufs to one Rx queue, memory is saved.
> > > > > > > 
> > > > > > > Polling any queue using same shared RX queue receives packets from all
> > > > > > > member ports. Source port is identified by mbuf->port.
> > > > > > > 
> > > > > > > Multiple groups is supported by group ID. Port queue number in a shared
> > > > > > > group should be identical. Queue index is 1:1 mapped in shared group.
> > > > > > > An example of polling two share groups:
> > > > > > >   core	group	queue
> > > > > > >   0	0	0
> > > > > > >   1	0	1
> > > > > > >   2	0	2
> > > > > > >   3	0	3
> > > > > > >   4	1	0
> > > > > > >   5	1	1
> > > > > > >   6	1	2
> > > > > > >   7	1	3
> > > > > > > 
> > > > > > > Shared RX queue must be polled on single thread or core. If both PF0 and
> > > > > > > representor0 joined same share group, can't poll pf0rxq0 on core1 and
> > > > > > > rep0rxq0 on core2. Actually, polling one port within share group is
> > > > > > > sufficient since polling any port in group will return packets for any
> > > > > > > port in group.
> > > > > > 
> > > > > > I apologize that I jump in into the review process that late.
> > > > > 
> > > > > Appreciate the bold suggestion, never too late :)
> > > > > 
> > > > > > 
> > > > > > Frankly speaking I doubt that it is the best design to solve
> > > > > > the problem. Yes, I confirm that the problem exists, but I
> > > > > > think there is better and simpler way to solve it.
> > > > > > 
> > > > > > The problem of the suggested solution is that it puts all
> > > > > > the headache about consistency to application and PMDs
> > > > > > without any help from ethdev layer to guarantee the
> > > > > > consistency. As the result I believe it will be either
> > > > > > missing/lost consistency checks or huge duplication in
> > > > > > each PMD which supports the feature. Shared RxQs must be
> > > > > > equally configured including number of queues, offloads
> > > > > > (taking device level Rx offloads into account), RSS
> > > > > > settings etc. So, applications must care about it and
> > > > > > PMDs (or ethdev layer) must check it.
> > > > > 
> > > > > The name might be confusing, here is my understanding:
> > > > > 1. NIC  shares the buffer supply HW Q between shared RxQs - for memory
> > > > > 2. PMD polls one shared RxQ - for latency and performance
> > > > > 3. Most per queue features like offloads and RSS not impacted. That's
> > > > > why this not mentioned. Some offloading might not being supported due
> > > > > to PMD or hw limitation, need to add check in PMD case by case.
> > > > > 4. Multiple group is defined for service level flexibility. For
> > > > > example, PF and VIP customer's load distributed via queues and dedicate
> > > > > cores. Low priority customers share one core with one shared queue.
> > > > > multiple groups enables more combination.
> > > > > 5. One port could assign queues to different group for polling
> > > > > flexibility. For example first 4 queues in group 0 and next 4 queues in
> > > > > group1, each group have other member ports with 4 queues, so the port
> > > > > with 8 queues could be polled with 8 cores w/o non-shared rxq penalty,
> > > > > in other words, each core only poll one shared RxQ.
> > > > > 
> > > > > > 
> > > > > > The advantage of the solution is that any device may
> > > > > > create group and subsequent devices join. Absence of
> > > > > > primary device is nice. But do we really need it?
> > > > > > Will the design work if some representors are configured
> > > > > > to use shared RxQ, but some do not? Theoretically it
> > > > > > is possible, but could require extra non-trivial code
> > > > > > on fast path.
> > > > > 
> > > > > If multiple groups, any device could be hot-unplugged.
> > > > > 
> > > > > Mixed configuration is supported, the only difference is how to set
> > > > > mbuf->port. Since group is per queue, mixed is better to be supported,
> > > > > didn't see any difficulty here.
> > > > > 
> > > > > PDM could select to support only group 0, same settings for each rxq,
> > > > > that fits most scenario.
> > > > > 
> > > > > > 
> > > > > > Also looking at the first two patch I don't understand
> > > > > > how application will find out which devices may share
> > > > > > RxQs. E.g. if we have two difference NICs which support
> > > > > > sharing, we can try to setup only one group 0, but
> > > > > > finally will have two devices (not one) which must be
> > > > > > polled.
> > > > > > 
> > > > > > 1. We need extra flag in dev_info->dev_capa
> > > > > >    RTE_ETH_DEV_CAPA_RX_SHARE to advertise that
> > > > > >    the device supports Rx sharing.
> > > > > 
> > > > > dev_info->rx_queue_offload_capa could be used here, no?
> > > 
> > > It depends. But we definitely need a flag which
> > > says that below rx_domain makes sense. It could be
> > > either RTE_ETH_DEV_CAPA_RX_SHARE or an Rx offload
> > > capability.
> > > 
> > > The question is if it is really an offload. The offload is
> > > when something could be done by HW/FW and result is provided
> > > to SW. May be it is just a nit picking...
> > > 
> > > May be we don't need an offload at all. Just have
> > > RTE_ETH_DEV_CAPA_RXQ_SHARE and use non-zero group ID
> > > as a flag that an RxQ should be shared (zero - default,
> > > no sharing). ethdev layer may check consistency on
> > > its layer to ensure that the device capability is
> > > reported if non-zero group is specified on queue setup.
> > > 
> > > > > 
> > > > > > 
> > > > > > 2. I think we need "rx_domain" in device info
> > > > > >    (which should be treated in boundaries of the
> > > > > >    switch_domain) if and only if
> > > > > >    RTE_ETH_DEV_CAPA_RX_SHARE is advertised.
> > > > > >    Otherwise rx_domain value does not make sense.
> > > > > 
> > > > > I see, this will give flexibility of different hw, will add it.
> > > > > 
> > > > > > 
> > > > > > (1) and (2) will allow application to find out which
> > > > > > devices can share Rx.
> > > > > > 
> > > > > > 3. Primary device (representors backing device) should
> > > > > >    advertise shared RxQ offload. Enabling of the offload
> > > > > >    tells the device to provide packets to all device in
> > > > > >    the Rx domain with mbuf->port filled in appropriately.
> > > > > >    Also it allows app to identify primary device in the
> > > > > >    Rx domain. When application enables the offload, it
> > > > > >    must ensure that it does not treat used port_id as an
> > > > > >    input port_id, but always check mbuf->port for each
> > > > > >    packet.
> > > > > > 
> > > > > > 4. A new Rx mode should be introduced for secondary
> > > > > >    devices. It should not allow to configure RSS, specify
> > > > > >    any Rx offloads etc. ethdev must ensure it.
> > > > > >    It is an open question right now if it should require
> > > > > >    to provide primary port_id. In theory representors
> > > > > >    have it. However, may be it is nice for consistency
> > > > > >    to ensure that application knows that it does.
> > > > > >    If shared Rx mode is specified for device, application
> > > > > >    does not need to setup RxQs and attempts to do it
> > > > > >    should be discarded in ethdev.
> > > > > >    For consistency it is better to ensure that number of
> > > > > >    queues match.
> > > > > 
> > > > > RSS and Rx offloads should be supported as individual, PMD needs to
> > > > > check if not supported.
> > > 
> > > Thinking a bit more about it I agree that RSS settings could
> > > be individual. Offload could be individual as well, but I'm
> > > not sure about all offloads. E.g. Rx scatter which is related
> > > to Rx buffer size (which is shared since Rx mempool is shared)
> > > vs MTU. May be it is acceptable. We just must define rules
> > > what should happen if offloads contradict to each other.
> > > It should be highlighted in the description including
> > > driver callback to ensure that PMD maintainers are responsible
> > > for consistency checks.
> > > 
> > > > > 
> > > > > >    It is an interesting question what should happen if
> > > > > >    primary device is reconfigured and shared Rx is
> > > > > >    disabled on reconfiguration.
> > > > > 
> > > > > I feel better no primary port/queue assumption in configuration, all
> > > > > members are equally treated, each queue can join or quit share group,
> > > > > that's important to support multiple groups.
> > > 
> > > I agree. The problem of many flexible solutions is
> > > complexity to support. We'll see how it goes.
> > > 
> > > > > 
> > > > > > 
> > > > > > 5. If so, in theory implementation of the Rx burst
> > > > > >    in the secondary could simply call Rx burst on
> > > > > >    primary device.
> > > > > > 
> > > > > > Andrew.
> > > > > 
> > > > 
> > > > Hi Andrew,
> > > > 
> > > > I realized that we are talking different things, this feature
> > > > introduced 2 RxQ share:
> > > > 1. Share mempool to save memory
> > > > 2. Share polling to save latency
> > > > 
> > > > What you suggested is reuse all RxQ configuration IIUC, maybe we should
> > > > break the flag into 3, so application could learn PMD capability and
> > > > configure accordingly, how do you think?
> > > > RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POOL
> > > 
> > > Not sure that I understand. Just specify the same mempool
> > > on Rx queue setup. Isn't it sufficient?
> > > 
> > > > RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POLL
> > > 
> > > It implies pool sharing if I'm not mistaken. Of course,
> > > we can pool many different HW queues in one poll, but it
> > > hardly makes sense to care specially about it.
> > > IMHO RxQ sharing is a sharing of the underlying HW Rx queue.
> > > 
> > > > RTE_ETH_RX_OFFLOAD_RXQ_SHARE_CFG //implies POOL and POLL
> > > 
> > > It is hardly a feature. Rather a possible limitation.
> > 
> > Thanks, then I'd drop this suggestion then.
> > 
> > Here is the TODO list, let me know if anything missing:
> > 1. change offload flag to RTE_ETH_DEV_CAPA_RX_SHARE
> 
> RTE_ETH_DEV_CAPA_RXQ_SHARE since it is not sharing of
> entire Rx, but just some queues.

V6 posted, thanks!

> 
> > 2. RxQ share group check in ethdev
> > 3. add rx_domain into device info
> > 
> > > 
> > > Andrew.
> > 
> 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v6 1/5] ethdev: introduce shared Rx queue
  2021-10-12 14:39   ` [dpdk-dev] [PATCH v6 1/5] " Xueming Li
@ 2021-10-15  9:28     ` Andrew Rybchenko
  2021-10-15 10:54       ` Xueming(Steven) Li
  2021-10-15 17:20     ` Ferruh Yigit
  1 sibling, 1 reply; 266+ messages in thread
From: Andrew Rybchenko @ 2021-10-15  9:28 UTC (permalink / raw)
  To: Xueming Li, Thomas Monjalon
  Cc: Jerin Jacob, Ferruh Yigit, Viacheslav Ovsiienko, Lior Margalit,
	Ananyev Konstantin, dev

On 10/12/21 5:39 PM, Xueming Li wrote:
> In current DPDK framework, each Rx queue is pre-loaded with mbufs to
> save incoming packets. For some PMDs, when number of representors scale
> out in a switch domain, the memory consumption became significant.
> Polling all ports also leads to high cache miss, high latency and low
> throughput.
> 
> This patch introduce shared Rx queue. Ports in same Rx domain and
> switch domain could share Rx queue set by specifying non-zero sharing
> group in Rx queue configuration.
> 
> No special API is defined to receive packets from shared Rx queue.
> Polling any member port of a shared Rx queue receives packets of that
> queue for all member ports, source port is identified by mbuf->port.
> 
> Shared Rx queue must be polled in same thread or core, polling a queue
> ID of any member port is essentially same.
> 
> Multiple share groups are supported by non-zero share group ID. Device

"by non-zero share group ID" is not required. Since it must be
always non-zero to enable sharing.

> should support mixed configuration by allowing multiple share
> groups and non-shared Rx queue.
> 
> Even Rx queue shared, queue configuration like offloads and RSS should
> not be impacted.

I don't understand above sentence.
Even when Rx queues are shared, queue configuration like
offloads and RSS may differ. If a PMD has some limitation,
it should care about consistency itself. These limitations
should be documented in the PMD documentation.

> 
> Example grouping and polling model to reflect service priority:
>  Group1, 2 shared Rx queues per port: PF, rep0, rep1
>  Group2, 1 shared Rx queue per port: rep2, rep3, ... rep127
>  Core0: poll PF queue0
>  Core1: poll PF queue1
>  Core2: poll rep2 queue0


Can I have:
PF RxQ#0, RxQ#1
Rep0 RxQ#0 shared with PF RxQ#0
Rep1 RxQ#0 shared with PF RxQ#1

I guess no, since it looks like RxQ ID must be equal.
Or am I missing something? Otherwise grouping rules
are not obvious to me. May be we need dedicated
shared_qid in boundaries of the share_group?

> 
> PMD driver advertise shared Rx queue capability via
> RTE_ETH_DEV_CAPA_RXQ_SHARE.
> 
> PMD driver is responsible for shared Rx queue consistency checks to
> avoid member port's configuration contradict to each other.
> 
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> ---
>  doc/guides/nics/features.rst                  | 13 ++++++++++++
>  doc/guides/nics/features/default.ini          |  1 +
>  .../prog_guide/switch_representation.rst      | 10 +++++++++
>  doc/guides/rel_notes/release_21_11.rst        |  5 +++++
>  lib/ethdev/rte_ethdev.c                       |  9 ++++++++
>  lib/ethdev/rte_ethdev.h                       | 21 +++++++++++++++++++
>  6 files changed, 59 insertions(+)
> 
> diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> index e346018e4b8..b64433b8ea5 100644
> --- a/doc/guides/nics/features.rst
> +++ b/doc/guides/nics/features.rst
> @@ -615,6 +615,19 @@ Supports inner packet L4 checksum.
>    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
>  
>  
> +.. _nic_features_shared_rx_queue:
> +
> +Shared Rx queue
> +---------------
> +
> +Supports shared Rx queue for ports in same Rx domain of a switch domain.
> +
> +* **[uses]     rte_eth_dev_info**: ``dev_capa:RTE_ETH_DEV_CAPA_RXQ_SHARE``.
> +* **[uses]     rte_eth_dev_info,rte_eth_switch_info**: ``rx_domain``, ``domain_id``.
> +* **[uses]     rte_eth_rxconf**: ``share_group``.
> +* **[provides] mbuf**: ``mbuf.port``.
> +
> +
>  .. _nic_features_packet_type_parsing:
>  
>  Packet type parsing
> diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
> index d473b94091a..93f5d1b46f4 100644
> --- a/doc/guides/nics/features/default.ini
> +++ b/doc/guides/nics/features/default.ini
> @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
>  Queue start/stop     =
>  Runtime Rx queue setup =
>  Runtime Tx queue setup =
> +Shared Rx queue      =
>  Burst mode info      =
>  Power mgmt address monitor =
>  MTU update           =
> diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
> index ff6aa91c806..de41db8385d 100644
> --- a/doc/guides/prog_guide/switch_representation.rst
> +++ b/doc/guides/prog_guide/switch_representation.rst
> @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
>  .. [1] `Ethernet switch device driver model (switchdev)
>         <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
>  
> +- For some PMDs, memory usage of representors is huge when number of
> +  representor grows, mbufs are allocated for each descriptor of Rx queue.
> +  Polling large number of ports brings more CPU load, cache miss and
> +  latency. Shared Rx queue can be used to share Rx queue between PF and
> +  representors among same Rx domain. ``RTE_ETH_DEV_CAPA_RXQ_SHARE`` is
> +  present in device capability of device info. Setting non-zero share group
> +  in Rx queue configuration to enable share. Polling any member port can
> +  receive packets of all member ports in the group, port ID is saved in
> +  ``mbuf.port``.
> +
>  Basic SR-IOV
>  ------------
>  
> diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
> index 5036641842c..d72fc97f4fb 100644
> --- a/doc/guides/rel_notes/release_21_11.rst
> +++ b/doc/guides/rel_notes/release_21_11.rst
> @@ -141,6 +141,11 @@ New Features
>    * Added tests to validate packets hard expiry.
>    * Added tests to verify tunnel header verification in IPsec inbound.
>  
> +* **Added ethdev shared Rx queue support.**
> +
> +  * Added new device capability flag and rx domain field to switch info.
> +  * Added share group to Rx queue configuration.
> +  * Added testpmd support and dedicate forwarding engine.

Please, add one more empty line since it must be two
before the next section. Also it should be put after
the last ethdev item above since list of features has
defined order.

>  
>  Removed Items
>  -------------
> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index 028907bc4b9..9b1b66370a7 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -2159,6 +2159,15 @@ rte_eth_rx_queue_setup(uint16_t port_id, uint16_t rx_queue_id,
>  		return -EINVAL;
>  	}
>  
> +	if (local_conf.share_group > 0 &&
> +	    (dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE) == 0) {
> +		RTE_ETHDEV_LOG(ERR,
> +			"Ethdev port_id=%d rx_queue_id=%d, enabled share_group=%u while device doesn't support Rx queue share in %s()\n",
> +			port_id, rx_queue_id, local_conf.share_group,
> +			__func__);

I'd remove function name logging here. Log is unique enough.

> +		return -EINVAL;
> +	}
> +
>  	/*
>  	 * If LRO is enabled, check that the maximum aggregated packet
>  	 * size is supported by the configured device.
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index 6d80514ba7a..041da6ee52f 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1044,6 +1044,13 @@ struct rte_eth_rxconf {
>  	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
>  	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
>  	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> +	/**
> +	 * Share group index in Rx domain and switch domain.
> +	 * Non-zero value to enable Rx queue share, zero value disable share.
> +	 * PMD driver is responsible for Rx queue consistency checks to avoid
> +	 * member port's configuration contradict to each other.
> +	 */
> +	uint32_t share_group;

I think that we don't need 32-bit for shared groups.
16-bits sounds more than enough.

>  	/**
>  	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
>  	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
> @@ -1445,6 +1452,14 @@ struct rte_eth_conf {
>  #define RTE_ETH_DEV_CAPA_RUNTIME_RX_QUEUE_SETUP 0x00000001
>  /** Device supports Tx queue setup after device started. */
>  #define RTE_ETH_DEV_CAPA_RUNTIME_TX_QUEUE_SETUP 0x00000002
> +/**
> + * Device supports shared Rx queue among ports within Rx domain and
> + * switch domain. Mbufs are consumed by shared Rx queue instead of
> + * every port. Multiple groups is supported by share_group of Rx
> + * queue configuration. Polling any port in the group receive packets
> + * of all member ports, source port identified by mbuf->port field.
> + */
> +#define RTE_ETH_DEV_CAPA_RXQ_SHARE              0x00000004

Let's use RTE_BIT64(2)

I think above two should be fixed in a separate
cleanup patch.

>  /**@}*/
>  
>  /*
> @@ -1488,6 +1503,12 @@ struct rte_eth_switch_info {
>  	 * but each driver should explicitly define the mapping of switch
>  	 * port identifier to that physical interconnect/switch
>  	 */
> +	uint16_t rx_domain;
> +	/**<
> +	 * Shared Rx queue sub-domain boundary. Only ports in same Rx domain
> +	 * and switch domain can share Rx queue. Valid only if device advertised
> +	 * RTE_ETH_DEV_CAPA_RXQ_SHARE capability.
> +	 */

Please, put the documentation before the documented
field.

[snip]

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v6 1/5] ethdev: introduce shared Rx queue
  2021-10-15  9:28     ` Andrew Rybchenko
@ 2021-10-15 10:54       ` Xueming(Steven) Li
  2021-10-18  6:46         ` Andrew Rybchenko
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-10-15 10:54 UTC (permalink / raw)
  To: NBU-Contact-Thomas Monjalon, andrew.rybchenko
  Cc: jerinjacobk, Lior Margalit, Slava Ovsiienko, konstantin.ananyev,
	dev, ferruh.yigit

On Fri, 2021-10-15 at 12:28 +0300, Andrew Rybchenko wrote:
> On 10/12/21 5:39 PM, Xueming Li wrote:
> > In current DPDK framework, each Rx queue is pre-loaded with mbufs to
> > save incoming packets. For some PMDs, when number of representors scale
> > out in a switch domain, the memory consumption became significant.
> > Polling all ports also leads to high cache miss, high latency and low
> > throughput.
> > 
> > This patch introduce shared Rx queue. Ports in same Rx domain and
> > switch domain could share Rx queue set by specifying non-zero sharing
> > group in Rx queue configuration.
> > 
> > No special API is defined to receive packets from shared Rx queue.
> > Polling any member port of a shared Rx queue receives packets of that
> > queue for all member ports, source port is identified by mbuf->port.
> > 
> > Shared Rx queue must be polled in same thread or core, polling a queue
> > ID of any member port is essentially same.
> > 
> > Multiple share groups are supported by non-zero share group ID. Device
> 
> "by non-zero share group ID" is not required. Since it must be
> always non-zero to enable sharing.
> 
> > should support mixed configuration by allowing multiple share
> > groups and non-shared Rx queue.
> > 
> > Even Rx queue shared, queue configuration like offloads and RSS should
> > not be impacted.
> 
> I don't understand above sentence.
> Even when Rx queues are shared, queue configuration like
> offloads and RSS may differ. If a PMD has some limitation,
> it should care about consistency itself. These limitations
> should be documented in the PMD documentation.
> 

OK, I'll remove this line.

> > 
> > Example grouping and polling model to reflect service priority:
> >  Group1, 2 shared Rx queues per port: PF, rep0, rep1
> >  Group2, 1 shared Rx queue per port: rep2, rep3, ... rep127
> >  Core0: poll PF queue0
> >  Core1: poll PF queue1
> >  Core2: poll rep2 queue0
> 
> 
> Can I have:
> PF RxQ#0, RxQ#1
> Rep0 RxQ#0 shared with PF RxQ#0
> Rep1 RxQ#0 shared with PF RxQ#1
> 
> I guess no, since it looks like RxQ ID must be equal.
> Or am I missing something? Otherwise grouping rules
> are not obvious to me. May be we need dedicated
> shared_qid in boundaries of the share_group?

Yes, RxQ ID must be equal, following configuration should work:
  Rep1 RxQ#1 shared with PF RxQ#1
Equal mapping should work by default instead of a new field that must
be set. I'll add some description to emphasis, how do you think?

> 
> > 
> > PMD driver advertise shared Rx queue capability via
> > RTE_ETH_DEV_CAPA_RXQ_SHARE.
> > 
> > PMD driver is responsible for shared Rx queue consistency checks to
> > avoid member port's configuration contradict to each other.
> > 
> > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > ---
> >  doc/guides/nics/features.rst                  | 13 ++++++++++++
> >  doc/guides/nics/features/default.ini          |  1 +
> >  .../prog_guide/switch_representation.rst      | 10 +++++++++
> >  doc/guides/rel_notes/release_21_11.rst        |  5 +++++
> >  lib/ethdev/rte_ethdev.c                       |  9 ++++++++
> >  lib/ethdev/rte_ethdev.h                       | 21 +++++++++++++++++++
> >  6 files changed, 59 insertions(+)
> > 
> > diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> > index e346018e4b8..b64433b8ea5 100644
> > --- a/doc/guides/nics/features.rst
> > +++ b/doc/guides/nics/features.rst
> > @@ -615,6 +615,19 @@ Supports inner packet L4 checksum.
> >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> >  
> >  
> > +.. _nic_features_shared_rx_queue:
> > +
> > +Shared Rx queue
> > +---------------
> > +
> > +Supports shared Rx queue for ports in same Rx domain of a switch domain.
> > +
> > +* **[uses]     rte_eth_dev_info**: ``dev_capa:RTE_ETH_DEV_CAPA_RXQ_SHARE``.
> > +* **[uses]     rte_eth_dev_info,rte_eth_switch_info**: ``rx_domain``, ``domain_id``.
> > +* **[uses]     rte_eth_rxconf**: ``share_group``.
> > +* **[provides] mbuf**: ``mbuf.port``.
> > +
> > +
> >  .. _nic_features_packet_type_parsing:
> >  
> >  Packet type parsing
> > diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
> > index d473b94091a..93f5d1b46f4 100644
> > --- a/doc/guides/nics/features/default.ini
> > +++ b/doc/guides/nics/features/default.ini
> > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> >  Queue start/stop     =
> >  Runtime Rx queue setup =
> >  Runtime Tx queue setup =
> > +Shared Rx queue      =
> >  Burst mode info      =
> >  Power mgmt address monitor =
> >  MTU update           =
> > diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
> > index ff6aa91c806..de41db8385d 100644
> > --- a/doc/guides/prog_guide/switch_representation.rst
> > +++ b/doc/guides/prog_guide/switch_representation.rst
> > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> >  .. [1] `Ethernet switch device driver model (switchdev)
> >         <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
> >  
> > +- For some PMDs, memory usage of representors is huge when number of
> > +  representor grows, mbufs are allocated for each descriptor of Rx queue.
> > +  Polling large number of ports brings more CPU load, cache miss and
> > +  latency. Shared Rx queue can be used to share Rx queue between PF and
> > +  representors among same Rx domain. ``RTE_ETH_DEV_CAPA_RXQ_SHARE`` is
> > +  present in device capability of device info. Setting non-zero share group
> > +  in Rx queue configuration to enable share. Polling any member port can
> > +  receive packets of all member ports in the group, port ID is saved in
> > +  ``mbuf.port``.
> > +
> >  Basic SR-IOV
> >  ------------
> >  
> > diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
> > index 5036641842c..d72fc97f4fb 100644
> > --- a/doc/guides/rel_notes/release_21_11.rst
> > +++ b/doc/guides/rel_notes/release_21_11.rst
> > @@ -141,6 +141,11 @@ New Features
> >    * Added tests to validate packets hard expiry.
> >    * Added tests to verify tunnel header verification in IPsec inbound.
> >  
> > +* **Added ethdev shared Rx queue support.**
> > +
> > +  * Added new device capability flag and rx domain field to switch info.
> > +  * Added share group to Rx queue configuration.
> > +  * Added testpmd support and dedicate forwarding engine.
> 
> Please, add one more empty line since it must be two
> before the next section. Also it should be put after
> the last ethdev item above since list of features has
> defined order.
> 
> >  
> >  Removed Items
> >  -------------
> > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > index 028907bc4b9..9b1b66370a7 100644
> > --- a/lib/ethdev/rte_ethdev.c
> > +++ b/lib/ethdev/rte_ethdev.c
> > @@ -2159,6 +2159,15 @@ rte_eth_rx_queue_setup(uint16_t port_id, uint16_t rx_queue_id,
> >  		return -EINVAL;
> >  	}
> >  
> > +	if (local_conf.share_group > 0 &&
> > +	    (dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE) == 0) {
> > +		RTE_ETHDEV_LOG(ERR,
> > +			"Ethdev port_id=%d rx_queue_id=%d, enabled share_group=%u while device doesn't support Rx queue share in %s()\n",
> > +			port_id, rx_queue_id, local_conf.share_group,
> > +			__func__);
> 
> I'd remove function name logging here. Log is unique enough.
> 
> > +		return -EINVAL;
> > +	}
> > +
> >  	/*
> >  	 * If LRO is enabled, check that the maximum aggregated packet
> >  	 * size is supported by the configured device.
> > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > index 6d80514ba7a..041da6ee52f 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -1044,6 +1044,13 @@ struct rte_eth_rxconf {
> >  	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> >  	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> >  	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> > +	/**
> > +	 * Share group index in Rx domain and switch domain.
> > +	 * Non-zero value to enable Rx queue share, zero value disable share.
> > +	 * PMD driver is responsible for Rx queue consistency checks to avoid
> > +	 * member port's configuration contradict to each other.
> > +	 */
> > +	uint32_t share_group;
> 
> I think that we don't need 32-bit for shared groups.
> 16-bits sounds more than enough.
> 
> >  	/**
> >  	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> >  	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
> > @@ -1445,6 +1452,14 @@ struct rte_eth_conf {
> >  #define RTE_ETH_DEV_CAPA_RUNTIME_RX_QUEUE_SETUP 0x00000001
> >  /** Device supports Tx queue setup after device started. */
> >  #define RTE_ETH_DEV_CAPA_RUNTIME_TX_QUEUE_SETUP 0x00000002
> > +/**
> > + * Device supports shared Rx queue among ports within Rx domain and
> > + * switch domain. Mbufs are consumed by shared Rx queue instead of
> > + * every port. Multiple groups is supported by share_group of Rx
> > + * queue configuration. Polling any port in the group receive packets
> > + * of all member ports, source port identified by mbuf->port field.
> > + */
> > +#define RTE_ETH_DEV_CAPA_RXQ_SHARE              0x00000004
> 
> Let's use RTE_BIT64(2)
> 
> I think above two should be fixed in a separate
> cleanup patch.

Not only above two, more Rx/Tx offload bits to cleanup, let's do it
later.

> 
> >  /**@}*/
> >  
> >  /*
> > @@ -1488,6 +1503,12 @@ struct rte_eth_switch_info {
> >  	 * but each driver should explicitly define the mapping of switch
> >  	 * port identifier to that physical interconnect/switch
> >  	 */
> > +	uint16_t rx_domain;
> > +	/**<
> > +	 * Shared Rx queue sub-domain boundary. Only ports in same Rx domain
> > +	 * and switch domain can share Rx queue. Valid only if device advertised
> > +	 * RTE_ETH_DEV_CAPA_RXQ_SHARE capability.
> > +	 */
> 
> Please, put the documentation before the documented
> field.
> 
> [snip]


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v6 1/5] ethdev: introduce shared Rx queue
  2021-10-12 14:39   ` [dpdk-dev] [PATCH v6 1/5] " Xueming Li
  2021-10-15  9:28     ` Andrew Rybchenko
@ 2021-10-15 17:20     ` Ferruh Yigit
  2021-10-16  9:14       ` Xueming(Steven) Li
  1 sibling, 1 reply; 266+ messages in thread
From: Ferruh Yigit @ 2021-10-15 17:20 UTC (permalink / raw)
  To: Xueming Li, dev
  Cc: Jerin Jacob, Andrew Rybchenko, Viacheslav Ovsiienko,
	Thomas Monjalon, Lior Margalit, Ananyev Konstantin

On 10/12/2021 3:39 PM, Xueming Li wrote:
> index 6d80514ba7a..041da6ee52f 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1044,6 +1044,13 @@ struct rte_eth_rxconf {
>   	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
>   	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
>   	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> +	/**
> +	 * Share group index in Rx domain and switch domain.
> +	 * Non-zero value to enable Rx queue share, zero value disable share.
> +	 * PMD driver is responsible for Rx queue consistency checks to avoid

When you update the set, can you please update 'PMD driver' usage too?

PMD = Poll Mode Driver, so second 'driver' is duplicate, there are a
few more instance of this usage in this set.

^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v7 0/5] ethdev: introduce shared Rx queue
  2021-07-27  3:42 [dpdk-dev] [RFC] ethdev: introduce shared Rx queue Xueming Li
                   ` (6 preceding siblings ...)
  2021-10-12 14:39 ` [dpdk-dev] [PATCH v6 0/5] ethdev: introduce " Xueming Li
@ 2021-10-16  8:42 ` Xueming Li
  2021-10-16  8:42   ` [dpdk-dev] [PATCH v7 1/5] " Xueming Li
                     ` (4 more replies)
  2021-10-18 12:59 ` [dpdk-dev] [PATCH v8 0/6] ethdev: introduce " Xueming Li
                   ` (8 subsequent siblings)
  16 siblings, 5 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-16  8:42 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin

In current DPDK framework, all Rx queues is pre-loaded with mbufs for
incoming packets. When number of representors scale out in a switch
domain, the memory consumption became significant. Further more,
polling all ports leads to high cache miss, high latency and low
throughputs.

This patch introduces shared Rx queue. PF and representors in same
Rx domain and switch domain could share Rx queue set by specifying
non-zero share group value in Rx queue configuration.

All ports that share Rx queue actually shares hardware descriptor
queue and feed all Rx queues with one descriptor supply, memory is saved.

Polling any queue using same shared Rx queue receives packets from all
member ports. Source port is identified by mbuf->port.

Multiple groups is supported by group ID. Port queue number in a shared
group should be identical. Queue index is 1:1 mapped in shared group.
An example of two share groups:
 Group1, 4 shared Rx queues per member port: PF, repr0, repr1
 Group2, 2 shared Rx queues per member port: repr2, repr3, ... repr127
 Poll first port for each group:
  core	port	queue
  0	0	0
  1	0	1
  2	0	2
  3	0	3
  4	2	0
  5	2	1

Shared Rx queue must be polled on single thread or core. If both PF0 and
representor0 joined same share group, can't poll pf0rxq0 on core1 and
rep0rxq0 on core2. Actually, polling one port within share group is
sufficient since polling any port in group will return packets for any
port in group.

There was some discussion to aggregate member ports in same group into a
dummy port, several ways to achieve it. Since it optional, need to collect
more feedback and requirement from user, make better decision later.

v1:
  - initial version
v2:
  - add testpmd patches
v3:
  - change common forwarding api to macro for performance, thanks Jerin.
  - save global variable accessed in forwarding to flowstream to minimize
    cache miss
  - combined patches for each forwarding engine
  - support multiple groups in testpmd "--share-rxq" parameter
  - new api to aggregate shared rxq group
v4:
  - spelling fixes
  - remove shared-rxq support for all forwarding engines
  - add dedicate shared-rxq forwarding engine
v5:
 - fix grammars
 - remove aggregate api and leave it for later discussion
 - add release notes
 - add deployment example
v6:
 - replace RxQ offload flag with device offload capability flag
 - add Rx domain
 - RxQ is shared when share group > 0
 - update testpmd accordingly
v7:
 - fix testpmd share group id allocation
 - change rx_domain to 16bits

Xueming Li (5):
  ethdev: introduce shared Rx queue
  app/testpmd: new parameter to enable shared Rx queue
  app/testpmd: dump port info for shared Rx queue
  app/testpmd: force shared Rx queue polled on same core
  app/testpmd: add forwarding engine for shared Rx queue

 app/test-pmd/config.c                         | 106 ++++++++++++-
 app/test-pmd/meson.build                      |   1 +
 app/test-pmd/parameters.c                     |  13 ++
 app/test-pmd/shared_rxq_fwd.c                 | 148 ++++++++++++++++++
 app/test-pmd/testpmd.c                        |  23 ++-
 app/test-pmd/testpmd.h                        |   5 +
 app/test-pmd/util.c                           |   3 +
 doc/guides/nics/features.rst                  |  13 ++
 doc/guides/nics/features/default.ini          |   1 +
 .../prog_guide/switch_representation.rst      |  10 ++
 doc/guides/rel_notes/release_21_11.rst        |   6 +
 doc/guides/testpmd_app_ug/run_app.rst         |   8 +
 doc/guides/testpmd_app_ug/testpmd_funcs.rst   |   5 +-
 lib/ethdev/rte_ethdev.c                       |   8 +
 lib/ethdev/rte_ethdev.h                       |  21 +++
 15 files changed, 365 insertions(+), 6 deletions(-)
 create mode 100644 app/test-pmd/shared_rxq_fwd.c

-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v7 1/5] ethdev: introduce shared Rx queue
  2021-10-16  8:42 ` [dpdk-dev] [PATCH v7 0/5] ethdev: introduce " Xueming Li
@ 2021-10-16  8:42   ` Xueming Li
  2021-10-17  5:33     ` Ajit Khaparde
  2021-10-16  8:42   ` [dpdk-dev] [PATCH v7 2/5] app/testpmd: new parameter to enable " Xueming Li
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 266+ messages in thread
From: Xueming Li @ 2021-10-16  8:42 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin

In current DPDK framework, each Rx queue is pre-loaded with mbufs to
save incoming packets. For some PMDs, when number of representors scale
out in a switch domain, the memory consumption became significant.
Polling all ports also leads to high cache miss, high latency and low
throughput.

This patch introduce shared Rx queue. Ports in same Rx domain and
switch domain could share Rx queue set by specifying non-zero sharing
group in Rx queue configuration.

Port A RxQ X can share RxQ with Port B RxQ X, but can't share with RxQ
Y. All member ports in share group share a list of shared Rx queue
indexed by Rx queue ID.

No special API is defined to receive packets from shared Rx queue.
Polling any member port of a shared Rx queue receives packets of that
queue for all member ports, source port is identified by mbuf->port.

Shared Rx queue must be polled in same thread or core, polling a queue
ID of any member port is essentially same.

Multiple share groups are supported. Device should support mixed
configuration by allowing multiple share groups and non-shared Rx queue.

Example grouping and polling model to reflect service priority:
 Group1, 2 shared Rx queues per port: PF, rep0, rep1
 Group2, 1 shared Rx queue per port: rep2, rep3, ... rep127
 Core0: poll PF queue0
 Core1: poll PF queue1
 Core2: poll rep2 queue0

PMD advertise shared Rx queue capability via RTE_ETH_DEV_CAPA_RXQ_SHARE.

PMD is responsible for shared Rx queue consistency checks to avoid
member port's configuration contradict to each other.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 doc/guides/nics/features.rst                  | 13 ++++++++++++
 doc/guides/nics/features/default.ini          |  1 +
 .../prog_guide/switch_representation.rst      | 10 +++++++++
 doc/guides/rel_notes/release_21_11.rst        |  6 ++++++
 lib/ethdev/rte_ethdev.c                       |  8 +++++++
 lib/ethdev/rte_ethdev.h                       | 21 +++++++++++++++++++
 6 files changed, 59 insertions(+)

diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index e346018e4b8..b64433b8ea5 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -615,6 +615,19 @@ Supports inner packet L4 checksum.
   ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
 
 
+.. _nic_features_shared_rx_queue:
+
+Shared Rx queue
+---------------
+
+Supports shared Rx queue for ports in same Rx domain of a switch domain.
+
+* **[uses]     rte_eth_dev_info**: ``dev_capa:RTE_ETH_DEV_CAPA_RXQ_SHARE``.
+* **[uses]     rte_eth_dev_info,rte_eth_switch_info**: ``rx_domain``, ``domain_id``.
+* **[uses]     rte_eth_rxconf**: ``share_group``.
+* **[provides] mbuf**: ``mbuf.port``.
+
+
 .. _nic_features_packet_type_parsing:
 
 Packet type parsing
diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
index d473b94091a..93f5d1b46f4 100644
--- a/doc/guides/nics/features/default.ini
+++ b/doc/guides/nics/features/default.ini
@@ -19,6 +19,7 @@ Free Tx mbuf on demand =
 Queue start/stop     =
 Runtime Rx queue setup =
 Runtime Tx queue setup =
+Shared Rx queue      =
 Burst mode info      =
 Power mgmt address monitor =
 MTU update           =
diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
index ff6aa91c806..de41db8385d 100644
--- a/doc/guides/prog_guide/switch_representation.rst
+++ b/doc/guides/prog_guide/switch_representation.rst
@@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
 .. [1] `Ethernet switch device driver model (switchdev)
        <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
 
+- For some PMDs, memory usage of representors is huge when number of
+  representor grows, mbufs are allocated for each descriptor of Rx queue.
+  Polling large number of ports brings more CPU load, cache miss and
+  latency. Shared Rx queue can be used to share Rx queue between PF and
+  representors among same Rx domain. ``RTE_ETH_DEV_CAPA_RXQ_SHARE`` is
+  present in device capability of device info. Setting non-zero share group
+  in Rx queue configuration to enable share. Polling any member port can
+  receive packets of all member ports in the group, port ID is saved in
+  ``mbuf.port``.
+
 Basic SR-IOV
 ------------
 
diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
index 4c56cdfeaaa..1c84e896554 100644
--- a/doc/guides/rel_notes/release_21_11.rst
+++ b/doc/guides/rel_notes/release_21_11.rst
@@ -67,6 +67,12 @@ New Features
   * Modified to allow ``--huge-dir`` option to specify a sub-directory
     within a hugetlbfs mountpoint.
 
+* **Added ethdev shared Rx queue support.**
+
+  * Added new device capability flag and rx domain field to switch info.
+  * Added share group to Rx queue configuration.
+  * Added testpmd support and dedicate forwarding engine.
+
 * **Added new RSS offload types for IPv4/L4 checksum in RSS flow.**
 
   Added macros ETH_RSS_IPV4_CHKSUM and ETH_RSS_L4_CHKSUM, now IPv4 and
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 028907bc4b9..bc55f899f72 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -2159,6 +2159,14 @@ rte_eth_rx_queue_setup(uint16_t port_id, uint16_t rx_queue_id,
 		return -EINVAL;
 	}
 
+	if (local_conf.share_group > 0 &&
+	    (dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE) == 0) {
+		RTE_ETHDEV_LOG(ERR,
+			"Ethdev port_id=%d rx_queue_id=%d, enabled share_group=%hu while device doesn't support Rx queue share\n",
+			port_id, rx_queue_id, local_conf.share_group);
+		return -EINVAL;
+	}
+
 	/*
 	 * If LRO is enabled, check that the maximum aggregated packet
 	 * size is supported by the configured device.
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 6d80514ba7a..59d8904ac7c 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1044,6 +1044,13 @@ struct rte_eth_rxconf {
 	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
 	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
 	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
+	/**
+	 * Share group index in Rx domain and switch domain.
+	 * Non-zero value to enable Rx queue share, zero value disable share.
+	 * PMD driver is responsible for Rx queue consistency checks to avoid
+	 * member port's configuration contradict to each other.
+	 */
+	uint16_t share_group;
 	/**
 	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
 	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
@@ -1445,6 +1452,14 @@ struct rte_eth_conf {
 #define RTE_ETH_DEV_CAPA_RUNTIME_RX_QUEUE_SETUP 0x00000001
 /** Device supports Tx queue setup after device started. */
 #define RTE_ETH_DEV_CAPA_RUNTIME_TX_QUEUE_SETUP 0x00000002
+/**
+ * Device supports shared Rx queue among ports within Rx domain and
+ * switch domain. Mbufs are consumed by shared Rx queue instead of
+ * every port. Multiple groups is supported by share_group of Rx
+ * queue configuration. Polling any port in the group receive packets
+ * of all member ports, source port identified by mbuf->port field.
+ */
+#define RTE_ETH_DEV_CAPA_RXQ_SHARE              RTE_BIT64(2)
 /**@}*/
 
 /*
@@ -1488,6 +1503,12 @@ struct rte_eth_switch_info {
 	 * but each driver should explicitly define the mapping of switch
 	 * port identifier to that physical interconnect/switch
 	 */
+	/**
+	 * Shared Rx queue sub-domain boundary. Only ports in same Rx domain
+	 * and switch domain can share Rx queue. Valid only if device advertised
+	 * RTE_ETH_DEV_CAPA_RXQ_SHARE capability.
+	 */
+	uint16_t rx_domain;
 };
 
 /**
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v7 2/5] app/testpmd: new parameter to enable shared Rx queue
  2021-10-16  8:42 ` [dpdk-dev] [PATCH v7 0/5] ethdev: introduce " Xueming Li
  2021-10-16  8:42   ` [dpdk-dev] [PATCH v7 1/5] " Xueming Li
@ 2021-10-16  8:42   ` Xueming Li
  2021-10-16  8:42   ` [dpdk-dev] [PATCH v7 3/5] app/testpmd: dump port info for " Xueming Li
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-16  8:42 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Xiaoyun Li

Adds "--rxq-share=X" parameter to enable shared RxQ, share if device
supports, otherwise fallback to standard RxQ.

Share group number grows per X ports. X defaults to MAX, implies all
ports join share group 1.

Forwarding engine "shared-rxq" should be used which Rx only and update
stream statistics correctly.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c                 |  6 +++++-
 app/test-pmd/parameters.c             | 13 +++++++++++++
 app/test-pmd/testpmd.c                | 18 +++++++++++++++---
 app/test-pmd/testpmd.h                |  2 ++
 doc/guides/testpmd_app_ug/run_app.rst |  7 +++++++
 5 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 9c66329e96e..96fc2ab888b 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -2709,7 +2709,11 @@ rxtx_config_display(void)
 			printf("      RX threshold registers: pthresh=%d hthresh=%d "
 				" wthresh=%d\n",
 				pthresh_tmp, hthresh_tmp, wthresh_tmp);
-			printf("      RX Offloads=0x%"PRIx64"\n", offloads_tmp);
+			printf("      RX Offloads=0x%"PRIx64, offloads_tmp);
+			if (rx_conf->share_group > 0)
+				printf(" share group=%u",
+				       rx_conf->share_group);
+			printf("\n");
 		}
 
 		/* per tx queue config only for first queue to be less verbose */
diff --git a/app/test-pmd/parameters.c b/app/test-pmd/parameters.c
index 3f94a82e321..30dae326310 100644
--- a/app/test-pmd/parameters.c
+++ b/app/test-pmd/parameters.c
@@ -167,6 +167,7 @@ usage(char* progname)
 	printf("  --tx-ip=src,dst: IP addresses in Tx-only mode\n");
 	printf("  --tx-udp=src[,dst]: UDP ports in Tx-only mode\n");
 	printf("  --eth-link-speed: force link speed.\n");
+	printf("  --rxq-share: number of ports per shared rxq groups, defaults to MAX(1 group)\n");
 	printf("  --disable-link-check: disable check on link status when "
 	       "starting/stopping ports.\n");
 	printf("  --disable-device-start: do not automatically start port\n");
@@ -607,6 +608,7 @@ launch_args_parse(int argc, char** argv)
 		{ "rxpkts",			1, 0, 0 },
 		{ "txpkts",			1, 0, 0 },
 		{ "txonly-multi-flow",		0, 0, 0 },
+		{ "rxq-share",			2, 0, 0 },
 		{ "eth-link-speed",		1, 0, 0 },
 		{ "disable-link-check",		0, 0, 0 },
 		{ "disable-device-start",	0, 0, 0 },
@@ -1271,6 +1273,17 @@ launch_args_parse(int argc, char** argv)
 			}
 			if (!strcmp(lgopts[opt_idx].name, "txonly-multi-flow"))
 				txonly_multi_flow = 1;
+			if (!strcmp(lgopts[opt_idx].name, "rxq-share")) {
+				if (optarg == NULL) {
+					rxq_share = UINT32_MAX;
+				} else {
+					n = atoi(optarg);
+					if (n >= 0)
+						rxq_share = (uint32_t)n;
+					else
+						rte_exit(EXIT_FAILURE, "rxq-share must be >= 0\n");
+				}
+			}
 			if (!strcmp(lgopts[opt_idx].name, "no-flush-rx"))
 				no_flush_rx = 1;
 			if (!strcmp(lgopts[opt_idx].name, "eth-link-speed")) {
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 97ae52e17ec..4c501bf43f3 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -498,6 +498,11 @@ uint8_t record_core_cycles;
  */
 uint8_t record_burst_stats;
 
+/*
+ * Number of ports per shared Rx queue group, 0 disable.
+ */
+uint32_t rxq_share;
+
 unsigned int num_sockets = 0;
 unsigned int socket_ids[RTE_MAX_NUMA_NODES];
 
@@ -3393,14 +3398,21 @@ dev_event_callback(const char *device_name, enum rte_dev_event_type type,
 }
 
 static void
-rxtx_port_config(struct rte_port *port)
+rxtx_port_config(portid_t pid)
 {
 	uint16_t qid;
 	uint64_t offloads;
+	struct rte_port *port = &ports[pid];
 
 	for (qid = 0; qid < nb_rxq; qid++) {
 		offloads = port->rx_conf[qid].offloads;
 		port->rx_conf[qid] = port->dev_info.default_rxconf;
+
+		if (rxq_share > 0 &&
+		    (port->dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE))
+			/* Non-zero share group to enable RxQ share. */
+			port->rx_conf[qid].share_group = pid / rxq_share + 1;
+
 		if (offloads != 0)
 			port->rx_conf[qid].offloads = offloads;
 
@@ -3558,7 +3570,7 @@ init_port_config(void)
 				port->dev_conf.rxmode.mq_mode = ETH_MQ_RX_NONE;
 		}
 
-		rxtx_port_config(port);
+		rxtx_port_config(pid);
 
 		ret = eth_macaddr_get_print_err(pid, &port->eth_addr);
 		if (ret != 0)
@@ -3772,7 +3784,7 @@ init_port_dcb_config(portid_t pid,
 
 	memcpy(&rte_port->dev_conf, &port_conf, sizeof(struct rte_eth_conf));
 
-	rxtx_port_config(rte_port);
+	rxtx_port_config(pid);
 	/* VLAN filter */
 	rte_port->dev_conf.rxmode.offloads |= DEV_RX_OFFLOAD_VLAN_FILTER;
 	for (i = 0; i < RTE_DIM(vlan_tags); i++)
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 5863b2f43f3..3dfaaad94c0 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -477,6 +477,8 @@ extern enum tx_pkt_split tx_pkt_split;
 
 extern uint8_t txonly_multi_flow;
 
+extern uint32_t rxq_share;
+
 extern uint16_t nb_pkt_per_burst;
 extern uint16_t nb_pkt_flowgen_clones;
 extern int nb_flows_flowgen;
diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
index 640eadeff73..ff5908dcd50 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -389,6 +389,13 @@ The command line options are:
 
     Generate multiple flows in txonly mode.
 
+*   ``--rxq-share=[X]``
+
+    Create queues in shared Rx queue mode if device supports.
+    Group number grows per X ports. X defaults to MAX, implies all ports
+    join share group 1. Forwarding engine "shared-rxq" should be used
+    which Rx only and update stream statistics correctly.
+
 *   ``--eth-link-speed``
 
     Set a forced link speed to the ethernet port::
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v7 3/5] app/testpmd: dump port info for shared Rx queue
  2021-10-16  8:42 ` [dpdk-dev] [PATCH v7 0/5] ethdev: introduce " Xueming Li
  2021-10-16  8:42   ` [dpdk-dev] [PATCH v7 1/5] " Xueming Li
  2021-10-16  8:42   ` [dpdk-dev] [PATCH v7 2/5] app/testpmd: new parameter to enable " Xueming Li
@ 2021-10-16  8:42   ` Xueming Li
  2021-10-16  8:42   ` [dpdk-dev] [PATCH v7 4/5] app/testpmd: force shared Rx queue polled on same core Xueming Li
  2021-10-16  8:42   ` [dpdk-dev] [PATCH v7 5/5] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
  4 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-16  8:42 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Xiaoyun Li

In case of shared Rx queue, polling any member port returns mbufs for
all members. This patch dumps mbuf->port for each packet.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/util.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/app/test-pmd/util.c b/app/test-pmd/util.c
index 51506e49404..e98f136d5ed 100644
--- a/app/test-pmd/util.c
+++ b/app/test-pmd/util.c
@@ -100,6 +100,9 @@ dump_pkt_burst(uint16_t port_id, uint16_t queue, struct rte_mbuf *pkts[],
 		struct rte_flow_restore_info info = { 0, };
 
 		mb = pkts[i];
+		if (rxq_share > 0)
+			MKDUMPSTR(print_buf, buf_size, cur_len, "port %u, ",
+				  mb->port);
 		eth_hdr = rte_pktmbuf_read(mb, 0, sizeof(_eth_hdr), &_eth_hdr);
 		eth_type = RTE_BE_TO_CPU_16(eth_hdr->ether_type);
 		packet_type = mb->packet_type;
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v7 4/5] app/testpmd: force shared Rx queue polled on same core
  2021-10-16  8:42 ` [dpdk-dev] [PATCH v7 0/5] ethdev: introduce " Xueming Li
                     ` (2 preceding siblings ...)
  2021-10-16  8:42   ` [dpdk-dev] [PATCH v7 3/5] app/testpmd: dump port info for " Xueming Li
@ 2021-10-16  8:42   ` Xueming Li
  2021-10-16  8:42   ` [dpdk-dev] [PATCH v7 5/5] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
  4 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-16  8:42 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Xiaoyun Li

Shared Rx queue must be polled on same core. This patch checks and stops
forwarding if shared RxQ being scheduled on multiple
cores.

It's suggested to use same number of Rx queues and polling cores.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c  | 100 +++++++++++++++++++++++++++++++++++++++++
 app/test-pmd/testpmd.c |   4 +-
 app/test-pmd/testpmd.h |   2 +
 3 files changed, 105 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 96fc2ab888b..9acd2705f18 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -2885,6 +2885,106 @@ port_rss_hash_key_update(portid_t port_id, char rss_type[], uint8_t *hash_key,
 	}
 }
 
+/*
+ * Check whether a shared rxq scheduled on other lcores.
+ */
+static bool
+fwd_stream_on_other_lcores(uint16_t domain_id, portid_t src_port,
+			   queueid_t src_rxq, lcoreid_t src_lc,
+			   uint32_t share_group)
+{
+	streamid_t sm_id;
+	streamid_t nb_fs_per_lcore;
+	lcoreid_t  nb_fc;
+	lcoreid_t  lc_id;
+	struct fwd_stream *fs;
+	struct rte_port *port;
+	struct rte_eth_dev_info *dev_info;
+	struct rte_eth_rxconf *rxq_conf;
+
+	nb_fc = cur_fwd_config.nb_fwd_lcores;
+	for (lc_id = src_lc + 1; lc_id < nb_fc; lc_id++) {
+		sm_id = fwd_lcores[lc_id]->stream_idx;
+		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
+		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
+		     sm_id++) {
+			fs = fwd_streams[sm_id];
+			port = &ports[fs->rx_port];
+			dev_info = &port->dev_info;
+			rxq_conf = &port->rx_conf[fs->rx_queue];
+			if ((dev_info->dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE)
+			    == 0)
+				/* Not shared rxq. */
+				continue;
+			if (domain_id != port->dev_info.switch_info.domain_id)
+				continue;
+			if (fs->rx_queue != src_rxq)
+				continue;
+			if (rxq_conf->share_group != share_group)
+				continue;
+			printf("Shared Rx queue group %u can't be scheduled on different cores:\n",
+			       share_group);
+			printf("  lcore %hhu Port %hu queue %hu\n",
+			       src_lc, src_port, src_rxq);
+			printf("  lcore %hhu Port %hu queue %hu\n",
+			       lc_id, fs->rx_port, fs->rx_queue);
+			printf("  please use --nb-cores=%hu to limit forwarding cores\n",
+			       nb_rxq);
+			return true;
+		}
+	}
+	return false;
+}
+
+/*
+ * Check shared rxq configuration.
+ *
+ * Shared group must not being scheduled on different core.
+ */
+bool
+pkt_fwd_shared_rxq_check(void)
+{
+	streamid_t sm_id;
+	streamid_t nb_fs_per_lcore;
+	lcoreid_t  nb_fc;
+	lcoreid_t  lc_id;
+	struct fwd_stream *fs;
+	uint16_t domain_id;
+	struct rte_port *port;
+	struct rte_eth_dev_info *dev_info;
+	struct rte_eth_rxconf *rxq_conf;
+
+	nb_fc = cur_fwd_config.nb_fwd_lcores;
+	/*
+	 * Check streams on each core, make sure the same switch domain +
+	 * group + queue doesn't get scheduled on other cores.
+	 */
+	for (lc_id = 0; lc_id < nb_fc; lc_id++) {
+		sm_id = fwd_lcores[lc_id]->stream_idx;
+		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
+		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
+		     sm_id++) {
+			fs = fwd_streams[sm_id];
+			/* Update lcore info stream being scheduled. */
+			fs->lcore = fwd_lcores[lc_id];
+			port = &ports[fs->rx_port];
+			dev_info = &port->dev_info;
+			rxq_conf = &port->rx_conf[fs->rx_queue];
+			if ((dev_info->dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE)
+			    == 0)
+				/* Not shared rxq. */
+				continue;
+			/* Check shared rxq not scheduled on remaining cores. */
+			domain_id = port->dev_info.switch_info.domain_id;
+			if (fwd_stream_on_other_lcores(domain_id, fs->rx_port,
+						       fs->rx_queue, lc_id,
+						       rxq_conf->share_group))
+				return false;
+		}
+	}
+	return true;
+}
+
 /*
  * Setup forwarding configuration for each logical core.
  */
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 4c501bf43f3..49c04de2501 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -2236,10 +2236,12 @@ start_packet_forwarding(int with_tx_first)
 
 	fwd_config_setup();
 
+	pkt_fwd_config_display(&cur_fwd_config);
+	if (!pkt_fwd_shared_rxq_check())
+		return;
 	if(!no_flush_rx)
 		flush_fwd_rx_queues();
 
-	pkt_fwd_config_display(&cur_fwd_config);
 	rxtx_config_display();
 
 	fwd_stats_reset();
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 3dfaaad94c0..f121a2da90c 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -144,6 +144,7 @@ struct fwd_stream {
 	uint64_t     core_cycles; /**< used for RX and TX processing */
 	struct pkt_burst_stats rx_burst_stats;
 	struct pkt_burst_stats tx_burst_stats;
+	struct fwd_lcore *lcore; /**< Lcore being scheduled. */
 };
 
 /**
@@ -795,6 +796,7 @@ void port_summary_header_display(void);
 void rx_queue_infos_display(portid_t port_idi, uint16_t queue_id);
 void tx_queue_infos_display(portid_t port_idi, uint16_t queue_id);
 void fwd_lcores_config_display(void);
+bool pkt_fwd_shared_rxq_check(void);
 void pkt_fwd_config_display(struct fwd_config *cfg);
 void rxtx_config_display(void);
 void fwd_config_setup(void);
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v7 5/5] app/testpmd: add forwarding engine for shared Rx queue
  2021-10-16  8:42 ` [dpdk-dev] [PATCH v7 0/5] ethdev: introduce " Xueming Li
                     ` (3 preceding siblings ...)
  2021-10-16  8:42   ` [dpdk-dev] [PATCH v7 4/5] app/testpmd: force shared Rx queue polled on same core Xueming Li
@ 2021-10-16  8:42   ` Xueming Li
  4 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-16  8:42 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Xiaoyun Li

To support shared Rx queue, this patch introduces dedicate forwarding
engine. The engine groups received packets by mbuf->port into sub-group,
updates stream statistics and simply frees packets.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/meson.build                    |   1 +
 app/test-pmd/shared_rxq_fwd.c               | 148 ++++++++++++++++++++
 app/test-pmd/testpmd.c                      |   1 +
 app/test-pmd/testpmd.h                      |   1 +
 doc/guides/testpmd_app_ug/run_app.rst       |   1 +
 doc/guides/testpmd_app_ug/testpmd_funcs.rst |   5 +-
 6 files changed, 156 insertions(+), 1 deletion(-)
 create mode 100644 app/test-pmd/shared_rxq_fwd.c

diff --git a/app/test-pmd/meson.build b/app/test-pmd/meson.build
index 98f3289bdfa..07042e45b12 100644
--- a/app/test-pmd/meson.build
+++ b/app/test-pmd/meson.build
@@ -21,6 +21,7 @@ sources = files(
         'noisy_vnf.c',
         'parameters.c',
         'rxonly.c',
+        'shared_rxq_fwd.c',
         'testpmd.c',
         'txonly.c',
         'util.c',
diff --git a/app/test-pmd/shared_rxq_fwd.c b/app/test-pmd/shared_rxq_fwd.c
new file mode 100644
index 00000000000..4e262b99bc7
--- /dev/null
+++ b/app/test-pmd/shared_rxq_fwd.c
@@ -0,0 +1,148 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2021 NVIDIA Corporation & Affiliates
+ */
+
+#include <stdarg.h>
+#include <string.h>
+#include <stdio.h>
+#include <errno.h>
+#include <stdint.h>
+#include <unistd.h>
+#include <inttypes.h>
+
+#include <sys/queue.h>
+#include <sys/stat.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_debug.h>
+#include <rte_cycles.h>
+#include <rte_memory.h>
+#include <rte_memcpy.h>
+#include <rte_launch.h>
+#include <rte_eal.h>
+#include <rte_per_lcore.h>
+#include <rte_lcore.h>
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_mempool.h>
+#include <rte_mbuf.h>
+#include <rte_pci.h>
+#include <rte_ether.h>
+#include <rte_ethdev.h>
+#include <rte_string_fns.h>
+#include <rte_ip.h>
+#include <rte_udp.h>
+#include <rte_net.h>
+#include <rte_flow.h>
+
+#include "testpmd.h"
+
+/*
+ * Rx only sub-burst forwarding.
+ */
+static void
+forward_rx_only(uint16_t nb_rx, struct rte_mbuf **pkts_burst)
+{
+	rte_pktmbuf_free_bulk(pkts_burst, nb_rx);
+}
+
+/**
+ * Get packet source stream by source port and queue.
+ * All streams of same shared Rx queue locates on same core.
+ */
+static struct fwd_stream *
+forward_stream_get(struct fwd_stream *fs, uint16_t port)
+{
+	streamid_t sm_id;
+	struct fwd_lcore *fc;
+	struct fwd_stream **fsm;
+	streamid_t nb_fs;
+
+	fc = fs->lcore;
+	fsm = &fwd_streams[fc->stream_idx];
+	nb_fs = fc->stream_nb;
+	for (sm_id = 0; sm_id < nb_fs; sm_id++) {
+		if (fsm[sm_id]->rx_port == port &&
+		    fsm[sm_id]->rx_queue == fs->rx_queue)
+			return fsm[sm_id];
+	}
+	return NULL;
+}
+
+/**
+ * Forward packet by source port and queue.
+ */
+static void
+forward_sub_burst(struct fwd_stream *src_fs, uint16_t port, uint16_t nb_rx,
+		  struct rte_mbuf **pkts)
+{
+	struct fwd_stream *fs = forward_stream_get(src_fs, port);
+
+	if (fs != NULL) {
+		fs->rx_packets += nb_rx;
+		forward_rx_only(nb_rx, pkts);
+	} else {
+		/* Source stream not found, drop all packets. */
+		src_fs->fwd_dropped += nb_rx;
+		while (nb_rx > 0)
+			rte_pktmbuf_free(pkts[--nb_rx]);
+	}
+}
+
+/**
+ * Forward packets from shared Rx queue.
+ *
+ * Source port of packets are identified by mbuf->port.
+ */
+static void
+forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
+		   struct rte_mbuf **pkts_burst)
+{
+	uint16_t i, nb_sub_burst, port, last_port;
+
+	nb_sub_burst = 0;
+	last_port = pkts_burst[0]->port;
+	/* Locate sub-burst according to mbuf->port. */
+	for (i = 0; i < nb_rx - 1; ++i) {
+		rte_prefetch0(pkts_burst[i + 1]);
+		port = pkts_burst[i]->port;
+		if (i > 0 && last_port != port) {
+			/* Forward packets with same source port. */
+			forward_sub_burst(fs, last_port, nb_sub_burst,
+					  &pkts_burst[i - nb_sub_burst]);
+			nb_sub_burst = 0;
+			last_port = port;
+		}
+		nb_sub_burst++;
+	}
+	/* Last sub-burst. */
+	nb_sub_burst++;
+	forward_sub_burst(fs, last_port, nb_sub_burst,
+			  &pkts_burst[nb_rx - nb_sub_burst]);
+}
+
+static void
+shared_rxq_fwd(struct fwd_stream *fs)
+{
+	struct rte_mbuf *pkts_burst[nb_pkt_per_burst];
+	uint16_t nb_rx;
+	uint64_t start_tsc = 0;
+
+	get_start_cycles(&start_tsc);
+	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
+				 nb_pkt_per_burst);
+	inc_rx_burst_stats(fs, nb_rx);
+	if (unlikely(nb_rx == 0))
+		return;
+	forward_shared_rxq(fs, nb_rx, pkts_burst);
+	get_end_cycles(fs, start_tsc);
+}
+
+struct fwd_engine shared_rxq_engine = {
+	.fwd_mode_name  = "shared_rxq",
+	.port_fwd_begin = NULL,
+	.port_fwd_end   = NULL,
+	.packet_fwd     = shared_rxq_fwd,
+};
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 49c04de2501..02c8a86f321 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -188,6 +188,7 @@ struct fwd_engine * fwd_engines[] = {
 #ifdef RTE_LIBRTE_IEEE1588
 	&ieee1588_fwd_engine,
 #endif
+	&shared_rxq_engine,
 	NULL,
 };
 
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index f121a2da90c..f1fd607e365 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -299,6 +299,7 @@ extern struct fwd_engine five_tuple_swap_fwd_engine;
 #ifdef RTE_LIBRTE_IEEE1588
 extern struct fwd_engine ieee1588_fwd_engine;
 #endif
+extern struct fwd_engine shared_rxq_engine;
 
 extern struct fwd_engine * fwd_engines[]; /**< NULL terminated array. */
 extern cmdline_parse_inst_t cmd_set_raw;
diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
index ff5908dcd50..e4b97844ced 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -252,6 +252,7 @@ The command line options are:
        tm
        noisy
        5tswap
+       shared-rxq
 
 *   ``--rss-ip``
 
diff --git a/doc/guides/testpmd_app_ug/testpmd_funcs.rst b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
index 8ead7a4a712..499874187f2 100644
--- a/doc/guides/testpmd_app_ug/testpmd_funcs.rst
+++ b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
@@ -314,7 +314,7 @@ set fwd
 Set the packet forwarding mode::
 
    testpmd> set fwd (io|mac|macswap|flowgen| \
-                     rxonly|txonly|csum|icmpecho|noisy|5tswap) (""|retry)
+                     rxonly|txonly|csum|icmpecho|noisy|5tswap|shared-rxq) (""|retry)
 
 ``retry`` can be specified for forwarding engines except ``rx_only``.
 
@@ -357,6 +357,9 @@ The available information categories are:
 
   L4 swaps the source port and destination port of transport layer (TCP and UDP).
 
+* ``shared-rxq``: Receive only for shared Rx queue.
+  Resolve packet source port from mbuf and update stream statistics accordingly.
+
 Example::
 
    testpmd> set fwd rxonly
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v6 1/5] ethdev: introduce shared Rx queue
  2021-10-15 17:20     ` Ferruh Yigit
@ 2021-10-16  9:14       ` Xueming(Steven) Li
  0 siblings, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-10-16  9:14 UTC (permalink / raw)
  To: dev, ferruh.yigit
  Cc: jerinjacobk, NBU-Contact-Thomas Monjalon, andrew.rybchenko,
	Slava Ovsiienko, konstantin.ananyev, Lior Margalit

On Fri, 2021-10-15 at 18:20 +0100, Ferruh Yigit wrote:
> On 10/12/2021 3:39 PM, Xueming Li wrote:
> > index 6d80514ba7a..041da6ee52f 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -1044,6 +1044,13 @@ struct rte_eth_rxconf {
> >   	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> >   	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> >   	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> > +	/**
> > +	 * Share group index in Rx domain and switch domain.
> > +	 * Non-zero value to enable Rx queue share, zero value disable share.
> > +	 * PMD driver is responsible for Rx queue consistency checks to avoid
> 
> When you update the set, can you please update 'PMD driver' usage too?
> 
> PMD = Poll Mode Driver, so second 'driver' is duplicate, there are a
> few more instance of this usage in this set.

Got it, thanks!

BTW, PMD patches updated:
https://patches.dpdk.org/project/dpdk/list/?series=19709

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v7 1/5] ethdev: introduce shared Rx queue
  2021-10-16  8:42   ` [dpdk-dev] [PATCH v7 1/5] " Xueming Li
@ 2021-10-17  5:33     ` Ajit Khaparde
  2021-10-17  7:29       ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Ajit Khaparde @ 2021-10-17  5:33 UTC (permalink / raw)
  To: Xueming Li
  Cc: dpdk-dev, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin

On Sat, Oct 16, 2021 at 1:43 AM Xueming Li <xuemingl@nvidia.com> wrote:
>
> In current DPDK framework, each Rx queue is pre-loaded with mbufs to
> save incoming packets. For some PMDs, when number of representors scale
> out in a switch domain, the memory consumption became significant.
> Polling all ports also leads to high cache miss, high latency and low
> throughput.
>
> This patch introduce shared Rx queue. Ports in same Rx domain and
> switch domain could share Rx queue set by specifying non-zero sharing
> group in Rx queue configuration.
>
> Port A RxQ X can share RxQ with Port B RxQ X, but can't share with RxQ
> Y. All member ports in share group share a list of shared Rx queue
> indexed by Rx queue ID.
>
> No special API is defined to receive packets from shared Rx queue.
> Polling any member port of a shared Rx queue receives packets of that
> queue for all member ports, source port is identified by mbuf->port.
Is this port the physical port which received the packet?
Or does this port number correlate with the port_id seen by the application?



>
> Shared Rx queue must be polled in same thread or core, polling a queue
> ID of any member port is essentially same.
So it is upto the application to poll the queue of any member port or
all ports or a designated port to handle Rx?

>
> Multiple share groups are supported. Device should support mixed
> configuration by allowing multiple share groups and non-shared Rx queue.
>
> Example grouping and polling model to reflect service priority:
>  Group1, 2 shared Rx queues per port: PF, rep0, rep1
>  Group2, 1 shared Rx queue per port: rep2, rep3, ... rep127
>  Core0: poll PF queue0
>  Core1: poll PF queue1
>  Core2: poll rep2 queue0
>
> PMD advertise shared Rx queue capability via RTE_ETH_DEV_CAPA_RXQ_SHARE.
>
> PMD is responsible for shared Rx queue consistency checks to avoid
> member port's configuration contradict to each other.
>
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> ---
>  doc/guides/nics/features.rst                  | 13 ++++++++++++
>  doc/guides/nics/features/default.ini          |  1 +
>  .../prog_guide/switch_representation.rst      | 10 +++++++++
>  doc/guides/rel_notes/release_21_11.rst        |  6 ++++++
>  lib/ethdev/rte_ethdev.c                       |  8 +++++++
>  lib/ethdev/rte_ethdev.h                       | 21 +++++++++++++++++++
>  6 files changed, 59 insertions(+)
>
> diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> index e346018e4b8..b64433b8ea5 100644
> --- a/doc/guides/nics/features.rst
> +++ b/doc/guides/nics/features.rst
> @@ -615,6 +615,19 @@ Supports inner packet L4 checksum.
>    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
>
>
> +.. _nic_features_shared_rx_queue:
> +
> +Shared Rx queue
> +---------------
> +
> +Supports shared Rx queue for ports in same Rx domain of a switch domain.
> +
> +* **[uses]     rte_eth_dev_info**: ``dev_capa:RTE_ETH_DEV_CAPA_RXQ_SHARE``.
> +* **[uses]     rte_eth_dev_info,rte_eth_switch_info**: ``rx_domain``, ``domain_id``.
> +* **[uses]     rte_eth_rxconf**: ``share_group``.
> +* **[provides] mbuf**: ``mbuf.port``.
> +
> +
>  .. _nic_features_packet_type_parsing:
>
>  Packet type parsing
> diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
> index d473b94091a..93f5d1b46f4 100644
> --- a/doc/guides/nics/features/default.ini
> +++ b/doc/guides/nics/features/default.ini
> @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
>  Queue start/stop     =
>  Runtime Rx queue setup =
>  Runtime Tx queue setup =
> +Shared Rx queue      =
>  Burst mode info      =
>  Power mgmt address monitor =
>  MTU update           =
> diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
> index ff6aa91c806..de41db8385d 100644
> --- a/doc/guides/prog_guide/switch_representation.rst
> +++ b/doc/guides/prog_guide/switch_representation.rst
> @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
>  .. [1] `Ethernet switch device driver model (switchdev)
>         <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
>
> +- For some PMDs, memory usage of representors is huge when number of
> +  representor grows, mbufs are allocated for each descriptor of Rx queue.
> +  Polling large number of ports brings more CPU load, cache miss and
> +  latency. Shared Rx queue can be used to share Rx queue between PF and
> +  representors among same Rx domain. ``RTE_ETH_DEV_CAPA_RXQ_SHARE`` is
> +  present in device capability of device info. Setting non-zero share group
> +  in Rx queue configuration to enable share. Polling any member port can
> +  receive packets of all member ports in the group, port ID is saved in
> +  ``mbuf.port``.
> +
>  Basic SR-IOV
>  ------------
>
> diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
> index 4c56cdfeaaa..1c84e896554 100644
> --- a/doc/guides/rel_notes/release_21_11.rst
> +++ b/doc/guides/rel_notes/release_21_11.rst
> @@ -67,6 +67,12 @@ New Features
>    * Modified to allow ``--huge-dir`` option to specify a sub-directory
>      within a hugetlbfs mountpoint.
>
> +* **Added ethdev shared Rx queue support.**
> +
> +  * Added new device capability flag and rx domain field to switch info.
> +  * Added share group to Rx queue configuration.
> +  * Added testpmd support and dedicate forwarding engine.
> +
>  * **Added new RSS offload types for IPv4/L4 checksum in RSS flow.**
>
>    Added macros ETH_RSS_IPV4_CHKSUM and ETH_RSS_L4_CHKSUM, now IPv4 and
> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index 028907bc4b9..bc55f899f72 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -2159,6 +2159,14 @@ rte_eth_rx_queue_setup(uint16_t port_id, uint16_t rx_queue_id,
>                 return -EINVAL;
>         }
>
> +       if (local_conf.share_group > 0 &&
> +           (dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE) == 0) {
> +               RTE_ETHDEV_LOG(ERR,
> +                       "Ethdev port_id=%d rx_queue_id=%d, enabled share_group=%hu while device doesn't support Rx queue share\n",
> +                       port_id, rx_queue_id, local_conf.share_group);
> +               return -EINVAL;
> +       }
> +
>         /*
>          * If LRO is enabled, check that the maximum aggregated packet
>          * size is supported by the configured device.
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index 6d80514ba7a..59d8904ac7c 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1044,6 +1044,13 @@ struct rte_eth_rxconf {
>         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
>         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
>         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> +       /**
> +        * Share group index in Rx domain and switch domain.
> +        * Non-zero value to enable Rx queue share, zero value disable share.
> +        * PMD driver is responsible for Rx queue consistency checks to avoid
> +        * member port's configuration contradict to each other.
> +        */
> +       uint16_t share_group;
>         /**
>          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
>          * Only offloads set on rx_queue_offload_capa or rx_offload_capa
> @@ -1445,6 +1452,14 @@ struct rte_eth_conf {
>  #define RTE_ETH_DEV_CAPA_RUNTIME_RX_QUEUE_SETUP 0x00000001
>  /** Device supports Tx queue setup after device started. */
>  #define RTE_ETH_DEV_CAPA_RUNTIME_TX_QUEUE_SETUP 0x00000002
> +/**
> + * Device supports shared Rx queue among ports within Rx domain and
> + * switch domain. Mbufs are consumed by shared Rx queue instead of
> + * every port. Multiple groups is supported by share_group of Rx
> + * queue configuration. Polling any port in the group receive packets
> + * of all member ports, source port identified by mbuf->port field.
> + */
> +#define RTE_ETH_DEV_CAPA_RXQ_SHARE              RTE_BIT64(2)
>  /**@}*/
>
>  /*
> @@ -1488,6 +1503,12 @@ struct rte_eth_switch_info {
>          * but each driver should explicitly define the mapping of switch
>          * port identifier to that physical interconnect/switch
>          */
> +       /**
> +        * Shared Rx queue sub-domain boundary. Only ports in same Rx domain
> +        * and switch domain can share Rx queue. Valid only if device advertised
> +        * RTE_ETH_DEV_CAPA_RXQ_SHARE capability.
> +        */
> +       uint16_t rx_domain;
>  };
>
>  /**
> --
> 2.33.0
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v7 1/5] ethdev: introduce shared Rx queue
  2021-10-17  5:33     ` Ajit Khaparde
@ 2021-10-17  7:29       ` Xueming(Steven) Li
  0 siblings, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-10-17  7:29 UTC (permalink / raw)
  To: ajit.khaparde
  Cc: konstantin.ananyev, jerinjacobk, NBU-Contact-Thomas Monjalon,
	Slava Ovsiienko, ferruh.yigit, andrew.rybchenko, Lior Margalit,
	dev

On Sat, 2021-10-16 at 22:33 -0700, Ajit Khaparde wrote:
> On Sat, Oct 16, 2021 at 1:43 AM Xueming Li <xuemingl@nvidia.com> wrote:
> > 
> > In current DPDK framework, each Rx queue is pre-loaded with mbufs to
> > save incoming packets. For some PMDs, when number of representors scale
> > out in a switch domain, the memory consumption became significant.
> > Polling all ports also leads to high cache miss, high latency and low
> > throughput.
> > 
> > This patch introduce shared Rx queue. Ports in same Rx domain and
> > switch domain could share Rx queue set by specifying non-zero sharing
> > group in Rx queue configuration.
> > 
> > Port A RxQ X can share RxQ with Port B RxQ X, but can't share with RxQ
> > Y. All member ports in share group share a list of shared Rx queue
> > indexed by Rx queue ID.
> > 
> > No special API is defined to receive packets from shared Rx queue.
> > Polling any member port of a shared Rx queue receives packets of that
> > queue for all member ports, source port is identified by mbuf->port.
> Is this port the physical port which received the packet?
> Or does this port number correlate with the port_id seen by the application?

Hi Ajit,

It's the port_id of member ports - PF or representor port.
I'll update commit message to avoid confusion.

> 
> 
> 
> > 
> > Shared Rx queue must be polled in same thread or core, polling a queue
> > ID of any member port is essentially same.
> So it is upto the application to poll the queue of any member port or
> all ports or a designated port to handle Rx?

Yes, up to application. As described in cover letter, aggregator port
will be considered after collecting more suggestion and user feedback.

> 
> > 
> > Multiple share groups are supported. Device should support mixed
> > configuration by allowing multiple share groups and non-shared Rx queue.
> > 
> > Example grouping and polling model to reflect service priority:
> >  Group1, 2 shared Rx queues per port: PF, rep0, rep1
> >  Group2, 1 shared Rx queue per port: rep2, rep3, ... rep127
> >  Core0: poll PF queue0
> >  Core1: poll PF queue1
> >  Core2: poll rep2 queue0
> > 
> > PMD advertise shared Rx queue capability via RTE_ETH_DEV_CAPA_RXQ_SHARE.
> > 
> > PMD is responsible for shared Rx queue consistency checks to avoid
> > member port's configuration contradict to each other.
> > 
> > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > ---
> >  doc/guides/nics/features.rst                  | 13 ++++++++++++
> >  doc/guides/nics/features/default.ini          |  1 +
> >  .../prog_guide/switch_representation.rst      | 10 +++++++++
> >  doc/guides/rel_notes/release_21_11.rst        |  6 ++++++
> >  lib/ethdev/rte_ethdev.c                       |  8 +++++++
> >  lib/ethdev/rte_ethdev.h                       | 21 +++++++++++++++++++
> >  6 files changed, 59 insertions(+)
> > 
> > diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> > index e346018e4b8..b64433b8ea5 100644
> > --- a/doc/guides/nics/features.rst
> > +++ b/doc/guides/nics/features.rst
> > @@ -615,6 +615,19 @@ Supports inner packet L4 checksum.
> >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > 
> > 
> > +.. _nic_features_shared_rx_queue:
> > +
> > +Shared Rx queue
> > +---------------
> > +
> > +Supports shared Rx queue for ports in same Rx domain of a switch domain.
> > +
> > +* **[uses]     rte_eth_dev_info**: ``dev_capa:RTE_ETH_DEV_CAPA_RXQ_SHARE``.
> > +* **[uses]     rte_eth_dev_info,rte_eth_switch_info**: ``rx_domain``, ``domain_id``.
> > +* **[uses]     rte_eth_rxconf**: ``share_group``.
> > +* **[provides] mbuf**: ``mbuf.port``.
> > +
> > +
> >  .. _nic_features_packet_type_parsing:
> > 
> >  Packet type parsing
> > diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
> > index d473b94091a..93f5d1b46f4 100644
> > --- a/doc/guides/nics/features/default.ini
> > +++ b/doc/guides/nics/features/default.ini
> > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> >  Queue start/stop     =
> >  Runtime Rx queue setup =
> >  Runtime Tx queue setup =
> > +Shared Rx queue      =
> >  Burst mode info      =
> >  Power mgmt address monitor =
> >  MTU update           =
> > diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
> > index ff6aa91c806..de41db8385d 100644
> > --- a/doc/guides/prog_guide/switch_representation.rst
> > +++ b/doc/guides/prog_guide/switch_representation.rst
> > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> >  .. [1] `Ethernet switch device driver model (switchdev)
> >         <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
> > 
> > +- For some PMDs, memory usage of representors is huge when number of
> > +  representor grows, mbufs are allocated for each descriptor of Rx queue.
> > +  Polling large number of ports brings more CPU load, cache miss and
> > +  latency. Shared Rx queue can be used to share Rx queue between PF and
> > +  representors among same Rx domain. ``RTE_ETH_DEV_CAPA_RXQ_SHARE`` is
> > +  present in device capability of device info. Setting non-zero share group
> > +  in Rx queue configuration to enable share. Polling any member port can
> > +  receive packets of all member ports in the group, port ID is saved in
> > +  ``mbuf.port``.
> > +
> >  Basic SR-IOV
> >  ------------
> > 
> > diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
> > index 4c56cdfeaaa..1c84e896554 100644
> > --- a/doc/guides/rel_notes/release_21_11.rst
> > +++ b/doc/guides/rel_notes/release_21_11.rst
> > @@ -67,6 +67,12 @@ New Features
> >    * Modified to allow ``--huge-dir`` option to specify a sub-directory
> >      within a hugetlbfs mountpoint.
> > 
> > +* **Added ethdev shared Rx queue support.**
> > +
> > +  * Added new device capability flag and rx domain field to switch info.
> > +  * Added share group to Rx queue configuration.
> > +  * Added testpmd support and dedicate forwarding engine.
> > +
> >  * **Added new RSS offload types for IPv4/L4 checksum in RSS flow.**
> > 
> >    Added macros ETH_RSS_IPV4_CHKSUM and ETH_RSS_L4_CHKSUM, now IPv4 and
> > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > index 028907bc4b9..bc55f899f72 100644
> > --- a/lib/ethdev/rte_ethdev.c
> > +++ b/lib/ethdev/rte_ethdev.c
> > @@ -2159,6 +2159,14 @@ rte_eth_rx_queue_setup(uint16_t port_id, uint16_t rx_queue_id,
> >                 return -EINVAL;
> >         }
> > 
> > +       if (local_conf.share_group > 0 &&
> > +           (dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE) == 0) {
> > +               RTE_ETHDEV_LOG(ERR,
> > +                       "Ethdev port_id=%d rx_queue_id=%d, enabled share_group=%hu while device doesn't support Rx queue share\n",
> > +                       port_id, rx_queue_id, local_conf.share_group);
> > +               return -EINVAL;
> > +       }
> > +
> >         /*
> >          * If LRO is enabled, check that the maximum aggregated packet
> >          * size is supported by the configured device.
> > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > index 6d80514ba7a..59d8904ac7c 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -1044,6 +1044,13 @@ struct rte_eth_rxconf {
> >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> > +       /**
> > +        * Share group index in Rx domain and switch domain.
> > +        * Non-zero value to enable Rx queue share, zero value disable share.
> > +        * PMD driver is responsible for Rx queue consistency checks to avoid
> > +        * member port's configuration contradict to each other.
> > +        */
> > +       uint16_t share_group;
> >         /**
> >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> >          * Only offloads set on rx_queue_offload_capa or rx_offload_capa
> > @@ -1445,6 +1452,14 @@ struct rte_eth_conf {
> >  #define RTE_ETH_DEV_CAPA_RUNTIME_RX_QUEUE_SETUP 0x00000001
> >  /** Device supports Tx queue setup after device started. */
> >  #define RTE_ETH_DEV_CAPA_RUNTIME_TX_QUEUE_SETUP 0x00000002
> > +/**
> > + * Device supports shared Rx queue among ports within Rx domain and
> > + * switch domain. Mbufs are consumed by shared Rx queue instead of
> > + * every port. Multiple groups is supported by share_group of Rx
> > + * queue configuration. Polling any port in the group receive packets
> > + * of all member ports, source port identified by mbuf->port field.
> > + */
> > +#define RTE_ETH_DEV_CAPA_RXQ_SHARE              RTE_BIT64(2)
> >  /**@}*/
> > 
> >  /*
> > @@ -1488,6 +1503,12 @@ struct rte_eth_switch_info {
> >          * but each driver should explicitly define the mapping of switch
> >          * port identifier to that physical interconnect/switch
> >          */
> > +       /**
> > +        * Shared Rx queue sub-domain boundary. Only ports in same Rx domain
> > +        * and switch domain can share Rx queue. Valid only if device advertised
> > +        * RTE_ETH_DEV_CAPA_RXQ_SHARE capability.
> > +        */
> > +       uint16_t rx_domain;
> >  };
> > 
> >  /**
> > --
> > 2.33.0
> > 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v6 1/5] ethdev: introduce shared Rx queue
  2021-10-15 10:54       ` Xueming(Steven) Li
@ 2021-10-18  6:46         ` Andrew Rybchenko
  2021-10-18  6:57           ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Andrew Rybchenko @ 2021-10-18  6:46 UTC (permalink / raw)
  To: Xueming(Steven) Li, NBU-Contact-Thomas Monjalon
  Cc: jerinjacobk, Lior Margalit, Slava Ovsiienko, konstantin.ananyev,
	dev, ferruh.yigit

On 10/15/21 1:54 PM, Xueming(Steven) Li wrote:
> On Fri, 2021-10-15 at 12:28 +0300, Andrew Rybchenko wrote:
>> On 10/12/21 5:39 PM, Xueming Li wrote:
>>> In current DPDK framework, each Rx queue is pre-loaded with mbufs to
>>> save incoming packets. For some PMDs, when number of representors scale
>>> out in a switch domain, the memory consumption became significant.
>>> Polling all ports also leads to high cache miss, high latency and low
>>> throughput.
>>>
>>> This patch introduce shared Rx queue. Ports in same Rx domain and
>>> switch domain could share Rx queue set by specifying non-zero sharing
>>> group in Rx queue configuration.
>>>
>>> No special API is defined to receive packets from shared Rx queue.
>>> Polling any member port of a shared Rx queue receives packets of that
>>> queue for all member ports, source port is identified by mbuf->port.
>>>
>>> Shared Rx queue must be polled in same thread or core, polling a queue
>>> ID of any member port is essentially same.
>>>
>>> Multiple share groups are supported by non-zero share group ID. Device
>>
>> "by non-zero share group ID" is not required. Since it must be
>> always non-zero to enable sharing.
>>
>>> should support mixed configuration by allowing multiple share
>>> groups and non-shared Rx queue.
>>>
>>> Even Rx queue shared, queue configuration like offloads and RSS should
>>> not be impacted.
>>
>> I don't understand above sentence.
>> Even when Rx queues are shared, queue configuration like
>> offloads and RSS may differ. If a PMD has some limitation,
>> it should care about consistency itself. These limitations
>> should be documented in the PMD documentation.
>>
> 
> OK, I'll remove this line.
> 
>>>
>>> Example grouping and polling model to reflect service priority:
>>>  Group1, 2 shared Rx queues per port: PF, rep0, rep1
>>>  Group2, 1 shared Rx queue per port: rep2, rep3, ... rep127
>>>  Core0: poll PF queue0
>>>  Core1: poll PF queue1
>>>  Core2: poll rep2 queue0
>>
>>
>> Can I have:
>> PF RxQ#0, RxQ#1
>> Rep0 RxQ#0 shared with PF RxQ#0
>> Rep1 RxQ#0 shared with PF RxQ#1
>>
>> I guess no, since it looks like RxQ ID must be equal.
>> Or am I missing something? Otherwise grouping rules
>> are not obvious to me. May be we need dedicated
>> shared_qid in boundaries of the share_group?
> 
> Yes, RxQ ID must be equal, following configuration should work:
>   Rep1 RxQ#1 shared with PF RxQ#1

But I want just one RxQ on Rep1. I don't need two.

> Equal mapping should work by default instead of a new field that must
> be set. I'll add some description to emphasis, how do you think?

Sorry for delay with reply. I think that above limitation is
not nice. It is better to avoid it.

[snip]

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v6 1/5] ethdev: introduce shared Rx queue
  2021-10-18  6:46         ` Andrew Rybchenko
@ 2021-10-18  6:57           ` Xueming(Steven) Li
  0 siblings, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-10-18  6:57 UTC (permalink / raw)
  To: NBU-Contact-Thomas Monjalon, andrew.rybchenko
  Cc: jerinjacobk, Lior Margalit, Slava Ovsiienko, konstantin.ananyev,
	dev, ferruh.yigit

On Mon, 2021-10-18 at 09:46 +0300, Andrew Rybchenko wrote:
> On 10/15/21 1:54 PM, Xueming(Steven) Li wrote:
> > On Fri, 2021-10-15 at 12:28 +0300, Andrew Rybchenko wrote:
> > > On 10/12/21 5:39 PM, Xueming Li wrote:
> > > > In current DPDK framework, each Rx queue is pre-loaded with mbufs to
> > > > save incoming packets. For some PMDs, when number of representors scale
> > > > out in a switch domain, the memory consumption became significant.
> > > > Polling all ports also leads to high cache miss, high latency and low
> > > > throughput.
> > > > 
> > > > This patch introduce shared Rx queue. Ports in same Rx domain and
> > > > switch domain could share Rx queue set by specifying non-zero sharing
> > > > group in Rx queue configuration.
> > > > 
> > > > No special API is defined to receive packets from shared Rx queue.
> > > > Polling any member port of a shared Rx queue receives packets of that
> > > > queue for all member ports, source port is identified by mbuf->port.
> > > > 
> > > > Shared Rx queue must be polled in same thread or core, polling a queue
> > > > ID of any member port is essentially same.
> > > > 
> > > > Multiple share groups are supported by non-zero share group ID. Device
> > > 
> > > "by non-zero share group ID" is not required. Since it must be
> > > always non-zero to enable sharing.
> > > 
> > > > should support mixed configuration by allowing multiple share
> > > > groups and non-shared Rx queue.
> > > > 
> > > > Even Rx queue shared, queue configuration like offloads and RSS should
> > > > not be impacted.
> > > 
> > > I don't understand above sentence.
> > > Even when Rx queues are shared, queue configuration like
> > > offloads and RSS may differ. If a PMD has some limitation,
> > > it should care about consistency itself. These limitations
> > > should be documented in the PMD documentation.
> > > 
> > 
> > OK, I'll remove this line.
> > 
> > > > 
> > > > Example grouping and polling model to reflect service priority:
> > > >  Group1, 2 shared Rx queues per port: PF, rep0, rep1
> > > >  Group2, 1 shared Rx queue per port: rep2, rep3, ... rep127
> > > >  Core0: poll PF queue0
> > > >  Core1: poll PF queue1
> > > >  Core2: poll rep2 queue0
> > > 
> > > 
> > > Can I have:
> > > PF RxQ#0, RxQ#1
> > > Rep0 RxQ#0 shared with PF RxQ#0
> > > Rep1 RxQ#0 shared with PF RxQ#1
> > > 
> > > I guess no, since it looks like RxQ ID must be equal.
> > > Or am I missing something? Otherwise grouping rules
> > > are not obvious to me. May be we need dedicated
> > > shared_qid in boundaries of the share_group?
> > 
> > Yes, RxQ ID must be equal, following configuration should work:
> >   Rep1 RxQ#1 shared with PF RxQ#1
> 
> But I want just one RxQ on Rep1. I don't need two.
> 
> > Equal mapping should work by default instead of a new field that must
> > be set. I'll add some description to emphasis, how do you think?
> 
> Sorry for delay with reply. I think that above limitation is
> not nice. It is better to avoid it.

Okay, it will offer more flexibility. I will add in next version. User
has to be aware the indirect mapping relation, the following polling in
above example rx burt the same shared RxQ:
  PF RxQ#1
  Rep1 RxQ#0 shared with PF RxQ#1

> 
> [snip]


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v8 0/6] ethdev: introduce shared Rx queue
  2021-07-27  3:42 [dpdk-dev] [RFC] ethdev: introduce shared Rx queue Xueming Li
                   ` (7 preceding siblings ...)
  2021-10-16  8:42 ` [dpdk-dev] [PATCH v7 0/5] ethdev: introduce " Xueming Li
@ 2021-10-18 12:59 ` Xueming Li
  2021-10-18 12:59   ` [dpdk-dev] [PATCH v8 1/6] " Xueming Li
                     ` (5 more replies)
  2021-10-19  8:17 ` [dpdk-dev] [PATCH v9 0/6] ethdev: introduce " Xueming Li
                   ` (7 subsequent siblings)
  16 siblings, 6 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-18 12:59 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin

In current DPDK framework, all Rx queues is pre-loaded with mbufs for
incoming packets. When number of representors scale out in a switch
domain, the memory consumption became significant. Further more,
polling all ports leads to high cache miss, high latency and low
throughputs.

This patch introduces shared Rx queue. PF and representors in same
Rx domain and switch domain could share Rx queue set by specifying
non-zero share group value in Rx queue configuration.

All ports that share Rx queue actually shares hardware descriptor
queue and feed all Rx queues with one descriptor supply, memory is saved.

Polling any queue using same shared Rx queue receives packets from all
member ports. Source port is identified by mbuf->port.

Multiple groups is supported by group ID. Port queue number in a shared
group should be identical. Queue index is 1:1 mapped in shared group.
An example of two share groups:
 Group1, 4 shared Rx queues per member port: PF, repr0, repr1
 Group2, 2 shared Rx queues per member port: repr2, repr3, ... repr127
 Poll first port for each group:
  core	port	queue
  0	0	0
  1	0	1
  2	0	2
  3	0	3
  4	2	0
  5	2	1

Shared Rx queue must be polled on single thread or core. If both PF0 and
representor0 joined same share group, can't poll pf0rxq0 on core1 and
rep0rxq0 on core2. Actually, polling one port within share group is
sufficient since polling any port in group will return packets for any
port in group.

There was some discussion to aggregate member ports in same group into a
dummy port, several ways to achieve it. Since it optional, need to collect
more feedback and requirement from user, make better decision later.

v1:
  - initial version
v2:
  - add testpmd patches
v3:
  - change common forwarding api to macro for performance, thanks Jerin.
  - save global variable accessed in forwarding to flowstream to minimize
    cache miss
  - combined patches for each forwarding engine
  - support multiple groups in testpmd "--share-rxq" parameter
  - new api to aggregate shared rxq group
v4:
  - spelling fixes
  - remove shared-rxq support for all forwarding engines
  - add dedicate shared-rxq forwarding engine
v5:
 - fix grammars
 - remove aggregate api and leave it for later discussion
 - add release notes
 - add deployment example
v6:
 - replace RxQ offload flag with device offload capability flag
 - add Rx domain
 - RxQ is shared when share group > 0
 - update testpmd accordingly
v7:
 - fix testpmd share group id allocation
 - change rx_domain to 16bits
v8:
 - add new patch for testpmd to show device Rx domain ID and capability
 - new share_qid in RxQ configuration

Xueming Li (6):
  ethdev: introduce shared Rx queue
  app/testpmd: dump device capability and Rx domain info
  app/testpmd: new parameter to enable shared Rx queue
  app/testpmd: dump port info for shared Rx queue
  app/testpmd: force shared Rx queue polled on same core
  app/testpmd: add forwarding engine for shared Rx queue

 app/test-pmd/config.c                         | 114 +++++++++++++-
 app/test-pmd/meson.build                      |   1 +
 app/test-pmd/parameters.c                     |  13 ++
 app/test-pmd/shared_rxq_fwd.c                 | 148 ++++++++++++++++++
 app/test-pmd/testpmd.c                        |  25 ++-
 app/test-pmd/testpmd.h                        |   5 +
 app/test-pmd/util.c                           |   3 +
 doc/guides/nics/features.rst                  |  13 ++
 doc/guides/nics/features/default.ini          |   1 +
 .../prog_guide/switch_representation.rst      |  11 ++
 doc/guides/rel_notes/release_21_11.rst        |   6 +
 doc/guides/testpmd_app_ug/run_app.rst         |   8 +
 doc/guides/testpmd_app_ug/testpmd_funcs.rst   |   5 +-
 lib/ethdev/rte_ethdev.c                       |   8 +
 lib/ethdev/rte_ethdev.h                       |  24 +++
 15 files changed, 379 insertions(+), 6 deletions(-)
 create mode 100644 app/test-pmd/shared_rxq_fwd.c

-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v8 1/6] ethdev: introduce shared Rx queue
  2021-10-18 12:59 ` [dpdk-dev] [PATCH v8 0/6] ethdev: introduce " Xueming Li
@ 2021-10-18 12:59   ` Xueming Li
  2021-10-19  0:21     ` Ajit Khaparde
  2021-10-19  6:28     ` Andrew Rybchenko
  2021-10-18 12:59   ` [dpdk-dev] [PATCH v8 2/6] app/testpmd: dump device capability and Rx domain info Xueming Li
                     ` (4 subsequent siblings)
  5 siblings, 2 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-18 12:59 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin

In current DPDK framework, each Rx queue is pre-loaded with mbufs to
save incoming packets. For some PMDs, when number of representors scale
out in a switch domain, the memory consumption became significant.
Polling all ports also leads to high cache miss, high latency and low
throughput.

This patch introduce shared Rx queue. Ports in same Rx domain and
switch domain could share Rx queue set by specifying non-zero sharing
group in Rx queue configuration.

Shared Rx queue is identified by share_rxq field of Rx queue
configuration. Port A RxQ X can share RxQ with Port B RxQ Y by using
same shared Rx queue ID.

No special API is defined to receive packets from shared Rx queue.
Polling any member port of a shared Rx queue receives packets of that
queue for all member ports, port_id is identified by mbuf->port. PMD is
responsible to resolve shared Rx queue from device and queue data.

Shared Rx queue must be polled in same thread or core, polling a queue
ID of any member port is essentially same.

Multiple share groups are supported. Device should support mixed
configuration by allowing multiple share groups and non-shared Rx queue
on one port.

Example grouping and polling model to reflect service priority:
 Group1, 2 shared Rx queues per port: PF, rep0, rep1
 Group2, 1 shared Rx queue per port: rep2, rep3, ... rep127
 Core0: poll PF queue0
 Core1: poll PF queue1
 Core2: poll rep2 queue0

PMD advertise shared Rx queue capability via RTE_ETH_DEV_CAPA_RXQ_SHARE.

PMD is responsible for shared Rx queue consistency checks to avoid
member port's configuration contradict to each other.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 doc/guides/nics/features.rst                  | 13 ++++++++++
 doc/guides/nics/features/default.ini          |  1 +
 .../prog_guide/switch_representation.rst      | 11 +++++++++
 doc/guides/rel_notes/release_21_11.rst        |  6 +++++
 lib/ethdev/rte_ethdev.c                       |  8 +++++++
 lib/ethdev/rte_ethdev.h                       | 24 +++++++++++++++++++
 6 files changed, 63 insertions(+)

diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index e346018e4b8..89f9accbca1 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -615,6 +615,19 @@ Supports inner packet L4 checksum.
   ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
 
 
+.. _nic_features_shared_rx_queue:
+
+Shared Rx queue
+---------------
+
+Supports shared Rx queue for ports in same Rx domain of a switch domain.
+
+* **[uses]     rte_eth_dev_info**: ``dev_capa:RTE_ETH_DEV_CAPA_RXQ_SHARE``.
+* **[uses]     rte_eth_dev_info,rte_eth_switch_info**: ``rx_domain``, ``domain_id``.
+* **[uses]     rte_eth_rxconf**: ``share_group``, ``share_qid``.
+* **[provides] mbuf**: ``mbuf.port``.
+
+
 .. _nic_features_packet_type_parsing:
 
 Packet type parsing
diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
index d473b94091a..93f5d1b46f4 100644
--- a/doc/guides/nics/features/default.ini
+++ b/doc/guides/nics/features/default.ini
@@ -19,6 +19,7 @@ Free Tx mbuf on demand =
 Queue start/stop     =
 Runtime Rx queue setup =
 Runtime Tx queue setup =
+Shared Rx queue      =
 Burst mode info      =
 Power mgmt address monitor =
 MTU update           =
diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
index ff6aa91c806..fe89a7f5c33 100644
--- a/doc/guides/prog_guide/switch_representation.rst
+++ b/doc/guides/prog_guide/switch_representation.rst
@@ -123,6 +123,17 @@ thought as a software "patch panel" front-end for applications.
 .. [1] `Ethernet switch device driver model (switchdev)
        <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
 
+- For some PMDs, memory usage of representors is huge when number of
+  representor grows, mbufs are allocated for each descriptor of Rx queue.
+  Polling large number of ports brings more CPU load, cache miss and
+  latency. Shared Rx queue can be used to share Rx queue between PF and
+  representors among same Rx domain. ``RTE_ETH_DEV_CAPA_RXQ_SHARE`` in
+  device info is used to indicate the capability. Setting non-zero share
+  group in Rx queue configuration to enable share, share_qid is used to
+  identifiy the shared Rx queue in group. Polling any member port can
+  receive packets of all member ports in the group, port ID is saved in
+  ``mbuf.port``.
+
 Basic SR-IOV
 ------------
 
diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
index d5435a64aa1..2143e38ff11 100644
--- a/doc/guides/rel_notes/release_21_11.rst
+++ b/doc/guides/rel_notes/release_21_11.rst
@@ -75,6 +75,12 @@ New Features
     operations.
   * Added multi-process support.
 
+* **Added ethdev shared Rx queue support.**
+
+  * Added new device capability flag and rx domain field to switch info.
+  * Added share group and share queue ID to Rx queue configuration.
+  * Added testpmd support and dedicate forwarding engine.
+
 * **Added new RSS offload types for IPv4/L4 checksum in RSS flow.**
 
   Added macros ETH_RSS_IPV4_CHKSUM and ETH_RSS_L4_CHKSUM, now IPv4 and
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 028907bc4b9..bc55f899f72 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -2159,6 +2159,14 @@ rte_eth_rx_queue_setup(uint16_t port_id, uint16_t rx_queue_id,
 		return -EINVAL;
 	}
 
+	if (local_conf.share_group > 0 &&
+	    (dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE) == 0) {
+		RTE_ETHDEV_LOG(ERR,
+			"Ethdev port_id=%d rx_queue_id=%d, enabled share_group=%hu while device doesn't support Rx queue share\n",
+			port_id, rx_queue_id, local_conf.share_group);
+		return -EINVAL;
+	}
+
 	/*
 	 * If LRO is enabled, check that the maximum aggregated packet
 	 * size is supported by the configured device.
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 6d80514ba7a..465293fd66d 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1044,6 +1044,14 @@ struct rte_eth_rxconf {
 	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
 	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
 	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
+	/**
+	 * Share group index in Rx domain and switch domain.
+	 * Non-zero value to enable Rx queue share, zero value disable share.
+	 * PMD is responsible for Rx queue consistency checks to avoid member
+	 * port's configuration contradict to each other.
+	 */
+	uint16_t share_group;
+	uint16_t share_qid; /**< Shared Rx queue ID in group. */
 	/**
 	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
 	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
@@ -1445,6 +1453,16 @@ struct rte_eth_conf {
 #define RTE_ETH_DEV_CAPA_RUNTIME_RX_QUEUE_SETUP 0x00000001
 /** Device supports Tx queue setup after device started. */
 #define RTE_ETH_DEV_CAPA_RUNTIME_TX_QUEUE_SETUP 0x00000002
+/**
+ * Device supports shared Rx queue among ports within Rx domain and
+ * switch domain. Mbufs are consumed by shared Rx queue instead of
+ * each queue. Multiple groups is supported by share_group of Rx
+ * queue configuration. Shared Rx queue is identified by PMD using
+ * share_qid of Rx queue configuration. Polling any port in the group
+ * receive packets of all member ports, source port identified by
+ * mbuf->port field.
+ */
+#define RTE_ETH_DEV_CAPA_RXQ_SHARE              RTE_BIT64(2)
 /**@}*/
 
 /*
@@ -1488,6 +1506,12 @@ struct rte_eth_switch_info {
 	 * but each driver should explicitly define the mapping of switch
 	 * port identifier to that physical interconnect/switch
 	 */
+	/**
+	 * Shared Rx queue sub-domain boundary. Only ports in same Rx domain
+	 * and switch domain can share Rx queue. Valid only if device advertised
+	 * RTE_ETH_DEV_CAPA_RXQ_SHARE capability.
+	 */
+	uint16_t rx_domain;
 };
 
 /**
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v8 2/6] app/testpmd: dump device capability and Rx domain info
  2021-10-18 12:59 ` [dpdk-dev] [PATCH v8 0/6] ethdev: introduce " Xueming Li
  2021-10-18 12:59   ` [dpdk-dev] [PATCH v8 1/6] " Xueming Li
@ 2021-10-18 12:59   ` Xueming Li
  2021-10-18 12:59   ` [dpdk-dev] [PATCH v8 3/6] app/testpmd: new parameter to enable shared Rx queue Xueming Li
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-18 12:59 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Xiaoyun Li

Dump device capability and Rx domain ID if shared Rx queue is supported
by device.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 9c66329e96e..c0616dcd2fd 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -733,6 +733,7 @@ port_infos_display(portid_t port_id)
 	printf("Max segment number per MTU/TSO: %hu\n",
 		dev_info.tx_desc_lim.nb_mtu_seg_max);
 
+	printf("Device capabilities: 0x%"PRIx64"\n", dev_info.dev_capa);
 	/* Show switch info only if valid switch domain and port id is set */
 	if (dev_info.switch_info.domain_id !=
 		RTE_ETH_DEV_SWITCH_DOMAIN_ID_INVALID) {
@@ -743,6 +744,9 @@ port_infos_display(portid_t port_id)
 			dev_info.switch_info.domain_id);
 		printf("Switch Port Id: %u\n",
 			dev_info.switch_info.port_id);
+		if ((dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE) != 0)
+			printf("Switch Rx domain: %u\n",
+			       dev_info.switch_info.rx_domain);
 	}
 }
 
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v8 3/6] app/testpmd: new parameter to enable shared Rx queue
  2021-10-18 12:59 ` [dpdk-dev] [PATCH v8 0/6] ethdev: introduce " Xueming Li
  2021-10-18 12:59   ` [dpdk-dev] [PATCH v8 1/6] " Xueming Li
  2021-10-18 12:59   ` [dpdk-dev] [PATCH v8 2/6] app/testpmd: dump device capability and Rx domain info Xueming Li
@ 2021-10-18 12:59   ` Xueming Li
  2021-10-18 12:59   ` [dpdk-dev] [PATCH v8 4/6] app/testpmd: dump port info for " Xueming Li
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-18 12:59 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Xiaoyun Li

Adds "--rxq-share=X" parameter to enable shared RxQ, share if device
supports, otherwise fallback to standard RxQ.

Share group number grows per X ports. X defaults to MAX, implies all
ports join share group 1. Queue ID is mapped equally with shared Rx
queue ID.

Forwarding engine "shared-rxq" should be used which Rx only and update
stream statistics correctly.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c                 |  7 ++++++-
 app/test-pmd/parameters.c             | 13 +++++++++++++
 app/test-pmd/testpmd.c                | 20 +++++++++++++++++---
 app/test-pmd/testpmd.h                |  2 ++
 doc/guides/testpmd_app_ug/run_app.rst |  7 +++++++
 5 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index c0616dcd2fd..f8fb8961cae 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -2713,7 +2713,12 @@ rxtx_config_display(void)
 			printf("      RX threshold registers: pthresh=%d hthresh=%d "
 				" wthresh=%d\n",
 				pthresh_tmp, hthresh_tmp, wthresh_tmp);
-			printf("      RX Offloads=0x%"PRIx64"\n", offloads_tmp);
+			printf("      RX Offloads=0x%"PRIx64, offloads_tmp);
+			if (rx_conf->share_group > 0)
+				printf(" share_group=%u share_qid=%u",
+				       rx_conf->share_group,
+				       rx_conf->share_qid);
+			printf("\n");
 		}
 
 		/* per tx queue config only for first queue to be less verbose */
diff --git a/app/test-pmd/parameters.c b/app/test-pmd/parameters.c
index 3f94a82e321..30dae326310 100644
--- a/app/test-pmd/parameters.c
+++ b/app/test-pmd/parameters.c
@@ -167,6 +167,7 @@ usage(char* progname)
 	printf("  --tx-ip=src,dst: IP addresses in Tx-only mode\n");
 	printf("  --tx-udp=src[,dst]: UDP ports in Tx-only mode\n");
 	printf("  --eth-link-speed: force link speed.\n");
+	printf("  --rxq-share: number of ports per shared rxq groups, defaults to MAX(1 group)\n");
 	printf("  --disable-link-check: disable check on link status when "
 	       "starting/stopping ports.\n");
 	printf("  --disable-device-start: do not automatically start port\n");
@@ -607,6 +608,7 @@ launch_args_parse(int argc, char** argv)
 		{ "rxpkts",			1, 0, 0 },
 		{ "txpkts",			1, 0, 0 },
 		{ "txonly-multi-flow",		0, 0, 0 },
+		{ "rxq-share",			2, 0, 0 },
 		{ "eth-link-speed",		1, 0, 0 },
 		{ "disable-link-check",		0, 0, 0 },
 		{ "disable-device-start",	0, 0, 0 },
@@ -1271,6 +1273,17 @@ launch_args_parse(int argc, char** argv)
 			}
 			if (!strcmp(lgopts[opt_idx].name, "txonly-multi-flow"))
 				txonly_multi_flow = 1;
+			if (!strcmp(lgopts[opt_idx].name, "rxq-share")) {
+				if (optarg == NULL) {
+					rxq_share = UINT32_MAX;
+				} else {
+					n = atoi(optarg);
+					if (n >= 0)
+						rxq_share = (uint32_t)n;
+					else
+						rte_exit(EXIT_FAILURE, "rxq-share must be >= 0\n");
+				}
+			}
 			if (!strcmp(lgopts[opt_idx].name, "no-flush-rx"))
 				no_flush_rx = 1;
 			if (!strcmp(lgopts[opt_idx].name, "eth-link-speed")) {
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 97ae52e17ec..123142ed110 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -498,6 +498,11 @@ uint8_t record_core_cycles;
  */
 uint8_t record_burst_stats;
 
+/*
+ * Number of ports per shared Rx queue group, 0 disable.
+ */
+uint32_t rxq_share;
+
 unsigned int num_sockets = 0;
 unsigned int socket_ids[RTE_MAX_NUMA_NODES];
 
@@ -3393,14 +3398,23 @@ dev_event_callback(const char *device_name, enum rte_dev_event_type type,
 }
 
 static void
-rxtx_port_config(struct rte_port *port)
+rxtx_port_config(portid_t pid)
 {
 	uint16_t qid;
 	uint64_t offloads;
+	struct rte_port *port = &ports[pid];
 
 	for (qid = 0; qid < nb_rxq; qid++) {
 		offloads = port->rx_conf[qid].offloads;
 		port->rx_conf[qid] = port->dev_info.default_rxconf;
+
+		if (rxq_share > 0 &&
+		    (port->dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE)) {
+			/* Non-zero share group to enable RxQ share. */
+			port->rx_conf[qid].share_group = pid / rxq_share + 1;
+			port->rx_conf[qid].share_qid = qid; /* Equal mapping. */
+		}
+
 		if (offloads != 0)
 			port->rx_conf[qid].offloads = offloads;
 
@@ -3558,7 +3572,7 @@ init_port_config(void)
 				port->dev_conf.rxmode.mq_mode = ETH_MQ_RX_NONE;
 		}
 
-		rxtx_port_config(port);
+		rxtx_port_config(pid);
 
 		ret = eth_macaddr_get_print_err(pid, &port->eth_addr);
 		if (ret != 0)
@@ -3772,7 +3786,7 @@ init_port_dcb_config(portid_t pid,
 
 	memcpy(&rte_port->dev_conf, &port_conf, sizeof(struct rte_eth_conf));
 
-	rxtx_port_config(rte_port);
+	rxtx_port_config(pid);
 	/* VLAN filter */
 	rte_port->dev_conf.rxmode.offloads |= DEV_RX_OFFLOAD_VLAN_FILTER;
 	for (i = 0; i < RTE_DIM(vlan_tags); i++)
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 5863b2f43f3..3dfaaad94c0 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -477,6 +477,8 @@ extern enum tx_pkt_split tx_pkt_split;
 
 extern uint8_t txonly_multi_flow;
 
+extern uint32_t rxq_share;
+
 extern uint16_t nb_pkt_per_burst;
 extern uint16_t nb_pkt_flowgen_clones;
 extern int nb_flows_flowgen;
diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
index 640eadeff73..ff5908dcd50 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -389,6 +389,13 @@ The command line options are:
 
     Generate multiple flows in txonly mode.
 
+*   ``--rxq-share=[X]``
+
+    Create queues in shared Rx queue mode if device supports.
+    Group number grows per X ports. X defaults to MAX, implies all ports
+    join share group 1. Forwarding engine "shared-rxq" should be used
+    which Rx only and update stream statistics correctly.
+
 *   ``--eth-link-speed``
 
     Set a forced link speed to the ethernet port::
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v8 4/6] app/testpmd: dump port info for shared Rx queue
  2021-10-18 12:59 ` [dpdk-dev] [PATCH v8 0/6] ethdev: introduce " Xueming Li
                     ` (2 preceding siblings ...)
  2021-10-18 12:59   ` [dpdk-dev] [PATCH v8 3/6] app/testpmd: new parameter to enable shared Rx queue Xueming Li
@ 2021-10-18 12:59   ` Xueming Li
  2021-10-18 12:59   ` [dpdk-dev] [PATCH v8 5/6] app/testpmd: force shared Rx queue polled on same core Xueming Li
  2021-10-18 12:59   ` [dpdk-dev] [PATCH v8 6/6] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
  5 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-18 12:59 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Xiaoyun Li

In case of shared Rx queue, polling any member port returns mbufs for
all members. This patch dumps mbuf->port for each packet.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/util.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/app/test-pmd/util.c b/app/test-pmd/util.c
index 51506e49404..e98f136d5ed 100644
--- a/app/test-pmd/util.c
+++ b/app/test-pmd/util.c
@@ -100,6 +100,9 @@ dump_pkt_burst(uint16_t port_id, uint16_t queue, struct rte_mbuf *pkts[],
 		struct rte_flow_restore_info info = { 0, };
 
 		mb = pkts[i];
+		if (rxq_share > 0)
+			MKDUMPSTR(print_buf, buf_size, cur_len, "port %u, ",
+				  mb->port);
 		eth_hdr = rte_pktmbuf_read(mb, 0, sizeof(_eth_hdr), &_eth_hdr);
 		eth_type = RTE_BE_TO_CPU_16(eth_hdr->ether_type);
 		packet_type = mb->packet_type;
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v8 5/6] app/testpmd: force shared Rx queue polled on same core
  2021-10-18 12:59 ` [dpdk-dev] [PATCH v8 0/6] ethdev: introduce " Xueming Li
                     ` (3 preceding siblings ...)
  2021-10-18 12:59   ` [dpdk-dev] [PATCH v8 4/6] app/testpmd: dump port info for " Xueming Li
@ 2021-10-18 12:59   ` Xueming Li
  2021-10-18 12:59   ` [dpdk-dev] [PATCH v8 6/6] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
  5 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-18 12:59 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Xiaoyun Li

Shared Rx queue must be polled on same core. This patch checks and stops
forwarding if shared RxQ being scheduled on multiple
cores.

It's suggested to use same number of Rx queues and polling cores.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c  | 103 +++++++++++++++++++++++++++++++++++++++++
 app/test-pmd/testpmd.c |   4 +-
 app/test-pmd/testpmd.h |   2 +
 3 files changed, 108 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index f8fb8961cae..c4150d77589 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -2890,6 +2890,109 @@ port_rss_hash_key_update(portid_t port_id, char rss_type[], uint8_t *hash_key,
 	}
 }
 
+/*
+ * Check whether a shared rxq scheduled on other lcores.
+ */
+static bool
+fwd_stream_on_other_lcores(uint16_t domain_id, lcoreid_t src_lc,
+			   portid_t src_port, queueid_t src_rxq,
+			   uint32_t share_group, queueid_t share_rxq)
+{
+	streamid_t sm_id;
+	streamid_t nb_fs_per_lcore;
+	lcoreid_t  nb_fc;
+	lcoreid_t  lc_id;
+	struct fwd_stream *fs;
+	struct rte_port *port;
+	struct rte_eth_dev_info *dev_info;
+	struct rte_eth_rxconf *rxq_conf;
+
+	nb_fc = cur_fwd_config.nb_fwd_lcores;
+	/* Check remaining cores. */
+	for (lc_id = src_lc + 1; lc_id < nb_fc; lc_id++) {
+		sm_id = fwd_lcores[lc_id]->stream_idx;
+		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
+		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
+		     sm_id++) {
+			fs = fwd_streams[sm_id];
+			port = &ports[fs->rx_port];
+			dev_info = &port->dev_info;
+			rxq_conf = &port->rx_conf[fs->rx_queue];
+			if ((dev_info->dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE)
+			    == 0)
+				/* Not shared rxq. */
+				continue;
+			if (domain_id != port->dev_info.switch_info.domain_id)
+				continue;
+			if (rxq_conf->share_group != share_group)
+				continue;
+			if (rxq_conf->share_qid != share_rxq)
+				continue;
+			printf("Shared Rx queue group %u queue %hu can't be scheduled on different cores:\n",
+			       share_group, share_rxq);
+			printf("  lcore %hhu Port %hu queue %hu\n",
+			       src_lc, src_port, src_rxq);
+			printf("  lcore %hhu Port %hu queue %hu\n",
+			       lc_id, fs->rx_port, fs->rx_queue);
+			printf("Please use --nb-cores=%hu to limit number of forwarding cores\n",
+			       nb_rxq);
+			return true;
+		}
+	}
+	return false;
+}
+
+/*
+ * Check shared rxq configuration.
+ *
+ * Shared group must not being scheduled on different core.
+ */
+bool
+pkt_fwd_shared_rxq_check(void)
+{
+	streamid_t sm_id;
+	streamid_t nb_fs_per_lcore;
+	lcoreid_t  nb_fc;
+	lcoreid_t  lc_id;
+	struct fwd_stream *fs;
+	uint16_t domain_id;
+	struct rte_port *port;
+	struct rte_eth_dev_info *dev_info;
+	struct rte_eth_rxconf *rxq_conf;
+
+	nb_fc = cur_fwd_config.nb_fwd_lcores;
+	/*
+	 * Check streams on each core, make sure the same switch domain +
+	 * group + queue doesn't get scheduled on other cores.
+	 */
+	for (lc_id = 0; lc_id < nb_fc; lc_id++) {
+		sm_id = fwd_lcores[lc_id]->stream_idx;
+		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
+		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
+		     sm_id++) {
+			fs = fwd_streams[sm_id];
+			/* Update lcore info stream being scheduled. */
+			fs->lcore = fwd_lcores[lc_id];
+			port = &ports[fs->rx_port];
+			dev_info = &port->dev_info;
+			rxq_conf = &port->rx_conf[fs->rx_queue];
+			if ((dev_info->dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE)
+			    == 0)
+				/* Not shared rxq. */
+				continue;
+			/* Check shared rxq not scheduled on remaining cores. */
+			domain_id = port->dev_info.switch_info.domain_id;
+			if (fwd_stream_on_other_lcores(domain_id, lc_id,
+						       fs->rx_port,
+						       fs->rx_queue,
+						       rxq_conf->share_group,
+						       rxq_conf->share_qid))
+				return false;
+		}
+	}
+	return true;
+}
+
 /*
  * Setup forwarding configuration for each logical core.
  */
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 123142ed110..f3f81ef561f 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -2236,10 +2236,12 @@ start_packet_forwarding(int with_tx_first)
 
 	fwd_config_setup();
 
+	pkt_fwd_config_display(&cur_fwd_config);
+	if (!pkt_fwd_shared_rxq_check())
+		return;
 	if(!no_flush_rx)
 		flush_fwd_rx_queues();
 
-	pkt_fwd_config_display(&cur_fwd_config);
 	rxtx_config_display();
 
 	fwd_stats_reset();
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 3dfaaad94c0..f121a2da90c 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -144,6 +144,7 @@ struct fwd_stream {
 	uint64_t     core_cycles; /**< used for RX and TX processing */
 	struct pkt_burst_stats rx_burst_stats;
 	struct pkt_burst_stats tx_burst_stats;
+	struct fwd_lcore *lcore; /**< Lcore being scheduled. */
 };
 
 /**
@@ -795,6 +796,7 @@ void port_summary_header_display(void);
 void rx_queue_infos_display(portid_t port_idi, uint16_t queue_id);
 void tx_queue_infos_display(portid_t port_idi, uint16_t queue_id);
 void fwd_lcores_config_display(void);
+bool pkt_fwd_shared_rxq_check(void);
 void pkt_fwd_config_display(struct fwd_config *cfg);
 void rxtx_config_display(void);
 void fwd_config_setup(void);
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v8 6/6] app/testpmd: add forwarding engine for shared Rx queue
  2021-10-18 12:59 ` [dpdk-dev] [PATCH v8 0/6] ethdev: introduce " Xueming Li
                     ` (4 preceding siblings ...)
  2021-10-18 12:59   ` [dpdk-dev] [PATCH v8 5/6] app/testpmd: force shared Rx queue polled on same core Xueming Li
@ 2021-10-18 12:59   ` Xueming Li
  5 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-18 12:59 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Xiaoyun Li

To support shared Rx queue, this patch introduces dedicate forwarding
engine. The engine groups received packets by mbuf->port into sub-group,
updates stream statistics and simply frees packets.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/meson.build                    |   1 +
 app/test-pmd/shared_rxq_fwd.c               | 148 ++++++++++++++++++++
 app/test-pmd/testpmd.c                      |   1 +
 app/test-pmd/testpmd.h                      |   1 +
 doc/guides/testpmd_app_ug/run_app.rst       |   1 +
 doc/guides/testpmd_app_ug/testpmd_funcs.rst |   5 +-
 6 files changed, 156 insertions(+), 1 deletion(-)
 create mode 100644 app/test-pmd/shared_rxq_fwd.c

diff --git a/app/test-pmd/meson.build b/app/test-pmd/meson.build
index 98f3289bdfa..07042e45b12 100644
--- a/app/test-pmd/meson.build
+++ b/app/test-pmd/meson.build
@@ -21,6 +21,7 @@ sources = files(
         'noisy_vnf.c',
         'parameters.c',
         'rxonly.c',
+        'shared_rxq_fwd.c',
         'testpmd.c',
         'txonly.c',
         'util.c',
diff --git a/app/test-pmd/shared_rxq_fwd.c b/app/test-pmd/shared_rxq_fwd.c
new file mode 100644
index 00000000000..4e262b99bc7
--- /dev/null
+++ b/app/test-pmd/shared_rxq_fwd.c
@@ -0,0 +1,148 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2021 NVIDIA Corporation & Affiliates
+ */
+
+#include <stdarg.h>
+#include <string.h>
+#include <stdio.h>
+#include <errno.h>
+#include <stdint.h>
+#include <unistd.h>
+#include <inttypes.h>
+
+#include <sys/queue.h>
+#include <sys/stat.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_debug.h>
+#include <rte_cycles.h>
+#include <rte_memory.h>
+#include <rte_memcpy.h>
+#include <rte_launch.h>
+#include <rte_eal.h>
+#include <rte_per_lcore.h>
+#include <rte_lcore.h>
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_mempool.h>
+#include <rte_mbuf.h>
+#include <rte_pci.h>
+#include <rte_ether.h>
+#include <rte_ethdev.h>
+#include <rte_string_fns.h>
+#include <rte_ip.h>
+#include <rte_udp.h>
+#include <rte_net.h>
+#include <rte_flow.h>
+
+#include "testpmd.h"
+
+/*
+ * Rx only sub-burst forwarding.
+ */
+static void
+forward_rx_only(uint16_t nb_rx, struct rte_mbuf **pkts_burst)
+{
+	rte_pktmbuf_free_bulk(pkts_burst, nb_rx);
+}
+
+/**
+ * Get packet source stream by source port and queue.
+ * All streams of same shared Rx queue locates on same core.
+ */
+static struct fwd_stream *
+forward_stream_get(struct fwd_stream *fs, uint16_t port)
+{
+	streamid_t sm_id;
+	struct fwd_lcore *fc;
+	struct fwd_stream **fsm;
+	streamid_t nb_fs;
+
+	fc = fs->lcore;
+	fsm = &fwd_streams[fc->stream_idx];
+	nb_fs = fc->stream_nb;
+	for (sm_id = 0; sm_id < nb_fs; sm_id++) {
+		if (fsm[sm_id]->rx_port == port &&
+		    fsm[sm_id]->rx_queue == fs->rx_queue)
+			return fsm[sm_id];
+	}
+	return NULL;
+}
+
+/**
+ * Forward packet by source port and queue.
+ */
+static void
+forward_sub_burst(struct fwd_stream *src_fs, uint16_t port, uint16_t nb_rx,
+		  struct rte_mbuf **pkts)
+{
+	struct fwd_stream *fs = forward_stream_get(src_fs, port);
+
+	if (fs != NULL) {
+		fs->rx_packets += nb_rx;
+		forward_rx_only(nb_rx, pkts);
+	} else {
+		/* Source stream not found, drop all packets. */
+		src_fs->fwd_dropped += nb_rx;
+		while (nb_rx > 0)
+			rte_pktmbuf_free(pkts[--nb_rx]);
+	}
+}
+
+/**
+ * Forward packets from shared Rx queue.
+ *
+ * Source port of packets are identified by mbuf->port.
+ */
+static void
+forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
+		   struct rte_mbuf **pkts_burst)
+{
+	uint16_t i, nb_sub_burst, port, last_port;
+
+	nb_sub_burst = 0;
+	last_port = pkts_burst[0]->port;
+	/* Locate sub-burst according to mbuf->port. */
+	for (i = 0; i < nb_rx - 1; ++i) {
+		rte_prefetch0(pkts_burst[i + 1]);
+		port = pkts_burst[i]->port;
+		if (i > 0 && last_port != port) {
+			/* Forward packets with same source port. */
+			forward_sub_burst(fs, last_port, nb_sub_burst,
+					  &pkts_burst[i - nb_sub_burst]);
+			nb_sub_burst = 0;
+			last_port = port;
+		}
+		nb_sub_burst++;
+	}
+	/* Last sub-burst. */
+	nb_sub_burst++;
+	forward_sub_burst(fs, last_port, nb_sub_burst,
+			  &pkts_burst[nb_rx - nb_sub_burst]);
+}
+
+static void
+shared_rxq_fwd(struct fwd_stream *fs)
+{
+	struct rte_mbuf *pkts_burst[nb_pkt_per_burst];
+	uint16_t nb_rx;
+	uint64_t start_tsc = 0;
+
+	get_start_cycles(&start_tsc);
+	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
+				 nb_pkt_per_burst);
+	inc_rx_burst_stats(fs, nb_rx);
+	if (unlikely(nb_rx == 0))
+		return;
+	forward_shared_rxq(fs, nb_rx, pkts_burst);
+	get_end_cycles(fs, start_tsc);
+}
+
+struct fwd_engine shared_rxq_engine = {
+	.fwd_mode_name  = "shared_rxq",
+	.port_fwd_begin = NULL,
+	.port_fwd_end   = NULL,
+	.packet_fwd     = shared_rxq_fwd,
+};
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index f3f81ef561f..11a85d92d9a 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -188,6 +188,7 @@ struct fwd_engine * fwd_engines[] = {
 #ifdef RTE_LIBRTE_IEEE1588
 	&ieee1588_fwd_engine,
 #endif
+	&shared_rxq_engine,
 	NULL,
 };
 
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index f121a2da90c..f1fd607e365 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -299,6 +299,7 @@ extern struct fwd_engine five_tuple_swap_fwd_engine;
 #ifdef RTE_LIBRTE_IEEE1588
 extern struct fwd_engine ieee1588_fwd_engine;
 #endif
+extern struct fwd_engine shared_rxq_engine;
 
 extern struct fwd_engine * fwd_engines[]; /**< NULL terminated array. */
 extern cmdline_parse_inst_t cmd_set_raw;
diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
index ff5908dcd50..e4b97844ced 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -252,6 +252,7 @@ The command line options are:
        tm
        noisy
        5tswap
+       shared-rxq
 
 *   ``--rss-ip``
 
diff --git a/doc/guides/testpmd_app_ug/testpmd_funcs.rst b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
index 8ead7a4a712..499874187f2 100644
--- a/doc/guides/testpmd_app_ug/testpmd_funcs.rst
+++ b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
@@ -314,7 +314,7 @@ set fwd
 Set the packet forwarding mode::
 
    testpmd> set fwd (io|mac|macswap|flowgen| \
-                     rxonly|txonly|csum|icmpecho|noisy|5tswap) (""|retry)
+                     rxonly|txonly|csum|icmpecho|noisy|5tswap|shared-rxq) (""|retry)
 
 ``retry`` can be specified for forwarding engines except ``rx_only``.
 
@@ -357,6 +357,9 @@ The available information categories are:
 
   L4 swaps the source port and destination port of transport layer (TCP and UDP).
 
+* ``shared-rxq``: Receive only for shared Rx queue.
+  Resolve packet source port from mbuf and update stream statistics accordingly.
+
 Example::
 
    testpmd> set fwd rxonly
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/6] ethdev: introduce shared Rx queue
  2021-10-18 12:59   ` [dpdk-dev] [PATCH v8 1/6] " Xueming Li
@ 2021-10-19  0:21     ` Ajit Khaparde
  2021-10-19  5:54       ` Xueming(Steven) Li
  2021-10-19  6:28     ` Andrew Rybchenko
  1 sibling, 1 reply; 266+ messages in thread
From: Ajit Khaparde @ 2021-10-19  0:21 UTC (permalink / raw)
  To: Xueming Li
  Cc: dpdk-dev, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin

On Mon, Oct 18, 2021 at 6:00 AM Xueming Li <xuemingl@nvidia.com> wrote:
>
> In current DPDK framework, each Rx queue is pre-loaded with mbufs to
> save incoming packets. For some PMDs, when number of representors scale
> out in a switch domain, the memory consumption became significant.
> Polling all ports also leads to high cache miss, high latency and low
> throughput.
>
> This patch introduce shared Rx queue. Ports in same Rx domain and
s/introduce/introduces

> switch domain could share Rx queue set by specifying non-zero sharing
> group in Rx queue configuration.
>
> Shared Rx queue is identified by share_rxq field of Rx queue
> configuration. Port A RxQ X can share RxQ with Port B RxQ Y by using
> same shared Rx queue ID.
>
> No special API is defined to receive packets from shared Rx queue.
> Polling any member port of a shared Rx queue receives packets of that
> queue for all member ports, port_id is identified by mbuf->port. PMD is
> responsible to resolve shared Rx queue from device and queue data.
>
> Shared Rx queue must be polled in same thread or core, polling a queue
> ID of any member port is essentially same.
>
> Multiple share groups are supported. Device should support mixed
> configuration by allowing multiple share groups and non-shared Rx queue
> on one port.
More than a device, it should be the PMD which should support it. Right?

>
> Example grouping and polling model to reflect service priority:
>  Group1, 2 shared Rx queues per port: PF, rep0, rep1
>  Group2, 1 shared Rx queue per port: rep2, rep3, ... rep127
>  Core0: poll PF queue0
>  Core1: poll PF queue1
>  Core2: poll rep2 queue0
>
> PMD advertise shared Rx queue capability via RTE_ETH_DEV_CAPA_RXQ_SHARE.
>
> PMD is responsible for shared Rx queue consistency checks to avoid
> member port's configuration contradict to each other.
contradict each other.

>
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>

> ---
>  doc/guides/nics/features.rst                  | 13 ++++++++++
>  doc/guides/nics/features/default.ini          |  1 +
>  .../prog_guide/switch_representation.rst      | 11 +++++++++
>  doc/guides/rel_notes/release_21_11.rst        |  6 +++++
>  lib/ethdev/rte_ethdev.c                       |  8 +++++++
>  lib/ethdev/rte_ethdev.h                       | 24 +++++++++++++++++++
>  6 files changed, 63 insertions(+)
>
> diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> index e346018e4b8..89f9accbca1 100644
> --- a/doc/guides/nics/features.rst
> +++ b/doc/guides/nics/features.rst
> @@ -615,6 +615,19 @@ Supports inner packet L4 checksum.
>    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
>
>
> +.. _nic_features_shared_rx_queue:
> +
> +Shared Rx queue
> +---------------
> +
> +Supports shared Rx queue for ports in same Rx domain of a switch domain.
> +
> +* **[uses]     rte_eth_dev_info**: ``dev_capa:RTE_ETH_DEV_CAPA_RXQ_SHARE``.
> +* **[uses]     rte_eth_dev_info,rte_eth_switch_info**: ``rx_domain``, ``domain_id``.
> +* **[uses]     rte_eth_rxconf**: ``share_group``, ``share_qid``.
> +* **[provides] mbuf**: ``mbuf.port``.
> +
> +
>  .. _nic_features_packet_type_parsing:
>
>  Packet type parsing
> diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
> index d473b94091a..93f5d1b46f4 100644
> --- a/doc/guides/nics/features/default.ini
> +++ b/doc/guides/nics/features/default.ini
> @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
>  Queue start/stop     =
>  Runtime Rx queue setup =
>  Runtime Tx queue setup =
> +Shared Rx queue      =
>  Burst mode info      =
>  Power mgmt address monitor =
>  MTU update           =
> diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
> index ff6aa91c806..fe89a7f5c33 100644
> --- a/doc/guides/prog_guide/switch_representation.rst
> +++ b/doc/guides/prog_guide/switch_representation.rst
> @@ -123,6 +123,17 @@ thought as a software "patch panel" front-end for applications.
>  .. [1] `Ethernet switch device driver model (switchdev)
>         <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
>
> +- For some PMDs, memory usage of representors is huge when number of
> +  representor grows, mbufs are allocated for each descriptor of Rx queue.
> +  Polling large number of ports brings more CPU load, cache miss and
> +  latency. Shared Rx queue can be used to share Rx queue between PF and
> +  representors among same Rx domain. ``RTE_ETH_DEV_CAPA_RXQ_SHARE`` in
> +  device info is used to indicate the capability. Setting non-zero share
> +  group in Rx queue configuration to enable share, share_qid is used to
> +  identifiy the shared Rx queue in group. Polling any member port can
s/identifiy/identify

> +  receive packets of all member ports in the group, port ID is saved in
> +  ``mbuf.port``.
> +
>  Basic SR-IOV
>  ------------
>
> diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
> index d5435a64aa1..2143e38ff11 100644
> --- a/doc/guides/rel_notes/release_21_11.rst
> +++ b/doc/guides/rel_notes/release_21_11.rst
> @@ -75,6 +75,12 @@ New Features
>      operations.
>    * Added multi-process support.
>
> +* **Added ethdev shared Rx queue support.**
> +
> +  * Added new device capability flag and rx domain field to switch info.
> +  * Added share group and share queue ID to Rx queue configuration.
> +  * Added testpmd support and dedicate forwarding engine.
> +
>  * **Added new RSS offload types for IPv4/L4 checksum in RSS flow.**
>
>    Added macros ETH_RSS_IPV4_CHKSUM and ETH_RSS_L4_CHKSUM, now IPv4 and
> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index 028907bc4b9..bc55f899f72 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -2159,6 +2159,14 @@ rte_eth_rx_queue_setup(uint16_t port_id, uint16_t rx_queue_id,
>                 return -EINVAL;
>         }
>
> +       if (local_conf.share_group > 0 &&
> +           (dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE) == 0) {
> +               RTE_ETHDEV_LOG(ERR,
> +                       "Ethdev port_id=%d rx_queue_id=%d, enabled share_group=%hu while device doesn't support Rx queue share\n",
> +                       port_id, rx_queue_id, local_conf.share_group);
> +               return -EINVAL;
> +       }
> +
>         /*
>          * If LRO is enabled, check that the maximum aggregated packet
>          * size is supported by the configured device.
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index 6d80514ba7a..465293fd66d 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1044,6 +1044,14 @@ struct rte_eth_rxconf {
>         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
>         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
>         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> +       /**
> +        * Share group index in Rx domain and switch domain.
> +        * Non-zero value to enable Rx queue share, zero value disable share.
> +        * PMD is responsible for Rx queue consistency checks to avoid member
> +        * port's configuration contradict to each other.
> +        */
> +       uint16_t share_group;
> +       uint16_t share_qid; /**< Shared Rx queue ID in group. */
>         /**
>          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
>          * Only offloads set on rx_queue_offload_capa or rx_offload_capa
> @@ -1445,6 +1453,16 @@ struct rte_eth_conf {
>  #define RTE_ETH_DEV_CAPA_RUNTIME_RX_QUEUE_SETUP 0x00000001
>  /** Device supports Tx queue setup after device started. */
>  #define RTE_ETH_DEV_CAPA_RUNTIME_TX_QUEUE_SETUP 0x00000002
> +/**
> + * Device supports shared Rx queue among ports within Rx domain and
> + * switch domain. Mbufs are consumed by shared Rx queue instead of
> + * each queue. Multiple groups is supported by share_group of Rx
are supported by..

> + * queue configuration. Shared Rx queue is identified by PMD using
> + * share_qid of Rx queue configuration. Polling any port in the group
> + * receive packets of all member ports, source port identified by
> + * mbuf->port field.
> + */
> +#define RTE_ETH_DEV_CAPA_RXQ_SHARE              RTE_BIT64(2)
>  /**@}*/
>
>  /*
> @@ -1488,6 +1506,12 @@ struct rte_eth_switch_info {
>          * but each driver should explicitly define the mapping of switch
>          * port identifier to that physical interconnect/switch
>          */
> +       /**
> +        * Shared Rx queue sub-domain boundary. Only ports in same Rx domain
> +        * and switch domain can share Rx queue. Valid only if device advertised
> +        * RTE_ETH_DEV_CAPA_RXQ_SHARE capability.
> +        */
> +       uint16_t rx_domain;
>  };
>
>  /**
> --
> 2.33.0
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/6] ethdev: introduce shared Rx queue
  2021-10-19  0:21     ` Ajit Khaparde
@ 2021-10-19  5:54       ` Xueming(Steven) Li
  0 siblings, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-10-19  5:54 UTC (permalink / raw)
  To: ajit.khaparde
  Cc: konstantin.ananyev, jerinjacobk, NBU-Contact-Thomas Monjalon,
	Slava Ovsiienko, ferruh.yigit, andrew.rybchenko, Lior Margalit,
	dev

On Mon, 2021-10-18 at 17:21 -0700, Ajit Khaparde wrote:
> On Mon, Oct 18, 2021 at 6:00 AM Xueming Li <xuemingl@nvidia.com> wrote:
> > 
> > In current DPDK framework, each Rx queue is pre-loaded with mbufs to
> > save incoming packets. For some PMDs, when number of representors scale
> > out in a switch domain, the memory consumption became significant.
> > Polling all ports also leads to high cache miss, high latency and low
> > throughput.
> > 
> > This patch introduce shared Rx queue. Ports in same Rx domain and
> s/introduce/introduces

Accepted all the comments, thanks!

> 
> > switch domain could share Rx queue set by specifying non-zero sharing
> > group in Rx queue configuration.
> > 
> > Shared Rx queue is identified by share_rxq field of Rx queue
> > configuration. Port A RxQ X can share RxQ with Port B RxQ Y by using
> > same shared Rx queue ID.
> > 
> > No special API is defined to receive packets from shared Rx queue.
> > Polling any member port of a shared Rx queue receives packets of that
> > queue for all member ports, port_id is identified by mbuf->port. PMD is
> > responsible to resolve shared Rx queue from device and queue data.
> > 
> > Shared Rx queue must be polled in same thread or core, polling a queue
> > ID of any member port is essentially same.
> > 
> > Multiple share groups are supported. Device should support mixed
> > configuration by allowing multiple share groups and non-shared Rx queue
> > on one port.
> More than a device, it should be the PMD which should support it. Right?
> 
> > 
> > Example grouping and polling model to reflect service priority:
> >  Group1, 2 shared Rx queues per port: PF, rep0, rep1
> >  Group2, 1 shared Rx queue per port: rep2, rep3, ... rep127
> >  Core0: poll PF queue0
> >  Core1: poll PF queue1
> >  Core2: poll rep2 queue0
> > 
> > PMD advertise shared Rx queue capability via RTE_ETH_DEV_CAPA_RXQ_SHARE.
> > 
> > PMD is responsible for shared Rx queue consistency checks to avoid
> > member port's configuration contradict to each other.
> contradict each other.
> 
> > 
> > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> 
> > ---
> >  doc/guides/nics/features.rst                  | 13 ++++++++++
> >  doc/guides/nics/features/default.ini          |  1 +
> >  .../prog_guide/switch_representation.rst      | 11 +++++++++
> >  doc/guides/rel_notes/release_21_11.rst        |  6 +++++
> >  lib/ethdev/rte_ethdev.c                       |  8 +++++++
> >  lib/ethdev/rte_ethdev.h                       | 24 +++++++++++++++++++
> >  6 files changed, 63 insertions(+)
> > 
> > diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> > index e346018e4b8..89f9accbca1 100644
> > --- a/doc/guides/nics/features.rst
> > +++ b/doc/guides/nics/features.rst
> > @@ -615,6 +615,19 @@ Supports inner packet L4 checksum.
> >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > 
> > 
> > +.. _nic_features_shared_rx_queue:
> > +
> > +Shared Rx queue
> > +---------------
> > +
> > +Supports shared Rx queue for ports in same Rx domain of a switch domain.
> > +
> > +* **[uses]     rte_eth_dev_info**: ``dev_capa:RTE_ETH_DEV_CAPA_RXQ_SHARE``.
> > +* **[uses]     rte_eth_dev_info,rte_eth_switch_info**: ``rx_domain``, ``domain_id``.
> > +* **[uses]     rte_eth_rxconf**: ``share_group``, ``share_qid``.
> > +* **[provides] mbuf**: ``mbuf.port``.
> > +
> > +
> >  .. _nic_features_packet_type_parsing:
> > 
> >  Packet type parsing
> > diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
> > index d473b94091a..93f5d1b46f4 100644
> > --- a/doc/guides/nics/features/default.ini
> > +++ b/doc/guides/nics/features/default.ini
> > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> >  Queue start/stop     =
> >  Runtime Rx queue setup =
> >  Runtime Tx queue setup =
> > +Shared Rx queue      =
> >  Burst mode info      =
> >  Power mgmt address monitor =
> >  MTU update           =
> > diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
> > index ff6aa91c806..fe89a7f5c33 100644
> > --- a/doc/guides/prog_guide/switch_representation.rst
> > +++ b/doc/guides/prog_guide/switch_representation.rst
> > @@ -123,6 +123,17 @@ thought as a software "patch panel" front-end for applications.
> >  .. [1] `Ethernet switch device driver model (switchdev)
> >         <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
> > 
> > +- For some PMDs, memory usage of representors is huge when number of
> > +  representor grows, mbufs are allocated for each descriptor of Rx queue.
> > +  Polling large number of ports brings more CPU load, cache miss and
> > +  latency. Shared Rx queue can be used to share Rx queue between PF and
> > +  representors among same Rx domain. ``RTE_ETH_DEV_CAPA_RXQ_SHARE`` in
> > +  device info is used to indicate the capability. Setting non-zero share
> > +  group in Rx queue configuration to enable share, share_qid is used to
> > +  identifiy the shared Rx queue in group. Polling any member port can
> s/identifiy/identify
> 
> > +  receive packets of all member ports in the group, port ID is saved in
> > +  ``mbuf.port``.
> > +
> >  Basic SR-IOV
> >  ------------
> > 
> > diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
> > index d5435a64aa1..2143e38ff11 100644
> > --- a/doc/guides/rel_notes/release_21_11.rst
> > +++ b/doc/guides/rel_notes/release_21_11.rst
> > @@ -75,6 +75,12 @@ New Features
> >      operations.
> >    * Added multi-process support.
> > 
> > +* **Added ethdev shared Rx queue support.**
> > +
> > +  * Added new device capability flag and rx domain field to switch info.
> > +  * Added share group and share queue ID to Rx queue configuration.
> > +  * Added testpmd support and dedicate forwarding engine.
> > +
> >  * **Added new RSS offload types for IPv4/L4 checksum in RSS flow.**
> > 
> >    Added macros ETH_RSS_IPV4_CHKSUM and ETH_RSS_L4_CHKSUM, now IPv4 and
> > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > index 028907bc4b9..bc55f899f72 100644
> > --- a/lib/ethdev/rte_ethdev.c
> > +++ b/lib/ethdev/rte_ethdev.c
> > @@ -2159,6 +2159,14 @@ rte_eth_rx_queue_setup(uint16_t port_id, uint16_t rx_queue_id,
> >                 return -EINVAL;
> >         }
> > 
> > +       if (local_conf.share_group > 0 &&
> > +           (dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE) == 0) {
> > +               RTE_ETHDEV_LOG(ERR,
> > +                       "Ethdev port_id=%d rx_queue_id=%d, enabled share_group=%hu while device doesn't support Rx queue share\n",
> > +                       port_id, rx_queue_id, local_conf.share_group);
> > +               return -EINVAL;
> > +       }
> > +
> >         /*
> >          * If LRO is enabled, check that the maximum aggregated packet
> >          * size is supported by the configured device.
> > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > index 6d80514ba7a..465293fd66d 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -1044,6 +1044,14 @@ struct rte_eth_rxconf {
> >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> > +       /**
> > +        * Share group index in Rx domain and switch domain.
> > +        * Non-zero value to enable Rx queue share, zero value disable share.
> > +        * PMD is responsible for Rx queue consistency checks to avoid member
> > +        * port's configuration contradict to each other.
> > +        */
> > +       uint16_t share_group;
> > +       uint16_t share_qid; /**< Shared Rx queue ID in group. */
> >         /**
> >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> >          * Only offloads set on rx_queue_offload_capa or rx_offload_capa
> > @@ -1445,6 +1453,16 @@ struct rte_eth_conf {
> >  #define RTE_ETH_DEV_CAPA_RUNTIME_RX_QUEUE_SETUP 0x00000001
> >  /** Device supports Tx queue setup after device started. */
> >  #define RTE_ETH_DEV_CAPA_RUNTIME_TX_QUEUE_SETUP 0x00000002
> > +/**
> > + * Device supports shared Rx queue among ports within Rx domain and
> > + * switch domain. Mbufs are consumed by shared Rx queue instead of
> > + * each queue. Multiple groups is supported by share_group of Rx
> are supported by..
> 
> > + * queue configuration. Shared Rx queue is identified by PMD using
> > + * share_qid of Rx queue configuration. Polling any port in the group
> > + * receive packets of all member ports, source port identified by
> > + * mbuf->port field.
> > + */
> > +#define RTE_ETH_DEV_CAPA_RXQ_SHARE              RTE_BIT64(2)
> >  /**@}*/
> > 
> >  /*
> > @@ -1488,6 +1506,12 @@ struct rte_eth_switch_info {
> >          * but each driver should explicitly define the mapping of switch
> >          * port identifier to that physical interconnect/switch
> >          */
> > +       /**
> > +        * Shared Rx queue sub-domain boundary. Only ports in same Rx domain
> > +        * and switch domain can share Rx queue. Valid only if device advertised
> > +        * RTE_ETH_DEV_CAPA_RXQ_SHARE capability.
> > +        */
> > +       uint16_t rx_domain;
> >  };
> > 
> >  /**
> > --
> > 2.33.0
> > 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/6] ethdev: introduce shared Rx queue
  2021-10-18 12:59   ` [dpdk-dev] [PATCH v8 1/6] " Xueming Li
  2021-10-19  0:21     ` Ajit Khaparde
@ 2021-10-19  6:28     ` Andrew Rybchenko
  1 sibling, 0 replies; 266+ messages in thread
From: Andrew Rybchenko @ 2021-10-19  6:28 UTC (permalink / raw)
  To: Xueming Li, dev
  Cc: Jerin Jacob, Ferruh Yigit, Viacheslav Ovsiienko, Thomas Monjalon,
	Lior Margalit, Ananyev Konstantin

On 10/18/21 3:59 PM, Xueming Li wrote:
> In current DPDK framework, each Rx queue is pre-loaded with mbufs to
> save incoming packets. For some PMDs, when number of representors scale
> out in a switch domain, the memory consumption became significant.
> Polling all ports also leads to high cache miss, high latency and low
> throughput.
> 
> This patch introduce shared Rx queue. Ports in same Rx domain and
> switch domain could share Rx queue set by specifying non-zero sharing
> group in Rx queue configuration.
> 
> Shared Rx queue is identified by share_rxq field of Rx queue
> configuration. Port A RxQ X can share RxQ with Port B RxQ Y by using
> same shared Rx queue ID.
> 
> No special API is defined to receive packets from shared Rx queue.
> Polling any member port of a shared Rx queue receives packets of that
> queue for all member ports, port_id is identified by mbuf->port. PMD is
> responsible to resolve shared Rx queue from device and queue data.
> 
> Shared Rx queue must be polled in same thread or core, polling a queue
> ID of any member port is essentially same.
> 
> Multiple share groups are supported. Device should support mixed
> configuration by allowing multiple share groups and non-shared Rx queue
> on one port.
> 
> Example grouping and polling model to reflect service priority:
>  Group1, 2 shared Rx queues per port: PF, rep0, rep1
>  Group2, 1 shared Rx queue per port: rep2, rep3, ... rep127
>  Core0: poll PF queue0
>  Core1: poll PF queue1
>  Core2: poll rep2 queue0
> 
> PMD advertise shared Rx queue capability via RTE_ETH_DEV_CAPA_RXQ_SHARE.
> 
> PMD is responsible for shared Rx queue consistency checks to avoid
> member port's configuration contradict to each other.
> 
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>

with few nits below:

Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>

[snip]

> diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
> index ff6aa91c806..fe89a7f5c33 100644
> --- a/doc/guides/prog_guide/switch_representation.rst
> +++ b/doc/guides/prog_guide/switch_representation.rst
> @@ -123,6 +123,17 @@ thought as a software "patch panel" front-end for applications.
>  .. [1] `Ethernet switch device driver model (switchdev)
>         <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
>  
> +- For some PMDs, memory usage of representors is huge when number of
> +  representor grows, mbufs are allocated for each descriptor of Rx queue.
> +  Polling large number of ports brings more CPU load, cache miss and
> +  latency. Shared Rx queue can be used to share Rx queue between PF and
> +  representors among same Rx domain. ``RTE_ETH_DEV_CAPA_RXQ_SHARE`` in
> +  device info is used to indicate the capability. Setting non-zero share
> +  group in Rx queue configuration to enable share, share_qid is used to
> +  identifiy the shared Rx queue in group. Polling any member port can

identifiy -> identify

> +  receive packets of all member ports in the group, port ID is saved in
> +  ``mbuf.port``.
> +
>  Basic SR-IOV
>  ------------
>  
> diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
> index d5435a64aa1..2143e38ff11 100644
> --- a/doc/guides/rel_notes/release_21_11.rst
> +++ b/doc/guides/rel_notes/release_21_11.rst
> @@ -75,6 +75,12 @@ New Features
>      operations.
>    * Added multi-process support.
>  
> +* **Added ethdev shared Rx queue support.**
> +
> +  * Added new device capability flag and rx domain field to switch info.

rx -> Rx

> +  * Added share group and share queue ID to Rx queue configuration.
> +  * Added testpmd support and dedicate forwarding engine.
> +
>  * **Added new RSS offload types for IPv4/L4 checksum in RSS flow.**
>  
>    Added macros ETH_RSS_IPV4_CHKSUM and ETH_RSS_L4_CHKSUM, now IPv4 and

[snip]


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v9 0/6] ethdev: introduce shared Rx queue
  2021-07-27  3:42 [dpdk-dev] [RFC] ethdev: introduce shared Rx queue Xueming Li
                   ` (8 preceding siblings ...)
  2021-10-18 12:59 ` [dpdk-dev] [PATCH v8 0/6] ethdev: introduce " Xueming Li
@ 2021-10-19  8:17 ` Xueming Li
  2021-10-19  8:17   ` [dpdk-dev] [PATCH v9 1/6] " Xueming Li
                     ` (5 more replies)
  2021-10-19 15:20 ` [dpdk-dev] [PATCH v10 0/6] ethdev: introduce " Xueming Li
                   ` (6 subsequent siblings)
  16 siblings, 6 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-19  8:17 UTC (permalink / raw)
  To: dev, Zhang Yuying
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde

In current DPDK framework, all Rx queues is pre-loaded with mbufs for
incoming packets. When number of representors scale out in a switch
domain, the memory consumption became significant. Further more,
polling all ports leads to high cache miss, high latency and low
throughputs.

This patch introduces shared Rx queue. PF and representors in same
Rx domain and switch domain could share Rx queue set by specifying
non-zero share group value in Rx queue configuration.

All ports that share Rx queue actually shares hardware descriptor
queue and feed all Rx queues with one descriptor supply, memory is saved.

Polling any queue using same shared Rx queue receives packets from all
member ports. Source port is identified by mbuf->port.

Multiple groups is supported by group ID. Port queue number in a shared
group should be identical. Queue index is 1:1 mapped in shared group.
An example of two share groups:
 Group1, 4 shared Rx queues per member port: PF, repr0, repr1
 Group2, 2 shared Rx queues per member port: repr2, repr3, ... repr127
 Poll first port for each group:
  core	port	queue
  0	0	0
  1	0	1
  2	0	2
  3	0	3
  4	2	0
  5	2	1

Shared Rx queue must be polled on single thread or core. If both PF0 and
representor0 joined same share group, can't poll pf0rxq0 on core1 and
rep0rxq0 on core2. Actually, polling one port within share group is
sufficient since polling any port in group will return packets for any
port in group.

There was some discussion to aggregate member ports in same group into a
dummy port, several ways to achieve it. Since it optional, need to collect
more feedback and requirement from user, make better decision later.

v1:
  - initial version
v2:
  - add testpmd patches
v3:
  - change common forwarding api to macro for performance, thanks Jerin.
  - save global variable accessed in forwarding to flowstream to minimize
    cache miss
  - combined patches for each forwarding engine
  - support multiple groups in testpmd "--share-rxq" parameter
  - new api to aggregate shared rxq group
v4:
  - spelling fixes
  - remove shared-rxq support for all forwarding engines
  - add dedicate shared-rxq forwarding engine
v5:
 - fix grammars
 - remove aggregate api and leave it for later discussion
 - add release notes
 - add deployment example
v6:
 - replace RxQ offload flag with device offload capability flag
 - add Rx domain
 - RxQ is shared when share group > 0
 - update testpmd accordingly
v7:
 - fix testpmd share group id allocation
 - change rx_domain to 16bits
v8:
 - add new patch for testpmd to show device Rx domain ID and capability
 - new share_qid in RxQ configuration
v9:
 - fix some spelling

Xueming Li (6):
  ethdev: introduce shared Rx queue
  app/testpmd: dump device capability and Rx domain info
  app/testpmd: new parameter to enable shared Rx queue
  app/testpmd: dump port info for shared Rx queue
  app/testpmd: force shared Rx queue polled on same core
  app/testpmd: add forwarding engine for shared Rx queue

 app/test-pmd/config.c                         | 114 +++++++++++++-
 app/test-pmd/meson.build                      |   1 +
 app/test-pmd/parameters.c                     |  13 ++
 app/test-pmd/shared_rxq_fwd.c                 | 148 ++++++++++++++++++
 app/test-pmd/testpmd.c                        |  25 ++-
 app/test-pmd/testpmd.h                        |   5 +
 app/test-pmd/util.c                           |   3 +
 doc/guides/nics/features.rst                  |  13 ++
 doc/guides/nics/features/default.ini          |   1 +
 .../prog_guide/switch_representation.rst      |  11 ++
 doc/guides/rel_notes/release_21_11.rst        |   6 +
 doc/guides/testpmd_app_ug/run_app.rst         |   8 +
 doc/guides/testpmd_app_ug/testpmd_funcs.rst   |   5 +-
 lib/ethdev/rte_ethdev.c                       |   8 +
 lib/ethdev/rte_ethdev.h                       |  24 +++
 15 files changed, 379 insertions(+), 6 deletions(-)
 create mode 100644 app/test-pmd/shared_rxq_fwd.c

-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v9 1/6] ethdev: introduce shared Rx queue
  2021-10-19  8:17 ` [dpdk-dev] [PATCH v9 0/6] ethdev: introduce " Xueming Li
@ 2021-10-19  8:17   ` Xueming Li
  2021-10-19  8:17   ` [dpdk-dev] [PATCH v9 2/6] app/testpmd: dump device capability and Rx domain info Xueming Li
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-19  8:17 UTC (permalink / raw)
  To: dev, Zhang Yuying
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde

In current DPDK framework, each Rx queue is pre-loaded with mbufs to
save incoming packets. For some PMDs, when number of representors scale
out in a switch domain, the memory consumption became significant.
Polling all ports also leads to high cache miss, high latency and low
throughput.

This patch introduces shared Rx queue. Ports in same Rx domain and
switch domain could share Rx queue set by specifying non-zero sharing
group in Rx queue configuration.

Shared Rx queue is identified by share_rxq field of Rx queue
configuration. Port A RxQ X can share RxQ with Port B RxQ Y by using
same shared Rx queue ID.

No special API is defined to receive packets from shared Rx queue.
Polling any member port of a shared Rx queue receives packets of that
queue for all member ports, port_id is identified by mbuf->port. PMD is
responsible to resolve shared Rx queue from device and queue data.

Shared Rx queue must be polled in same thread or core, polling a queue
ID of any member port is essentially same.

Multiple share groups are supported. PMD should support mixed
configuration by allowing multiple share groups and non-shared Rx queue
on one port.

Example grouping and polling model to reflect service priority:
 Group1, 2 shared Rx queues per port: PF, rep0, rep1
 Group2, 1 shared Rx queue per port: rep2, rep3, ... rep127
 Core0: poll PF queue0
 Core1: poll PF queue1
 Core2: poll rep2 queue0

PMD advertise shared Rx queue capability via RTE_ETH_DEV_CAPA_RXQ_SHARE.

PMD is responsible for shared Rx queue consistency checks to avoid
member port's configuration contradict each other.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---
 doc/guides/nics/features.rst                  | 13 ++++++++++
 doc/guides/nics/features/default.ini          |  1 +
 .../prog_guide/switch_representation.rst      | 11 +++++++++
 doc/guides/rel_notes/release_21_11.rst        |  6 +++++
 lib/ethdev/rte_ethdev.c                       |  8 +++++++
 lib/ethdev/rte_ethdev.h                       | 24 +++++++++++++++++++
 6 files changed, 63 insertions(+)

diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index e346018e4b8..89f9accbca1 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -615,6 +615,19 @@ Supports inner packet L4 checksum.
   ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
 
 
+.. _nic_features_shared_rx_queue:
+
+Shared Rx queue
+---------------
+
+Supports shared Rx queue for ports in same Rx domain of a switch domain.
+
+* **[uses]     rte_eth_dev_info**: ``dev_capa:RTE_ETH_DEV_CAPA_RXQ_SHARE``.
+* **[uses]     rte_eth_dev_info,rte_eth_switch_info**: ``rx_domain``, ``domain_id``.
+* **[uses]     rte_eth_rxconf**: ``share_group``, ``share_qid``.
+* **[provides] mbuf**: ``mbuf.port``.
+
+
 .. _nic_features_packet_type_parsing:
 
 Packet type parsing
diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
index d473b94091a..93f5d1b46f4 100644
--- a/doc/guides/nics/features/default.ini
+++ b/doc/guides/nics/features/default.ini
@@ -19,6 +19,7 @@ Free Tx mbuf on demand =
 Queue start/stop     =
 Runtime Rx queue setup =
 Runtime Tx queue setup =
+Shared Rx queue      =
 Burst mode info      =
 Power mgmt address monitor =
 MTU update           =
diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
index ff6aa91c806..4f2532a91ea 100644
--- a/doc/guides/prog_guide/switch_representation.rst
+++ b/doc/guides/prog_guide/switch_representation.rst
@@ -123,6 +123,17 @@ thought as a software "patch panel" front-end for applications.
 .. [1] `Ethernet switch device driver model (switchdev)
        <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
 
+- For some PMDs, memory usage of representors is huge when number of
+  representor grows, mbufs are allocated for each descriptor of Rx queue.
+  Polling large number of ports brings more CPU load, cache miss and
+  latency. Shared Rx queue can be used to share Rx queue between PF and
+  representors among same Rx domain. ``RTE_ETH_DEV_CAPA_RXQ_SHARE`` in
+  device info is used to indicate the capability. Setting non-zero share
+  group in Rx queue configuration to enable share, share_qid is used to
+  identify the shared Rx queue in group. Polling any member port can
+  receive packets of all member ports in the group, port ID is saved in
+  ``mbuf.port``.
+
 Basic SR-IOV
 ------------
 
diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
index d5435a64aa1..b34d9776a15 100644
--- a/doc/guides/rel_notes/release_21_11.rst
+++ b/doc/guides/rel_notes/release_21_11.rst
@@ -75,6 +75,12 @@ New Features
     operations.
   * Added multi-process support.
 
+* **Added ethdev shared Rx queue support.**
+
+  * Added new device capability flag and Rx domain field to switch info.
+  * Added share group and share queue ID to Rx queue configuration.
+  * Added testpmd support and dedicate forwarding engine.
+
 * **Added new RSS offload types for IPv4/L4 checksum in RSS flow.**
 
   Added macros ETH_RSS_IPV4_CHKSUM and ETH_RSS_L4_CHKSUM, now IPv4 and
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 028907bc4b9..bc55f899f72 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -2159,6 +2159,14 @@ rte_eth_rx_queue_setup(uint16_t port_id, uint16_t rx_queue_id,
 		return -EINVAL;
 	}
 
+	if (local_conf.share_group > 0 &&
+	    (dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE) == 0) {
+		RTE_ETHDEV_LOG(ERR,
+			"Ethdev port_id=%d rx_queue_id=%d, enabled share_group=%hu while device doesn't support Rx queue share\n",
+			port_id, rx_queue_id, local_conf.share_group);
+		return -EINVAL;
+	}
+
 	/*
 	 * If LRO is enabled, check that the maximum aggregated packet
 	 * size is supported by the configured device.
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 6d80514ba7a..34acc91273d 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1044,6 +1044,14 @@ struct rte_eth_rxconf {
 	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
 	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
 	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
+	/**
+	 * Share group index in Rx domain and switch domain.
+	 * Non-zero value to enable Rx queue share, zero value disable share.
+	 * PMD is responsible for Rx queue consistency checks to avoid member
+	 * port's configuration contradict to each other.
+	 */
+	uint16_t share_group;
+	uint16_t share_qid; /**< Shared Rx queue ID in group. */
 	/**
 	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
 	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
@@ -1445,6 +1453,16 @@ struct rte_eth_conf {
 #define RTE_ETH_DEV_CAPA_RUNTIME_RX_QUEUE_SETUP 0x00000001
 /** Device supports Tx queue setup after device started. */
 #define RTE_ETH_DEV_CAPA_RUNTIME_TX_QUEUE_SETUP 0x00000002
+/**
+ * Device supports shared Rx queue among ports within Rx domain and
+ * switch domain. Mbufs are consumed by shared Rx queue instead of
+ * each queue. Multiple groups are supported by share_group of Rx
+ * queue configuration. Shared Rx queue is identified by PMD using
+ * share_qid of Rx queue configuration. Polling any port in the group
+ * receive packets of all member ports, source port identified by
+ * mbuf->port field.
+ */
+#define RTE_ETH_DEV_CAPA_RXQ_SHARE              RTE_BIT64(2)
 /**@}*/
 
 /*
@@ -1488,6 +1506,12 @@ struct rte_eth_switch_info {
 	 * but each driver should explicitly define the mapping of switch
 	 * port identifier to that physical interconnect/switch
 	 */
+	/**
+	 * Shared Rx queue sub-domain boundary. Only ports in same Rx domain
+	 * and switch domain can share Rx queue. Valid only if device advertised
+	 * RTE_ETH_DEV_CAPA_RXQ_SHARE capability.
+	 */
+	uint16_t rx_domain;
 };
 
 /**
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v9 2/6] app/testpmd: dump device capability and Rx domain info
  2021-10-19  8:17 ` [dpdk-dev] [PATCH v9 0/6] ethdev: introduce " Xueming Li
  2021-10-19  8:17   ` [dpdk-dev] [PATCH v9 1/6] " Xueming Li
@ 2021-10-19  8:17   ` Xueming Li
  2021-10-19  8:33     ` Andrew Rybchenko
  2021-10-19  8:17   ` [dpdk-dev] [PATCH v9 3/6] app/testpmd: new parameter to enable shared Rx queue Xueming Li
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 266+ messages in thread
From: Xueming Li @ 2021-10-19  8:17 UTC (permalink / raw)
  To: dev, Zhang Yuying
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde, Xiaoyun Li

Dump device capability and Rx domain ID if shared Rx queue is supported
by device.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 9c66329e96e..c0616dcd2fd 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -733,6 +733,7 @@ port_infos_display(portid_t port_id)
 	printf("Max segment number per MTU/TSO: %hu\n",
 		dev_info.tx_desc_lim.nb_mtu_seg_max);
 
+	printf("Device capabilities: 0x%"PRIx64"\n", dev_info.dev_capa);
 	/* Show switch info only if valid switch domain and port id is set */
 	if (dev_info.switch_info.domain_id !=
 		RTE_ETH_DEV_SWITCH_DOMAIN_ID_INVALID) {
@@ -743,6 +744,9 @@ port_infos_display(portid_t port_id)
 			dev_info.switch_info.domain_id);
 		printf("Switch Port Id: %u\n",
 			dev_info.switch_info.port_id);
+		if ((dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE) != 0)
+			printf("Switch Rx domain: %u\n",
+			       dev_info.switch_info.rx_domain);
 	}
 }
 
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v9 3/6] app/testpmd: new parameter to enable shared Rx queue
  2021-10-19  8:17 ` [dpdk-dev] [PATCH v9 0/6] ethdev: introduce " Xueming Li
  2021-10-19  8:17   ` [dpdk-dev] [PATCH v9 1/6] " Xueming Li
  2021-10-19  8:17   ` [dpdk-dev] [PATCH v9 2/6] app/testpmd: dump device capability and Rx domain info Xueming Li
@ 2021-10-19  8:17   ` Xueming Li
  2021-10-19  8:17   ` [dpdk-dev] [PATCH v9 4/6] app/testpmd: dump port info for " Xueming Li
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-19  8:17 UTC (permalink / raw)
  To: dev, Zhang Yuying
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde, Xiaoyun Li

Adds "--rxq-share=X" parameter to enable shared RxQ, share if device
supports, otherwise fallback to standard RxQ.

Share group number grows per X ports. X defaults to MAX, implies all
ports join share group 1. Queue ID is mapped equally with shared Rx
queue ID.

Forwarding engine "shared-rxq" should be used which Rx only and update
stream statistics correctly.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c                 |  7 ++++++-
 app/test-pmd/parameters.c             | 13 +++++++++++++
 app/test-pmd/testpmd.c                | 20 +++++++++++++++++---
 app/test-pmd/testpmd.h                |  2 ++
 doc/guides/testpmd_app_ug/run_app.rst |  7 +++++++
 5 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index c0616dcd2fd..f8fb8961cae 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -2713,7 +2713,12 @@ rxtx_config_display(void)
 			printf("      RX threshold registers: pthresh=%d hthresh=%d "
 				" wthresh=%d\n",
 				pthresh_tmp, hthresh_tmp, wthresh_tmp);
-			printf("      RX Offloads=0x%"PRIx64"\n", offloads_tmp);
+			printf("      RX Offloads=0x%"PRIx64, offloads_tmp);
+			if (rx_conf->share_group > 0)
+				printf(" share_group=%u share_qid=%u",
+				       rx_conf->share_group,
+				       rx_conf->share_qid);
+			printf("\n");
 		}
 
 		/* per tx queue config only for first queue to be less verbose */
diff --git a/app/test-pmd/parameters.c b/app/test-pmd/parameters.c
index 3f94a82e321..30dae326310 100644
--- a/app/test-pmd/parameters.c
+++ b/app/test-pmd/parameters.c
@@ -167,6 +167,7 @@ usage(char* progname)
 	printf("  --tx-ip=src,dst: IP addresses in Tx-only mode\n");
 	printf("  --tx-udp=src[,dst]: UDP ports in Tx-only mode\n");
 	printf("  --eth-link-speed: force link speed.\n");
+	printf("  --rxq-share: number of ports per shared rxq groups, defaults to MAX(1 group)\n");
 	printf("  --disable-link-check: disable check on link status when "
 	       "starting/stopping ports.\n");
 	printf("  --disable-device-start: do not automatically start port\n");
@@ -607,6 +608,7 @@ launch_args_parse(int argc, char** argv)
 		{ "rxpkts",			1, 0, 0 },
 		{ "txpkts",			1, 0, 0 },
 		{ "txonly-multi-flow",		0, 0, 0 },
+		{ "rxq-share",			2, 0, 0 },
 		{ "eth-link-speed",		1, 0, 0 },
 		{ "disable-link-check",		0, 0, 0 },
 		{ "disable-device-start",	0, 0, 0 },
@@ -1271,6 +1273,17 @@ launch_args_parse(int argc, char** argv)
 			}
 			if (!strcmp(lgopts[opt_idx].name, "txonly-multi-flow"))
 				txonly_multi_flow = 1;
+			if (!strcmp(lgopts[opt_idx].name, "rxq-share")) {
+				if (optarg == NULL) {
+					rxq_share = UINT32_MAX;
+				} else {
+					n = atoi(optarg);
+					if (n >= 0)
+						rxq_share = (uint32_t)n;
+					else
+						rte_exit(EXIT_FAILURE, "rxq-share must be >= 0\n");
+				}
+			}
 			if (!strcmp(lgopts[opt_idx].name, "no-flush-rx"))
 				no_flush_rx = 1;
 			if (!strcmp(lgopts[opt_idx].name, "eth-link-speed")) {
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 97ae52e17ec..123142ed110 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -498,6 +498,11 @@ uint8_t record_core_cycles;
  */
 uint8_t record_burst_stats;
 
+/*
+ * Number of ports per shared Rx queue group, 0 disable.
+ */
+uint32_t rxq_share;
+
 unsigned int num_sockets = 0;
 unsigned int socket_ids[RTE_MAX_NUMA_NODES];
 
@@ -3393,14 +3398,23 @@ dev_event_callback(const char *device_name, enum rte_dev_event_type type,
 }
 
 static void
-rxtx_port_config(struct rte_port *port)
+rxtx_port_config(portid_t pid)
 {
 	uint16_t qid;
 	uint64_t offloads;
+	struct rte_port *port = &ports[pid];
 
 	for (qid = 0; qid < nb_rxq; qid++) {
 		offloads = port->rx_conf[qid].offloads;
 		port->rx_conf[qid] = port->dev_info.default_rxconf;
+
+		if (rxq_share > 0 &&
+		    (port->dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE)) {
+			/* Non-zero share group to enable RxQ share. */
+			port->rx_conf[qid].share_group = pid / rxq_share + 1;
+			port->rx_conf[qid].share_qid = qid; /* Equal mapping. */
+		}
+
 		if (offloads != 0)
 			port->rx_conf[qid].offloads = offloads;
 
@@ -3558,7 +3572,7 @@ init_port_config(void)
 				port->dev_conf.rxmode.mq_mode = ETH_MQ_RX_NONE;
 		}
 
-		rxtx_port_config(port);
+		rxtx_port_config(pid);
 
 		ret = eth_macaddr_get_print_err(pid, &port->eth_addr);
 		if (ret != 0)
@@ -3772,7 +3786,7 @@ init_port_dcb_config(portid_t pid,
 
 	memcpy(&rte_port->dev_conf, &port_conf, sizeof(struct rte_eth_conf));
 
-	rxtx_port_config(rte_port);
+	rxtx_port_config(pid);
 	/* VLAN filter */
 	rte_port->dev_conf.rxmode.offloads |= DEV_RX_OFFLOAD_VLAN_FILTER;
 	for (i = 0; i < RTE_DIM(vlan_tags); i++)
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 5863b2f43f3..3dfaaad94c0 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -477,6 +477,8 @@ extern enum tx_pkt_split tx_pkt_split;
 
 extern uint8_t txonly_multi_flow;
 
+extern uint32_t rxq_share;
+
 extern uint16_t nb_pkt_per_burst;
 extern uint16_t nb_pkt_flowgen_clones;
 extern int nb_flows_flowgen;
diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
index 640eadeff73..ff5908dcd50 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -389,6 +389,13 @@ The command line options are:
 
     Generate multiple flows in txonly mode.
 
+*   ``--rxq-share=[X]``
+
+    Create queues in shared Rx queue mode if device supports.
+    Group number grows per X ports. X defaults to MAX, implies all ports
+    join share group 1. Forwarding engine "shared-rxq" should be used
+    which Rx only and update stream statistics correctly.
+
 *   ``--eth-link-speed``
 
     Set a forced link speed to the ethernet port::
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v9 4/6] app/testpmd: dump port info for shared Rx queue
  2021-10-19  8:17 ` [dpdk-dev] [PATCH v9 0/6] ethdev: introduce " Xueming Li
                     ` (2 preceding siblings ...)
  2021-10-19  8:17   ` [dpdk-dev] [PATCH v9 3/6] app/testpmd: new parameter to enable shared Rx queue Xueming Li
@ 2021-10-19  8:17   ` Xueming Li
  2021-10-19  8:17   ` [dpdk-dev] [PATCH v9 5/6] app/testpmd: force shared Rx queue polled on same core Xueming Li
  2021-10-19  8:17   ` [dpdk-dev] [PATCH v9 6/6] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
  5 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-19  8:17 UTC (permalink / raw)
  To: dev, Zhang Yuying
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde, Xiaoyun Li

In case of shared Rx queue, polling any member port returns mbufs for
all members. This patch dumps mbuf->port for each packet.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/util.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/app/test-pmd/util.c b/app/test-pmd/util.c
index 51506e49404..e98f136d5ed 100644
--- a/app/test-pmd/util.c
+++ b/app/test-pmd/util.c
@@ -100,6 +100,9 @@ dump_pkt_burst(uint16_t port_id, uint16_t queue, struct rte_mbuf *pkts[],
 		struct rte_flow_restore_info info = { 0, };
 
 		mb = pkts[i];
+		if (rxq_share > 0)
+			MKDUMPSTR(print_buf, buf_size, cur_len, "port %u, ",
+				  mb->port);
 		eth_hdr = rte_pktmbuf_read(mb, 0, sizeof(_eth_hdr), &_eth_hdr);
 		eth_type = RTE_BE_TO_CPU_16(eth_hdr->ether_type);
 		packet_type = mb->packet_type;
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v9 5/6] app/testpmd: force shared Rx queue polled on same core
  2021-10-19  8:17 ` [dpdk-dev] [PATCH v9 0/6] ethdev: introduce " Xueming Li
                     ` (3 preceding siblings ...)
  2021-10-19  8:17   ` [dpdk-dev] [PATCH v9 4/6] app/testpmd: dump port info for " Xueming Li
@ 2021-10-19  8:17   ` Xueming Li
  2021-10-19  8:17   ` [dpdk-dev] [PATCH v9 6/6] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
  5 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-19  8:17 UTC (permalink / raw)
  To: dev, Zhang Yuying
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde, Xiaoyun Li

Shared Rx queue must be polled on same core. This patch checks and stops
forwarding if shared RxQ being scheduled on multiple
cores.

It's suggested to use same number of Rx queues and polling cores.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c  | 103 +++++++++++++++++++++++++++++++++++++++++
 app/test-pmd/testpmd.c |   4 +-
 app/test-pmd/testpmd.h |   2 +
 3 files changed, 108 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index f8fb8961cae..c4150d77589 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -2890,6 +2890,109 @@ port_rss_hash_key_update(portid_t port_id, char rss_type[], uint8_t *hash_key,
 	}
 }
 
+/*
+ * Check whether a shared rxq scheduled on other lcores.
+ */
+static bool
+fwd_stream_on_other_lcores(uint16_t domain_id, lcoreid_t src_lc,
+			   portid_t src_port, queueid_t src_rxq,
+			   uint32_t share_group, queueid_t share_rxq)
+{
+	streamid_t sm_id;
+	streamid_t nb_fs_per_lcore;
+	lcoreid_t  nb_fc;
+	lcoreid_t  lc_id;
+	struct fwd_stream *fs;
+	struct rte_port *port;
+	struct rte_eth_dev_info *dev_info;
+	struct rte_eth_rxconf *rxq_conf;
+
+	nb_fc = cur_fwd_config.nb_fwd_lcores;
+	/* Check remaining cores. */
+	for (lc_id = src_lc + 1; lc_id < nb_fc; lc_id++) {
+		sm_id = fwd_lcores[lc_id]->stream_idx;
+		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
+		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
+		     sm_id++) {
+			fs = fwd_streams[sm_id];
+			port = &ports[fs->rx_port];
+			dev_info = &port->dev_info;
+			rxq_conf = &port->rx_conf[fs->rx_queue];
+			if ((dev_info->dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE)
+			    == 0)
+				/* Not shared rxq. */
+				continue;
+			if (domain_id != port->dev_info.switch_info.domain_id)
+				continue;
+			if (rxq_conf->share_group != share_group)
+				continue;
+			if (rxq_conf->share_qid != share_rxq)
+				continue;
+			printf("Shared Rx queue group %u queue %hu can't be scheduled on different cores:\n",
+			       share_group, share_rxq);
+			printf("  lcore %hhu Port %hu queue %hu\n",
+			       src_lc, src_port, src_rxq);
+			printf("  lcore %hhu Port %hu queue %hu\n",
+			       lc_id, fs->rx_port, fs->rx_queue);
+			printf("Please use --nb-cores=%hu to limit number of forwarding cores\n",
+			       nb_rxq);
+			return true;
+		}
+	}
+	return false;
+}
+
+/*
+ * Check shared rxq configuration.
+ *
+ * Shared group must not being scheduled on different core.
+ */
+bool
+pkt_fwd_shared_rxq_check(void)
+{
+	streamid_t sm_id;
+	streamid_t nb_fs_per_lcore;
+	lcoreid_t  nb_fc;
+	lcoreid_t  lc_id;
+	struct fwd_stream *fs;
+	uint16_t domain_id;
+	struct rte_port *port;
+	struct rte_eth_dev_info *dev_info;
+	struct rte_eth_rxconf *rxq_conf;
+
+	nb_fc = cur_fwd_config.nb_fwd_lcores;
+	/*
+	 * Check streams on each core, make sure the same switch domain +
+	 * group + queue doesn't get scheduled on other cores.
+	 */
+	for (lc_id = 0; lc_id < nb_fc; lc_id++) {
+		sm_id = fwd_lcores[lc_id]->stream_idx;
+		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
+		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
+		     sm_id++) {
+			fs = fwd_streams[sm_id];
+			/* Update lcore info stream being scheduled. */
+			fs->lcore = fwd_lcores[lc_id];
+			port = &ports[fs->rx_port];
+			dev_info = &port->dev_info;
+			rxq_conf = &port->rx_conf[fs->rx_queue];
+			if ((dev_info->dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE)
+			    == 0)
+				/* Not shared rxq. */
+				continue;
+			/* Check shared rxq not scheduled on remaining cores. */
+			domain_id = port->dev_info.switch_info.domain_id;
+			if (fwd_stream_on_other_lcores(domain_id, lc_id,
+						       fs->rx_port,
+						       fs->rx_queue,
+						       rxq_conf->share_group,
+						       rxq_conf->share_qid))
+				return false;
+		}
+	}
+	return true;
+}
+
 /*
  * Setup forwarding configuration for each logical core.
  */
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 123142ed110..f3f81ef561f 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -2236,10 +2236,12 @@ start_packet_forwarding(int with_tx_first)
 
 	fwd_config_setup();
 
+	pkt_fwd_config_display(&cur_fwd_config);
+	if (!pkt_fwd_shared_rxq_check())
+		return;
 	if(!no_flush_rx)
 		flush_fwd_rx_queues();
 
-	pkt_fwd_config_display(&cur_fwd_config);
 	rxtx_config_display();
 
 	fwd_stats_reset();
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 3dfaaad94c0..f121a2da90c 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -144,6 +144,7 @@ struct fwd_stream {
 	uint64_t     core_cycles; /**< used for RX and TX processing */
 	struct pkt_burst_stats rx_burst_stats;
 	struct pkt_burst_stats tx_burst_stats;
+	struct fwd_lcore *lcore; /**< Lcore being scheduled. */
 };
 
 /**
@@ -795,6 +796,7 @@ void port_summary_header_display(void);
 void rx_queue_infos_display(portid_t port_idi, uint16_t queue_id);
 void tx_queue_infos_display(portid_t port_idi, uint16_t queue_id);
 void fwd_lcores_config_display(void);
+bool pkt_fwd_shared_rxq_check(void);
 void pkt_fwd_config_display(struct fwd_config *cfg);
 void rxtx_config_display(void);
 void fwd_config_setup(void);
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v9 6/6] app/testpmd: add forwarding engine for shared Rx queue
  2021-10-19  8:17 ` [dpdk-dev] [PATCH v9 0/6] ethdev: introduce " Xueming Li
                     ` (4 preceding siblings ...)
  2021-10-19  8:17   ` [dpdk-dev] [PATCH v9 5/6] app/testpmd: force shared Rx queue polled on same core Xueming Li
@ 2021-10-19  8:17   ` Xueming Li
  5 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-19  8:17 UTC (permalink / raw)
  To: dev, Zhang Yuying
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde, Xiaoyun Li

To support shared Rx queue, this patch introduces dedicate forwarding
engine. The engine groups received packets by mbuf->port into sub-group,
updates stream statistics and simply frees packets.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/meson.build                    |   1 +
 app/test-pmd/shared_rxq_fwd.c               | 148 ++++++++++++++++++++
 app/test-pmd/testpmd.c                      |   1 +
 app/test-pmd/testpmd.h                      |   1 +
 doc/guides/testpmd_app_ug/run_app.rst       |   1 +
 doc/guides/testpmd_app_ug/testpmd_funcs.rst |   5 +-
 6 files changed, 156 insertions(+), 1 deletion(-)
 create mode 100644 app/test-pmd/shared_rxq_fwd.c

diff --git a/app/test-pmd/meson.build b/app/test-pmd/meson.build
index 98f3289bdfa..07042e45b12 100644
--- a/app/test-pmd/meson.build
+++ b/app/test-pmd/meson.build
@@ -21,6 +21,7 @@ sources = files(
         'noisy_vnf.c',
         'parameters.c',
         'rxonly.c',
+        'shared_rxq_fwd.c',
         'testpmd.c',
         'txonly.c',
         'util.c',
diff --git a/app/test-pmd/shared_rxq_fwd.c b/app/test-pmd/shared_rxq_fwd.c
new file mode 100644
index 00000000000..4e262b99bc7
--- /dev/null
+++ b/app/test-pmd/shared_rxq_fwd.c
@@ -0,0 +1,148 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2021 NVIDIA Corporation & Affiliates
+ */
+
+#include <stdarg.h>
+#include <string.h>
+#include <stdio.h>
+#include <errno.h>
+#include <stdint.h>
+#include <unistd.h>
+#include <inttypes.h>
+
+#include <sys/queue.h>
+#include <sys/stat.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_debug.h>
+#include <rte_cycles.h>
+#include <rte_memory.h>
+#include <rte_memcpy.h>
+#include <rte_launch.h>
+#include <rte_eal.h>
+#include <rte_per_lcore.h>
+#include <rte_lcore.h>
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_mempool.h>
+#include <rte_mbuf.h>
+#include <rte_pci.h>
+#include <rte_ether.h>
+#include <rte_ethdev.h>
+#include <rte_string_fns.h>
+#include <rte_ip.h>
+#include <rte_udp.h>
+#include <rte_net.h>
+#include <rte_flow.h>
+
+#include "testpmd.h"
+
+/*
+ * Rx only sub-burst forwarding.
+ */
+static void
+forward_rx_only(uint16_t nb_rx, struct rte_mbuf **pkts_burst)
+{
+	rte_pktmbuf_free_bulk(pkts_burst, nb_rx);
+}
+
+/**
+ * Get packet source stream by source port and queue.
+ * All streams of same shared Rx queue locates on same core.
+ */
+static struct fwd_stream *
+forward_stream_get(struct fwd_stream *fs, uint16_t port)
+{
+	streamid_t sm_id;
+	struct fwd_lcore *fc;
+	struct fwd_stream **fsm;
+	streamid_t nb_fs;
+
+	fc = fs->lcore;
+	fsm = &fwd_streams[fc->stream_idx];
+	nb_fs = fc->stream_nb;
+	for (sm_id = 0; sm_id < nb_fs; sm_id++) {
+		if (fsm[sm_id]->rx_port == port &&
+		    fsm[sm_id]->rx_queue == fs->rx_queue)
+			return fsm[sm_id];
+	}
+	return NULL;
+}
+
+/**
+ * Forward packet by source port and queue.
+ */
+static void
+forward_sub_burst(struct fwd_stream *src_fs, uint16_t port, uint16_t nb_rx,
+		  struct rte_mbuf **pkts)
+{
+	struct fwd_stream *fs = forward_stream_get(src_fs, port);
+
+	if (fs != NULL) {
+		fs->rx_packets += nb_rx;
+		forward_rx_only(nb_rx, pkts);
+	} else {
+		/* Source stream not found, drop all packets. */
+		src_fs->fwd_dropped += nb_rx;
+		while (nb_rx > 0)
+			rte_pktmbuf_free(pkts[--nb_rx]);
+	}
+}
+
+/**
+ * Forward packets from shared Rx queue.
+ *
+ * Source port of packets are identified by mbuf->port.
+ */
+static void
+forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
+		   struct rte_mbuf **pkts_burst)
+{
+	uint16_t i, nb_sub_burst, port, last_port;
+
+	nb_sub_burst = 0;
+	last_port = pkts_burst[0]->port;
+	/* Locate sub-burst according to mbuf->port. */
+	for (i = 0; i < nb_rx - 1; ++i) {
+		rte_prefetch0(pkts_burst[i + 1]);
+		port = pkts_burst[i]->port;
+		if (i > 0 && last_port != port) {
+			/* Forward packets with same source port. */
+			forward_sub_burst(fs, last_port, nb_sub_burst,
+					  &pkts_burst[i - nb_sub_burst]);
+			nb_sub_burst = 0;
+			last_port = port;
+		}
+		nb_sub_burst++;
+	}
+	/* Last sub-burst. */
+	nb_sub_burst++;
+	forward_sub_burst(fs, last_port, nb_sub_burst,
+			  &pkts_burst[nb_rx - nb_sub_burst]);
+}
+
+static void
+shared_rxq_fwd(struct fwd_stream *fs)
+{
+	struct rte_mbuf *pkts_burst[nb_pkt_per_burst];
+	uint16_t nb_rx;
+	uint64_t start_tsc = 0;
+
+	get_start_cycles(&start_tsc);
+	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
+				 nb_pkt_per_burst);
+	inc_rx_burst_stats(fs, nb_rx);
+	if (unlikely(nb_rx == 0))
+		return;
+	forward_shared_rxq(fs, nb_rx, pkts_burst);
+	get_end_cycles(fs, start_tsc);
+}
+
+struct fwd_engine shared_rxq_engine = {
+	.fwd_mode_name  = "shared_rxq",
+	.port_fwd_begin = NULL,
+	.port_fwd_end   = NULL,
+	.packet_fwd     = shared_rxq_fwd,
+};
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index f3f81ef561f..11a85d92d9a 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -188,6 +188,7 @@ struct fwd_engine * fwd_engines[] = {
 #ifdef RTE_LIBRTE_IEEE1588
 	&ieee1588_fwd_engine,
 #endif
+	&shared_rxq_engine,
 	NULL,
 };
 
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index f121a2da90c..f1fd607e365 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -299,6 +299,7 @@ extern struct fwd_engine five_tuple_swap_fwd_engine;
 #ifdef RTE_LIBRTE_IEEE1588
 extern struct fwd_engine ieee1588_fwd_engine;
 #endif
+extern struct fwd_engine shared_rxq_engine;
 
 extern struct fwd_engine * fwd_engines[]; /**< NULL terminated array. */
 extern cmdline_parse_inst_t cmd_set_raw;
diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
index ff5908dcd50..e4b97844ced 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -252,6 +252,7 @@ The command line options are:
        tm
        noisy
        5tswap
+       shared-rxq
 
 *   ``--rss-ip``
 
diff --git a/doc/guides/testpmd_app_ug/testpmd_funcs.rst b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
index 8ead7a4a712..499874187f2 100644
--- a/doc/guides/testpmd_app_ug/testpmd_funcs.rst
+++ b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
@@ -314,7 +314,7 @@ set fwd
 Set the packet forwarding mode::
 
    testpmd> set fwd (io|mac|macswap|flowgen| \
-                     rxonly|txonly|csum|icmpecho|noisy|5tswap) (""|retry)
+                     rxonly|txonly|csum|icmpecho|noisy|5tswap|shared-rxq) (""|retry)
 
 ``retry`` can be specified for forwarding engines except ``rx_only``.
 
@@ -357,6 +357,9 @@ The available information categories are:
 
   L4 swaps the source port and destination port of transport layer (TCP and UDP).
 
+* ``shared-rxq``: Receive only for shared Rx queue.
+  Resolve packet source port from mbuf and update stream statistics accordingly.
+
 Example::
 
    testpmd> set fwd rxonly
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v9 2/6] app/testpmd: dump device capability and Rx domain info
  2021-10-19  8:17   ` [dpdk-dev] [PATCH v9 2/6] app/testpmd: dump device capability and Rx domain info Xueming Li
@ 2021-10-19  8:33     ` Andrew Rybchenko
  2021-10-19  9:10       ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Andrew Rybchenko @ 2021-10-19  8:33 UTC (permalink / raw)
  To: Xueming Li, dev, Zhang Yuying
  Cc: Jerin Jacob, Ferruh Yigit, Viacheslav Ovsiienko, Thomas Monjalon,
	Lior Margalit, Ananyev Konstantin, Ajit Khaparde, Xiaoyun Li

On 10/19/21 11:17 AM, Xueming Li wrote:
> Dump device capability and Rx domain ID if shared Rx queue is supported
> by device.
> 
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>

LGTM except one minor note:

Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>

> ---
>  app/test-pmd/config.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
> index 9c66329e96e..c0616dcd2fd 100644
> --- a/app/test-pmd/config.c
> +++ b/app/test-pmd/config.c
> @@ -733,6 +733,7 @@ port_infos_display(portid_t port_id)
>  	printf("Max segment number per MTU/TSO: %hu\n",
>  		dev_info.tx_desc_lim.nb_mtu_seg_max);
>  
> +	printf("Device capabilities: 0x%"PRIx64"\n", dev_info.dev_capa);

IMHO, it should be decoded

>  	/* Show switch info only if valid switch domain and port id is set */
>  	if (dev_info.switch_info.domain_id !=
>  		RTE_ETH_DEV_SWITCH_DOMAIN_ID_INVALID) {
> @@ -743,6 +744,9 @@ port_infos_display(portid_t port_id)
>  			dev_info.switch_info.domain_id);
>  		printf("Switch Port Id: %u\n",
>  			dev_info.switch_info.port_id);
> +		if ((dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE) != 0)
> +			printf("Switch Rx domain: %u\n",
> +			       dev_info.switch_info.rx_domain);
>  	}
>  }
>  
> 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v9 2/6] app/testpmd: dump device capability and Rx domain info
  2021-10-19  8:33     ` Andrew Rybchenko
@ 2021-10-19  9:10       ` Xueming(Steven) Li
  2021-10-19  9:39         ` Andrew Rybchenko
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-10-19  9:10 UTC (permalink / raw)
  To: yuying.zhang, andrew.rybchenko, dev
  Cc: konstantin.ananyev, jerinjacobk, NBU-Contact-Thomas Monjalon,
	Slava Ovsiienko, ajit.khaparde, ferruh.yigit, Lior Margalit,
	xiaoyun.li

On Tue, 2021-10-19 at 11:33 +0300, Andrew Rybchenko wrote:
> On 10/19/21 11:17 AM, Xueming Li wrote:
> > Dump device capability and Rx domain ID if shared Rx queue is supported
> > by device.
> > 
> > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> 
> LGTM except one minor note:
> 
> Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> 
> > ---
> >  app/test-pmd/config.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
> > index 9c66329e96e..c0616dcd2fd 100644
> > --- a/app/test-pmd/config.c
> > +++ b/app/test-pmd/config.c
> > @@ -733,6 +733,7 @@ port_infos_display(portid_t port_id)
> >  	printf("Max segment number per MTU/TSO: %hu\n",
> >  		dev_info.tx_desc_lim.nb_mtu_seg_max);
> >  
> > +	printf("Device capabilities: 0x%"PRIx64"\n", dev_info.dev_capa);
> 
> IMHO, it should be decoded

Thanks for checking this, do you mean decode to readable names?
Then we need a new API rte_eth_dev_capability_name(), it's simple, but
is it ok to add API w/o RFC?

> 
> >  	/* Show switch info only if valid switch domain and port id is set */
> >  	if (dev_info.switch_info.domain_id !=
> >  		RTE_ETH_DEV_SWITCH_DOMAIN_ID_INVALID) {
> > @@ -743,6 +744,9 @@ port_infos_display(portid_t port_id)
> >  			dev_info.switch_info.domain_id);
> >  		printf("Switch Port Id: %u\n",
> >  			dev_info.switch_info.port_id);
> > +		if ((dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE) != 0)
> > +			printf("Switch Rx domain: %u\n",
> > +			       dev_info.switch_info.rx_domain);
> >  	}
> >  }
> >  
> > 
> 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v9 2/6] app/testpmd: dump device capability and Rx domain info
  2021-10-19  9:10       ` Xueming(Steven) Li
@ 2021-10-19  9:39         ` Andrew Rybchenko
  0 siblings, 0 replies; 266+ messages in thread
From: Andrew Rybchenko @ 2021-10-19  9:39 UTC (permalink / raw)
  To: Xueming(Steven) Li, yuying.zhang, dev
  Cc: konstantin.ananyev, jerinjacobk, NBU-Contact-Thomas Monjalon,
	Slava Ovsiienko, ajit.khaparde, ferruh.yigit, Lior Margalit,
	xiaoyun.li

On 10/19/21 12:10 PM, Xueming(Steven) Li wrote:
> On Tue, 2021-10-19 at 11:33 +0300, Andrew Rybchenko wrote:
>> On 10/19/21 11:17 AM, Xueming Li wrote:
>>> Dump device capability and Rx domain ID if shared Rx queue is supported
>>> by device.
>>>
>>> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
>>
>> LGTM except one minor note:
>>
>> Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
>>
>>> ---
>>>  app/test-pmd/config.c | 4 ++++
>>>  1 file changed, 4 insertions(+)
>>>
>>> diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
>>> index 9c66329e96e..c0616dcd2fd 100644
>>> --- a/app/test-pmd/config.c
>>> +++ b/app/test-pmd/config.c
>>> @@ -733,6 +733,7 @@ port_infos_display(portid_t port_id)
>>>  	printf("Max segment number per MTU/TSO: %hu\n",
>>>  		dev_info.tx_desc_lim.nb_mtu_seg_max);
>>>  
>>> +	printf("Device capabilities: 0x%"PRIx64"\n", dev_info.dev_capa);
>>
>> IMHO, it should be decoded
> 
> Thanks for checking this, do you mean decode to readable names?
> Then we need a new API rte_eth_dev_capability_name(), it's simple, but
> is it ok to add API w/o RFC?

It is trivial. So, I think it should be OK.

^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v10 0/6] ethdev: introduce shared Rx queue
  2021-07-27  3:42 [dpdk-dev] [RFC] ethdev: introduce shared Rx queue Xueming Li
                   ` (9 preceding siblings ...)
  2021-10-19  8:17 ` [dpdk-dev] [PATCH v9 0/6] ethdev: introduce " Xueming Li
@ 2021-10-19 15:20 ` Xueming Li
  2021-10-19 15:20   ` [dpdk-dev] [PATCH v10 1/6] ethdev: new API to resolve device capability name Xueming Li
                     ` (5 more replies)
  2021-10-19 15:28 ` [dpdk-dev] [PATCH v10 0/7] ethdev: introduce " Xueming Li
                   ` (5 subsequent siblings)
  16 siblings, 6 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-19 15:20 UTC (permalink / raw)
  To: dev, Zhang Yuying
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde

In current DPDK framework, all Rx queues is pre-loaded with mbufs for
incoming packets. When number of representors scale out in a switch
domain, the memory consumption became significant. Further more,
polling all ports leads to high cache miss, high latency and low
throughputs.

This patch introduces shared Rx queue. PF and representors in same
Rx domain and switch domain could share Rx queue set by specifying
non-zero share group value in Rx queue configuration.

All ports that share Rx queue actually shares hardware descriptor
queue and feed all Rx queues with one descriptor supply, memory is saved.

Polling any queue using same shared Rx queue receives packets from all
member ports. Source port is identified by mbuf->port.

Multiple groups is supported by group ID. Port queue number in a shared
group should be identical. Queue index is 1:1 mapped in shared group.
An example of two share groups:
 Group1, 4 shared Rx queues per member port: PF, repr0, repr1
 Group2, 2 shared Rx queues per member port: repr2, repr3, ... repr127
 Poll first port for each group:
  core	port	queue
  0	0	0
  1	0	1
  2	0	2
  3	0	3
  4	2	0
  5	2	1

Shared Rx queue must be polled on single thread or core. If both PF0 and
representor0 joined same share group, can't poll pf0rxq0 on core1 and
rep0rxq0 on core2. Actually, polling one port within share group is
sufficient since polling any port in group will return packets for any
port in group.

There was some discussion to aggregate member ports in same group into a
dummy port, several ways to achieve it. Since it optional, need to collect
more feedback and requirement from user, make better decision later.

v1:
  - initial version
v2:
  - add testpmd patches
v3:
  - change common forwarding api to macro for performance, thanks Jerin.
  - save global variable accessed in forwarding to flowstream to minimize
    cache miss
  - combined patches for each forwarding engine
  - support multiple groups in testpmd "--share-rxq" parameter
  - new api to aggregate shared rxq group
v4:
  - spelling fixes
  - remove shared-rxq support for all forwarding engines
  - add dedicate shared-rxq forwarding engine
v5:
 - fix grammars
 - remove aggregate api and leave it for later discussion
 - add release notes
 - add deployment example
v6:
 - replace RxQ offload flag with device offload capability flag
 - add Rx domain
 - RxQ is shared when share group > 0
 - update testpmd accordingly
v7:
 - fix testpmd share group id allocation
 - change rx_domain to 16bits
v8:
 - add new patch for testpmd to show device Rx domain ID and capability
 - new share_qid in RxQ configuration
v9:
 - fix some spelling
v10:
 - add device capability name api

Xueming Li (6):
  ethdev: new API to resolve device capability name
  app/testpmd: dump device capability and Rx domain info
  app/testpmd: new parameter to enable shared Rx queue
  app/testpmd: dump port info for shared Rx queue
  app/testpmd: force shared Rx queue polled on same core
  app/testpmd: add forwarding engine for shared Rx queue

 app/test-pmd/config.c                       | 139 +++++++++++++++++-
 app/test-pmd/meson.build                    |   1 +
 app/test-pmd/parameters.c                   |  13 ++
 app/test-pmd/shared_rxq_fwd.c               | 148 ++++++++++++++++++++
 app/test-pmd/testpmd.c                      |  25 +++-
 app/test-pmd/testpmd.h                      |   5 +
 app/test-pmd/util.c                         |   3 +
 doc/guides/testpmd_app_ug/run_app.rst       |   8 ++
 doc/guides/testpmd_app_ug/testpmd_funcs.rst |   5 +-
 lib/ethdev/rte_ethdev.c                     |  30 ++++
 lib/ethdev/rte_ethdev.h                     |  14 ++
 lib/ethdev/version.map                      |   3 +
 12 files changed, 388 insertions(+), 6 deletions(-)
 create mode 100644 app/test-pmd/shared_rxq_fwd.c

-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v10 1/6] ethdev: new API to resolve device capability name
  2021-10-19 15:20 ` [dpdk-dev] [PATCH v10 0/6] ethdev: introduce " Xueming Li
@ 2021-10-19 15:20   ` Xueming Li
  2021-10-19 15:20   ` [dpdk-dev] [PATCH v10 2/6] app/testpmd: dump device capability and Rx domain info Xueming Li
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-19 15:20 UTC (permalink / raw)
  To: dev, Zhang Yuying
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde, Ray Kinsella

This patch adds API to return name of device capability.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 lib/ethdev/rte_ethdev.c | 30 ++++++++++++++++++++++++++++++
 lib/ethdev/rte_ethdev.h | 14 ++++++++++++++
 lib/ethdev/version.map  |  3 +++
 3 files changed, 47 insertions(+)

diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index bc55f899f72..97217529449 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -165,6 +165,20 @@ static const struct {
 
 #undef RTE_TX_OFFLOAD_BIT2STR
 
+#define RTE_ETH_DEV_CAPA_BIT2STR(_name)	\
+	{ RTE_ETH_DEV_CAPA_##_name, #_name }
+
+static const struct {
+	uint64_t offload;
+	const char *name;
+} rte_eth_dev_capa_names[] = {
+	RTE_ETH_DEV_CAPA_BIT2STR(RUNTIME_RX_QUEUE_SETUP),
+	RTE_ETH_DEV_CAPA_BIT2STR(RUNTIME_TX_QUEUE_SETUP),
+	RTE_ETH_DEV_CAPA_BIT2STR(RXQ_SHARE),
+};
+
+#undef RTE_ETH_DEV_CAPA_BIT2STR
+
 /**
  * The user application callback description.
  *
@@ -1260,6 +1274,22 @@ rte_eth_dev_tx_offload_name(uint64_t offload)
 	return name;
 }
 
+const char *
+rte_eth_dev_capability_name(uint64_t capability)
+{
+	const char *name = "UNKNOWN";
+	unsigned int i;
+
+	for (i = 0; i < RTE_DIM(rte_eth_dev_capa_names); ++i) {
+		if (capability == rte_eth_dev_capa_names[i].offload) {
+			name = rte_eth_dev_capa_names[i].name;
+			break;
+		}
+	}
+
+	return name;
+}
+
 static inline int
 eth_dev_check_lro_pkt_size(uint16_t port_id, uint32_t config_size,
 		   uint32_t max_rx_pkt_len, uint32_t dev_info_size)
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 34acc91273d..df8ef9382a9 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -2109,6 +2109,20 @@ const char *rte_eth_dev_rx_offload_name(uint64_t offload);
  */
 const char *rte_eth_dev_tx_offload_name(uint64_t offload);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Get RTE_ETH_DEV_CAPA_* flag name.
+ *
+ * @param capability
+ *   Capability flag.
+ * @return
+ *   Capability name or 'UNKNOWN' if the flag cannot be recognized.
+ */
+__rte_experimental
+const char *rte_eth_dev_capability_name(uint64_t capability);
+
 /**
  * Configure an Ethernet device.
  * This function must be invoked first before any other function in the
diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
index efd729c0f2d..e1d403dd357 100644
--- a/lib/ethdev/version.map
+++ b/lib/ethdev/version.map
@@ -245,6 +245,9 @@ EXPERIMENTAL {
 	rte_mtr_meter_policy_delete;
 	rte_mtr_meter_policy_update;
 	rte_mtr_meter_policy_validate;
+
+	# added in 21.11
+	rte_eth_dev_capability_name;
 };
 
 INTERNAL {
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v10 2/6] app/testpmd: dump device capability and Rx domain info
  2021-10-19 15:20 ` [dpdk-dev] [PATCH v10 0/6] ethdev: introduce " Xueming Li
  2021-10-19 15:20   ` [dpdk-dev] [PATCH v10 1/6] ethdev: new API to resolve device capability name Xueming Li
@ 2021-10-19 15:20   ` Xueming Li
  2021-10-19 15:20   ` [dpdk-dev] [PATCH v10 3/6] app/testpmd: new parameter to enable shared Rx queue Xueming Li
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-19 15:20 UTC (permalink / raw)
  To: dev, Zhang Yuying
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde, Xiaoyun Li

Dump device capability and Rx domain ID if shared Rx queue is supported
by device.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---
 app/test-pmd/config.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 9c66329e96e..2c1b06c544d 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -582,6 +582,29 @@ device_infos_display(const char *identifier)
 	rte_devargs_reset(&da);
 }
 
+static void
+print_dev_capabilities(uint64_t capabilities)
+{
+	uint64_t single_capa;
+	int begin;
+	int end;
+	int bit;
+
+	if (capabilities == 0)
+		return;
+
+	begin = __builtin_ctzll(capabilities);
+	end = sizeof(capabilities) * CHAR_BIT - __builtin_clzll(capabilities);
+
+	single_capa = 1ULL << begin;
+	for (bit = begin; bit < end; bit++) {
+		if (capabilities & single_capa)
+			printf(" %s",
+			       rte_eth_dev_capability_name(single_capa));
+		single_capa <<= 1;
+	}
+}
+
 void
 port_infos_display(portid_t port_id)
 {
@@ -733,6 +756,9 @@ port_infos_display(portid_t port_id)
 	printf("Max segment number per MTU/TSO: %hu\n",
 		dev_info.tx_desc_lim.nb_mtu_seg_max);
 
+	printf("Device capabilities: 0x%"PRIx64"(", dev_info.dev_capa);
+	print_dev_capabilities(dev_info.dev_capa);
+	printf(" )\n");
 	/* Show switch info only if valid switch domain and port id is set */
 	if (dev_info.switch_info.domain_id !=
 		RTE_ETH_DEV_SWITCH_DOMAIN_ID_INVALID) {
@@ -743,6 +769,9 @@ port_infos_display(portid_t port_id)
 			dev_info.switch_info.domain_id);
 		printf("Switch Port Id: %u\n",
 			dev_info.switch_info.port_id);
+		if ((dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE) != 0)
+			printf("Switch Rx domain: %u\n",
+			       dev_info.switch_info.rx_domain);
 	}
 }
 
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v10 3/6] app/testpmd: new parameter to enable shared Rx queue
  2021-10-19 15:20 ` [dpdk-dev] [PATCH v10 0/6] ethdev: introduce " Xueming Li
  2021-10-19 15:20   ` [dpdk-dev] [PATCH v10 1/6] ethdev: new API to resolve device capability name Xueming Li
  2021-10-19 15:20   ` [dpdk-dev] [PATCH v10 2/6] app/testpmd: dump device capability and Rx domain info Xueming Li
@ 2021-10-19 15:20   ` Xueming Li
  2021-10-19 15:20   ` [dpdk-dev] [PATCH v10 4/6] app/testpmd: dump port info for " Xueming Li
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-19 15:20 UTC (permalink / raw)
  To: dev, Zhang Yuying
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde, Xiaoyun Li

Adds "--rxq-share=X" parameter to enable shared RxQ, share if device
supports, otherwise fallback to standard RxQ.

Share group number grows per X ports. X defaults to MAX, implies all
ports join share group 1. Queue ID is mapped equally with shared Rx
queue ID.

Forwarding engine "shared-rxq" should be used which Rx only and update
stream statistics correctly.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c                 |  7 ++++++-
 app/test-pmd/parameters.c             | 13 +++++++++++++
 app/test-pmd/testpmd.c                | 20 +++++++++++++++++---
 app/test-pmd/testpmd.h                |  2 ++
 doc/guides/testpmd_app_ug/run_app.rst |  7 +++++++
 5 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 2c1b06c544d..fa951a86704 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -2738,7 +2738,12 @@ rxtx_config_display(void)
 			printf("      RX threshold registers: pthresh=%d hthresh=%d "
 				" wthresh=%d\n",
 				pthresh_tmp, hthresh_tmp, wthresh_tmp);
-			printf("      RX Offloads=0x%"PRIx64"\n", offloads_tmp);
+			printf("      RX Offloads=0x%"PRIx64, offloads_tmp);
+			if (rx_conf->share_group > 0)
+				printf(" share_group=%u share_qid=%u",
+				       rx_conf->share_group,
+				       rx_conf->share_qid);
+			printf("\n");
 		}
 
 		/* per tx queue config only for first queue to be less verbose */
diff --git a/app/test-pmd/parameters.c b/app/test-pmd/parameters.c
index 3f94a82e321..30dae326310 100644
--- a/app/test-pmd/parameters.c
+++ b/app/test-pmd/parameters.c
@@ -167,6 +167,7 @@ usage(char* progname)
 	printf("  --tx-ip=src,dst: IP addresses in Tx-only mode\n");
 	printf("  --tx-udp=src[,dst]: UDP ports in Tx-only mode\n");
 	printf("  --eth-link-speed: force link speed.\n");
+	printf("  --rxq-share: number of ports per shared rxq groups, defaults to MAX(1 group)\n");
 	printf("  --disable-link-check: disable check on link status when "
 	       "starting/stopping ports.\n");
 	printf("  --disable-device-start: do not automatically start port\n");
@@ -607,6 +608,7 @@ launch_args_parse(int argc, char** argv)
 		{ "rxpkts",			1, 0, 0 },
 		{ "txpkts",			1, 0, 0 },
 		{ "txonly-multi-flow",		0, 0, 0 },
+		{ "rxq-share",			2, 0, 0 },
 		{ "eth-link-speed",		1, 0, 0 },
 		{ "disable-link-check",		0, 0, 0 },
 		{ "disable-device-start",	0, 0, 0 },
@@ -1271,6 +1273,17 @@ launch_args_parse(int argc, char** argv)
 			}
 			if (!strcmp(lgopts[opt_idx].name, "txonly-multi-flow"))
 				txonly_multi_flow = 1;
+			if (!strcmp(lgopts[opt_idx].name, "rxq-share")) {
+				if (optarg == NULL) {
+					rxq_share = UINT32_MAX;
+				} else {
+					n = atoi(optarg);
+					if (n >= 0)
+						rxq_share = (uint32_t)n;
+					else
+						rte_exit(EXIT_FAILURE, "rxq-share must be >= 0\n");
+				}
+			}
 			if (!strcmp(lgopts[opt_idx].name, "no-flush-rx"))
 				no_flush_rx = 1;
 			if (!strcmp(lgopts[opt_idx].name, "eth-link-speed")) {
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 97ae52e17ec..123142ed110 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -498,6 +498,11 @@ uint8_t record_core_cycles;
  */
 uint8_t record_burst_stats;
 
+/*
+ * Number of ports per shared Rx queue group, 0 disable.
+ */
+uint32_t rxq_share;
+
 unsigned int num_sockets = 0;
 unsigned int socket_ids[RTE_MAX_NUMA_NODES];
 
@@ -3393,14 +3398,23 @@ dev_event_callback(const char *device_name, enum rte_dev_event_type type,
 }
 
 static void
-rxtx_port_config(struct rte_port *port)
+rxtx_port_config(portid_t pid)
 {
 	uint16_t qid;
 	uint64_t offloads;
+	struct rte_port *port = &ports[pid];
 
 	for (qid = 0; qid < nb_rxq; qid++) {
 		offloads = port->rx_conf[qid].offloads;
 		port->rx_conf[qid] = port->dev_info.default_rxconf;
+
+		if (rxq_share > 0 &&
+		    (port->dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE)) {
+			/* Non-zero share group to enable RxQ share. */
+			port->rx_conf[qid].share_group = pid / rxq_share + 1;
+			port->rx_conf[qid].share_qid = qid; /* Equal mapping. */
+		}
+
 		if (offloads != 0)
 			port->rx_conf[qid].offloads = offloads;
 
@@ -3558,7 +3572,7 @@ init_port_config(void)
 				port->dev_conf.rxmode.mq_mode = ETH_MQ_RX_NONE;
 		}
 
-		rxtx_port_config(port);
+		rxtx_port_config(pid);
 
 		ret = eth_macaddr_get_print_err(pid, &port->eth_addr);
 		if (ret != 0)
@@ -3772,7 +3786,7 @@ init_port_dcb_config(portid_t pid,
 
 	memcpy(&rte_port->dev_conf, &port_conf, sizeof(struct rte_eth_conf));
 
-	rxtx_port_config(rte_port);
+	rxtx_port_config(pid);
 	/* VLAN filter */
 	rte_port->dev_conf.rxmode.offloads |= DEV_RX_OFFLOAD_VLAN_FILTER;
 	for (i = 0; i < RTE_DIM(vlan_tags); i++)
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 5863b2f43f3..3dfaaad94c0 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -477,6 +477,8 @@ extern enum tx_pkt_split tx_pkt_split;
 
 extern uint8_t txonly_multi_flow;
 
+extern uint32_t rxq_share;
+
 extern uint16_t nb_pkt_per_burst;
 extern uint16_t nb_pkt_flowgen_clones;
 extern int nb_flows_flowgen;
diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
index 640eadeff73..ff5908dcd50 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -389,6 +389,13 @@ The command line options are:
 
     Generate multiple flows in txonly mode.
 
+*   ``--rxq-share=[X]``
+
+    Create queues in shared Rx queue mode if device supports.
+    Group number grows per X ports. X defaults to MAX, implies all ports
+    join share group 1. Forwarding engine "shared-rxq" should be used
+    which Rx only and update stream statistics correctly.
+
 *   ``--eth-link-speed``
 
     Set a forced link speed to the ethernet port::
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v10 4/6] app/testpmd: dump port info for shared Rx queue
  2021-10-19 15:20 ` [dpdk-dev] [PATCH v10 0/6] ethdev: introduce " Xueming Li
                     ` (2 preceding siblings ...)
  2021-10-19 15:20   ` [dpdk-dev] [PATCH v10 3/6] app/testpmd: new parameter to enable shared Rx queue Xueming Li
@ 2021-10-19 15:20   ` Xueming Li
  2021-10-19 15:20   ` [dpdk-dev] [PATCH v10 5/6] app/testpmd: force shared Rx queue polled on same core Xueming Li
  2021-10-19 15:20   ` [dpdk-dev] [PATCH v10 6/6] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
  5 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-19 15:20 UTC (permalink / raw)
  To: dev, Zhang Yuying
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde, Xiaoyun Li

In case of shared Rx queue, polling any member port returns mbufs for
all members. This patch dumps mbuf->port for each packet.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/util.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/app/test-pmd/util.c b/app/test-pmd/util.c
index 51506e49404..e98f136d5ed 100644
--- a/app/test-pmd/util.c
+++ b/app/test-pmd/util.c
@@ -100,6 +100,9 @@ dump_pkt_burst(uint16_t port_id, uint16_t queue, struct rte_mbuf *pkts[],
 		struct rte_flow_restore_info info = { 0, };
 
 		mb = pkts[i];
+		if (rxq_share > 0)
+			MKDUMPSTR(print_buf, buf_size, cur_len, "port %u, ",
+				  mb->port);
 		eth_hdr = rte_pktmbuf_read(mb, 0, sizeof(_eth_hdr), &_eth_hdr);
 		eth_type = RTE_BE_TO_CPU_16(eth_hdr->ether_type);
 		packet_type = mb->packet_type;
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v10 5/6] app/testpmd: force shared Rx queue polled on same core
  2021-10-19 15:20 ` [dpdk-dev] [PATCH v10 0/6] ethdev: introduce " Xueming Li
                     ` (3 preceding siblings ...)
  2021-10-19 15:20   ` [dpdk-dev] [PATCH v10 4/6] app/testpmd: dump port info for " Xueming Li
@ 2021-10-19 15:20   ` Xueming Li
  2021-10-19 15:20   ` [dpdk-dev] [PATCH v10 6/6] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
  5 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-19 15:20 UTC (permalink / raw)
  To: dev, Zhang Yuying
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde, Xiaoyun Li

Shared Rx queue must be polled on same core. This patch checks and stops
forwarding if shared RxQ being scheduled on multiple
cores.

It's suggested to use same number of Rx queues and polling cores.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c  | 103 +++++++++++++++++++++++++++++++++++++++++
 app/test-pmd/testpmd.c |   4 +-
 app/test-pmd/testpmd.h |   2 +
 3 files changed, 108 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index fa951a86704..1f1307178be 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -2915,6 +2915,109 @@ port_rss_hash_key_update(portid_t port_id, char rss_type[], uint8_t *hash_key,
 	}
 }
 
+/*
+ * Check whether a shared rxq scheduled on other lcores.
+ */
+static bool
+fwd_stream_on_other_lcores(uint16_t domain_id, lcoreid_t src_lc,
+			   portid_t src_port, queueid_t src_rxq,
+			   uint32_t share_group, queueid_t share_rxq)
+{
+	streamid_t sm_id;
+	streamid_t nb_fs_per_lcore;
+	lcoreid_t  nb_fc;
+	lcoreid_t  lc_id;
+	struct fwd_stream *fs;
+	struct rte_port *port;
+	struct rte_eth_dev_info *dev_info;
+	struct rte_eth_rxconf *rxq_conf;
+
+	nb_fc = cur_fwd_config.nb_fwd_lcores;
+	/* Check remaining cores. */
+	for (lc_id = src_lc + 1; lc_id < nb_fc; lc_id++) {
+		sm_id = fwd_lcores[lc_id]->stream_idx;
+		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
+		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
+		     sm_id++) {
+			fs = fwd_streams[sm_id];
+			port = &ports[fs->rx_port];
+			dev_info = &port->dev_info;
+			rxq_conf = &port->rx_conf[fs->rx_queue];
+			if ((dev_info->dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE)
+			    == 0)
+				/* Not shared rxq. */
+				continue;
+			if (domain_id != port->dev_info.switch_info.domain_id)
+				continue;
+			if (rxq_conf->share_group != share_group)
+				continue;
+			if (rxq_conf->share_qid != share_rxq)
+				continue;
+			printf("Shared Rx queue group %u queue %hu can't be scheduled on different cores:\n",
+			       share_group, share_rxq);
+			printf("  lcore %hhu Port %hu queue %hu\n",
+			       src_lc, src_port, src_rxq);
+			printf("  lcore %hhu Port %hu queue %hu\n",
+			       lc_id, fs->rx_port, fs->rx_queue);
+			printf("Please use --nb-cores=%hu to limit number of forwarding cores\n",
+			       nb_rxq);
+			return true;
+		}
+	}
+	return false;
+}
+
+/*
+ * Check shared rxq configuration.
+ *
+ * Shared group must not being scheduled on different core.
+ */
+bool
+pkt_fwd_shared_rxq_check(void)
+{
+	streamid_t sm_id;
+	streamid_t nb_fs_per_lcore;
+	lcoreid_t  nb_fc;
+	lcoreid_t  lc_id;
+	struct fwd_stream *fs;
+	uint16_t domain_id;
+	struct rte_port *port;
+	struct rte_eth_dev_info *dev_info;
+	struct rte_eth_rxconf *rxq_conf;
+
+	nb_fc = cur_fwd_config.nb_fwd_lcores;
+	/*
+	 * Check streams on each core, make sure the same switch domain +
+	 * group + queue doesn't get scheduled on other cores.
+	 */
+	for (lc_id = 0; lc_id < nb_fc; lc_id++) {
+		sm_id = fwd_lcores[lc_id]->stream_idx;
+		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
+		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
+		     sm_id++) {
+			fs = fwd_streams[sm_id];
+			/* Update lcore info stream being scheduled. */
+			fs->lcore = fwd_lcores[lc_id];
+			port = &ports[fs->rx_port];
+			dev_info = &port->dev_info;
+			rxq_conf = &port->rx_conf[fs->rx_queue];
+			if ((dev_info->dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE)
+			    == 0)
+				/* Not shared rxq. */
+				continue;
+			/* Check shared rxq not scheduled on remaining cores. */
+			domain_id = port->dev_info.switch_info.domain_id;
+			if (fwd_stream_on_other_lcores(domain_id, lc_id,
+						       fs->rx_port,
+						       fs->rx_queue,
+						       rxq_conf->share_group,
+						       rxq_conf->share_qid))
+				return false;
+		}
+	}
+	return true;
+}
+
 /*
  * Setup forwarding configuration for each logical core.
  */
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 123142ed110..f3f81ef561f 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -2236,10 +2236,12 @@ start_packet_forwarding(int with_tx_first)
 
 	fwd_config_setup();
 
+	pkt_fwd_config_display(&cur_fwd_config);
+	if (!pkt_fwd_shared_rxq_check())
+		return;
 	if(!no_flush_rx)
 		flush_fwd_rx_queues();
 
-	pkt_fwd_config_display(&cur_fwd_config);
 	rxtx_config_display();
 
 	fwd_stats_reset();
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 3dfaaad94c0..f121a2da90c 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -144,6 +144,7 @@ struct fwd_stream {
 	uint64_t     core_cycles; /**< used for RX and TX processing */
 	struct pkt_burst_stats rx_burst_stats;
 	struct pkt_burst_stats tx_burst_stats;
+	struct fwd_lcore *lcore; /**< Lcore being scheduled. */
 };
 
 /**
@@ -795,6 +796,7 @@ void port_summary_header_display(void);
 void rx_queue_infos_display(portid_t port_idi, uint16_t queue_id);
 void tx_queue_infos_display(portid_t port_idi, uint16_t queue_id);
 void fwd_lcores_config_display(void);
+bool pkt_fwd_shared_rxq_check(void);
 void pkt_fwd_config_display(struct fwd_config *cfg);
 void rxtx_config_display(void);
 void fwd_config_setup(void);
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v10 6/6] app/testpmd: add forwarding engine for shared Rx queue
  2021-10-19 15:20 ` [dpdk-dev] [PATCH v10 0/6] ethdev: introduce " Xueming Li
                     ` (4 preceding siblings ...)
  2021-10-19 15:20   ` [dpdk-dev] [PATCH v10 5/6] app/testpmd: force shared Rx queue polled on same core Xueming Li
@ 2021-10-19 15:20   ` Xueming Li
  5 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-19 15:20 UTC (permalink / raw)
  To: dev, Zhang Yuying
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde, Xiaoyun Li

To support shared Rx queue, this patch introduces dedicate forwarding
engine. The engine groups received packets by mbuf->port into sub-group,
updates stream statistics and simply frees packets.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/meson.build                    |   1 +
 app/test-pmd/shared_rxq_fwd.c               | 148 ++++++++++++++++++++
 app/test-pmd/testpmd.c                      |   1 +
 app/test-pmd/testpmd.h                      |   1 +
 doc/guides/testpmd_app_ug/run_app.rst       |   1 +
 doc/guides/testpmd_app_ug/testpmd_funcs.rst |   5 +-
 6 files changed, 156 insertions(+), 1 deletion(-)
 create mode 100644 app/test-pmd/shared_rxq_fwd.c

diff --git a/app/test-pmd/meson.build b/app/test-pmd/meson.build
index 98f3289bdfa..07042e45b12 100644
--- a/app/test-pmd/meson.build
+++ b/app/test-pmd/meson.build
@@ -21,6 +21,7 @@ sources = files(
         'noisy_vnf.c',
         'parameters.c',
         'rxonly.c',
+        'shared_rxq_fwd.c',
         'testpmd.c',
         'txonly.c',
         'util.c',
diff --git a/app/test-pmd/shared_rxq_fwd.c b/app/test-pmd/shared_rxq_fwd.c
new file mode 100644
index 00000000000..4e262b99bc7
--- /dev/null
+++ b/app/test-pmd/shared_rxq_fwd.c
@@ -0,0 +1,148 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2021 NVIDIA Corporation & Affiliates
+ */
+
+#include <stdarg.h>
+#include <string.h>
+#include <stdio.h>
+#include <errno.h>
+#include <stdint.h>
+#include <unistd.h>
+#include <inttypes.h>
+
+#include <sys/queue.h>
+#include <sys/stat.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_debug.h>
+#include <rte_cycles.h>
+#include <rte_memory.h>
+#include <rte_memcpy.h>
+#include <rte_launch.h>
+#include <rte_eal.h>
+#include <rte_per_lcore.h>
+#include <rte_lcore.h>
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_mempool.h>
+#include <rte_mbuf.h>
+#include <rte_pci.h>
+#include <rte_ether.h>
+#include <rte_ethdev.h>
+#include <rte_string_fns.h>
+#include <rte_ip.h>
+#include <rte_udp.h>
+#include <rte_net.h>
+#include <rte_flow.h>
+
+#include "testpmd.h"
+
+/*
+ * Rx only sub-burst forwarding.
+ */
+static void
+forward_rx_only(uint16_t nb_rx, struct rte_mbuf **pkts_burst)
+{
+	rte_pktmbuf_free_bulk(pkts_burst, nb_rx);
+}
+
+/**
+ * Get packet source stream by source port and queue.
+ * All streams of same shared Rx queue locates on same core.
+ */
+static struct fwd_stream *
+forward_stream_get(struct fwd_stream *fs, uint16_t port)
+{
+	streamid_t sm_id;
+	struct fwd_lcore *fc;
+	struct fwd_stream **fsm;
+	streamid_t nb_fs;
+
+	fc = fs->lcore;
+	fsm = &fwd_streams[fc->stream_idx];
+	nb_fs = fc->stream_nb;
+	for (sm_id = 0; sm_id < nb_fs; sm_id++) {
+		if (fsm[sm_id]->rx_port == port &&
+		    fsm[sm_id]->rx_queue == fs->rx_queue)
+			return fsm[sm_id];
+	}
+	return NULL;
+}
+
+/**
+ * Forward packet by source port and queue.
+ */
+static void
+forward_sub_burst(struct fwd_stream *src_fs, uint16_t port, uint16_t nb_rx,
+		  struct rte_mbuf **pkts)
+{
+	struct fwd_stream *fs = forward_stream_get(src_fs, port);
+
+	if (fs != NULL) {
+		fs->rx_packets += nb_rx;
+		forward_rx_only(nb_rx, pkts);
+	} else {
+		/* Source stream not found, drop all packets. */
+		src_fs->fwd_dropped += nb_rx;
+		while (nb_rx > 0)
+			rte_pktmbuf_free(pkts[--nb_rx]);
+	}
+}
+
+/**
+ * Forward packets from shared Rx queue.
+ *
+ * Source port of packets are identified by mbuf->port.
+ */
+static void
+forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
+		   struct rte_mbuf **pkts_burst)
+{
+	uint16_t i, nb_sub_burst, port, last_port;
+
+	nb_sub_burst = 0;
+	last_port = pkts_burst[0]->port;
+	/* Locate sub-burst according to mbuf->port. */
+	for (i = 0; i < nb_rx - 1; ++i) {
+		rte_prefetch0(pkts_burst[i + 1]);
+		port = pkts_burst[i]->port;
+		if (i > 0 && last_port != port) {
+			/* Forward packets with same source port. */
+			forward_sub_burst(fs, last_port, nb_sub_burst,
+					  &pkts_burst[i - nb_sub_burst]);
+			nb_sub_burst = 0;
+			last_port = port;
+		}
+		nb_sub_burst++;
+	}
+	/* Last sub-burst. */
+	nb_sub_burst++;
+	forward_sub_burst(fs, last_port, nb_sub_burst,
+			  &pkts_burst[nb_rx - nb_sub_burst]);
+}
+
+static void
+shared_rxq_fwd(struct fwd_stream *fs)
+{
+	struct rte_mbuf *pkts_burst[nb_pkt_per_burst];
+	uint16_t nb_rx;
+	uint64_t start_tsc = 0;
+
+	get_start_cycles(&start_tsc);
+	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
+				 nb_pkt_per_burst);
+	inc_rx_burst_stats(fs, nb_rx);
+	if (unlikely(nb_rx == 0))
+		return;
+	forward_shared_rxq(fs, nb_rx, pkts_burst);
+	get_end_cycles(fs, start_tsc);
+}
+
+struct fwd_engine shared_rxq_engine = {
+	.fwd_mode_name  = "shared_rxq",
+	.port_fwd_begin = NULL,
+	.port_fwd_end   = NULL,
+	.packet_fwd     = shared_rxq_fwd,
+};
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index f3f81ef561f..11a85d92d9a 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -188,6 +188,7 @@ struct fwd_engine * fwd_engines[] = {
 #ifdef RTE_LIBRTE_IEEE1588
 	&ieee1588_fwd_engine,
 #endif
+	&shared_rxq_engine,
 	NULL,
 };
 
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index f121a2da90c..f1fd607e365 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -299,6 +299,7 @@ extern struct fwd_engine five_tuple_swap_fwd_engine;
 #ifdef RTE_LIBRTE_IEEE1588
 extern struct fwd_engine ieee1588_fwd_engine;
 #endif
+extern struct fwd_engine shared_rxq_engine;
 
 extern struct fwd_engine * fwd_engines[]; /**< NULL terminated array. */
 extern cmdline_parse_inst_t cmd_set_raw;
diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
index ff5908dcd50..e4b97844ced 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -252,6 +252,7 @@ The command line options are:
        tm
        noisy
        5tswap
+       shared-rxq
 
 *   ``--rss-ip``
 
diff --git a/doc/guides/testpmd_app_ug/testpmd_funcs.rst b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
index 8ead7a4a712..499874187f2 100644
--- a/doc/guides/testpmd_app_ug/testpmd_funcs.rst
+++ b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
@@ -314,7 +314,7 @@ set fwd
 Set the packet forwarding mode::
 
    testpmd> set fwd (io|mac|macswap|flowgen| \
-                     rxonly|txonly|csum|icmpecho|noisy|5tswap) (""|retry)
+                     rxonly|txonly|csum|icmpecho|noisy|5tswap|shared-rxq) (""|retry)
 
 ``retry`` can be specified for forwarding engines except ``rx_only``.
 
@@ -357,6 +357,9 @@ The available information categories are:
 
   L4 swaps the source port and destination port of transport layer (TCP and UDP).
 
+* ``shared-rxq``: Receive only for shared Rx queue.
+  Resolve packet source port from mbuf and update stream statistics accordingly.
+
 Example::
 
    testpmd> set fwd rxonly
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v10 0/7] ethdev: introduce shared Rx queue
  2021-07-27  3:42 [dpdk-dev] [RFC] ethdev: introduce shared Rx queue Xueming Li
                   ` (10 preceding siblings ...)
  2021-10-19 15:20 ` [dpdk-dev] [PATCH v10 0/6] ethdev: introduce " Xueming Li
@ 2021-10-19 15:28 ` Xueming Li
  2021-10-19 15:28   ` [dpdk-dev] [PATCH v10 1/7] " Xueming Li
                     ` (6 more replies)
  2021-10-20  7:53 ` [dpdk-dev] [PATCH v11 0/7] ethdev: introduce " Xueming Li
                   ` (4 subsequent siblings)
  16 siblings, 7 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-19 15:28 UTC (permalink / raw)
  To: dev, Zhang Yuying
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde

In current DPDK framework, all Rx queues is pre-loaded with mbufs for
incoming packets. When number of representors scale out in a switch
domain, the memory consumption became significant. Further more,
polling all ports leads to high cache miss, high latency and low
throughputs.

This patch introduces shared Rx queue. PF and representors in same
Rx domain and switch domain could share Rx queue set by specifying
non-zero share group value in Rx queue configuration.

All ports that share Rx queue actually shares hardware descriptor
queue and feed all Rx queues with one descriptor supply, memory is saved.

Polling any queue using same shared Rx queue receives packets from all
member ports. Source port is identified by mbuf->port.

Multiple groups is supported by group ID. Port queue number in a shared
group should be identical. Queue index is 1:1 mapped in shared group.
An example of two share groups:
 Group1, 4 shared Rx queues per member port: PF, repr0, repr1
 Group2, 2 shared Rx queues per member port: repr2, repr3, ... repr127
 Poll first port for each group:
  core	port	queue
  0	0	0
  1	0	1
  2	0	2
  3	0	3
  4	2	0
  5	2	1

Shared Rx queue must be polled on single thread or core. If both PF0 and
representor0 joined same share group, can't poll pf0rxq0 on core1 and
rep0rxq0 on core2. Actually, polling one port within share group is
sufficient since polling any port in group will return packets for any
port in group.

There was some discussion to aggregate member ports in same group into a
dummy port, several ways to achieve it. Since it optional, need to collect
more feedback and requirement from user, make better decision later.

v1:
  - initial version
v2:
  - add testpmd patches
v3:
  - change common forwarding api to macro for performance, thanks Jerin.
  - save global variable accessed in forwarding to flowstream to minimize
    cache miss
  - combined patches for each forwarding engine
  - support multiple groups in testpmd "--share-rxq" parameter
  - new api to aggregate shared rxq group
v4:
  - spelling fixes
  - remove shared-rxq support for all forwarding engines
  - add dedicate shared-rxq forwarding engine
v5:
 - fix grammars
 - remove aggregate api and leave it for later discussion
 - add release notes
 - add deployment example
v6:
 - replace RxQ offload flag with device offload capability flag
 - add Rx domain
 - RxQ is shared when share group > 0
 - update testpmd accordingly
v7:
 - fix testpmd share group id allocation
 - change rx_domain to 16bits
v8:
 - add new patch for testpmd to show device Rx domain ID and capability
 - new share_qid in RxQ configuration
v9:
 - fix some spelling
v10:
 - add device capability name api

Xueming Li (7):
  ethdev: introduce shared Rx queue
  ethdev: new API to resolve device capability name
  app/testpmd: dump device capability and Rx domain info
  app/testpmd: new parameter to enable shared Rx queue
  app/testpmd: dump port info for shared Rx queue
  app/testpmd: force shared Rx queue polled on same core
  app/testpmd: add forwarding engine for shared Rx queue

 app/test-pmd/config.c                         | 139 +++++++++++++++-
 app/test-pmd/meson.build                      |   1 +
 app/test-pmd/parameters.c                     |  13 ++
 app/test-pmd/shared_rxq_fwd.c                 | 148 ++++++++++++++++++
 app/test-pmd/testpmd.c                        |  25 ++-
 app/test-pmd/testpmd.h                        |   5 +
 app/test-pmd/util.c                           |   3 +
 doc/guides/nics/features.rst                  |  13 ++
 doc/guides/nics/features/default.ini          |   1 +
 .../prog_guide/switch_representation.rst      |  11 ++
 doc/guides/rel_notes/release_21_11.rst        |   6 +
 doc/guides/testpmd_app_ug/run_app.rst         |   8 +
 doc/guides/testpmd_app_ug/testpmd_funcs.rst   |   5 +-
 lib/ethdev/rte_ethdev.c                       |  38 +++++
 lib/ethdev/rte_ethdev.h                       |  38 +++++
 lib/ethdev/version.map                        |   3 +
 16 files changed, 451 insertions(+), 6 deletions(-)
 create mode 100644 app/test-pmd/shared_rxq_fwd.c

-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v10 1/7] ethdev: introduce shared Rx queue
  2021-10-19 15:28 ` [dpdk-dev] [PATCH v10 0/7] ethdev: introduce " Xueming Li
@ 2021-10-19 15:28   ` Xueming Li
  2021-10-19 15:28   ` [dpdk-dev] [PATCH v10 2/7] ethdev: new API to resolve device capability name Xueming Li
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-19 15:28 UTC (permalink / raw)
  To: dev, Zhang Yuying
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde

In current DPDK framework, each Rx queue is pre-loaded with mbufs to
save incoming packets. For some PMDs, when number of representors scale
out in a switch domain, the memory consumption became significant.
Polling all ports also leads to high cache miss, high latency and low
throughput.

This patch introduces shared Rx queue. Ports in same Rx domain and
switch domain could share Rx queue set by specifying non-zero sharing
group in Rx queue configuration.

Shared Rx queue is identified by share_rxq field of Rx queue
configuration. Port A RxQ X can share RxQ with Port B RxQ Y by using
same shared Rx queue ID.

No special API is defined to receive packets from shared Rx queue.
Polling any member port of a shared Rx queue receives packets of that
queue for all member ports, port_id is identified by mbuf->port. PMD is
responsible to resolve shared Rx queue from device and queue data.

Shared Rx queue must be polled in same thread or core, polling a queue
ID of any member port is essentially same.

Multiple share groups are supported. PMD should support mixed
configuration by allowing multiple share groups and non-shared Rx queue
on one port.

Example grouping and polling model to reflect service priority:
 Group1, 2 shared Rx queues per port: PF, rep0, rep1
 Group2, 1 shared Rx queue per port: rep2, rep3, ... rep127
 Core0: poll PF queue0
 Core1: poll PF queue1
 Core2: poll rep2 queue0

PMD advertise shared Rx queue capability via RTE_ETH_DEV_CAPA_RXQ_SHARE.

PMD is responsible for shared Rx queue consistency checks to avoid
member port's configuration contradict each other.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---
 doc/guides/nics/features.rst                  | 13 ++++++++++
 doc/guides/nics/features/default.ini          |  1 +
 .../prog_guide/switch_representation.rst      | 11 +++++++++
 doc/guides/rel_notes/release_21_11.rst        |  6 +++++
 lib/ethdev/rte_ethdev.c                       |  8 +++++++
 lib/ethdev/rte_ethdev.h                       | 24 +++++++++++++++++++
 6 files changed, 63 insertions(+)

diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index e346018e4b8..89f9accbca1 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -615,6 +615,19 @@ Supports inner packet L4 checksum.
   ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
 
 
+.. _nic_features_shared_rx_queue:
+
+Shared Rx queue
+---------------
+
+Supports shared Rx queue for ports in same Rx domain of a switch domain.
+
+* **[uses]     rte_eth_dev_info**: ``dev_capa:RTE_ETH_DEV_CAPA_RXQ_SHARE``.
+* **[uses]     rte_eth_dev_info,rte_eth_switch_info**: ``rx_domain``, ``domain_id``.
+* **[uses]     rte_eth_rxconf**: ``share_group``, ``share_qid``.
+* **[provides] mbuf**: ``mbuf.port``.
+
+
 .. _nic_features_packet_type_parsing:
 
 Packet type parsing
diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
index d473b94091a..93f5d1b46f4 100644
--- a/doc/guides/nics/features/default.ini
+++ b/doc/guides/nics/features/default.ini
@@ -19,6 +19,7 @@ Free Tx mbuf on demand =
 Queue start/stop     =
 Runtime Rx queue setup =
 Runtime Tx queue setup =
+Shared Rx queue      =
 Burst mode info      =
 Power mgmt address monitor =
 MTU update           =
diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
index ff6aa91c806..4f2532a91ea 100644
--- a/doc/guides/prog_guide/switch_representation.rst
+++ b/doc/guides/prog_guide/switch_representation.rst
@@ -123,6 +123,17 @@ thought as a software "patch panel" front-end for applications.
 .. [1] `Ethernet switch device driver model (switchdev)
        <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
 
+- For some PMDs, memory usage of representors is huge when number of
+  representor grows, mbufs are allocated for each descriptor of Rx queue.
+  Polling large number of ports brings more CPU load, cache miss and
+  latency. Shared Rx queue can be used to share Rx queue between PF and
+  representors among same Rx domain. ``RTE_ETH_DEV_CAPA_RXQ_SHARE`` in
+  device info is used to indicate the capability. Setting non-zero share
+  group in Rx queue configuration to enable share, share_qid is used to
+  identify the shared Rx queue in group. Polling any member port can
+  receive packets of all member ports in the group, port ID is saved in
+  ``mbuf.port``.
+
 Basic SR-IOV
 ------------
 
diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
index d5435a64aa1..b34d9776a15 100644
--- a/doc/guides/rel_notes/release_21_11.rst
+++ b/doc/guides/rel_notes/release_21_11.rst
@@ -75,6 +75,12 @@ New Features
     operations.
   * Added multi-process support.
 
+* **Added ethdev shared Rx queue support.**
+
+  * Added new device capability flag and Rx domain field to switch info.
+  * Added share group and share queue ID to Rx queue configuration.
+  * Added testpmd support and dedicate forwarding engine.
+
 * **Added new RSS offload types for IPv4/L4 checksum in RSS flow.**
 
   Added macros ETH_RSS_IPV4_CHKSUM and ETH_RSS_L4_CHKSUM, now IPv4 and
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 028907bc4b9..bc55f899f72 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -2159,6 +2159,14 @@ rte_eth_rx_queue_setup(uint16_t port_id, uint16_t rx_queue_id,
 		return -EINVAL;
 	}
 
+	if (local_conf.share_group > 0 &&
+	    (dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE) == 0) {
+		RTE_ETHDEV_LOG(ERR,
+			"Ethdev port_id=%d rx_queue_id=%d, enabled share_group=%hu while device doesn't support Rx queue share\n",
+			port_id, rx_queue_id, local_conf.share_group);
+		return -EINVAL;
+	}
+
 	/*
 	 * If LRO is enabled, check that the maximum aggregated packet
 	 * size is supported by the configured device.
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 6d80514ba7a..34acc91273d 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1044,6 +1044,14 @@ struct rte_eth_rxconf {
 	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
 	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
 	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
+	/**
+	 * Share group index in Rx domain and switch domain.
+	 * Non-zero value to enable Rx queue share, zero value disable share.
+	 * PMD is responsible for Rx queue consistency checks to avoid member
+	 * port's configuration contradict to each other.
+	 */
+	uint16_t share_group;
+	uint16_t share_qid; /**< Shared Rx queue ID in group. */
 	/**
 	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
 	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
@@ -1445,6 +1453,16 @@ struct rte_eth_conf {
 #define RTE_ETH_DEV_CAPA_RUNTIME_RX_QUEUE_SETUP 0x00000001
 /** Device supports Tx queue setup after device started. */
 #define RTE_ETH_DEV_CAPA_RUNTIME_TX_QUEUE_SETUP 0x00000002
+/**
+ * Device supports shared Rx queue among ports within Rx domain and
+ * switch domain. Mbufs are consumed by shared Rx queue instead of
+ * each queue. Multiple groups are supported by share_group of Rx
+ * queue configuration. Shared Rx queue is identified by PMD using
+ * share_qid of Rx queue configuration. Polling any port in the group
+ * receive packets of all member ports, source port identified by
+ * mbuf->port field.
+ */
+#define RTE_ETH_DEV_CAPA_RXQ_SHARE              RTE_BIT64(2)
 /**@}*/
 
 /*
@@ -1488,6 +1506,12 @@ struct rte_eth_switch_info {
 	 * but each driver should explicitly define the mapping of switch
 	 * port identifier to that physical interconnect/switch
 	 */
+	/**
+	 * Shared Rx queue sub-domain boundary. Only ports in same Rx domain
+	 * and switch domain can share Rx queue. Valid only if device advertised
+	 * RTE_ETH_DEV_CAPA_RXQ_SHARE capability.
+	 */
+	uint16_t rx_domain;
 };
 
 /**
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v10 2/7] ethdev: new API to resolve device capability name
  2021-10-19 15:28 ` [dpdk-dev] [PATCH v10 0/7] ethdev: introduce " Xueming Li
  2021-10-19 15:28   ` [dpdk-dev] [PATCH v10 1/7] " Xueming Li
@ 2021-10-19 15:28   ` Xueming Li
  2021-10-19 17:57     ` Andrew Rybchenko
  2021-10-19 15:28   ` [dpdk-dev] [PATCH v10 3/7] app/testpmd: dump device capability and Rx domain info Xueming Li
                     ` (4 subsequent siblings)
  6 siblings, 1 reply; 266+ messages in thread
From: Xueming Li @ 2021-10-19 15:28 UTC (permalink / raw)
  To: dev, Zhang Yuying
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde, Ray Kinsella

This patch adds API to return name of device capability.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 lib/ethdev/rte_ethdev.c | 30 ++++++++++++++++++++++++++++++
 lib/ethdev/rte_ethdev.h | 14 ++++++++++++++
 lib/ethdev/version.map  |  3 +++
 3 files changed, 47 insertions(+)

diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index bc55f899f72..97217529449 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -165,6 +165,20 @@ static const struct {
 
 #undef RTE_TX_OFFLOAD_BIT2STR
 
+#define RTE_ETH_DEV_CAPA_BIT2STR(_name)	\
+	{ RTE_ETH_DEV_CAPA_##_name, #_name }
+
+static const struct {
+	uint64_t offload;
+	const char *name;
+} rte_eth_dev_capa_names[] = {
+	RTE_ETH_DEV_CAPA_BIT2STR(RUNTIME_RX_QUEUE_SETUP),
+	RTE_ETH_DEV_CAPA_BIT2STR(RUNTIME_TX_QUEUE_SETUP),
+	RTE_ETH_DEV_CAPA_BIT2STR(RXQ_SHARE),
+};
+
+#undef RTE_ETH_DEV_CAPA_BIT2STR
+
 /**
  * The user application callback description.
  *
@@ -1260,6 +1274,22 @@ rte_eth_dev_tx_offload_name(uint64_t offload)
 	return name;
 }
 
+const char *
+rte_eth_dev_capability_name(uint64_t capability)
+{
+	const char *name = "UNKNOWN";
+	unsigned int i;
+
+	for (i = 0; i < RTE_DIM(rte_eth_dev_capa_names); ++i) {
+		if (capability == rte_eth_dev_capa_names[i].offload) {
+			name = rte_eth_dev_capa_names[i].name;
+			break;
+		}
+	}
+
+	return name;
+}
+
 static inline int
 eth_dev_check_lro_pkt_size(uint16_t port_id, uint32_t config_size,
 		   uint32_t max_rx_pkt_len, uint32_t dev_info_size)
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 34acc91273d..df8ef9382a9 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -2109,6 +2109,20 @@ const char *rte_eth_dev_rx_offload_name(uint64_t offload);
  */
 const char *rte_eth_dev_tx_offload_name(uint64_t offload);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Get RTE_ETH_DEV_CAPA_* flag name.
+ *
+ * @param capability
+ *   Capability flag.
+ * @return
+ *   Capability name or 'UNKNOWN' if the flag cannot be recognized.
+ */
+__rte_experimental
+const char *rte_eth_dev_capability_name(uint64_t capability);
+
 /**
  * Configure an Ethernet device.
  * This function must be invoked first before any other function in the
diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
index efd729c0f2d..e1d403dd357 100644
--- a/lib/ethdev/version.map
+++ b/lib/ethdev/version.map
@@ -245,6 +245,9 @@ EXPERIMENTAL {
 	rte_mtr_meter_policy_delete;
 	rte_mtr_meter_policy_update;
 	rte_mtr_meter_policy_validate;
+
+	# added in 21.11
+	rte_eth_dev_capability_name;
 };
 
 INTERNAL {
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v10 3/7] app/testpmd: dump device capability and Rx domain info
  2021-10-19 15:28 ` [dpdk-dev] [PATCH v10 0/7] ethdev: introduce " Xueming Li
  2021-10-19 15:28   ` [dpdk-dev] [PATCH v10 1/7] " Xueming Li
  2021-10-19 15:28   ` [dpdk-dev] [PATCH v10 2/7] ethdev: new API to resolve device capability name Xueming Li
@ 2021-10-19 15:28   ` Xueming Li
  2021-10-19 15:28   ` [dpdk-dev] [PATCH v10 4/7] app/testpmd: new parameter to enable shared Rx queue Xueming Li
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-19 15:28 UTC (permalink / raw)
  To: dev, Zhang Yuying
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde, Xiaoyun Li

Dump device capability and Rx domain ID if shared Rx queue is supported
by device.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---
 app/test-pmd/config.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 9c66329e96e..2c1b06c544d 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -582,6 +582,29 @@ device_infos_display(const char *identifier)
 	rte_devargs_reset(&da);
 }
 
+static void
+print_dev_capabilities(uint64_t capabilities)
+{
+	uint64_t single_capa;
+	int begin;
+	int end;
+	int bit;
+
+	if (capabilities == 0)
+		return;
+
+	begin = __builtin_ctzll(capabilities);
+	end = sizeof(capabilities) * CHAR_BIT - __builtin_clzll(capabilities);
+
+	single_capa = 1ULL << begin;
+	for (bit = begin; bit < end; bit++) {
+		if (capabilities & single_capa)
+			printf(" %s",
+			       rte_eth_dev_capability_name(single_capa));
+		single_capa <<= 1;
+	}
+}
+
 void
 port_infos_display(portid_t port_id)
 {
@@ -733,6 +756,9 @@ port_infos_display(portid_t port_id)
 	printf("Max segment number per MTU/TSO: %hu\n",
 		dev_info.tx_desc_lim.nb_mtu_seg_max);
 
+	printf("Device capabilities: 0x%"PRIx64"(", dev_info.dev_capa);
+	print_dev_capabilities(dev_info.dev_capa);
+	printf(" )\n");
 	/* Show switch info only if valid switch domain and port id is set */
 	if (dev_info.switch_info.domain_id !=
 		RTE_ETH_DEV_SWITCH_DOMAIN_ID_INVALID) {
@@ -743,6 +769,9 @@ port_infos_display(portid_t port_id)
 			dev_info.switch_info.domain_id);
 		printf("Switch Port Id: %u\n",
 			dev_info.switch_info.port_id);
+		if ((dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE) != 0)
+			printf("Switch Rx domain: %u\n",
+			       dev_info.switch_info.rx_domain);
 	}
 }
 
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v10 4/7] app/testpmd: new parameter to enable shared Rx queue
  2021-10-19 15:28 ` [dpdk-dev] [PATCH v10 0/7] ethdev: introduce " Xueming Li
                     ` (2 preceding siblings ...)
  2021-10-19 15:28   ` [dpdk-dev] [PATCH v10 3/7] app/testpmd: dump device capability and Rx domain info Xueming Li
@ 2021-10-19 15:28   ` Xueming Li
  2021-10-19 15:28   ` [dpdk-dev] [PATCH v10 5/7] app/testpmd: dump port info for " Xueming Li
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-19 15:28 UTC (permalink / raw)
  To: dev, Zhang Yuying
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde, Xiaoyun Li

Adds "--rxq-share=X" parameter to enable shared RxQ, share if device
supports, otherwise fallback to standard RxQ.

Share group number grows per X ports. X defaults to MAX, implies all
ports join share group 1. Queue ID is mapped equally with shared Rx
queue ID.

Forwarding engine "shared-rxq" should be used which Rx only and update
stream statistics correctly.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c                 |  7 ++++++-
 app/test-pmd/parameters.c             | 13 +++++++++++++
 app/test-pmd/testpmd.c                | 20 +++++++++++++++++---
 app/test-pmd/testpmd.h                |  2 ++
 doc/guides/testpmd_app_ug/run_app.rst |  7 +++++++
 5 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 2c1b06c544d..fa951a86704 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -2738,7 +2738,12 @@ rxtx_config_display(void)
 			printf("      RX threshold registers: pthresh=%d hthresh=%d "
 				" wthresh=%d\n",
 				pthresh_tmp, hthresh_tmp, wthresh_tmp);
-			printf("      RX Offloads=0x%"PRIx64"\n", offloads_tmp);
+			printf("      RX Offloads=0x%"PRIx64, offloads_tmp);
+			if (rx_conf->share_group > 0)
+				printf(" share_group=%u share_qid=%u",
+				       rx_conf->share_group,
+				       rx_conf->share_qid);
+			printf("\n");
 		}
 
 		/* per tx queue config only for first queue to be less verbose */
diff --git a/app/test-pmd/parameters.c b/app/test-pmd/parameters.c
index 3f94a82e321..30dae326310 100644
--- a/app/test-pmd/parameters.c
+++ b/app/test-pmd/parameters.c
@@ -167,6 +167,7 @@ usage(char* progname)
 	printf("  --tx-ip=src,dst: IP addresses in Tx-only mode\n");
 	printf("  --tx-udp=src[,dst]: UDP ports in Tx-only mode\n");
 	printf("  --eth-link-speed: force link speed.\n");
+	printf("  --rxq-share: number of ports per shared rxq groups, defaults to MAX(1 group)\n");
 	printf("  --disable-link-check: disable check on link status when "
 	       "starting/stopping ports.\n");
 	printf("  --disable-device-start: do not automatically start port\n");
@@ -607,6 +608,7 @@ launch_args_parse(int argc, char** argv)
 		{ "rxpkts",			1, 0, 0 },
 		{ "txpkts",			1, 0, 0 },
 		{ "txonly-multi-flow",		0, 0, 0 },
+		{ "rxq-share",			2, 0, 0 },
 		{ "eth-link-speed",		1, 0, 0 },
 		{ "disable-link-check",		0, 0, 0 },
 		{ "disable-device-start",	0, 0, 0 },
@@ -1271,6 +1273,17 @@ launch_args_parse(int argc, char** argv)
 			}
 			if (!strcmp(lgopts[opt_idx].name, "txonly-multi-flow"))
 				txonly_multi_flow = 1;
+			if (!strcmp(lgopts[opt_idx].name, "rxq-share")) {
+				if (optarg == NULL) {
+					rxq_share = UINT32_MAX;
+				} else {
+					n = atoi(optarg);
+					if (n >= 0)
+						rxq_share = (uint32_t)n;
+					else
+						rte_exit(EXIT_FAILURE, "rxq-share must be >= 0\n");
+				}
+			}
 			if (!strcmp(lgopts[opt_idx].name, "no-flush-rx"))
 				no_flush_rx = 1;
 			if (!strcmp(lgopts[opt_idx].name, "eth-link-speed")) {
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 97ae52e17ec..123142ed110 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -498,6 +498,11 @@ uint8_t record_core_cycles;
  */
 uint8_t record_burst_stats;
 
+/*
+ * Number of ports per shared Rx queue group, 0 disable.
+ */
+uint32_t rxq_share;
+
 unsigned int num_sockets = 0;
 unsigned int socket_ids[RTE_MAX_NUMA_NODES];
 
@@ -3393,14 +3398,23 @@ dev_event_callback(const char *device_name, enum rte_dev_event_type type,
 }
 
 static void
-rxtx_port_config(struct rte_port *port)
+rxtx_port_config(portid_t pid)
 {
 	uint16_t qid;
 	uint64_t offloads;
+	struct rte_port *port = &ports[pid];
 
 	for (qid = 0; qid < nb_rxq; qid++) {
 		offloads = port->rx_conf[qid].offloads;
 		port->rx_conf[qid] = port->dev_info.default_rxconf;
+
+		if (rxq_share > 0 &&
+		    (port->dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE)) {
+			/* Non-zero share group to enable RxQ share. */
+			port->rx_conf[qid].share_group = pid / rxq_share + 1;
+			port->rx_conf[qid].share_qid = qid; /* Equal mapping. */
+		}
+
 		if (offloads != 0)
 			port->rx_conf[qid].offloads = offloads;
 
@@ -3558,7 +3572,7 @@ init_port_config(void)
 				port->dev_conf.rxmode.mq_mode = ETH_MQ_RX_NONE;
 		}
 
-		rxtx_port_config(port);
+		rxtx_port_config(pid);
 
 		ret = eth_macaddr_get_print_err(pid, &port->eth_addr);
 		if (ret != 0)
@@ -3772,7 +3786,7 @@ init_port_dcb_config(portid_t pid,
 
 	memcpy(&rte_port->dev_conf, &port_conf, sizeof(struct rte_eth_conf));
 
-	rxtx_port_config(rte_port);
+	rxtx_port_config(pid);
 	/* VLAN filter */
 	rte_port->dev_conf.rxmode.offloads |= DEV_RX_OFFLOAD_VLAN_FILTER;
 	for (i = 0; i < RTE_DIM(vlan_tags); i++)
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 5863b2f43f3..3dfaaad94c0 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -477,6 +477,8 @@ extern enum tx_pkt_split tx_pkt_split;
 
 extern uint8_t txonly_multi_flow;
 
+extern uint32_t rxq_share;
+
 extern uint16_t nb_pkt_per_burst;
 extern uint16_t nb_pkt_flowgen_clones;
 extern int nb_flows_flowgen;
diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
index 640eadeff73..ff5908dcd50 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -389,6 +389,13 @@ The command line options are:
 
     Generate multiple flows in txonly mode.
 
+*   ``--rxq-share=[X]``
+
+    Create queues in shared Rx queue mode if device supports.
+    Group number grows per X ports. X defaults to MAX, implies all ports
+    join share group 1. Forwarding engine "shared-rxq" should be used
+    which Rx only and update stream statistics correctly.
+
 *   ``--eth-link-speed``
 
     Set a forced link speed to the ethernet port::
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v10 5/7] app/testpmd: dump port info for shared Rx queue
  2021-10-19 15:28 ` [dpdk-dev] [PATCH v10 0/7] ethdev: introduce " Xueming Li
                     ` (3 preceding siblings ...)
  2021-10-19 15:28   ` [dpdk-dev] [PATCH v10 4/7] app/testpmd: new parameter to enable shared Rx queue Xueming Li
@ 2021-10-19 15:28   ` Xueming Li
  2021-10-19 15:28   ` [dpdk-dev] [PATCH v10 6/7] app/testpmd: force shared Rx queue polled on same core Xueming Li
  2021-10-19 15:28   ` [dpdk-dev] [PATCH v10 7/7] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
  6 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-19 15:28 UTC (permalink / raw)
  To: dev, Zhang Yuying
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde, Xiaoyun Li

In case of shared Rx queue, polling any member port returns mbufs for
all members. This patch dumps mbuf->port for each packet.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/util.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/app/test-pmd/util.c b/app/test-pmd/util.c
index 51506e49404..e98f136d5ed 100644
--- a/app/test-pmd/util.c
+++ b/app/test-pmd/util.c
@@ -100,6 +100,9 @@ dump_pkt_burst(uint16_t port_id, uint16_t queue, struct rte_mbuf *pkts[],
 		struct rte_flow_restore_info info = { 0, };
 
 		mb = pkts[i];
+		if (rxq_share > 0)
+			MKDUMPSTR(print_buf, buf_size, cur_len, "port %u, ",
+				  mb->port);
 		eth_hdr = rte_pktmbuf_read(mb, 0, sizeof(_eth_hdr), &_eth_hdr);
 		eth_type = RTE_BE_TO_CPU_16(eth_hdr->ether_type);
 		packet_type = mb->packet_type;
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v10 6/7] app/testpmd: force shared Rx queue polled on same core
  2021-10-19 15:28 ` [dpdk-dev] [PATCH v10 0/7] ethdev: introduce " Xueming Li
                     ` (4 preceding siblings ...)
  2021-10-19 15:28   ` [dpdk-dev] [PATCH v10 5/7] app/testpmd: dump port info for " Xueming Li
@ 2021-10-19 15:28   ` Xueming Li
  2021-10-19 15:28   ` [dpdk-dev] [PATCH v10 7/7] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
  6 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-19 15:28 UTC (permalink / raw)
  To: dev, Zhang Yuying
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde, Xiaoyun Li

Shared Rx queue must be polled on same core. This patch checks and stops
forwarding if shared RxQ being scheduled on multiple
cores.

It's suggested to use same number of Rx queues and polling cores.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c  | 103 +++++++++++++++++++++++++++++++++++++++++
 app/test-pmd/testpmd.c |   4 +-
 app/test-pmd/testpmd.h |   2 +
 3 files changed, 108 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index fa951a86704..1f1307178be 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -2915,6 +2915,109 @@ port_rss_hash_key_update(portid_t port_id, char rss_type[], uint8_t *hash_key,
 	}
 }
 
+/*
+ * Check whether a shared rxq scheduled on other lcores.
+ */
+static bool
+fwd_stream_on_other_lcores(uint16_t domain_id, lcoreid_t src_lc,
+			   portid_t src_port, queueid_t src_rxq,
+			   uint32_t share_group, queueid_t share_rxq)
+{
+	streamid_t sm_id;
+	streamid_t nb_fs_per_lcore;
+	lcoreid_t  nb_fc;
+	lcoreid_t  lc_id;
+	struct fwd_stream *fs;
+	struct rte_port *port;
+	struct rte_eth_dev_info *dev_info;
+	struct rte_eth_rxconf *rxq_conf;
+
+	nb_fc = cur_fwd_config.nb_fwd_lcores;
+	/* Check remaining cores. */
+	for (lc_id = src_lc + 1; lc_id < nb_fc; lc_id++) {
+		sm_id = fwd_lcores[lc_id]->stream_idx;
+		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
+		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
+		     sm_id++) {
+			fs = fwd_streams[sm_id];
+			port = &ports[fs->rx_port];
+			dev_info = &port->dev_info;
+			rxq_conf = &port->rx_conf[fs->rx_queue];
+			if ((dev_info->dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE)
+			    == 0)
+				/* Not shared rxq. */
+				continue;
+			if (domain_id != port->dev_info.switch_info.domain_id)
+				continue;
+			if (rxq_conf->share_group != share_group)
+				continue;
+			if (rxq_conf->share_qid != share_rxq)
+				continue;
+			printf("Shared Rx queue group %u queue %hu can't be scheduled on different cores:\n",
+			       share_group, share_rxq);
+			printf("  lcore %hhu Port %hu queue %hu\n",
+			       src_lc, src_port, src_rxq);
+			printf("  lcore %hhu Port %hu queue %hu\n",
+			       lc_id, fs->rx_port, fs->rx_queue);
+			printf("Please use --nb-cores=%hu to limit number of forwarding cores\n",
+			       nb_rxq);
+			return true;
+		}
+	}
+	return false;
+}
+
+/*
+ * Check shared rxq configuration.
+ *
+ * Shared group must not being scheduled on different core.
+ */
+bool
+pkt_fwd_shared_rxq_check(void)
+{
+	streamid_t sm_id;
+	streamid_t nb_fs_per_lcore;
+	lcoreid_t  nb_fc;
+	lcoreid_t  lc_id;
+	struct fwd_stream *fs;
+	uint16_t domain_id;
+	struct rte_port *port;
+	struct rte_eth_dev_info *dev_info;
+	struct rte_eth_rxconf *rxq_conf;
+
+	nb_fc = cur_fwd_config.nb_fwd_lcores;
+	/*
+	 * Check streams on each core, make sure the same switch domain +
+	 * group + queue doesn't get scheduled on other cores.
+	 */
+	for (lc_id = 0; lc_id < nb_fc; lc_id++) {
+		sm_id = fwd_lcores[lc_id]->stream_idx;
+		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
+		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
+		     sm_id++) {
+			fs = fwd_streams[sm_id];
+			/* Update lcore info stream being scheduled. */
+			fs->lcore = fwd_lcores[lc_id];
+			port = &ports[fs->rx_port];
+			dev_info = &port->dev_info;
+			rxq_conf = &port->rx_conf[fs->rx_queue];
+			if ((dev_info->dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE)
+			    == 0)
+				/* Not shared rxq. */
+				continue;
+			/* Check shared rxq not scheduled on remaining cores. */
+			domain_id = port->dev_info.switch_info.domain_id;
+			if (fwd_stream_on_other_lcores(domain_id, lc_id,
+						       fs->rx_port,
+						       fs->rx_queue,
+						       rxq_conf->share_group,
+						       rxq_conf->share_qid))
+				return false;
+		}
+	}
+	return true;
+}
+
 /*
  * Setup forwarding configuration for each logical core.
  */
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 123142ed110..f3f81ef561f 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -2236,10 +2236,12 @@ start_packet_forwarding(int with_tx_first)
 
 	fwd_config_setup();
 
+	pkt_fwd_config_display(&cur_fwd_config);
+	if (!pkt_fwd_shared_rxq_check())
+		return;
 	if(!no_flush_rx)
 		flush_fwd_rx_queues();
 
-	pkt_fwd_config_display(&cur_fwd_config);
 	rxtx_config_display();
 
 	fwd_stats_reset();
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 3dfaaad94c0..f121a2da90c 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -144,6 +144,7 @@ struct fwd_stream {
 	uint64_t     core_cycles; /**< used for RX and TX processing */
 	struct pkt_burst_stats rx_burst_stats;
 	struct pkt_burst_stats tx_burst_stats;
+	struct fwd_lcore *lcore; /**< Lcore being scheduled. */
 };
 
 /**
@@ -795,6 +796,7 @@ void port_summary_header_display(void);
 void rx_queue_infos_display(portid_t port_idi, uint16_t queue_id);
 void tx_queue_infos_display(portid_t port_idi, uint16_t queue_id);
 void fwd_lcores_config_display(void);
+bool pkt_fwd_shared_rxq_check(void);
 void pkt_fwd_config_display(struct fwd_config *cfg);
 void rxtx_config_display(void);
 void fwd_config_setup(void);
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v10 7/7] app/testpmd: add forwarding engine for shared Rx queue
  2021-10-19 15:28 ` [dpdk-dev] [PATCH v10 0/7] ethdev: introduce " Xueming Li
                     ` (5 preceding siblings ...)
  2021-10-19 15:28   ` [dpdk-dev] [PATCH v10 6/7] app/testpmd: force shared Rx queue polled on same core Xueming Li
@ 2021-10-19 15:28   ` Xueming Li
  6 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-19 15:28 UTC (permalink / raw)
  To: dev, Zhang Yuying
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde, Xiaoyun Li

To support shared Rx queue, this patch introduces dedicate forwarding
engine. The engine groups received packets by mbuf->port into sub-group,
updates stream statistics and simply frees packets.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/meson.build                    |   1 +
 app/test-pmd/shared_rxq_fwd.c               | 148 ++++++++++++++++++++
 app/test-pmd/testpmd.c                      |   1 +
 app/test-pmd/testpmd.h                      |   1 +
 doc/guides/testpmd_app_ug/run_app.rst       |   1 +
 doc/guides/testpmd_app_ug/testpmd_funcs.rst |   5 +-
 6 files changed, 156 insertions(+), 1 deletion(-)
 create mode 100644 app/test-pmd/shared_rxq_fwd.c

diff --git a/app/test-pmd/meson.build b/app/test-pmd/meson.build
index 98f3289bdfa..07042e45b12 100644
--- a/app/test-pmd/meson.build
+++ b/app/test-pmd/meson.build
@@ -21,6 +21,7 @@ sources = files(
         'noisy_vnf.c',
         'parameters.c',
         'rxonly.c',
+        'shared_rxq_fwd.c',
         'testpmd.c',
         'txonly.c',
         'util.c',
diff --git a/app/test-pmd/shared_rxq_fwd.c b/app/test-pmd/shared_rxq_fwd.c
new file mode 100644
index 00000000000..4e262b99bc7
--- /dev/null
+++ b/app/test-pmd/shared_rxq_fwd.c
@@ -0,0 +1,148 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2021 NVIDIA Corporation & Affiliates
+ */
+
+#include <stdarg.h>
+#include <string.h>
+#include <stdio.h>
+#include <errno.h>
+#include <stdint.h>
+#include <unistd.h>
+#include <inttypes.h>
+
+#include <sys/queue.h>
+#include <sys/stat.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_debug.h>
+#include <rte_cycles.h>
+#include <rte_memory.h>
+#include <rte_memcpy.h>
+#include <rte_launch.h>
+#include <rte_eal.h>
+#include <rte_per_lcore.h>
+#include <rte_lcore.h>
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_mempool.h>
+#include <rte_mbuf.h>
+#include <rte_pci.h>
+#include <rte_ether.h>
+#include <rte_ethdev.h>
+#include <rte_string_fns.h>
+#include <rte_ip.h>
+#include <rte_udp.h>
+#include <rte_net.h>
+#include <rte_flow.h>
+
+#include "testpmd.h"
+
+/*
+ * Rx only sub-burst forwarding.
+ */
+static void
+forward_rx_only(uint16_t nb_rx, struct rte_mbuf **pkts_burst)
+{
+	rte_pktmbuf_free_bulk(pkts_burst, nb_rx);
+}
+
+/**
+ * Get packet source stream by source port and queue.
+ * All streams of same shared Rx queue locates on same core.
+ */
+static struct fwd_stream *
+forward_stream_get(struct fwd_stream *fs, uint16_t port)
+{
+	streamid_t sm_id;
+	struct fwd_lcore *fc;
+	struct fwd_stream **fsm;
+	streamid_t nb_fs;
+
+	fc = fs->lcore;
+	fsm = &fwd_streams[fc->stream_idx];
+	nb_fs = fc->stream_nb;
+	for (sm_id = 0; sm_id < nb_fs; sm_id++) {
+		if (fsm[sm_id]->rx_port == port &&
+		    fsm[sm_id]->rx_queue == fs->rx_queue)
+			return fsm[sm_id];
+	}
+	return NULL;
+}
+
+/**
+ * Forward packet by source port and queue.
+ */
+static void
+forward_sub_burst(struct fwd_stream *src_fs, uint16_t port, uint16_t nb_rx,
+		  struct rte_mbuf **pkts)
+{
+	struct fwd_stream *fs = forward_stream_get(src_fs, port);
+
+	if (fs != NULL) {
+		fs->rx_packets += nb_rx;
+		forward_rx_only(nb_rx, pkts);
+	} else {
+		/* Source stream not found, drop all packets. */
+		src_fs->fwd_dropped += nb_rx;
+		while (nb_rx > 0)
+			rte_pktmbuf_free(pkts[--nb_rx]);
+	}
+}
+
+/**
+ * Forward packets from shared Rx queue.
+ *
+ * Source port of packets are identified by mbuf->port.
+ */
+static void
+forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
+		   struct rte_mbuf **pkts_burst)
+{
+	uint16_t i, nb_sub_burst, port, last_port;
+
+	nb_sub_burst = 0;
+	last_port = pkts_burst[0]->port;
+	/* Locate sub-burst according to mbuf->port. */
+	for (i = 0; i < nb_rx - 1; ++i) {
+		rte_prefetch0(pkts_burst[i + 1]);
+		port = pkts_burst[i]->port;
+		if (i > 0 && last_port != port) {
+			/* Forward packets with same source port. */
+			forward_sub_burst(fs, last_port, nb_sub_burst,
+					  &pkts_burst[i - nb_sub_burst]);
+			nb_sub_burst = 0;
+			last_port = port;
+		}
+		nb_sub_burst++;
+	}
+	/* Last sub-burst. */
+	nb_sub_burst++;
+	forward_sub_burst(fs, last_port, nb_sub_burst,
+			  &pkts_burst[nb_rx - nb_sub_burst]);
+}
+
+static void
+shared_rxq_fwd(struct fwd_stream *fs)
+{
+	struct rte_mbuf *pkts_burst[nb_pkt_per_burst];
+	uint16_t nb_rx;
+	uint64_t start_tsc = 0;
+
+	get_start_cycles(&start_tsc);
+	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
+				 nb_pkt_per_burst);
+	inc_rx_burst_stats(fs, nb_rx);
+	if (unlikely(nb_rx == 0))
+		return;
+	forward_shared_rxq(fs, nb_rx, pkts_burst);
+	get_end_cycles(fs, start_tsc);
+}
+
+struct fwd_engine shared_rxq_engine = {
+	.fwd_mode_name  = "shared_rxq",
+	.port_fwd_begin = NULL,
+	.port_fwd_end   = NULL,
+	.packet_fwd     = shared_rxq_fwd,
+};
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index f3f81ef561f..11a85d92d9a 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -188,6 +188,7 @@ struct fwd_engine * fwd_engines[] = {
 #ifdef RTE_LIBRTE_IEEE1588
 	&ieee1588_fwd_engine,
 #endif
+	&shared_rxq_engine,
 	NULL,
 };
 
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index f121a2da90c..f1fd607e365 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -299,6 +299,7 @@ extern struct fwd_engine five_tuple_swap_fwd_engine;
 #ifdef RTE_LIBRTE_IEEE1588
 extern struct fwd_engine ieee1588_fwd_engine;
 #endif
+extern struct fwd_engine shared_rxq_engine;
 
 extern struct fwd_engine * fwd_engines[]; /**< NULL terminated array. */
 extern cmdline_parse_inst_t cmd_set_raw;
diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
index ff5908dcd50..e4b97844ced 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -252,6 +252,7 @@ The command line options are:
        tm
        noisy
        5tswap
+       shared-rxq
 
 *   ``--rss-ip``
 
diff --git a/doc/guides/testpmd_app_ug/testpmd_funcs.rst b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
index 8ead7a4a712..499874187f2 100644
--- a/doc/guides/testpmd_app_ug/testpmd_funcs.rst
+++ b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
@@ -314,7 +314,7 @@ set fwd
 Set the packet forwarding mode::
 
    testpmd> set fwd (io|mac|macswap|flowgen| \
-                     rxonly|txonly|csum|icmpecho|noisy|5tswap) (""|retry)
+                     rxonly|txonly|csum|icmpecho|noisy|5tswap|shared-rxq) (""|retry)
 
 ``retry`` can be specified for forwarding engines except ``rx_only``.
 
@@ -357,6 +357,9 @@ The available information categories are:
 
   L4 swaps the source port and destination port of transport layer (TCP and UDP).
 
+* ``shared-rxq``: Receive only for shared Rx queue.
+  Resolve packet source port from mbuf and update stream statistics accordingly.
+
 Example::
 
    testpmd> set fwd rxonly
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v10 2/7] ethdev: new API to resolve device capability name
  2021-10-19 15:28   ` [dpdk-dev] [PATCH v10 2/7] ethdev: new API to resolve device capability name Xueming Li
@ 2021-10-19 17:57     ` Andrew Rybchenko
  2021-10-20  7:47       ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Andrew Rybchenko @ 2021-10-19 17:57 UTC (permalink / raw)
  To: Xueming Li, dev, Zhang Yuying
  Cc: Jerin Jacob, Ferruh Yigit, Viacheslav Ovsiienko, Thomas Monjalon,
	Lior Margalit, Ananyev Konstantin, Ajit Khaparde, Ray Kinsella

On 10/19/21 6:28 PM, Xueming Li wrote:
> This patch adds API to return name of device capability.
> 
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>

[snip]

> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index bc55f899f72..97217529449 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -165,6 +165,20 @@ static const struct {
>   
>   #undef RTE_TX_OFFLOAD_BIT2STR
>   
> +#define RTE_ETH_DEV_CAPA_BIT2STR(_name)	\
> +	{ RTE_ETH_DEV_CAPA_##_name, #_name }

In fact, such macros make more harm than add value.
It complicates grep by capability name. So, it is better
to drop the macro and just duplicate few symbols below.

> +
> +static const struct {
> +	uint64_t offload;
> +	const char *name;
> +} rte_eth_dev_capa_names[] = {
> +	RTE_ETH_DEV_CAPA_BIT2STR(RUNTIME_RX_QUEUE_SETUP),
> +	RTE_ETH_DEV_CAPA_BIT2STR(RUNTIME_TX_QUEUE_SETUP),
> +	RTE_ETH_DEV_CAPA_BIT2STR(RXQ_SHARE),
> +};
> +
> +#undef RTE_ETH_DEV_CAPA_BIT2STR
> +
>   /**
>    * The user application callback description.
>    *

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v10 2/7] ethdev: new API to resolve device capability name
  2021-10-19 17:57     ` Andrew Rybchenko
@ 2021-10-20  7:47       ` Xueming(Steven) Li
  2021-10-20  7:48         ` Andrew Rybchenko
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-10-20  7:47 UTC (permalink / raw)
  To: yuying.zhang, andrew.rybchenko, dev
  Cc: konstantin.ananyev, mdr, jerinjacobk,
	NBU-Contact-Thomas Monjalon, Slava Ovsiienko, ajit.khaparde,
	ferruh.yigit, Lior Margalit

On Tue, 2021-10-19 at 20:57 +0300, Andrew Rybchenko wrote:
> On 10/19/21 6:28 PM, Xueming Li wrote:
> > This patch adds API to return name of device capability.
> > 
> > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> 
> [snip]
> 
> > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > index bc55f899f72..97217529449 100644
> > --- a/lib/ethdev/rte_ethdev.c
> > +++ b/lib/ethdev/rte_ethdev.c
> > @@ -165,6 +165,20 @@ static const struct {
> >   
> >   #undef RTE_TX_OFFLOAD_BIT2STR
> >   
> > +#define RTE_ETH_DEV_CAPA_BIT2STR(_name)	\
> > +	{ RTE_ETH_DEV_CAPA_##_name, #_name }
> 
> In fact, such macros make more harm than add value.
> It complicates grep by capability name. So, it is better
> to drop the macro and just duplicate few symbols below.

Will update in next version. Eclipse resolves macros and search into
expanded macros.

BTW, do you plan to review other patches today? If so I will hold new
version a little bit to avoid explode maillist.

> 
> > +
> > +static const struct {
> > +	uint64_t offload;
> > +	const char *name;
> > +} rte_eth_dev_capa_names[] = {
> > +	RTE_ETH_DEV_CAPA_BIT2STR(RUNTIME_RX_QUEUE_SETUP),
> > +	RTE_ETH_DEV_CAPA_BIT2STR(RUNTIME_TX_QUEUE_SETUP),
> > +	RTE_ETH_DEV_CAPA_BIT2STR(RXQ_SHARE),
> > +};
> > +
> > +#undef RTE_ETH_DEV_CAPA_BIT2STR
> > +
> >   /**
> >    * The user application callback description.
> >    *


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v10 2/7] ethdev: new API to resolve device capability name
  2021-10-20  7:47       ` Xueming(Steven) Li
@ 2021-10-20  7:48         ` Andrew Rybchenko
  0 siblings, 0 replies; 266+ messages in thread
From: Andrew Rybchenko @ 2021-10-20  7:48 UTC (permalink / raw)
  To: Xueming(Steven) Li, yuying.zhang, dev
  Cc: konstantin.ananyev, mdr, jerinjacobk,
	NBU-Contact-Thomas Monjalon, Slava Ovsiienko, ajit.khaparde,
	ferruh.yigit, Lior Margalit

On 10/20/21 10:47 AM, Xueming(Steven) Li wrote:
> On Tue, 2021-10-19 at 20:57 +0300, Andrew Rybchenko wrote:
>> On 10/19/21 6:28 PM, Xueming Li wrote:
>>> This patch adds API to return name of device capability.
>>>
>>> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
>>
>> [snip]
>>
>>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
>>> index bc55f899f72..97217529449 100644
>>> --- a/lib/ethdev/rte_ethdev.c
>>> +++ b/lib/ethdev/rte_ethdev.c
>>> @@ -165,6 +165,20 @@ static const struct {
>>>   
>>>   #undef RTE_TX_OFFLOAD_BIT2STR
>>>   
>>> +#define RTE_ETH_DEV_CAPA_BIT2STR(_name)	\
>>> +	{ RTE_ETH_DEV_CAPA_##_name, #_name }
>>
>> In fact, such macros make more harm than add value.
>> It complicates grep by capability name. So, it is better
>> to drop the macro and just duplicate few symbols below.
> 
> Will update in next version. Eclipse resolves macros and search into
> expanded macros.
> 
> BTW, do you plan to review other patches today? If so I will hold new
> version a little bit to avoid explode maillist.

Sorry, I have no time to review testpmd patches today.
ethdev part LGTM.

^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v11 0/7] ethdev: introduce shared Rx queue
  2021-07-27  3:42 [dpdk-dev] [RFC] ethdev: introduce shared Rx queue Xueming Li
                   ` (11 preceding siblings ...)
  2021-10-19 15:28 ` [dpdk-dev] [PATCH v10 0/7] ethdev: introduce " Xueming Li
@ 2021-10-20  7:53 ` Xueming Li
  2021-10-20  7:53   ` [dpdk-dev] [PATCH v11 1/7] " Xueming Li
                     ` (6 more replies)
  2021-10-21  5:08 ` [dpdk-dev] [PATCH v12 0/7] ethdev: introduce " Xueming Li
                   ` (3 subsequent siblings)
  16 siblings, 7 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-20  7:53 UTC (permalink / raw)
  To: dev, Zhang Yuying
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde

In current DPDK framework, all Rx queues is pre-loaded with mbufs for
incoming packets. When number of representors scale out in a switch
domain, the memory consumption became significant. Further more,
polling all ports leads to high cache miss, high latency and low
throughputs.

This patch introduces shared Rx queue. PF and representors in same
Rx domain and switch domain could share Rx queue set by specifying
non-zero share group value in Rx queue configuration.

All ports that share Rx queue actually shares hardware descriptor
queue and feed all Rx queues with one descriptor supply, memory is saved.

Polling any queue using same shared Rx queue receives packets from all
member ports. Source port is identified by mbuf->port.

Multiple groups is supported by group ID. Port queue number in a shared
group should be identical. Queue index is 1:1 mapped in shared group.
An example of two share groups:
 Group1, 4 shared Rx queues per member port: PF, repr0, repr1
 Group2, 2 shared Rx queues per member port: repr2, repr3, ... repr127
 Poll first port for each group:
  core	port	queue
  0	0	0
  1	0	1
  2	0	2
  3	0	3
  4	2	0
  5	2	1

Shared Rx queue must be polled on single thread or core. If both PF0 and
representor0 joined same share group, can't poll pf0rxq0 on core1 and
rep0rxq0 on core2. Actually, polling one port within share group is
sufficient since polling any port in group will return packets for any
port in group.

There was some discussion to aggregate member ports in same group into a
dummy port, several ways to achieve it. Since it optional, need to collect
more feedback and requirement from user, make better decision later.

v1:
  - initial version
v2:
  - add testpmd patches
v3:
  - change common forwarding api to macro for performance, thanks Jerin.
  - save global variable accessed in forwarding to flowstream to minimize
    cache miss
  - combined patches for each forwarding engine
  - support multiple groups in testpmd "--share-rxq" parameter
  - new api to aggregate shared rxq group
v4:
  - spelling fixes
  - remove shared-rxq support for all forwarding engines
  - add dedicate shared-rxq forwarding engine
v5:
 - fix grammars
 - remove aggregate api and leave it for later discussion
 - add release notes
 - add deployment example
v6:
 - replace RxQ offload flag with device offload capability flag
 - add Rx domain
 - RxQ is shared when share group > 0
 - update testpmd accordingly
v7:
 - fix testpmd share group id allocation
 - change rx_domain to 16bits
v8:
 - add new patch for testpmd to show device Rx domain ID and capability
 - new share_qid in RxQ configuration
v9:
 - fix some spelling
v10:
 - add device capability name api
v11:
 - remove macro from device capability name list

Xueming Li (7):
  ethdev: introduce shared Rx queue
  ethdev: new API to resolve device capability name
  app/testpmd: dump device capability and Rx domain info
  app/testpmd: new parameter to enable shared Rx queue
  app/testpmd: dump port info for shared Rx queue
  app/testpmd: force shared Rx queue polled on same core
  app/testpmd: add forwarding engine for shared Rx queue

 app/test-pmd/config.c                         | 139 +++++++++++++++-
 app/test-pmd/meson.build                      |   1 +
 app/test-pmd/parameters.c                     |  13 ++
 app/test-pmd/shared_rxq_fwd.c                 | 148 ++++++++++++++++++
 app/test-pmd/testpmd.c                        |  25 ++-
 app/test-pmd/testpmd.h                        |   5 +
 app/test-pmd/util.c                           |   3 +
 doc/guides/nics/features.rst                  |  13 ++
 doc/guides/nics/features/default.ini          |   1 +
 .../prog_guide/switch_representation.rst      |  11 ++
 doc/guides/rel_notes/release_21_11.rst        |   6 +
 doc/guides/testpmd_app_ug/run_app.rst         |   8 +
 doc/guides/testpmd_app_ug/testpmd_funcs.rst   |   5 +-
 lib/ethdev/rte_ethdev.c                       |  33 ++++
 lib/ethdev/rte_ethdev.h                       |  38 +++++
 lib/ethdev/version.map                        |   3 +
 16 files changed, 446 insertions(+), 6 deletions(-)
 create mode 100644 app/test-pmd/shared_rxq_fwd.c

-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v11 1/7] ethdev: introduce shared Rx queue
  2021-10-20  7:53 ` [dpdk-dev] [PATCH v11 0/7] ethdev: introduce " Xueming Li
@ 2021-10-20  7:53   ` Xueming Li
  2021-10-20 17:14     ` Ajit Khaparde
  2021-10-20  7:53   ` [dpdk-dev] [PATCH v11 2/7] ethdev: new API to resolve device capability name Xueming Li
                     ` (5 subsequent siblings)
  6 siblings, 1 reply; 266+ messages in thread
From: Xueming Li @ 2021-10-20  7:53 UTC (permalink / raw)
  To: dev, Zhang Yuying
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde

In current DPDK framework, each Rx queue is pre-loaded with mbufs to
save incoming packets. For some PMDs, when number of representors scale
out in a switch domain, the memory consumption became significant.
Polling all ports also leads to high cache miss, high latency and low
throughput.

This patch introduces shared Rx queue. Ports in same Rx domain and
switch domain could share Rx queue set by specifying non-zero sharing
group in Rx queue configuration.

Shared Rx queue is identified by share_rxq field of Rx queue
configuration. Port A RxQ X can share RxQ with Port B RxQ Y by using
same shared Rx queue ID.

No special API is defined to receive packets from shared Rx queue.
Polling any member port of a shared Rx queue receives packets of that
queue for all member ports, port_id is identified by mbuf->port. PMD is
responsible to resolve shared Rx queue from device and queue data.

Shared Rx queue must be polled in same thread or core, polling a queue
ID of any member port is essentially same.

Multiple share groups are supported. PMD should support mixed
configuration by allowing multiple share groups and non-shared Rx queue
on one port.

Example grouping and polling model to reflect service priority:
 Group1, 2 shared Rx queues per port: PF, rep0, rep1
 Group2, 1 shared Rx queue per port: rep2, rep3, ... rep127
 Core0: poll PF queue0
 Core1: poll PF queue1
 Core2: poll rep2 queue0

PMD advertise shared Rx queue capability via RTE_ETH_DEV_CAPA_RXQ_SHARE.

PMD is responsible for shared Rx queue consistency checks to avoid
member port's configuration contradict each other.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---
 doc/guides/nics/features.rst                  | 13 ++++++++++
 doc/guides/nics/features/default.ini          |  1 +
 .../prog_guide/switch_representation.rst      | 11 +++++++++
 doc/guides/rel_notes/release_21_11.rst        |  6 +++++
 lib/ethdev/rte_ethdev.c                       |  8 +++++++
 lib/ethdev/rte_ethdev.h                       | 24 +++++++++++++++++++
 6 files changed, 63 insertions(+)

diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index e346018e4b8..89f9accbca1 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -615,6 +615,19 @@ Supports inner packet L4 checksum.
   ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
 
 
+.. _nic_features_shared_rx_queue:
+
+Shared Rx queue
+---------------
+
+Supports shared Rx queue for ports in same Rx domain of a switch domain.
+
+* **[uses]     rte_eth_dev_info**: ``dev_capa:RTE_ETH_DEV_CAPA_RXQ_SHARE``.
+* **[uses]     rte_eth_dev_info,rte_eth_switch_info**: ``rx_domain``, ``domain_id``.
+* **[uses]     rte_eth_rxconf**: ``share_group``, ``share_qid``.
+* **[provides] mbuf**: ``mbuf.port``.
+
+
 .. _nic_features_packet_type_parsing:
 
 Packet type parsing
diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
index d473b94091a..93f5d1b46f4 100644
--- a/doc/guides/nics/features/default.ini
+++ b/doc/guides/nics/features/default.ini
@@ -19,6 +19,7 @@ Free Tx mbuf on demand =
 Queue start/stop     =
 Runtime Rx queue setup =
 Runtime Tx queue setup =
+Shared Rx queue      =
 Burst mode info      =
 Power mgmt address monitor =
 MTU update           =
diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
index ff6aa91c806..4f2532a91ea 100644
--- a/doc/guides/prog_guide/switch_representation.rst
+++ b/doc/guides/prog_guide/switch_representation.rst
@@ -123,6 +123,17 @@ thought as a software "patch panel" front-end for applications.
 .. [1] `Ethernet switch device driver model (switchdev)
        <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
 
+- For some PMDs, memory usage of representors is huge when number of
+  representor grows, mbufs are allocated for each descriptor of Rx queue.
+  Polling large number of ports brings more CPU load, cache miss and
+  latency. Shared Rx queue can be used to share Rx queue between PF and
+  representors among same Rx domain. ``RTE_ETH_DEV_CAPA_RXQ_SHARE`` in
+  device info is used to indicate the capability. Setting non-zero share
+  group in Rx queue configuration to enable share, share_qid is used to
+  identify the shared Rx queue in group. Polling any member port can
+  receive packets of all member ports in the group, port ID is saved in
+  ``mbuf.port``.
+
 Basic SR-IOV
 ------------
 
diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
index 3362c52a738..caf82242f2e 100644
--- a/doc/guides/rel_notes/release_21_11.rst
+++ b/doc/guides/rel_notes/release_21_11.rst
@@ -75,6 +75,12 @@ New Features
     operations.
   * Added multi-process support.
 
+* **Added ethdev shared Rx queue support.**
+
+  * Added new device capability flag and Rx domain field to switch info.
+  * Added share group and share queue ID to Rx queue configuration.
+  * Added testpmd support and dedicate forwarding engine.
+
 * **Added new RSS offload types for IPv4/L4 checksum in RSS flow.**
 
   Added macros ETH_RSS_IPV4_CHKSUM and ETH_RSS_L4_CHKSUM, now IPv4 and
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 028907bc4b9..bc55f899f72 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -2159,6 +2159,14 @@ rte_eth_rx_queue_setup(uint16_t port_id, uint16_t rx_queue_id,
 		return -EINVAL;
 	}
 
+	if (local_conf.share_group > 0 &&
+	    (dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE) == 0) {
+		RTE_ETHDEV_LOG(ERR,
+			"Ethdev port_id=%d rx_queue_id=%d, enabled share_group=%hu while device doesn't support Rx queue share\n",
+			port_id, rx_queue_id, local_conf.share_group);
+		return -EINVAL;
+	}
+
 	/*
 	 * If LRO is enabled, check that the maximum aggregated packet
 	 * size is supported by the configured device.
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 6d80514ba7a..34acc91273d 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1044,6 +1044,14 @@ struct rte_eth_rxconf {
 	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
 	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
 	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
+	/**
+	 * Share group index in Rx domain and switch domain.
+	 * Non-zero value to enable Rx queue share, zero value disable share.
+	 * PMD is responsible for Rx queue consistency checks to avoid member
+	 * port's configuration contradict to each other.
+	 */
+	uint16_t share_group;
+	uint16_t share_qid; /**< Shared Rx queue ID in group. */
 	/**
 	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
 	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
@@ -1445,6 +1453,16 @@ struct rte_eth_conf {
 #define RTE_ETH_DEV_CAPA_RUNTIME_RX_QUEUE_SETUP 0x00000001
 /** Device supports Tx queue setup after device started. */
 #define RTE_ETH_DEV_CAPA_RUNTIME_TX_QUEUE_SETUP 0x00000002
+/**
+ * Device supports shared Rx queue among ports within Rx domain and
+ * switch domain. Mbufs are consumed by shared Rx queue instead of
+ * each queue. Multiple groups are supported by share_group of Rx
+ * queue configuration. Shared Rx queue is identified by PMD using
+ * share_qid of Rx queue configuration. Polling any port in the group
+ * receive packets of all member ports, source port identified by
+ * mbuf->port field.
+ */
+#define RTE_ETH_DEV_CAPA_RXQ_SHARE              RTE_BIT64(2)
 /**@}*/
 
 /*
@@ -1488,6 +1506,12 @@ struct rte_eth_switch_info {
 	 * but each driver should explicitly define the mapping of switch
 	 * port identifier to that physical interconnect/switch
 	 */
+	/**
+	 * Shared Rx queue sub-domain boundary. Only ports in same Rx domain
+	 * and switch domain can share Rx queue. Valid only if device advertised
+	 * RTE_ETH_DEV_CAPA_RXQ_SHARE capability.
+	 */
+	uint16_t rx_domain;
 };
 
 /**
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v11 2/7] ethdev: new API to resolve device capability name
  2021-10-20  7:53 ` [dpdk-dev] [PATCH v11 0/7] ethdev: introduce " Xueming Li
  2021-10-20  7:53   ` [dpdk-dev] [PATCH v11 1/7] " Xueming Li
@ 2021-10-20  7:53   ` Xueming Li
  2021-10-20 10:52     ` Andrew Rybchenko
  2021-10-20 18:42     ` Thomas Monjalon
  2021-10-20  7:53   ` [dpdk-dev] [PATCH v11 3/7] app/testpmd: dump device capability and Rx domain info Xueming Li
                     ` (4 subsequent siblings)
  6 siblings, 2 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-20  7:53 UTC (permalink / raw)
  To: dev, Zhang Yuying
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde, Ray Kinsella

This patch adds API to return name of device capability.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 lib/ethdev/rte_ethdev.c | 25 +++++++++++++++++++++++++
 lib/ethdev/rte_ethdev.h | 14 ++++++++++++++
 lib/ethdev/version.map  |  3 +++
 3 files changed, 42 insertions(+)

diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index bc55f899f72..d1a4a0405d6 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -165,6 +165,15 @@ static const struct {
 
 #undef RTE_TX_OFFLOAD_BIT2STR
 
+static const struct {
+	uint64_t offload;
+	const char *name;
+} rte_eth_dev_capa_names[] = {
+	{RTE_ETH_DEV_CAPA_RUNTIME_RX_QUEUE_SETUP, "RUNTIME_RX_QUEUE_SETUP"},
+	{RTE_ETH_DEV_CAPA_RUNTIME_TX_QUEUE_SETUP, "RUNTIME_TX_QUEUE_SETUP"},
+	{RTE_ETH_DEV_CAPA_RXQ_SHARE, "RXQ_SHARE"},
+};
+
 /**
  * The user application callback description.
  *
@@ -1260,6 +1269,22 @@ rte_eth_dev_tx_offload_name(uint64_t offload)
 	return name;
 }
 
+const char *
+rte_eth_dev_capability_name(uint64_t capability)
+{
+	const char *name = "UNKNOWN";
+	unsigned int i;
+
+	for (i = 0; i < RTE_DIM(rte_eth_dev_capa_names); ++i) {
+		if (capability == rte_eth_dev_capa_names[i].offload) {
+			name = rte_eth_dev_capa_names[i].name;
+			break;
+		}
+	}
+
+	return name;
+}
+
 static inline int
 eth_dev_check_lro_pkt_size(uint16_t port_id, uint32_t config_size,
 		   uint32_t max_rx_pkt_len, uint32_t dev_info_size)
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 34acc91273d..df8ef9382a9 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -2109,6 +2109,20 @@ const char *rte_eth_dev_rx_offload_name(uint64_t offload);
  */
 const char *rte_eth_dev_tx_offload_name(uint64_t offload);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Get RTE_ETH_DEV_CAPA_* flag name.
+ *
+ * @param capability
+ *   Capability flag.
+ * @return
+ *   Capability name or 'UNKNOWN' if the flag cannot be recognized.
+ */
+__rte_experimental
+const char *rte_eth_dev_capability_name(uint64_t capability);
+
 /**
  * Configure an Ethernet device.
  * This function must be invoked first before any other function in the
diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
index efd729c0f2d..e1d403dd357 100644
--- a/lib/ethdev/version.map
+++ b/lib/ethdev/version.map
@@ -245,6 +245,9 @@ EXPERIMENTAL {
 	rte_mtr_meter_policy_delete;
 	rte_mtr_meter_policy_update;
 	rte_mtr_meter_policy_validate;
+
+	# added in 21.11
+	rte_eth_dev_capability_name;
 };
 
 INTERNAL {
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v11 3/7] app/testpmd: dump device capability and Rx domain info
  2021-10-20  7:53 ` [dpdk-dev] [PATCH v11 0/7] ethdev: introduce " Xueming Li
  2021-10-20  7:53   ` [dpdk-dev] [PATCH v11 1/7] " Xueming Li
  2021-10-20  7:53   ` [dpdk-dev] [PATCH v11 2/7] ethdev: new API to resolve device capability name Xueming Li
@ 2021-10-20  7:53   ` Xueming Li
  2021-10-21  3:24     ` Li, Xiaoyun
  2021-10-20  7:53   ` [dpdk-dev] [PATCH v11 4/7] app/testpmd: new parameter to enable shared Rx queue Xueming Li
                     ` (3 subsequent siblings)
  6 siblings, 1 reply; 266+ messages in thread
From: Xueming Li @ 2021-10-20  7:53 UTC (permalink / raw)
  To: dev, Zhang Yuying
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde, Xiaoyun Li

Dump device capability and Rx domain ID if shared Rx queue is supported
by device.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---
 app/test-pmd/config.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 9c66329e96e..2c1b06c544d 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -582,6 +582,29 @@ device_infos_display(const char *identifier)
 	rte_devargs_reset(&da);
 }
 
+static void
+print_dev_capabilities(uint64_t capabilities)
+{
+	uint64_t single_capa;
+	int begin;
+	int end;
+	int bit;
+
+	if (capabilities == 0)
+		return;
+
+	begin = __builtin_ctzll(capabilities);
+	end = sizeof(capabilities) * CHAR_BIT - __builtin_clzll(capabilities);
+
+	single_capa = 1ULL << begin;
+	for (bit = begin; bit < end; bit++) {
+		if (capabilities & single_capa)
+			printf(" %s",
+			       rte_eth_dev_capability_name(single_capa));
+		single_capa <<= 1;
+	}
+}
+
 void
 port_infos_display(portid_t port_id)
 {
@@ -733,6 +756,9 @@ port_infos_display(portid_t port_id)
 	printf("Max segment number per MTU/TSO: %hu\n",
 		dev_info.tx_desc_lim.nb_mtu_seg_max);
 
+	printf("Device capabilities: 0x%"PRIx64"(", dev_info.dev_capa);
+	print_dev_capabilities(dev_info.dev_capa);
+	printf(" )\n");
 	/* Show switch info only if valid switch domain and port id is set */
 	if (dev_info.switch_info.domain_id !=
 		RTE_ETH_DEV_SWITCH_DOMAIN_ID_INVALID) {
@@ -743,6 +769,9 @@ port_infos_display(portid_t port_id)
 			dev_info.switch_info.domain_id);
 		printf("Switch Port Id: %u\n",
 			dev_info.switch_info.port_id);
+		if ((dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE) != 0)
+			printf("Switch Rx domain: %u\n",
+			       dev_info.switch_info.rx_domain);
 	}
 }
 
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v11 4/7] app/testpmd: new parameter to enable shared Rx queue
  2021-10-20  7:53 ` [dpdk-dev] [PATCH v11 0/7] ethdev: introduce " Xueming Li
                     ` (2 preceding siblings ...)
  2021-10-20  7:53   ` [dpdk-dev] [PATCH v11 3/7] app/testpmd: dump device capability and Rx domain info Xueming Li
@ 2021-10-20  7:53   ` Xueming Li
  2021-10-20 17:29     ` Ajit Khaparde
  2021-10-21  3:24     ` Li, Xiaoyun
  2021-10-20  7:53   ` [dpdk-dev] [PATCH v11 5/7] app/testpmd: dump port info for " Xueming Li
                     ` (2 subsequent siblings)
  6 siblings, 2 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-20  7:53 UTC (permalink / raw)
  To: dev, Zhang Yuying
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde, Xiaoyun Li

Adds "--rxq-share=X" parameter to enable shared RxQ, share if device
supports, otherwise fallback to standard RxQ.

Share group number grows per X ports. X defaults to MAX, implies all
ports join share group 1. Queue ID is mapped equally with shared Rx
queue ID.

Forwarding engine "shared-rxq" should be used which Rx only and update
stream statistics correctly.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c                 |  7 ++++++-
 app/test-pmd/parameters.c             | 13 +++++++++++++
 app/test-pmd/testpmd.c                | 20 +++++++++++++++++---
 app/test-pmd/testpmd.h                |  2 ++
 doc/guides/testpmd_app_ug/run_app.rst |  7 +++++++
 5 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 2c1b06c544d..fa951a86704 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -2738,7 +2738,12 @@ rxtx_config_display(void)
 			printf("      RX threshold registers: pthresh=%d hthresh=%d "
 				" wthresh=%d\n",
 				pthresh_tmp, hthresh_tmp, wthresh_tmp);
-			printf("      RX Offloads=0x%"PRIx64"\n", offloads_tmp);
+			printf("      RX Offloads=0x%"PRIx64, offloads_tmp);
+			if (rx_conf->share_group > 0)
+				printf(" share_group=%u share_qid=%u",
+				       rx_conf->share_group,
+				       rx_conf->share_qid);
+			printf("\n");
 		}
 
 		/* per tx queue config only for first queue to be less verbose */
diff --git a/app/test-pmd/parameters.c b/app/test-pmd/parameters.c
index 3f94a82e321..30dae326310 100644
--- a/app/test-pmd/parameters.c
+++ b/app/test-pmd/parameters.c
@@ -167,6 +167,7 @@ usage(char* progname)
 	printf("  --tx-ip=src,dst: IP addresses in Tx-only mode\n");
 	printf("  --tx-udp=src[,dst]: UDP ports in Tx-only mode\n");
 	printf("  --eth-link-speed: force link speed.\n");
+	printf("  --rxq-share: number of ports per shared rxq groups, defaults to MAX(1 group)\n");
 	printf("  --disable-link-check: disable check on link status when "
 	       "starting/stopping ports.\n");
 	printf("  --disable-device-start: do not automatically start port\n");
@@ -607,6 +608,7 @@ launch_args_parse(int argc, char** argv)
 		{ "rxpkts",			1, 0, 0 },
 		{ "txpkts",			1, 0, 0 },
 		{ "txonly-multi-flow",		0, 0, 0 },
+		{ "rxq-share",			2, 0, 0 },
 		{ "eth-link-speed",		1, 0, 0 },
 		{ "disable-link-check",		0, 0, 0 },
 		{ "disable-device-start",	0, 0, 0 },
@@ -1271,6 +1273,17 @@ launch_args_parse(int argc, char** argv)
 			}
 			if (!strcmp(lgopts[opt_idx].name, "txonly-multi-flow"))
 				txonly_multi_flow = 1;
+			if (!strcmp(lgopts[opt_idx].name, "rxq-share")) {
+				if (optarg == NULL) {
+					rxq_share = UINT32_MAX;
+				} else {
+					n = atoi(optarg);
+					if (n >= 0)
+						rxq_share = (uint32_t)n;
+					else
+						rte_exit(EXIT_FAILURE, "rxq-share must be >= 0\n");
+				}
+			}
 			if (!strcmp(lgopts[opt_idx].name, "no-flush-rx"))
 				no_flush_rx = 1;
 			if (!strcmp(lgopts[opt_idx].name, "eth-link-speed")) {
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 97ae52e17ec..123142ed110 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -498,6 +498,11 @@ uint8_t record_core_cycles;
  */
 uint8_t record_burst_stats;
 
+/*
+ * Number of ports per shared Rx queue group, 0 disable.
+ */
+uint32_t rxq_share;
+
 unsigned int num_sockets = 0;
 unsigned int socket_ids[RTE_MAX_NUMA_NODES];
 
@@ -3393,14 +3398,23 @@ dev_event_callback(const char *device_name, enum rte_dev_event_type type,
 }
 
 static void
-rxtx_port_config(struct rte_port *port)
+rxtx_port_config(portid_t pid)
 {
 	uint16_t qid;
 	uint64_t offloads;
+	struct rte_port *port = &ports[pid];
 
 	for (qid = 0; qid < nb_rxq; qid++) {
 		offloads = port->rx_conf[qid].offloads;
 		port->rx_conf[qid] = port->dev_info.default_rxconf;
+
+		if (rxq_share > 0 &&
+		    (port->dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE)) {
+			/* Non-zero share group to enable RxQ share. */
+			port->rx_conf[qid].share_group = pid / rxq_share + 1;
+			port->rx_conf[qid].share_qid = qid; /* Equal mapping. */
+		}
+
 		if (offloads != 0)
 			port->rx_conf[qid].offloads = offloads;
 
@@ -3558,7 +3572,7 @@ init_port_config(void)
 				port->dev_conf.rxmode.mq_mode = ETH_MQ_RX_NONE;
 		}
 
-		rxtx_port_config(port);
+		rxtx_port_config(pid);
 
 		ret = eth_macaddr_get_print_err(pid, &port->eth_addr);
 		if (ret != 0)
@@ -3772,7 +3786,7 @@ init_port_dcb_config(portid_t pid,
 
 	memcpy(&rte_port->dev_conf, &port_conf, sizeof(struct rte_eth_conf));
 
-	rxtx_port_config(rte_port);
+	rxtx_port_config(pid);
 	/* VLAN filter */
 	rte_port->dev_conf.rxmode.offloads |= DEV_RX_OFFLOAD_VLAN_FILTER;
 	for (i = 0; i < RTE_DIM(vlan_tags); i++)
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 5863b2f43f3..3dfaaad94c0 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -477,6 +477,8 @@ extern enum tx_pkt_split tx_pkt_split;
 
 extern uint8_t txonly_multi_flow;
 
+extern uint32_t rxq_share;
+
 extern uint16_t nb_pkt_per_burst;
 extern uint16_t nb_pkt_flowgen_clones;
 extern int nb_flows_flowgen;
diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
index 640eadeff73..ff5908dcd50 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -389,6 +389,13 @@ The command line options are:
 
     Generate multiple flows in txonly mode.
 
+*   ``--rxq-share=[X]``
+
+    Create queues in shared Rx queue mode if device supports.
+    Group number grows per X ports. X defaults to MAX, implies all ports
+    join share group 1. Forwarding engine "shared-rxq" should be used
+    which Rx only and update stream statistics correctly.
+
 *   ``--eth-link-speed``
 
     Set a forced link speed to the ethernet port::
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v11 5/7] app/testpmd: dump port info for shared Rx queue
  2021-10-20  7:53 ` [dpdk-dev] [PATCH v11 0/7] ethdev: introduce " Xueming Li
                     ` (3 preceding siblings ...)
  2021-10-20  7:53   ` [dpdk-dev] [PATCH v11 4/7] app/testpmd: new parameter to enable shared Rx queue Xueming Li
@ 2021-10-20  7:53   ` Xueming Li
  2021-10-21  3:24     ` Li, Xiaoyun
  2021-10-20  7:53   ` [dpdk-dev] [PATCH v11 6/7] app/testpmd: force shared Rx queue polled on same core Xueming Li
  2021-10-20  7:53   ` [dpdk-dev] [PATCH v11 7/7] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
  6 siblings, 1 reply; 266+ messages in thread
From: Xueming Li @ 2021-10-20  7:53 UTC (permalink / raw)
  To: dev, Zhang Yuying
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde, Xiaoyun Li

In case of shared Rx queue, polling any member port returns mbufs for
all members. This patch dumps mbuf->port for each packet.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/util.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/app/test-pmd/util.c b/app/test-pmd/util.c
index 51506e49404..e98f136d5ed 100644
--- a/app/test-pmd/util.c
+++ b/app/test-pmd/util.c
@@ -100,6 +100,9 @@ dump_pkt_burst(uint16_t port_id, uint16_t queue, struct rte_mbuf *pkts[],
 		struct rte_flow_restore_info info = { 0, };
 
 		mb = pkts[i];
+		if (rxq_share > 0)
+			MKDUMPSTR(print_buf, buf_size, cur_len, "port %u, ",
+				  mb->port);
 		eth_hdr = rte_pktmbuf_read(mb, 0, sizeof(_eth_hdr), &_eth_hdr);
 		eth_type = RTE_BE_TO_CPU_16(eth_hdr->ether_type);
 		packet_type = mb->packet_type;
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v11 6/7] app/testpmd: force shared Rx queue polled on same core
  2021-10-20  7:53 ` [dpdk-dev] [PATCH v11 0/7] ethdev: introduce " Xueming Li
                     ` (4 preceding siblings ...)
  2021-10-20  7:53   ` [dpdk-dev] [PATCH v11 5/7] app/testpmd: dump port info for " Xueming Li
@ 2021-10-20  7:53   ` Xueming Li
  2021-10-21  3:24     ` Li, Xiaoyun
  2021-10-20  7:53   ` [dpdk-dev] [PATCH v11 7/7] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
  6 siblings, 1 reply; 266+ messages in thread
From: Xueming Li @ 2021-10-20  7:53 UTC (permalink / raw)
  To: dev, Zhang Yuying
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde, Xiaoyun Li

Shared Rx queue must be polled on same core. This patch checks and stops
forwarding if shared RxQ being scheduled on multiple
cores.

It's suggested to use same number of Rx queues and polling cores.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c  | 103 +++++++++++++++++++++++++++++++++++++++++
 app/test-pmd/testpmd.c |   4 +-
 app/test-pmd/testpmd.h |   2 +
 3 files changed, 108 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index fa951a86704..1f1307178be 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -2915,6 +2915,109 @@ port_rss_hash_key_update(portid_t port_id, char rss_type[], uint8_t *hash_key,
 	}
 }
 
+/*
+ * Check whether a shared rxq scheduled on other lcores.
+ */
+static bool
+fwd_stream_on_other_lcores(uint16_t domain_id, lcoreid_t src_lc,
+			   portid_t src_port, queueid_t src_rxq,
+			   uint32_t share_group, queueid_t share_rxq)
+{
+	streamid_t sm_id;
+	streamid_t nb_fs_per_lcore;
+	lcoreid_t  nb_fc;
+	lcoreid_t  lc_id;
+	struct fwd_stream *fs;
+	struct rte_port *port;
+	struct rte_eth_dev_info *dev_info;
+	struct rte_eth_rxconf *rxq_conf;
+
+	nb_fc = cur_fwd_config.nb_fwd_lcores;
+	/* Check remaining cores. */
+	for (lc_id = src_lc + 1; lc_id < nb_fc; lc_id++) {
+		sm_id = fwd_lcores[lc_id]->stream_idx;
+		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
+		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
+		     sm_id++) {
+			fs = fwd_streams[sm_id];
+			port = &ports[fs->rx_port];
+			dev_info = &port->dev_info;
+			rxq_conf = &port->rx_conf[fs->rx_queue];
+			if ((dev_info->dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE)
+			    == 0)
+				/* Not shared rxq. */
+				continue;
+			if (domain_id != port->dev_info.switch_info.domain_id)
+				continue;
+			if (rxq_conf->share_group != share_group)
+				continue;
+			if (rxq_conf->share_qid != share_rxq)
+				continue;
+			printf("Shared Rx queue group %u queue %hu can't be scheduled on different cores:\n",
+			       share_group, share_rxq);
+			printf("  lcore %hhu Port %hu queue %hu\n",
+			       src_lc, src_port, src_rxq);
+			printf("  lcore %hhu Port %hu queue %hu\n",
+			       lc_id, fs->rx_port, fs->rx_queue);
+			printf("Please use --nb-cores=%hu to limit number of forwarding cores\n",
+			       nb_rxq);
+			return true;
+		}
+	}
+	return false;
+}
+
+/*
+ * Check shared rxq configuration.
+ *
+ * Shared group must not being scheduled on different core.
+ */
+bool
+pkt_fwd_shared_rxq_check(void)
+{
+	streamid_t sm_id;
+	streamid_t nb_fs_per_lcore;
+	lcoreid_t  nb_fc;
+	lcoreid_t  lc_id;
+	struct fwd_stream *fs;
+	uint16_t domain_id;
+	struct rte_port *port;
+	struct rte_eth_dev_info *dev_info;
+	struct rte_eth_rxconf *rxq_conf;
+
+	nb_fc = cur_fwd_config.nb_fwd_lcores;
+	/*
+	 * Check streams on each core, make sure the same switch domain +
+	 * group + queue doesn't get scheduled on other cores.
+	 */
+	for (lc_id = 0; lc_id < nb_fc; lc_id++) {
+		sm_id = fwd_lcores[lc_id]->stream_idx;
+		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
+		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
+		     sm_id++) {
+			fs = fwd_streams[sm_id];
+			/* Update lcore info stream being scheduled. */
+			fs->lcore = fwd_lcores[lc_id];
+			port = &ports[fs->rx_port];
+			dev_info = &port->dev_info;
+			rxq_conf = &port->rx_conf[fs->rx_queue];
+			if ((dev_info->dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE)
+			    == 0)
+				/* Not shared rxq. */
+				continue;
+			/* Check shared rxq not scheduled on remaining cores. */
+			domain_id = port->dev_info.switch_info.domain_id;
+			if (fwd_stream_on_other_lcores(domain_id, lc_id,
+						       fs->rx_port,
+						       fs->rx_queue,
+						       rxq_conf->share_group,
+						       rxq_conf->share_qid))
+				return false;
+		}
+	}
+	return true;
+}
+
 /*
  * Setup forwarding configuration for each logical core.
  */
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 123142ed110..f3f81ef561f 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -2236,10 +2236,12 @@ start_packet_forwarding(int with_tx_first)
 
 	fwd_config_setup();
 
+	pkt_fwd_config_display(&cur_fwd_config);
+	if (!pkt_fwd_shared_rxq_check())
+		return;
 	if(!no_flush_rx)
 		flush_fwd_rx_queues();
 
-	pkt_fwd_config_display(&cur_fwd_config);
 	rxtx_config_display();
 
 	fwd_stats_reset();
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 3dfaaad94c0..f121a2da90c 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -144,6 +144,7 @@ struct fwd_stream {
 	uint64_t     core_cycles; /**< used for RX and TX processing */
 	struct pkt_burst_stats rx_burst_stats;
 	struct pkt_burst_stats tx_burst_stats;
+	struct fwd_lcore *lcore; /**< Lcore being scheduled. */
 };
 
 /**
@@ -795,6 +796,7 @@ void port_summary_header_display(void);
 void rx_queue_infos_display(portid_t port_idi, uint16_t queue_id);
 void tx_queue_infos_display(portid_t port_idi, uint16_t queue_id);
 void fwd_lcores_config_display(void);
+bool pkt_fwd_shared_rxq_check(void);
 void pkt_fwd_config_display(struct fwd_config *cfg);
 void rxtx_config_display(void);
 void fwd_config_setup(void);
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v11 7/7] app/testpmd: add forwarding engine for shared Rx queue
  2021-10-20  7:53 ` [dpdk-dev] [PATCH v11 0/7] ethdev: introduce " Xueming Li
                     ` (5 preceding siblings ...)
  2021-10-20  7:53   ` [dpdk-dev] [PATCH v11 6/7] app/testpmd: force shared Rx queue polled on same core Xueming Li
@ 2021-10-20  7:53   ` Xueming Li
  2021-10-20 19:20     ` Thomas Monjalon
  6 siblings, 1 reply; 266+ messages in thread
From: Xueming Li @ 2021-10-20  7:53 UTC (permalink / raw)
  To: dev, Zhang Yuying
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde, Xiaoyun Li

To support shared Rx queue, this patch introduces dedicate forwarding
engine. The engine groups received packets by mbuf->port into sub-group,
updates stream statistics and simply frees packets.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/meson.build                    |   1 +
 app/test-pmd/shared_rxq_fwd.c               | 148 ++++++++++++++++++++
 app/test-pmd/testpmd.c                      |   1 +
 app/test-pmd/testpmd.h                      |   1 +
 doc/guides/testpmd_app_ug/run_app.rst       |   1 +
 doc/guides/testpmd_app_ug/testpmd_funcs.rst |   5 +-
 6 files changed, 156 insertions(+), 1 deletion(-)
 create mode 100644 app/test-pmd/shared_rxq_fwd.c

diff --git a/app/test-pmd/meson.build b/app/test-pmd/meson.build
index 98f3289bdfa..07042e45b12 100644
--- a/app/test-pmd/meson.build
+++ b/app/test-pmd/meson.build
@@ -21,6 +21,7 @@ sources = files(
         'noisy_vnf.c',
         'parameters.c',
         'rxonly.c',
+        'shared_rxq_fwd.c',
         'testpmd.c',
         'txonly.c',
         'util.c',
diff --git a/app/test-pmd/shared_rxq_fwd.c b/app/test-pmd/shared_rxq_fwd.c
new file mode 100644
index 00000000000..4e262b99bc7
--- /dev/null
+++ b/app/test-pmd/shared_rxq_fwd.c
@@ -0,0 +1,148 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2021 NVIDIA Corporation & Affiliates
+ */
+
+#include <stdarg.h>
+#include <string.h>
+#include <stdio.h>
+#include <errno.h>
+#include <stdint.h>
+#include <unistd.h>
+#include <inttypes.h>
+
+#include <sys/queue.h>
+#include <sys/stat.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_debug.h>
+#include <rte_cycles.h>
+#include <rte_memory.h>
+#include <rte_memcpy.h>
+#include <rte_launch.h>
+#include <rte_eal.h>
+#include <rte_per_lcore.h>
+#include <rte_lcore.h>
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_mempool.h>
+#include <rte_mbuf.h>
+#include <rte_pci.h>
+#include <rte_ether.h>
+#include <rte_ethdev.h>
+#include <rte_string_fns.h>
+#include <rte_ip.h>
+#include <rte_udp.h>
+#include <rte_net.h>
+#include <rte_flow.h>
+
+#include "testpmd.h"
+
+/*
+ * Rx only sub-burst forwarding.
+ */
+static void
+forward_rx_only(uint16_t nb_rx, struct rte_mbuf **pkts_burst)
+{
+	rte_pktmbuf_free_bulk(pkts_burst, nb_rx);
+}
+
+/**
+ * Get packet source stream by source port and queue.
+ * All streams of same shared Rx queue locates on same core.
+ */
+static struct fwd_stream *
+forward_stream_get(struct fwd_stream *fs, uint16_t port)
+{
+	streamid_t sm_id;
+	struct fwd_lcore *fc;
+	struct fwd_stream **fsm;
+	streamid_t nb_fs;
+
+	fc = fs->lcore;
+	fsm = &fwd_streams[fc->stream_idx];
+	nb_fs = fc->stream_nb;
+	for (sm_id = 0; sm_id < nb_fs; sm_id++) {
+		if (fsm[sm_id]->rx_port == port &&
+		    fsm[sm_id]->rx_queue == fs->rx_queue)
+			return fsm[sm_id];
+	}
+	return NULL;
+}
+
+/**
+ * Forward packet by source port and queue.
+ */
+static void
+forward_sub_burst(struct fwd_stream *src_fs, uint16_t port, uint16_t nb_rx,
+		  struct rte_mbuf **pkts)
+{
+	struct fwd_stream *fs = forward_stream_get(src_fs, port);
+
+	if (fs != NULL) {
+		fs->rx_packets += nb_rx;
+		forward_rx_only(nb_rx, pkts);
+	} else {
+		/* Source stream not found, drop all packets. */
+		src_fs->fwd_dropped += nb_rx;
+		while (nb_rx > 0)
+			rte_pktmbuf_free(pkts[--nb_rx]);
+	}
+}
+
+/**
+ * Forward packets from shared Rx queue.
+ *
+ * Source port of packets are identified by mbuf->port.
+ */
+static void
+forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
+		   struct rte_mbuf **pkts_burst)
+{
+	uint16_t i, nb_sub_burst, port, last_port;
+
+	nb_sub_burst = 0;
+	last_port = pkts_burst[0]->port;
+	/* Locate sub-burst according to mbuf->port. */
+	for (i = 0; i < nb_rx - 1; ++i) {
+		rte_prefetch0(pkts_burst[i + 1]);
+		port = pkts_burst[i]->port;
+		if (i > 0 && last_port != port) {
+			/* Forward packets with same source port. */
+			forward_sub_burst(fs, last_port, nb_sub_burst,
+					  &pkts_burst[i - nb_sub_burst]);
+			nb_sub_burst = 0;
+			last_port = port;
+		}
+		nb_sub_burst++;
+	}
+	/* Last sub-burst. */
+	nb_sub_burst++;
+	forward_sub_burst(fs, last_port, nb_sub_burst,
+			  &pkts_burst[nb_rx - nb_sub_burst]);
+}
+
+static void
+shared_rxq_fwd(struct fwd_stream *fs)
+{
+	struct rte_mbuf *pkts_burst[nb_pkt_per_burst];
+	uint16_t nb_rx;
+	uint64_t start_tsc = 0;
+
+	get_start_cycles(&start_tsc);
+	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
+				 nb_pkt_per_burst);
+	inc_rx_burst_stats(fs, nb_rx);
+	if (unlikely(nb_rx == 0))
+		return;
+	forward_shared_rxq(fs, nb_rx, pkts_burst);
+	get_end_cycles(fs, start_tsc);
+}
+
+struct fwd_engine shared_rxq_engine = {
+	.fwd_mode_name  = "shared_rxq",
+	.port_fwd_begin = NULL,
+	.port_fwd_end   = NULL,
+	.packet_fwd     = shared_rxq_fwd,
+};
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index f3f81ef561f..11a85d92d9a 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -188,6 +188,7 @@ struct fwd_engine * fwd_engines[] = {
 #ifdef RTE_LIBRTE_IEEE1588
 	&ieee1588_fwd_engine,
 #endif
+	&shared_rxq_engine,
 	NULL,
 };
 
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index f121a2da90c..f1fd607e365 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -299,6 +299,7 @@ extern struct fwd_engine five_tuple_swap_fwd_engine;
 #ifdef RTE_LIBRTE_IEEE1588
 extern struct fwd_engine ieee1588_fwd_engine;
 #endif
+extern struct fwd_engine shared_rxq_engine;
 
 extern struct fwd_engine * fwd_engines[]; /**< NULL terminated array. */
 extern cmdline_parse_inst_t cmd_set_raw;
diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
index ff5908dcd50..e4b97844ced 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -252,6 +252,7 @@ The command line options are:
        tm
        noisy
        5tswap
+       shared-rxq
 
 *   ``--rss-ip``
 
diff --git a/doc/guides/testpmd_app_ug/testpmd_funcs.rst b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
index 8ead7a4a712..499874187f2 100644
--- a/doc/guides/testpmd_app_ug/testpmd_funcs.rst
+++ b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
@@ -314,7 +314,7 @@ set fwd
 Set the packet forwarding mode::
 
    testpmd> set fwd (io|mac|macswap|flowgen| \
-                     rxonly|txonly|csum|icmpecho|noisy|5tswap) (""|retry)
+                     rxonly|txonly|csum|icmpecho|noisy|5tswap|shared-rxq) (""|retry)
 
 ``retry`` can be specified for forwarding engines except ``rx_only``.
 
@@ -357,6 +357,9 @@ The available information categories are:
 
   L4 swaps the source port and destination port of transport layer (TCP and UDP).
 
+* ``shared-rxq``: Receive only for shared Rx queue.
+  Resolve packet source port from mbuf and update stream statistics accordingly.
+
 Example::
 
    testpmd> set fwd rxonly
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v11 2/7] ethdev: new API to resolve device capability name
  2021-10-20  7:53   ` [dpdk-dev] [PATCH v11 2/7] ethdev: new API to resolve device capability name Xueming Li
@ 2021-10-20 10:52     ` Andrew Rybchenko
  2021-10-20 17:16       ` Ajit Khaparde
  2021-10-20 18:42     ` Thomas Monjalon
  1 sibling, 1 reply; 266+ messages in thread
From: Andrew Rybchenko @ 2021-10-20 10:52 UTC (permalink / raw)
  To: Xueming Li, dev, Zhang Yuying
  Cc: Jerin Jacob, Ferruh Yigit, Viacheslav Ovsiienko, Thomas Monjalon,
	Lior Margalit, Ananyev Konstantin, Ajit Khaparde, Ray Kinsella

On 10/20/21 10:53 AM, Xueming Li wrote:
> This patch adds API to return name of device capability.
> 
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>

Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v11 1/7] ethdev: introduce shared Rx queue
  2021-10-20  7:53   ` [dpdk-dev] [PATCH v11 1/7] " Xueming Li
@ 2021-10-20 17:14     ` Ajit Khaparde
  0 siblings, 0 replies; 266+ messages in thread
From: Ajit Khaparde @ 2021-10-20 17:14 UTC (permalink / raw)
  To: Xueming Li
  Cc: dpdk-dev, Zhang Yuying, Jerin Jacob, Ferruh Yigit,
	Andrew Rybchenko, Viacheslav Ovsiienko, Thomas Monjalon,
	Lior Margalit, Ananyev Konstantin

On Wed, Oct 20, 2021 at 12:54 AM Xueming Li <xuemingl@nvidia.com> wrote:
>
> In current DPDK framework, each Rx queue is pre-loaded with mbufs to
> save incoming packets. For some PMDs, when number of representors scale
> out in a switch domain, the memory consumption became significant.
> Polling all ports also leads to high cache miss, high latency and low
> throughput.
>
> This patch introduces shared Rx queue. Ports in same Rx domain and
> switch domain could share Rx queue set by specifying non-zero sharing
> group in Rx queue configuration.
>
> Shared Rx queue is identified by share_rxq field of Rx queue
> configuration. Port A RxQ X can share RxQ with Port B RxQ Y by using
> same shared Rx queue ID.
>
> No special API is defined to receive packets from shared Rx queue.
> Polling any member port of a shared Rx queue receives packets of that
> queue for all member ports, port_id is identified by mbuf->port. PMD is
> responsible to resolve shared Rx queue from device and queue data.
>
> Shared Rx queue must be polled in same thread or core, polling a queue
> ID of any member port is essentially same.
>
> Multiple share groups are supported. PMD should support mixed
> configuration by allowing multiple share groups and non-shared Rx queue
> on one port.
>
> Example grouping and polling model to reflect service priority:
>  Group1, 2 shared Rx queues per port: PF, rep0, rep1
>  Group2, 1 shared Rx queue per port: rep2, rep3, ... rep127
>  Core0: poll PF queue0
>  Core1: poll PF queue1
>  Core2: poll rep2 queue0
>
> PMD advertise shared Rx queue capability via RTE_ETH_DEV_CAPA_RXQ_SHARE.
>
> PMD is responsible for shared Rx queue consistency checks to avoid
> member port's configuration contradict each other.
>
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>

> ---
>  doc/guides/nics/features.rst                  | 13 ++++++++++
>  doc/guides/nics/features/default.ini          |  1 +
>  .../prog_guide/switch_representation.rst      | 11 +++++++++
>  doc/guides/rel_notes/release_21_11.rst        |  6 +++++
>  lib/ethdev/rte_ethdev.c                       |  8 +++++++
>  lib/ethdev/rte_ethdev.h                       | 24 +++++++++++++++++++
>  6 files changed, 63 insertions(+)
>
> diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> index e346018e4b8..89f9accbca1 100644
> --- a/doc/guides/nics/features.rst
> +++ b/doc/guides/nics/features.rst
> @@ -615,6 +615,19 @@ Supports inner packet L4 checksum.
>    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
>
>
> +.. _nic_features_shared_rx_queue:
> +
> +Shared Rx queue
> +---------------
> +
> +Supports shared Rx queue for ports in same Rx domain of a switch domain.
> +
> +* **[uses]     rte_eth_dev_info**: ``dev_capa:RTE_ETH_DEV_CAPA_RXQ_SHARE``.
> +* **[uses]     rte_eth_dev_info,rte_eth_switch_info**: ``rx_domain``, ``domain_id``.
> +* **[uses]     rte_eth_rxconf**: ``share_group``, ``share_qid``.
> +* **[provides] mbuf**: ``mbuf.port``.
> +
> +
>  .. _nic_features_packet_type_parsing:
>
>  Packet type parsing
> diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
> index d473b94091a..93f5d1b46f4 100644
> --- a/doc/guides/nics/features/default.ini
> +++ b/doc/guides/nics/features/default.ini
> @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
>  Queue start/stop     =
>  Runtime Rx queue setup =
>  Runtime Tx queue setup =
> +Shared Rx queue      =
>  Burst mode info      =
>  Power mgmt address monitor =
>  MTU update           =
> diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
> index ff6aa91c806..4f2532a91ea 100644
> --- a/doc/guides/prog_guide/switch_representation.rst
> +++ b/doc/guides/prog_guide/switch_representation.rst
> @@ -123,6 +123,17 @@ thought as a software "patch panel" front-end for applications.
>  .. [1] `Ethernet switch device driver model (switchdev)
>         <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
>
> +- For some PMDs, memory usage of representors is huge when number of
> +  representor grows, mbufs are allocated for each descriptor of Rx queue.
> +  Polling large number of ports brings more CPU load, cache miss and
> +  latency. Shared Rx queue can be used to share Rx queue between PF and
> +  representors among same Rx domain. ``RTE_ETH_DEV_CAPA_RXQ_SHARE`` in
> +  device info is used to indicate the capability. Setting non-zero share
> +  group in Rx queue configuration to enable share, share_qid is used to
> +  identify the shared Rx queue in group. Polling any member port can
> +  receive packets of all member ports in the group, port ID is saved in
> +  ``mbuf.port``.
> +
>  Basic SR-IOV
>  ------------
>
> diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
> index 3362c52a738..caf82242f2e 100644
> --- a/doc/guides/rel_notes/release_21_11.rst
> +++ b/doc/guides/rel_notes/release_21_11.rst
> @@ -75,6 +75,12 @@ New Features
>      operations.
>    * Added multi-process support.
>
> +* **Added ethdev shared Rx queue support.**
> +
> +  * Added new device capability flag and Rx domain field to switch info.
> +  * Added share group and share queue ID to Rx queue configuration.
> +  * Added testpmd support and dedicate forwarding engine.
> +
>  * **Added new RSS offload types for IPv4/L4 checksum in RSS flow.**
>
>    Added macros ETH_RSS_IPV4_CHKSUM and ETH_RSS_L4_CHKSUM, now IPv4 and
> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index 028907bc4b9..bc55f899f72 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -2159,6 +2159,14 @@ rte_eth_rx_queue_setup(uint16_t port_id, uint16_t rx_queue_id,
>                 return -EINVAL;
>         }
>
> +       if (local_conf.share_group > 0 &&
> +           (dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE) == 0) {
> +               RTE_ETHDEV_LOG(ERR,
> +                       "Ethdev port_id=%d rx_queue_id=%d, enabled share_group=%hu while device doesn't support Rx queue share\n",
> +                       port_id, rx_queue_id, local_conf.share_group);
> +               return -EINVAL;
> +       }
> +
>         /*
>          * If LRO is enabled, check that the maximum aggregated packet
>          * size is supported by the configured device.
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index 6d80514ba7a..34acc91273d 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1044,6 +1044,14 @@ struct rte_eth_rxconf {
>         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
>         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
>         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> +       /**
> +        * Share group index in Rx domain and switch domain.
> +        * Non-zero value to enable Rx queue share, zero value disable share.
> +        * PMD is responsible for Rx queue consistency checks to avoid member
> +        * port's configuration contradict to each other.
> +        */
> +       uint16_t share_group;
> +       uint16_t share_qid; /**< Shared Rx queue ID in group. */
>         /**
>          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
>          * Only offloads set on rx_queue_offload_capa or rx_offload_capa
> @@ -1445,6 +1453,16 @@ struct rte_eth_conf {
>  #define RTE_ETH_DEV_CAPA_RUNTIME_RX_QUEUE_SETUP 0x00000001
>  /** Device supports Tx queue setup after device started. */
>  #define RTE_ETH_DEV_CAPA_RUNTIME_TX_QUEUE_SETUP 0x00000002
> +/**
> + * Device supports shared Rx queue among ports within Rx domain and
> + * switch domain. Mbufs are consumed by shared Rx queue instead of
> + * each queue. Multiple groups are supported by share_group of Rx
> + * queue configuration. Shared Rx queue is identified by PMD using
> + * share_qid of Rx queue configuration. Polling any port in the group
> + * receive packets of all member ports, source port identified by
> + * mbuf->port field.
> + */
> +#define RTE_ETH_DEV_CAPA_RXQ_SHARE              RTE_BIT64(2)
>  /**@}*/
>
>  /*
> @@ -1488,6 +1506,12 @@ struct rte_eth_switch_info {
>          * but each driver should explicitly define the mapping of switch
>          * port identifier to that physical interconnect/switch
>          */
> +       /**
> +        * Shared Rx queue sub-domain boundary. Only ports in same Rx domain
> +        * and switch domain can share Rx queue. Valid only if device advertised
> +        * RTE_ETH_DEV_CAPA_RXQ_SHARE capability.
> +        */
> +       uint16_t rx_domain;
>  };
>
>  /**
> --
> 2.33.0
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v11 2/7] ethdev: new API to resolve device capability name
  2021-10-20 10:52     ` Andrew Rybchenko
@ 2021-10-20 17:16       ` Ajit Khaparde
  0 siblings, 0 replies; 266+ messages in thread
From: Ajit Khaparde @ 2021-10-20 17:16 UTC (permalink / raw)
  To: Andrew Rybchenko
  Cc: Xueming Li, dpdk-dev, Zhang Yuying, Jerin Jacob, Ferruh Yigit,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ray Kinsella

On Wed, Oct 20, 2021 at 3:52 AM Andrew Rybchenko
<andrew.rybchenko@oktetlabs.ru> wrote:
>
> On 10/20/21 10:53 AM, Xueming Li wrote:
> > This patch adds API to return name of device capability.
> >
> > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
>
> Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>

>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v11 4/7] app/testpmd: new parameter to enable shared Rx queue
  2021-10-20  7:53   ` [dpdk-dev] [PATCH v11 4/7] app/testpmd: new parameter to enable shared Rx queue Xueming Li
@ 2021-10-20 17:29     ` Ajit Khaparde
  2021-10-20 19:14       ` Thomas Monjalon
  2021-10-21  3:49       ` Xueming(Steven) Li
  2021-10-21  3:24     ` Li, Xiaoyun
  1 sibling, 2 replies; 266+ messages in thread
From: Ajit Khaparde @ 2021-10-20 17:29 UTC (permalink / raw)
  To: Xueming Li
  Cc: dpdk-dev, Zhang Yuying, Jerin Jacob, Ferruh Yigit,
	Andrew Rybchenko, Viacheslav Ovsiienko, Thomas Monjalon,
	Lior Margalit, Ananyev Konstantin, Xiaoyun Li

On Wed, Oct 20, 2021 at 12:54 AM Xueming Li <xuemingl@nvidia.com> wrote:
>
> Adds "--rxq-share=X" parameter to enable shared RxQ, share if device
> supports, otherwise fallback to standard RxQ.
>
> Share group number grows per X ports. X defaults to MAX, implies all
> ports join share group 1. Queue ID is mapped equally with shared Rx
> queue ID.
>
> Forwarding engine "shared-rxq" should be used which Rx only and update
> stream statistics correctly.
>
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> ---
>  app/test-pmd/config.c                 |  7 ++++++-
>  app/test-pmd/parameters.c             | 13 +++++++++++++
>  app/test-pmd/testpmd.c                | 20 +++++++++++++++++---
>  app/test-pmd/testpmd.h                |  2 ++
>  doc/guides/testpmd_app_ug/run_app.rst |  7 +++++++
>  5 files changed, 45 insertions(+), 4 deletions(-)
>
:::snip::::

> +
>  extern uint16_t nb_pkt_per_burst;
>  extern uint16_t nb_pkt_flowgen_clones;
>  extern int nb_flows_flowgen;
> diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
> index 640eadeff73..ff5908dcd50 100644
> --- a/doc/guides/testpmd_app_ug/run_app.rst
> +++ b/doc/guides/testpmd_app_ug/run_app.rst
> @@ -389,6 +389,13 @@ The command line options are:
>
>      Generate multiple flows in txonly mode.
>
> +*   ``--rxq-share=[X]``
> +
> +    Create queues in shared Rx queue mode if device supports.
> +    Group number grows per X ports. X defaults to MAX, implies all ports
> +    join share group 1. Forwarding engine "shared-rxq" should be used
> +    which Rx only and update stream statistics correctly.
Did you mean "with Rx only"?
Something like this?
"shared-rxq" should be used in Rx only mode.

If you say - "the Forwarding engine should update stream statistics correctly",
I think that is expected anyway? So there is no need to mention that
in the guide.


> +
>  *   ``--eth-link-speed``
>
>      Set a forced link speed to the ethernet port::
> --
> 2.33.0
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v11 2/7] ethdev: new API to resolve device capability name
  2021-10-20  7:53   ` [dpdk-dev] [PATCH v11 2/7] ethdev: new API to resolve device capability name Xueming Li
  2021-10-20 10:52     ` Andrew Rybchenko
@ 2021-10-20 18:42     ` Thomas Monjalon
  1 sibling, 0 replies; 266+ messages in thread
From: Thomas Monjalon @ 2021-10-20 18:42 UTC (permalink / raw)
  To: Xueming Li
  Cc: dev, Zhang Yuying, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Lior Margalit, Ananyev Konstantin,
	Ajit Khaparde, Ray Kinsella

20/10/2021 09:53, Xueming Li:
> This patch adds API to return name of device capability.
> 
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>

The title of this patch should be:
"ethdev: get device capability name as string"

Acked-by: Thomas Monjalon <thomas@monjalon.net>



^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v11 4/7] app/testpmd: new parameter to enable shared Rx queue
  2021-10-20 17:29     ` Ajit Khaparde
@ 2021-10-20 19:14       ` Thomas Monjalon
  2021-10-21  4:09         ` Xueming(Steven) Li
  2021-10-21  3:49       ` Xueming(Steven) Li
  1 sibling, 1 reply; 266+ messages in thread
From: Thomas Monjalon @ 2021-10-20 19:14 UTC (permalink / raw)
  To: Xueming Li
  Cc: dev, Zhang Yuying, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Lior Margalit, Ananyev Konstantin,
	Xiaoyun Li, Ajit Khaparde

20/10/2021 19:29, Ajit Khaparde:
> On Wed, Oct 20, 2021 at 12:54 AM Xueming Li <xuemingl@nvidia.com> wrote:
> >
> > Adds "--rxq-share=X" parameter to enable shared RxQ,

You should end the sentence here.

> > share if device
> > supports, otherwise fallback to standard RxQ.
> >
> > Share group number grows per X ports.

Do you mean "Shared queues are grouped per X ports." ?

> > X defaults to MAX, implies all
> > ports join share group 1. Queue ID is mapped equally with shared Rx
> > queue ID.
> >
> > Forwarding engine "shared-rxq" should be used which Rx only and update
> > stream statistics correctly.

I suggest this wording:
"
A new forwarding engine "shared-rxq" should be used for shared Rx queues.
This engine does Rx only and update stream statistics accordingly.
"

> > Signed-off-by: Xueming Li <xuemingl@nvidia.com>

[...]
> +	printf("  --rxq-share: number of ports per shared rxq groups, defaults to MAX(1 group)\n");

rxq -> Rx queue
Is MAX a special value? or should it be "all queues"?
Note: space is missing before the parenthesis.

[...]
> > +*   ``--rxq-share=[X]``
> > +
> > +    Create queues in shared Rx queue mode if device supports.
> > +    Group number grows per X ports.

Again I suggest "Shared queues are grouped per X ports."

> > + X defaults to MAX, implies all ports
> > +    join share group 1. Forwarding engine "shared-rxq" should be used
> > +    which Rx only and update stream statistics correctly.
> 
> Did you mean "with Rx only"?
> Something like this?
> "shared-rxq" should be used in Rx only mode.
> 
> If you say - "the Forwarding engine should update stream statistics correctly",
> I think that is expected anyway? So there is no need to mention that
> in the guide.

I suggested a wording above.




^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v11 7/7] app/testpmd: add forwarding engine for shared Rx queue
  2021-10-20  7:53   ` [dpdk-dev] [PATCH v11 7/7] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
@ 2021-10-20 19:20     ` Thomas Monjalon
  2021-10-21  3:26       ` Li, Xiaoyun
  2021-10-21  4:39       ` Xueming(Steven) Li
  0 siblings, 2 replies; 266+ messages in thread
From: Thomas Monjalon @ 2021-10-20 19:20 UTC (permalink / raw)
  To: Xueming Li
  Cc: dev, Zhang Yuying, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Lior Margalit, Ananyev Konstantin,
	Ajit Khaparde, Xiaoyun Li

20/10/2021 09:53, Xueming Li:
> To support shared Rx queue, this patch introduces dedicate forwarding
> engine. The engine groups received packets by mbuf->port into sub-group,
> updates stream statistics and simply frees packets.

Given this engine is mentioned in previous commits,
shouldn't it be placed earlier in the series?

> +#include <stdarg.h>
> +#include <string.h>
> +#include <stdio.h>
> +#include <errno.h>
> +#include <stdint.h>
> +#include <unistd.h>
> +#include <inttypes.h>
> +
> +#include <sys/queue.h>
> +#include <sys/stat.h>
> +
> +#include <rte_common.h>
> +#include <rte_byteorder.h>
> +#include <rte_log.h>
> +#include <rte_debug.h>
> +#include <rte_cycles.h>
> +#include <rte_memory.h>
> +#include <rte_memcpy.h>
> +#include <rte_launch.h>
> +#include <rte_eal.h>
> +#include <rte_per_lcore.h>
> +#include <rte_lcore.h>
> +#include <rte_atomic.h>
> +#include <rte_branch_prediction.h>
> +#include <rte_mempool.h>
> +#include <rte_mbuf.h>
> +#include <rte_pci.h>
> +#include <rte_ether.h>
> +#include <rte_ethdev.h>
> +#include <rte_string_fns.h>
> +#include <rte_ip.h>
> +#include <rte_udp.h>
> +#include <rte_net.h>
> +#include <rte_flow.h>

Please do not include useless files.



^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v11 3/7] app/testpmd: dump device capability and Rx domain info
  2021-10-20  7:53   ` [dpdk-dev] [PATCH v11 3/7] app/testpmd: dump device capability and Rx domain info Xueming Li
@ 2021-10-21  3:24     ` Li, Xiaoyun
  2021-10-21  3:28       ` Ajit Khaparde
  0 siblings, 1 reply; 266+ messages in thread
From: Li, Xiaoyun @ 2021-10-21  3:24 UTC (permalink / raw)
  To: Xueming Li, dev, Zhang, Yuying
  Cc: Jerin Jacob, Yigit, Ferruh, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit, Ananyev,
	Konstantin, Ajit Khaparde

> -----Original Message-----
> From: Xueming Li <xuemingl@nvidia.com>
> Sent: Wednesday, October 20, 2021 15:53
> To: dev@dpdk.org; Zhang, Yuying <yuying.zhang@intel.com>
> Cc: xuemingl@nvidia.com; Jerin Jacob <jerinjacobk@gmail.com>; Yigit, Ferruh
> <ferruh.yigit@intel.com>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>; Viacheslav Ovsiienko
> <viacheslavo@nvidia.com>; Thomas Monjalon <thomas@monjalon.net>; Lior
> Margalit <lmargalit@nvidia.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Ajit Khaparde
> <ajit.khaparde@broadcom.com>; Li, Xiaoyun <xiaoyun.li@intel.com>
> Subject: [PATCH v11 3/7] app/testpmd: dump device capability and Rx domain
> info
> 
> Dump device capability and Rx domain ID if shared Rx queue is supported by
> device.
> 
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> ---
>  app/test-pmd/config.c | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 

Acked-by: Xiaoyun Li <xiaoyun.li@intel.com>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v11 4/7] app/testpmd: new parameter to enable shared Rx queue
  2021-10-20  7:53   ` [dpdk-dev] [PATCH v11 4/7] app/testpmd: new parameter to enable shared Rx queue Xueming Li
  2021-10-20 17:29     ` Ajit Khaparde
@ 2021-10-21  3:24     ` Li, Xiaoyun
  2021-10-21  3:58       ` Xueming(Steven) Li
  1 sibling, 1 reply; 266+ messages in thread
From: Li, Xiaoyun @ 2021-10-21  3:24 UTC (permalink / raw)
  To: Xueming Li, dev, Zhang, Yuying
  Cc: Jerin Jacob, Yigit, Ferruh, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit, Ananyev,
	Konstantin, Ajit Khaparde

Hi

> -----Original Message-----
> From: Xueming Li <xuemingl@nvidia.com>
> Sent: Wednesday, October 20, 2021 15:53
> To: dev@dpdk.org; Zhang, Yuying <yuying.zhang@intel.com>
> Cc: xuemingl@nvidia.com; Jerin Jacob <jerinjacobk@gmail.com>; Yigit, Ferruh
> <ferruh.yigit@intel.com>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>; Viacheslav Ovsiienko
> <viacheslavo@nvidia.com>; Thomas Monjalon <thomas@monjalon.net>; Lior
> Margalit <lmargalit@nvidia.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Ajit Khaparde
> <ajit.khaparde@broadcom.com>; Li, Xiaoyun <xiaoyun.li@intel.com>
> Subject: [PATCH v11 4/7] app/testpmd: new parameter to enable shared Rx
> queue
> 
> Adds "--rxq-share=X" parameter to enable shared RxQ, share if device supports,
> otherwise fallback to standard RxQ.
> 
> Share group number grows per X ports. X defaults to MAX, implies all ports join

X defaults to number of probed ports.

> share group 1. Queue ID is mapped equally with shared Rx queue ID.
> 
> Forwarding engine "shared-rxq" should be used which Rx only and update
> stream statistics correctly.
> 
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> ---
>  app/test-pmd/config.c                 |  7 ++++++-
>  app/test-pmd/parameters.c             | 13 +++++++++++++
>  app/test-pmd/testpmd.c                | 20 +++++++++++++++++---
>  app/test-pmd/testpmd.h                |  2 ++
>  doc/guides/testpmd_app_ug/run_app.rst |  7 +++++++
>  5 files changed, 45 insertions(+), 4 deletions(-)
> 
> diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c index
> 2c1b06c544d..fa951a86704 100644
> --- a/app/test-pmd/config.c
> +++ b/app/test-pmd/config.c
<snip>
> @@ -1271,6 +1273,17 @@ launch_args_parse(int argc, char** argv)
>  			}
>  			if (!strcmp(lgopts[opt_idx].name, "txonly-multi-flow"))
>  				txonly_multi_flow = 1;
> +			if (!strcmp(lgopts[opt_idx].name, "rxq-share")) {
> +				if (optarg == NULL) {
> +					rxq_share = UINT32_MAX;

Why not use "nb_ports" here? nb_ports is the number of probed ports.

> +				} else {
> +					n = atoi(optarg);
> +					if (n >= 0)
> +						rxq_share = (uint32_t)n;
> +					else
> +						rte_exit(EXIT_FAILURE, "rxq-
> share must be >= 0\n");
> +				}
> +			}
>  			if (!strcmp(lgopts[opt_idx].name, "no-flush-rx"))
>  				no_flush_rx = 1;
>  			if (!strcmp(lgopts[opt_idx].name, "eth-link-speed"))
<snip>
> 
> +*   ``--rxq-share=[X]``
> +
> +    Create queues in shared Rx queue mode if device supports.
> +    Group number grows per X ports. X defaults to MAX, implies all ports

X defaults to number of probed ports.
I suppose this is what you mean? Also, I agree with other comments with the wording part.

> +    join share group 1. Forwarding engine "shared-rxq" should be used
> +    which Rx only and update stream statistics correctly.
> +
>  *   ``--eth-link-speed``
> 
>      Set a forced link speed to the ethernet port::
> --
> 2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v11 5/7] app/testpmd: dump port info for shared Rx queue
  2021-10-20  7:53   ` [dpdk-dev] [PATCH v11 5/7] app/testpmd: dump port info for " Xueming Li
@ 2021-10-21  3:24     ` Li, Xiaoyun
  0 siblings, 0 replies; 266+ messages in thread
From: Li, Xiaoyun @ 2021-10-21  3:24 UTC (permalink / raw)
  To: Xueming Li, dev, Zhang, Yuying
  Cc: Jerin Jacob, Yigit, Ferruh, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit, Ananyev,
	Konstantin, Ajit Khaparde

> -----Original Message-----
> From: Xueming Li <xuemingl@nvidia.com>
> Sent: Wednesday, October 20, 2021 15:53
> To: dev@dpdk.org; Zhang, Yuying <yuying.zhang@intel.com>
> Cc: xuemingl@nvidia.com; Jerin Jacob <jerinjacobk@gmail.com>; Yigit, Ferruh
> <ferruh.yigit@intel.com>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>; Viacheslav Ovsiienko
> <viacheslavo@nvidia.com>; Thomas Monjalon <thomas@monjalon.net>; Lior
> Margalit <lmargalit@nvidia.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Ajit Khaparde
> <ajit.khaparde@broadcom.com>; Li, Xiaoyun <xiaoyun.li@intel.com>
> Subject: [PATCH v11 5/7] app/testpmd: dump port info for shared Rx queue
> 
> In case of shared Rx queue, polling any member port returns mbufs for all
> members. This patch dumps mbuf->port for each packet.
> 
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> ---
>  app/test-pmd/util.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/app/test-pmd/util.c b/app/test-pmd/util.c index
> 51506e49404..e98f136d5ed 100644
> --- a/app/test-pmd/util.c
> +++ b/app/test-pmd/util.c
> @@ -100,6 +100,9 @@ dump_pkt_burst(uint16_t port_id, uint16_t queue,
> struct rte_mbuf *pkts[],
>  		struct rte_flow_restore_info info = { 0, };
> 
>  		mb = pkts[i];
> +		if (rxq_share > 0)
> +			MKDUMPSTR(print_buf, buf_size, cur_len, "port %u, ",
> +				  mb->port);
>  		eth_hdr = rte_pktmbuf_read(mb, 0, sizeof(_eth_hdr),
> &_eth_hdr);
>  		eth_type = RTE_BE_TO_CPU_16(eth_hdr->ether_type);
>  		packet_type = mb->packet_type;
> --
> 2.33.0

Acked-by: Xiaoyun Li <xiaoyun.li@intel.com>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v11 6/7] app/testpmd: force shared Rx queue polled on same core
  2021-10-20  7:53   ` [dpdk-dev] [PATCH v11 6/7] app/testpmd: force shared Rx queue polled on same core Xueming Li
@ 2021-10-21  3:24     ` Li, Xiaoyun
  2021-10-21  4:21       ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Li, Xiaoyun @ 2021-10-21  3:24 UTC (permalink / raw)
  To: Xueming Li, dev, Zhang, Yuying
  Cc: Jerin Jacob, Yigit, Ferruh, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit, Ananyev,
	Konstantin, Ajit Khaparde

Hi

> -----Original Message-----
> From: Xueming Li <xuemingl@nvidia.com>
> Sent: Wednesday, October 20, 2021 15:53
> To: dev@dpdk.org; Zhang, Yuying <yuying.zhang@intel.com>
> Cc: xuemingl@nvidia.com; Jerin Jacob <jerinjacobk@gmail.com>; Yigit, Ferruh
> <ferruh.yigit@intel.com>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>; Viacheslav Ovsiienko
> <viacheslavo@nvidia.com>; Thomas Monjalon <thomas@monjalon.net>; Lior
> Margalit <lmargalit@nvidia.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Ajit Khaparde
> <ajit.khaparde@broadcom.com>; Li, Xiaoyun <xiaoyun.li@intel.com>
> Subject: [PATCH v11 6/7] app/testpmd: force shared Rx queue polled on same
> core
> 
> Shared Rx queue must be polled on same core. This patch checks and stops
> forwarding if shared RxQ being scheduled on multiple cores.
> 
> It's suggested to use same number of Rx queues and polling cores.
> 
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> ---
>  app/test-pmd/config.c  | 103
> +++++++++++++++++++++++++++++++++++++++++
>  app/test-pmd/testpmd.c |   4 +-
>  app/test-pmd/testpmd.h |   2 +
>  3 files changed, 108 insertions(+), 1 deletion(-)
> 
> diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c index
> fa951a86704..1f1307178be 100644
> --- a/app/test-pmd/config.c
> +++ b/app/test-pmd/config.c
> @@ -2915,6 +2915,109 @@ port_rss_hash_key_update(portid_t port_id, char
> rss_type[], uint8_t *hash_key,
>  	}
>  }
> 
> +/*
> + * Check whether a shared rxq scheduled on other lcores.
> + */
> +static bool
> +fwd_stream_on_other_lcores(uint16_t domain_id, lcoreid_t src_lc,
> +			   portid_t src_port, queueid_t src_rxq,
> +			   uint32_t share_group, queueid_t share_rxq) {
> +	streamid_t sm_id;
> +	streamid_t nb_fs_per_lcore;
> +	lcoreid_t  nb_fc;
> +	lcoreid_t  lc_id;
> +	struct fwd_stream *fs;
> +	struct rte_port *port;
> +	struct rte_eth_dev_info *dev_info;
> +	struct rte_eth_rxconf *rxq_conf;
> +
> +	nb_fc = cur_fwd_config.nb_fwd_lcores;
> +	/* Check remaining cores. */
> +	for (lc_id = src_lc + 1; lc_id < nb_fc; lc_id++) {
> +		sm_id = fwd_lcores[lc_id]->stream_idx;
> +		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
> +		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
> +		     sm_id++) {
> +			fs = fwd_streams[sm_id];
> +			port = &ports[fs->rx_port];
> +			dev_info = &port->dev_info;
> +			rxq_conf = &port->rx_conf[fs->rx_queue];
> +			if ((dev_info->dev_capa &
> RTE_ETH_DEV_CAPA_RXQ_SHARE)
> +			    == 0)
> +				/* Not shared rxq. */
> +				continue;
> +			if (domain_id != port->dev_info.switch_info.domain_id)
> +				continue;
> +			if (rxq_conf->share_group != share_group)
> +				continue;
> +			if (rxq_conf->share_qid != share_rxq)
> +				continue;
> +			printf("Shared Rx queue group %u queue %hu can't be
> scheduled on different cores:\n",
> +			       share_group, share_rxq);
> +			printf("  lcore %hhu Port %hu queue %hu\n",
> +			       src_lc, src_port, src_rxq);
> +			printf("  lcore %hhu Port %hu queue %hu\n",
> +			       lc_id, fs->rx_port, fs->rx_queue);
> +			printf("Please use --nb-cores=%hu to limit number of
> forwarding cores\n",
> +			       nb_rxq);
> +			return true;
> +		}
> +	}
> +	return false;
> +}
> +
> +/*
> + * Check shared rxq configuration.
> + *
> + * Shared group must not being scheduled on different core.
> + */
> +bool
> +pkt_fwd_shared_rxq_check(void)
> +{
> +	streamid_t sm_id;
> +	streamid_t nb_fs_per_lcore;
> +	lcoreid_t  nb_fc;
> +	lcoreid_t  lc_id;
> +	struct fwd_stream *fs;
> +	uint16_t domain_id;
> +	struct rte_port *port;
> +	struct rte_eth_dev_info *dev_info;
> +	struct rte_eth_rxconf *rxq_conf;
> +
> +	nb_fc = cur_fwd_config.nb_fwd_lcores;
> +	/*
> +	 * Check streams on each core, make sure the same switch domain +
> +	 * group + queue doesn't get scheduled on other cores.
> +	 */
> +	for (lc_id = 0; lc_id < nb_fc; lc_id++) {
> +		sm_id = fwd_lcores[lc_id]->stream_idx;
> +		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
> +		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
> +		     sm_id++) {
> +			fs = fwd_streams[sm_id];
> +			/* Update lcore info stream being scheduled. */
> +			fs->lcore = fwd_lcores[lc_id];
> +			port = &ports[fs->rx_port];
> +			dev_info = &port->dev_info;
> +			rxq_conf = &port->rx_conf[fs->rx_queue];
> +			if ((dev_info->dev_capa &
> RTE_ETH_DEV_CAPA_RXQ_SHARE)
> +			    == 0)
> +				/* Not shared rxq. */
> +				continue;
> +			/* Check shared rxq not scheduled on remaining cores.

The check will be done anyway just if dev has the capability of share_rxq.
But what if user wants normal queue config when they are using the dev which has the share_q capability?
You should limit the check only when "rxq_share > 0".

> */
> +			domain_id = port->dev_info.switch_info.domain_id;
> +			if (fwd_stream_on_other_lcores(domain_id, lc_id,
> +						       fs->rx_port,
> +						       fs->rx_queue,
> +						       rxq_conf->share_group,
> +						       rxq_conf->share_qid))
> +				return false;
> +		}
> +	}
> +	return true;
> +}
> +
>  /*
>   * Setup forwarding configuration for each logical core.
>   */
> diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c index
> 123142ed110..f3f81ef561f 100644
> --- a/app/test-pmd/testpmd.c
> +++ b/app/test-pmd/testpmd.c
> @@ -2236,10 +2236,12 @@ start_packet_forwarding(int with_tx_first)
> 
>  	fwd_config_setup();
> 
> +	pkt_fwd_config_display(&cur_fwd_config);
> +	if (!pkt_fwd_shared_rxq_check())

Same comment as above
This check should only happens if user enables "--rxq-share=[X]".
You can limit the check here too.
If (rxq_share > 0 && !pkt_fwd_shared_rxq_check())

> +		return;
>  	if(!no_flush_rx)
>  		flush_fwd_rx_queues();
> 
> -	pkt_fwd_config_display(&cur_fwd_config);
>  	rxtx_config_display();
> 
>  	fwd_stats_reset();
> diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h index
> 3dfaaad94c0..f121a2da90c 100644
> --- a/app/test-pmd/testpmd.h
> +++ b/app/test-pmd/testpmd.h
> @@ -144,6 +144,7 @@ struct fwd_stream {
>  	uint64_t     core_cycles; /**< used for RX and TX processing */
>  	struct pkt_burst_stats rx_burst_stats;
>  	struct pkt_burst_stats tx_burst_stats;
> +	struct fwd_lcore *lcore; /**< Lcore being scheduled. */
>  };
> 
>  /**
> @@ -795,6 +796,7 @@ void port_summary_header_display(void);
>  void rx_queue_infos_display(portid_t port_idi, uint16_t queue_id);  void
> tx_queue_infos_display(portid_t port_idi, uint16_t queue_id);  void
> fwd_lcores_config_display(void);
> +bool pkt_fwd_shared_rxq_check(void);
>  void pkt_fwd_config_display(struct fwd_config *cfg);  void
> rxtx_config_display(void);  void fwd_config_setup(void);
> --
> 2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v11 7/7] app/testpmd: add forwarding engine for shared Rx queue
  2021-10-20 19:20     ` Thomas Monjalon
@ 2021-10-21  3:26       ` Li, Xiaoyun
  2021-10-21  4:39       ` Xueming(Steven) Li
  1 sibling, 0 replies; 266+ messages in thread
From: Li, Xiaoyun @ 2021-10-21  3:26 UTC (permalink / raw)
  To: Thomas Monjalon, Xueming Li
  Cc: dev, Zhang, Yuying, Jerin Jacob, Yigit, Ferruh, Andrew Rybchenko,
	Viacheslav Ovsiienko, Lior Margalit, Ananyev, Konstantin,
	Ajit Khaparde

> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Thursday, October 21, 2021 03:20
> To: Xueming Li <xuemingl@nvidia.com>
> Cc: dev@dpdk.org; Zhang, Yuying <yuying.zhang@intel.com>; Jerin Jacob
> <jerinjacobk@gmail.com>; Yigit, Ferruh <ferruh.yigit@intel.com>; Andrew
> Rybchenko <andrew.rybchenko@oktetlabs.ru>; Viacheslav Ovsiienko
> <viacheslavo@nvidia.com>; Lior Margalit <lmargalit@nvidia.com>; Ananyev,
> Konstantin <konstantin.ananyev@intel.com>; Ajit Khaparde
> <ajit.khaparde@broadcom.com>; Li, Xiaoyun <xiaoyun.li@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v11 7/7] app/testpmd: add forwarding engine
> for shared Rx queue
> 
> 20/10/2021 09:53, Xueming Li:
> > To support shared Rx queue, this patch introduces dedicate forwarding
> > engine. The engine groups received packets by mbuf->port into
> > sub-group, updates stream statistics and simply frees packets.
> 
> Given this engine is mentioned in previous commits, shouldn't it be placed earlier
> in the series?
> 
> > +#include <stdarg.h>
> > +#include <string.h>
> > +#include <stdio.h>
> > +#include <errno.h>
> > +#include <stdint.h>
> > +#include <unistd.h>
> > +#include <inttypes.h>
> > +
> > +#include <sys/queue.h>
> > +#include <sys/stat.h>
> > +
> > +#include <rte_common.h>
> > +#include <rte_byteorder.h>
> > +#include <rte_log.h>
> > +#include <rte_debug.h>
> > +#include <rte_cycles.h>
> > +#include <rte_memory.h>
> > +#include <rte_memcpy.h>
> > +#include <rte_launch.h>
> > +#include <rte_eal.h>
> > +#include <rte_per_lcore.h>
> > +#include <rte_lcore.h>
> > +#include <rte_atomic.h>
> > +#include <rte_branch_prediction.h>
> > +#include <rte_mempool.h>
> > +#include <rte_mbuf.h>
> > +#include <rte_pci.h>
> > +#include <rte_ether.h>
> > +#include <rte_ethdev.h>
> > +#include <rte_string_fns.h>
> > +#include <rte_ip.h>
> > +#include <rte_udp.h>
> > +#include <rte_net.h>
> > +#include <rte_flow.h>
> 
> Please do not include useless files.
+1

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v11 3/7] app/testpmd: dump device capability and Rx domain info
  2021-10-21  3:24     ` Li, Xiaoyun
@ 2021-10-21  3:28       ` Ajit Khaparde
  0 siblings, 0 replies; 266+ messages in thread
From: Ajit Khaparde @ 2021-10-21  3:28 UTC (permalink / raw)
  To: Li, Xiaoyun
  Cc: Xueming Li, dev, Zhang, Yuying, Jerin Jacob, Yigit, Ferruh,
	Andrew Rybchenko, Viacheslav Ovsiienko, Thomas Monjalon,
	Lior Margalit, Ananyev, Konstantin

On Wed, Oct 20, 2021 at 8:24 PM Li, Xiaoyun <xiaoyun.li@intel.com> wrote:
>
> > -----Original Message-----
> > From: Xueming Li <xuemingl@nvidia.com>
> > Sent: Wednesday, October 20, 2021 15:53
> > To: dev@dpdk.org; Zhang, Yuying <yuying.zhang@intel.com>
> > Cc: xuemingl@nvidia.com; Jerin Jacob <jerinjacobk@gmail.com>; Yigit, Ferruh
> > <ferruh.yigit@intel.com>; Andrew Rybchenko
> > <andrew.rybchenko@oktetlabs.ru>; Viacheslav Ovsiienko
> > <viacheslavo@nvidia.com>; Thomas Monjalon <thomas@monjalon.net>; Lior
> > Margalit <lmargalit@nvidia.com>; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>; Ajit Khaparde
> > <ajit.khaparde@broadcom.com>; Li, Xiaoyun <xiaoyun.li@intel.com>
> > Subject: [PATCH v11 3/7] app/testpmd: dump device capability and Rx domain
> > info
> >
> > Dump device capability and Rx domain ID if shared Rx queue is supported by
> > device.
> >
> > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > ---
> >  app/test-pmd/config.c | 29 +++++++++++++++++++++++++++++
> >  1 file changed, 29 insertions(+)
> >
>
> Acked-by: Xiaoyun Li <xiaoyun.li@intel.com>
Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v11 4/7] app/testpmd: new parameter to enable shared Rx queue
  2021-10-20 17:29     ` Ajit Khaparde
  2021-10-20 19:14       ` Thomas Monjalon
@ 2021-10-21  3:49       ` Xueming(Steven) Li
  1 sibling, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-10-21  3:49 UTC (permalink / raw)
  To: ajit.khaparde
  Cc: konstantin.ananyev, jerinjacobk, yuying.zhang, Slava Ovsiienko,
	ferruh.yigit, andrew.rybchenko, Lior Margalit, dev,
	NBU-Contact-Thomas Monjalon, xiaoyun.li

On Wed, 2021-10-20 at 10:29 -0700, Ajit Khaparde wrote:
> On Wed, Oct 20, 2021 at 12:54 AM Xueming Li <xuemingl@nvidia.com> wrote:
> > 
> > Adds "--rxq-share=X" parameter to enable shared RxQ, share if device
> > supports, otherwise fallback to standard RxQ.
> > 
> > Share group number grows per X ports. X defaults to MAX, implies all
> > ports join share group 1. Queue ID is mapped equally with shared Rx
> > queue ID.
> > 
> > Forwarding engine "shared-rxq" should be used which Rx only and update
> > stream statistics correctly.
> > 
> > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > ---
> >  app/test-pmd/config.c                 |  7 ++++++-
> >  app/test-pmd/parameters.c             | 13 +++++++++++++
> >  app/test-pmd/testpmd.c                | 20 +++++++++++++++++---
> >  app/test-pmd/testpmd.h                |  2 ++
> >  doc/guides/testpmd_app_ug/run_app.rst |  7 +++++++
> >  5 files changed, 45 insertions(+), 4 deletions(-)
> > 
> :::snip::::
> 
> > +
> >  extern uint16_t nb_pkt_per_burst;
> >  extern uint16_t nb_pkt_flowgen_clones;
> >  extern int nb_flows_flowgen;
> > diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
> > index 640eadeff73..ff5908dcd50 100644
> > --- a/doc/guides/testpmd_app_ug/run_app.rst
> > +++ b/doc/guides/testpmd_app_ug/run_app.rst
> > @@ -389,6 +389,13 @@ The command line options are:
> > 
> >      Generate multiple flows in txonly mode.
> > 
> > +*   ``--rxq-share=[X]``
> > +
> > +    Create queues in shared Rx queue mode if device supports.
> > +    Group number grows per X ports. X defaults to MAX, implies all ports
> > +    join share group 1. Forwarding engine "shared-rxq" should be used
> > +    which Rx only and update stream statistics correctly.
> Did you mean "with Rx only"?
> Something like this?
> "shared-rxq" should be used in Rx only mode.
> 
> If you say - "the Forwarding engine should update stream statistics correctly",
> I think that is expected anyway? So there is no need to mention that
> in the guide.

I will change like this:
"shared-rxq" should be used, other forwarding engines can't resolve
source stream correctly, statistics and forwarding target could be
wrong.

> 
> 
> > +
> >  *   ``--eth-link-speed``
> > 
> >      Set a forced link speed to the ethernet port::
> > --
> > 2.33.0
> > 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v11 4/7] app/testpmd: new parameter to enable shared Rx queue
  2021-10-21  3:24     ` Li, Xiaoyun
@ 2021-10-21  3:58       ` Xueming(Steven) Li
  2021-10-21  5:15         ` Li, Xiaoyun
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-10-21  3:58 UTC (permalink / raw)
  To: xiaoyun.li, yuying.zhang, dev
  Cc: konstantin.ananyev, jerinjacobk, NBU-Contact-Thomas Monjalon,
	Slava Ovsiienko, ajit.khaparde, ferruh.yigit, andrew.rybchenko,
	Lior Margalit

On Thu, 2021-10-21 at 03:24 +0000, Li, Xiaoyun wrote:
> Hi
> 
> > -----Original Message-----
> > From: Xueming Li <xuemingl@nvidia.com>
> > Sent: Wednesday, October 20, 2021 15:53
> > To: dev@dpdk.org; Zhang, Yuying <yuying.zhang@intel.com>
> > Cc: xuemingl@nvidia.com; Jerin Jacob <jerinjacobk@gmail.com>; Yigit, Ferruh
> > <ferruh.yigit@intel.com>; Andrew Rybchenko
> > <andrew.rybchenko@oktetlabs.ru>; Viacheslav Ovsiienko
> > <viacheslavo@nvidia.com>; Thomas Monjalon <thomas@monjalon.net>; Lior
> > Margalit <lmargalit@nvidia.com>; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>; Ajit Khaparde
> > <ajit.khaparde@broadcom.com>; Li, Xiaoyun <xiaoyun.li@intel.com>
> > Subject: [PATCH v11 4/7] app/testpmd: new parameter to enable shared Rx
> > queue
> > 
> > Adds "--rxq-share=X" parameter to enable shared RxQ, share if device supports,
> > otherwise fallback to standard RxQ.
> > 
> > Share group number grows per X ports. X defaults to MAX, implies all ports join
> 
> X defaults to number of probed ports.

I will change to UINT32_MAX, thanks.

> 
> > share group 1. Queue ID is mapped equally with shared Rx queue ID.
> > 
> > Forwarding engine "shared-rxq" should be used which Rx only and update
> > stream statistics correctly.
> > 
> > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > ---
> >  app/test-pmd/config.c                 |  7 ++++++-
> >  app/test-pmd/parameters.c             | 13 +++++++++++++
> >  app/test-pmd/testpmd.c                | 20 +++++++++++++++++---
> >  app/test-pmd/testpmd.h                |  2 ++
> >  doc/guides/testpmd_app_ug/run_app.rst |  7 +++++++
> >  5 files changed, 45 insertions(+), 4 deletions(-)
> > 
> > diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c index
> > 2c1b06c544d..fa951a86704 100644
> > --- a/app/test-pmd/config.c
> > +++ b/app/test-pmd/config.c
> <snip>
> > @@ -1271,6 +1273,17 @@ launch_args_parse(int argc, char** argv)
> >  			}
> >  			if (!strcmp(lgopts[opt_idx].name, "txonly-multi-flow"))
> >  				txonly_multi_flow = 1;
> > +			if (!strcmp(lgopts[opt_idx].name, "rxq-share")) {
> > +				if (optarg == NULL) {
> > +					rxq_share = UINT32_MAX;
> 
> Why not use "nb_ports" here? nb_ports is the number of probed ports.

Considering hotplug, nb_ports could grow later, I think UINT32_MAX is
safe.

> 
> > +				} else {
> > +					n = atoi(optarg);
> > +					if (n >= 0)
> > +						rxq_share = (uint32_t)n;
> > +					else
> > +						rte_exit(EXIT_FAILURE, "rxq-
> > share must be >= 0\n");
> > +				}
> > +			}
> >  			if (!strcmp(lgopts[opt_idx].name, "no-flush-rx"))
> >  				no_flush_rx = 1;
> >  			if (!strcmp(lgopts[opt_idx].name, "eth-link-speed"))
> <snip>
> > 
> > +*   ``--rxq-share=[X]``
> > +
> > +    Create queues in shared Rx queue mode if device supports.
> > +    Group number grows per X ports. X defaults to MAX, implies all ports
> 
> X defaults to number of probed ports.
> I suppose this is what you mean? Also, I agree with other comments with the wording part
> 
> > +    join share group 1. Forwarding engine "shared-rxq" should be used
> > +    which Rx only and update stream statistics correctly.
> > +
> >  *   ``--eth-link-speed``
> > 
> >      Set a forced link speed to the ethernet port::
> > --
> > 2.33.0
> 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v11 4/7] app/testpmd: new parameter to enable shared Rx queue
  2021-10-20 19:14       ` Thomas Monjalon
@ 2021-10-21  4:09         ` Xueming(Steven) Li
  0 siblings, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-10-21  4:09 UTC (permalink / raw)
  To: NBU-Contact-Thomas Monjalon
  Cc: konstantin.ananyev, jerinjacobk, yuying.zhang, Slava Ovsiienko,
	ferruh.yigit, ajit.khaparde, andrew.rybchenko, Lior Margalit,
	dev, xiaoyun.li

On Wed, 2021-10-20 at 21:14 +0200, Thomas Monjalon wrote:
> 20/10/2021 19:29, Ajit Khaparde:
> > On Wed, Oct 20, 2021 at 12:54 AM Xueming Li <xuemingl@nvidia.com> wrote:
> > > 
> > > Adds "--rxq-share=X" parameter to enable shared RxQ,
> 
> You should end the sentence here.
> 
> > > share if device
> > > supports, otherwise fallback to standard RxQ.
> > > 
> > > Share group number grows per X ports.
> 
> Do you mean "Shared queues are grouped per X ports." ?
> 
> > > X defaults to MAX, implies all
> > > ports join share group 1. Queue ID is mapped equally with shared Rx
> > > queue ID.
> > > 
> > > Forwarding engine "shared-rxq" should be used which Rx only and update
> > > stream statistics correctly.
> 
> I suggest this wording:
> "
> A new forwarding engine "shared-rxq" should be used for shared Rx queues.
> This engine does Rx only and update stream statistics accordingly.
> "
> 
> > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> 
> [...]
> > +	printf("  --rxq-share: number of ports per shared rxq groups, defaults to MAX(1 group)\n");
> 
> rxq -> Rx queue
> Is MAX a special value? or should it be "all queues"?
> Note: space is missing before the parenthesis.
> 
> [...]
> > > +*   ``--rxq-share=[X]``
> > > +
> > > +    Create queues in shared Rx queue mode if device supports.
> > > +    Group number grows per X ports.
> 
> Again I suggest "Shared queues are grouped per X ports."
> 
> > > + X defaults to MAX, implies all ports
> > > +    join share group 1. Forwarding engine "shared-rxq" should be used
> > > +    which Rx only and update stream statistics correctly.
> > 
> > Did you mean "with Rx only"?
> > Something like this?
> > "shared-rxq" should be used in Rx only mode.
> > 
> > If you say - "the Forwarding engine should update stream statistics correctly",
> > I think that is expected anyway? So there is no need to mention that
> > in the guide.
> 
> I suggested a wording above.
> 

Looks good, thanks Ajit and Thomas!

> 
> 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v11 6/7] app/testpmd: force shared Rx queue polled on same core
  2021-10-21  3:24     ` Li, Xiaoyun
@ 2021-10-21  4:21       ` Xueming(Steven) Li
  0 siblings, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-10-21  4:21 UTC (permalink / raw)
  To: xiaoyun.li, yuying.zhang, dev
  Cc: konstantin.ananyev, jerinjacobk, NBU-Contact-Thomas Monjalon,
	Slava Ovsiienko, ajit.khaparde, ferruh.yigit, andrew.rybchenko,
	Lior Margalit

On Thu, 2021-10-21 at 03:24 +0000, Li, Xiaoyun wrote:
> Hi
> 
> > -----Original Message-----
> > From: Xueming Li <xuemingl@nvidia.com>
> > Sent: Wednesday, October 20, 2021 15:53
> > To: dev@dpdk.org; Zhang, Yuying <yuying.zhang@intel.com>
> > Cc: xuemingl@nvidia.com; Jerin Jacob <jerinjacobk@gmail.com>; Yigit, Ferruh
> > <ferruh.yigit@intel.com>; Andrew Rybchenko
> > <andrew.rybchenko@oktetlabs.ru>; Viacheslav Ovsiienko
> > <viacheslavo@nvidia.com>; Thomas Monjalon <thomas@monjalon.net>; Lior
> > Margalit <lmargalit@nvidia.com>; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>; Ajit Khaparde
> > <ajit.khaparde@broadcom.com>; Li, Xiaoyun <xiaoyun.li@intel.com>
> > Subject: [PATCH v11 6/7] app/testpmd: force shared Rx queue polled on same
> > core
> > 
> > Shared Rx queue must be polled on same core. This patch checks and stops
> > forwarding if shared RxQ being scheduled on multiple cores.
> > 
> > It's suggested to use same number of Rx queues and polling cores.
> > 
> > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > ---
> >  app/test-pmd/config.c  | 103
> > +++++++++++++++++++++++++++++++++++++++++
> >  app/test-pmd/testpmd.c |   4 +-
> >  app/test-pmd/testpmd.h |   2 +
> >  3 files changed, 108 insertions(+), 1 deletion(-)
> > 
> > diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c index
> > fa951a86704..1f1307178be 100644
> > --- a/app/test-pmd/config.c
> > +++ b/app/test-pmd/config.c
> > @@ -2915,6 +2915,109 @@ port_rss_hash_key_update(portid_t port_id, char
> > rss_type[], uint8_t *hash_key,
> >  	}
> >  }
> > 
> > +/*
> > + * Check whether a shared rxq scheduled on other lcores.
> > + */
> > +static bool
> > +fwd_stream_on_other_lcores(uint16_t domain_id, lcoreid_t src_lc,
> > +			   portid_t src_port, queueid_t src_rxq,
> > +			   uint32_t share_group, queueid_t share_rxq) {
> > +	streamid_t sm_id;
> > +	streamid_t nb_fs_per_lcore;
> > +	lcoreid_t  nb_fc;
> > +	lcoreid_t  lc_id;
> > +	struct fwd_stream *fs;
> > +	struct rte_port *port;
> > +	struct rte_eth_dev_info *dev_info;
> > +	struct rte_eth_rxconf *rxq_conf;
> > +
> > +	nb_fc = cur_fwd_config.nb_fwd_lcores;
> > +	/* Check remaining cores. */
> > +	for (lc_id = src_lc + 1; lc_id < nb_fc; lc_id++) {
> > +		sm_id = fwd_lcores[lc_id]->stream_idx;
> > +		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
> > +		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
> > +		     sm_id++) {
> > +			fs = fwd_streams[sm_id];
> > +			port = &ports[fs->rx_port];
> > +			dev_info = &port->dev_info;
> > +			rxq_conf = &port->rx_conf[fs->rx_queue];
> > +			if ((dev_info->dev_capa &
> > RTE_ETH_DEV_CAPA_RXQ_SHARE)
> > +			    == 0)
> > +				/* Not shared rxq. */
> > +				continue;
> > +			if (domain_id != port->dev_info.switch_info.domain_id)
> > +				continue;
> > +			if (rxq_conf->share_group != share_group)
> > +				continue;
> > +			if (rxq_conf->share_qid != share_rxq)
> > +				continue;
> > +			printf("Shared Rx queue group %u queue %hu can't be
> > scheduled on different cores:\n",
> > +			       share_group, share_rxq);
> > +			printf("  lcore %hhu Port %hu queue %hu\n",
> > +			       src_lc, src_port, src_rxq);
> > +			printf("  lcore %hhu Port %hu queue %hu\n",
> > +			       lc_id, fs->rx_port, fs->rx_queue);
> > +			printf("Please use --nb-cores=%hu to limit number of
> > forwarding cores\n",
> > +			       nb_rxq);
> > +			return true;
> > +		}
> > +	}
> > +	return false;
> > +}
> > +
> > +/*
> > + * Check shared rxq configuration.
> > + *
> > + * Shared group must not being scheduled on different core.
> > + */
> > +bool
> > +pkt_fwd_shared_rxq_check(void)
> > +{
> > +	streamid_t sm_id;
> > +	streamid_t nb_fs_per_lcore;
> > +	lcoreid_t  nb_fc;
> > +	lcoreid_t  lc_id;
> > +	struct fwd_stream *fs;
> > +	uint16_t domain_id;
> > +	struct rte_port *port;
> > +	struct rte_eth_dev_info *dev_info;
> > +	struct rte_eth_rxconf *rxq_conf;
> > +
> > +	nb_fc = cur_fwd_config.nb_fwd_lcores;
> > +	/*
> > +	 * Check streams on each core, make sure the same switch domain +
> > +	 * group + queue doesn't get scheduled on other cores.
> > +	 */
> > +	for (lc_id = 0; lc_id < nb_fc; lc_id++) {
> > +		sm_id = fwd_lcores[lc_id]->stream_idx;
> > +		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
> > +		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
> > +		     sm_id++) {
> > +			fs = fwd_streams[sm_id];
> > +			/* Update lcore info stream being scheduled. */
> > +			fs->lcore = fwd_lcores[lc_id];
> > +			port = &ports[fs->rx_port];
> > +			dev_info = &port->dev_info;
> > +			rxq_conf = &port->rx_conf[fs->rx_queue];
> > +			if ((dev_info->dev_capa &
> > RTE_ETH_DEV_CAPA_RXQ_SHARE)
> > +			    == 0)
> > +				/* Not shared rxq. */
> > +				continue;
> > +			/* Check shared rxq not scheduled on remaining cores.
> 
> The check will be done anyway just if dev has the capability of share_rxq.
> But what if user wants normal queue config when they are using the dev which has the share_q capability?

Good catch, thanks!

> You should limit the check only when "rxq_share > 0".

Yes, will add this at top of this function.

> 
> > */
> > +			domain_id = port->dev_info.switch_info.domain_id;
> > +			if (fwd_stream_on_other_lcores(domain_id, lc_id,
> > +						       fs->rx_port,
> > +						       fs->rx_queue,
> > +						       rxq_conf->share_group,
> > +						       rxq_conf->share_qid))
> > +				return false;
> > +		}
> > +	}
> > +	return true;
> > +}
> > +
> >  /*
> >   * Setup forwarding configuration for each logical core.
> >   */
> > diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c index
> > 123142ed110..f3f81ef561f 100644
> > --- a/app/test-pmd/testpmd.c
> > +++ b/app/test-pmd/testpmd.c
> > @@ -2236,10 +2236,12 @@ start_packet_forwarding(int with_tx_first)
> > 
> >  	fwd_config_setup();
> > 
> > +	pkt_fwd_config_display(&cur_fwd_config);
> > +	if (!pkt_fwd_shared_rxq_check())
> 
> Same comment as above
> This check should only happens if user enables "--rxq-share=[X]".
> You can limit the check here too.
> If (rxq_share > 0 && !pkt_fwd_shared_rxq_check())

I will add rxq_share > 0 check at begining of
pk_fwd_shared_rxq_check(), thanks!

> 
> > +		return;
> >  	if(!no_flush_rx)
> >  		flush_fwd_rx_queues();
> > 
> > -	pkt_fwd_config_display(&cur_fwd_config);
> >  	rxtx_config_display();
> > 
> >  	fwd_stats_reset();
> > diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h index
> > 3dfaaad94c0..f121a2da90c 100644
> > --- a/app/test-pmd/testpmd.h
> > +++ b/app/test-pmd/testpmd.h
> > @@ -144,6 +144,7 @@ struct fwd_stream {
> >  	uint64_t     core_cycles; /**< used for RX and TX processing */
> >  	struct pkt_burst_stats rx_burst_stats;
> >  	struct pkt_burst_stats tx_burst_stats;
> > +	struct fwd_lcore *lcore; /**< Lcore being scheduled. */
> >  };
> > 
> >  /**
> > @@ -795,6 +796,7 @@ void port_summary_header_display(void);
> >  void rx_queue_infos_display(portid_t port_idi, uint16_t queue_id);  void
> > tx_queue_infos_display(portid_t port_idi, uint16_t queue_id);  void
> > fwd_lcores_config_display(void);
> > +bool pkt_fwd_shared_rxq_check(void);
> >  void pkt_fwd_config_display(struct fwd_config *cfg);  void
> > rxtx_config_display(void);  void fwd_config_setup(void);
> > --
> > 2.33.0
> 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v11 7/7] app/testpmd: add forwarding engine for shared Rx queue
  2021-10-20 19:20     ` Thomas Monjalon
  2021-10-21  3:26       ` Li, Xiaoyun
@ 2021-10-21  4:39       ` Xueming(Steven) Li
  1 sibling, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-10-21  4:39 UTC (permalink / raw)
  To: NBU-Contact-Thomas Monjalon
  Cc: konstantin.ananyev, jerinjacobk, yuying.zhang, Slava Ovsiienko,
	ferruh.yigit, ajit.khaparde, andrew.rybchenko, Lior Margalit,
	dev, xiaoyun.li

On Wed, 2021-10-20 at 21:20 +0200, Thomas Monjalon wrote:
> 20/10/2021 09:53, Xueming Li:
> > To support shared Rx queue, this patch introduces dedicate forwarding
> > engine. The engine groups received packets by mbuf->port into sub-group,
> > updates stream statistics and simply frees packets.
> 
> Given this engine is mentioned in previous commits,
> shouldn't it be placed earlier in the series?

There will be compilation issue, I'll move the document change into
this patch.

> 
> > +#include <stdarg.h>
> > +#include <string.h>
> > +#include <stdio.h>
> > +#include <errno.h>
> > +#include <stdint.h>
> > +#include <unistd.h>
> > +#include <inttypes.h>
> > +
> > +#include <sys/queue.h>
> > +#include <sys/stat.h>
> > +
> > +#include <rte_common.h>
> > +#include <rte_byteorder.h>
> > +#include <rte_log.h>
> > +#include <rte_debug.h>
> > +#include <rte_cycles.h>
> > +#include <rte_memory.h>
> > +#include <rte_memcpy.h>
> > +#include <rte_launch.h>
> > +#include <rte_eal.h>
> > +#include <rte_per_lcore.h>
> > +#include <rte_lcore.h>
> > +#include <rte_atomic.h>
> > +#include <rte_branch_prediction.h>
> > +#include <rte_mempool.h>
> > +#include <rte_mbuf.h>
> > +#include <rte_pci.h>
> > +#include <rte_ether.h>
> > +#include <rte_ethdev.h>
> > +#include <rte_string_fns.h>
> > +#include <rte_ip.h>
> > +#include <rte_udp.h>
> > +#include <rte_net.h>
> > +#include <rte_flow.h>
> 
> Please do not include useless files.
> 
> 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v12 0/7] ethdev: introduce shared Rx queue
  2021-07-27  3:42 [dpdk-dev] [RFC] ethdev: introduce shared Rx queue Xueming Li
                   ` (12 preceding siblings ...)
  2021-10-20  7:53 ` [dpdk-dev] [PATCH v11 0/7] ethdev: introduce " Xueming Li
@ 2021-10-21  5:08 ` Xueming Li
  2021-10-21  5:08   ` [dpdk-dev] [PATCH v12 1/7] " Xueming Li
                     ` (6 more replies)
  2021-10-21 10:41 ` [dpdk-dev] [PATCH v13 0/7] ethdev: introduce " Xueming Li
                   ` (2 subsequent siblings)
  16 siblings, 7 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-21  5:08 UTC (permalink / raw)
  To: dev, Zhang Yuying, Li Xiaoyun
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde

In current DPDK framework, all Rx queues is pre-loaded with mbufs for
incoming packets. When number of representors scale out in a switch
domain, the memory consumption became significant. Further more,
polling all ports leads to high cache miss, high latency and low
throughputs.

This patch introduces shared Rx queue. PF and representors in same
Rx domain and switch domain could share Rx queue set by specifying
non-zero share group value in Rx queue configuration.

All ports that share Rx queue actually shares hardware descriptor
queue and feed all Rx queues with one descriptor supply, memory is saved.

Polling any queue using same shared Rx queue receives packets from all
member ports. Source port is identified by mbuf->port.

Multiple groups is supported by group ID. Port queue number in a shared
group should be identical. Queue index is 1:1 mapped in shared group.
An example of two share groups:
 Group1, 4 shared Rx queues per member port: PF, repr0, repr1
 Group2, 2 shared Rx queues per member port: repr2, repr3, ... repr127
 Poll first port for each group:
  core	port	queue
  0	0	0
  1	0	1
  2	0	2
  3	0	3
  4	2	0
  5	2	1

Shared Rx queue must be polled on single thread or core. If both PF0 and
representor0 joined same share group, can't poll pf0rxq0 on core1 and
rep0rxq0 on core2. Actually, polling one port within share group is
sufficient since polling any port in group will return packets for any
port in group.

There was some discussion to aggregate member ports in same group into a
dummy port, several ways to achieve it. Since it optional, need to collect
more feedback and requirement from user, make better decision later.

v1:
  - initial version
v2:
  - add testpmd patches
v3:
  - change common forwarding api to macro for performance, thanks Jerin.
  - save global variable accessed in forwarding to flowstream to minimize
    cache miss
  - combined patches for each forwarding engine
  - support multiple groups in testpmd "--share-rxq" parameter
  - new api to aggregate shared rxq group
v4:
  - spelling fixes
  - remove shared-rxq support for all forwarding engines
  - add dedicate shared-rxq forwarding engine
v5:
 - fix grammars
 - remove aggregate api and leave it for later discussion
 - add release notes
 - add deployment example
v6:
 - replace RxQ offload flag with device offload capability flag
 - add Rx domain
 - RxQ is shared when share group > 0
 - update testpmd accordingly
v7:
 - fix testpmd share group id allocation
 - change rx_domain to 16bits
v8:
 - add new patch for testpmd to show device Rx domain ID and capability
 - new share_qid in RxQ configuration
v9:
 - fix some spelling
v10:
 - add device capability name api
v11:
 - remove macro from device capability name list
v12:
 - rephrase
 - in forwarding core check, add  global flag and RxQ enabled check

Xueming Li (7):
  ethdev: introduce shared Rx queue
  ethdev: get device capability name as string
  app/testpmd: dump device capability and Rx domain info
  app/testpmd: new parameter to enable shared Rx queue
  app/testpmd: dump port info for shared Rx queue
  app/testpmd: force shared Rx queue polled on same core
  app/testpmd: add forwarding engine for shared Rx queue

 app/test-pmd/config.c                         | 141 +++++++++++++++++-
 app/test-pmd/meson.build                      |   1 +
 app/test-pmd/parameters.c                     |  13 ++
 app/test-pmd/shared_rxq_fwd.c                 | 113 ++++++++++++++
 app/test-pmd/testpmd.c                        |  26 +++-
 app/test-pmd/testpmd.h                        |   9 ++
 app/test-pmd/util.c                           |   3 +
 doc/guides/nics/features.rst                  |  13 ++
 doc/guides/nics/features/default.ini          |   1 +
 .../prog_guide/switch_representation.rst      |  11 ++
 doc/guides/rel_notes/release_21_11.rst        |   6 +
 doc/guides/testpmd_app_ug/run_app.rst         |   9 ++
 doc/guides/testpmd_app_ug/testpmd_funcs.rst   |   5 +-
 lib/ethdev/rte_ethdev.c                       |  33 ++++
 lib/ethdev/rte_ethdev.h                       |  38 +++++
 lib/ethdev/version.map                        |   1 +
 16 files changed, 417 insertions(+), 6 deletions(-)
 create mode 100644 app/test-pmd/shared_rxq_fwd.c

-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v12 1/7] ethdev: introduce shared Rx queue
  2021-10-21  5:08 ` [dpdk-dev] [PATCH v12 0/7] ethdev: introduce " Xueming Li
@ 2021-10-21  5:08   ` Xueming Li
  2021-10-21  5:08   ` [dpdk-dev] [PATCH v12 2/7] ethdev: get device capability name as string Xueming Li
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-21  5:08 UTC (permalink / raw)
  To: dev, Zhang Yuying, Li Xiaoyun
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde

In current DPDK framework, each Rx queue is pre-loaded with mbufs to
save incoming packets. For some PMDs, when number of representors scale
out in a switch domain, the memory consumption became significant.
Polling all ports also leads to high cache miss, high latency and low
throughput.

This patch introduces shared Rx queue. Ports in same Rx domain and
switch domain could share Rx queue set by specifying non-zero sharing
group in Rx queue configuration.

Shared Rx queue is identified by share_rxq field of Rx queue
configuration. Port A RxQ X can share RxQ with Port B RxQ Y by using
same shared Rx queue ID.

No special API is defined to receive packets from shared Rx queue.
Polling any member port of a shared Rx queue receives packets of that
queue for all member ports, port_id is identified by mbuf->port. PMD is
responsible to resolve shared Rx queue from device and queue data.

Shared Rx queue must be polled in same thread or core, polling a queue
ID of any member port is essentially same.

Multiple share groups are supported. PMD should support mixed
configuration by allowing multiple share groups and non-shared Rx queue
on one port.

Example grouping and polling model to reflect service priority:
 Group1, 2 shared Rx queues per port: PF, rep0, rep1
 Group2, 1 shared Rx queue per port: rep2, rep3, ... rep127
 Core0: poll PF queue0
 Core1: poll PF queue1
 Core2: poll rep2 queue0

PMD advertise shared Rx queue capability via RTE_ETH_DEV_CAPA_RXQ_SHARE.

PMD is responsible for shared Rx queue consistency checks to avoid
member port's configuration contradict each other.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
---
 doc/guides/nics/features.rst                  | 13 ++++++++++
 doc/guides/nics/features/default.ini          |  1 +
 .../prog_guide/switch_representation.rst      | 11 +++++++++
 doc/guides/rel_notes/release_21_11.rst        |  6 +++++
 lib/ethdev/rte_ethdev.c                       |  8 +++++++
 lib/ethdev/rte_ethdev.h                       | 24 +++++++++++++++++++
 6 files changed, 63 insertions(+)

diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index 8dd421ca013..d35751d5b5a 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -614,6 +614,19 @@ Supports inner packet L4 checksum.
   ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
 
 
+.. _nic_features_shared_rx_queue:
+
+Shared Rx queue
+---------------
+
+Supports shared Rx queue for ports in same Rx domain of a switch domain.
+
+* **[uses]     rte_eth_dev_info**: ``dev_capa:RTE_ETH_DEV_CAPA_RXQ_SHARE``.
+* **[uses]     rte_eth_dev_info,rte_eth_switch_info**: ``rx_domain``, ``domain_id``.
+* **[uses]     rte_eth_rxconf**: ``share_group``, ``share_qid``.
+* **[provides] mbuf**: ``mbuf.port``.
+
+
 .. _nic_features_packet_type_parsing:
 
 Packet type parsing
diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
index 09914b1ad32..39d21fcd379 100644
--- a/doc/guides/nics/features/default.ini
+++ b/doc/guides/nics/features/default.ini
@@ -19,6 +19,7 @@ Free Tx mbuf on demand =
 Queue start/stop     =
 Runtime Rx queue setup =
 Runtime Tx queue setup =
+Shared Rx queue      =
 Burst mode info      =
 Power mgmt address monitor =
 MTU update           =
diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
index ff6aa91c806..4f2532a91ea 100644
--- a/doc/guides/prog_guide/switch_representation.rst
+++ b/doc/guides/prog_guide/switch_representation.rst
@@ -123,6 +123,17 @@ thought as a software "patch panel" front-end for applications.
 .. [1] `Ethernet switch device driver model (switchdev)
        <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
 
+- For some PMDs, memory usage of representors is huge when number of
+  representor grows, mbufs are allocated for each descriptor of Rx queue.
+  Polling large number of ports brings more CPU load, cache miss and
+  latency. Shared Rx queue can be used to share Rx queue between PF and
+  representors among same Rx domain. ``RTE_ETH_DEV_CAPA_RXQ_SHARE`` in
+  device info is used to indicate the capability. Setting non-zero share
+  group in Rx queue configuration to enable share, share_qid is used to
+  identify the shared Rx queue in group. Polling any member port can
+  receive packets of all member ports in the group, port ID is saved in
+  ``mbuf.port``.
+
 Basic SR-IOV
 ------------
 
diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
index 74776ca0691..f4fb68e7408 100644
--- a/doc/guides/rel_notes/release_21_11.rst
+++ b/doc/guides/rel_notes/release_21_11.rst
@@ -75,6 +75,12 @@ New Features
     operations.
   * Added multi-process support.
 
+* **Added ethdev shared Rx queue support.**
+
+  * Added new device capability flag and Rx domain field to switch info.
+  * Added share group and share queue ID to Rx queue configuration.
+  * Added testpmd support and dedicate forwarding engine.
+
 * **Added support to get all MAC addresses of a device.**
 
   Added ``rte_eth_macaddrs_get`` to allow user to retrieve all Ethernet
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 1f18aa916cc..31a9cba065b 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -2175,6 +2175,14 @@ rte_eth_rx_queue_setup(uint16_t port_id, uint16_t rx_queue_id,
 		return -EINVAL;
 	}
 
+	if (local_conf.share_group > 0 &&
+	    (dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE) == 0) {
+		RTE_ETHDEV_LOG(ERR,
+			"Ethdev port_id=%d rx_queue_id=%d, enabled share_group=%hu while device doesn't support Rx queue share\n",
+			port_id, rx_queue_id, local_conf.share_group);
+		return -EINVAL;
+	}
+
 	/*
 	 * If LRO is enabled, check that the maximum aggregated packet
 	 * size is supported by the configured device.
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 014270d3167..40f88cc3d64 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1045,6 +1045,14 @@ struct rte_eth_rxconf {
 	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
 	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
 	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
+	/**
+	 * Share group index in Rx domain and switch domain.
+	 * Non-zero value to enable Rx queue share, zero value disable share.
+	 * PMD is responsible for Rx queue consistency checks to avoid member
+	 * port's configuration contradict to each other.
+	 */
+	uint16_t share_group;
+	uint16_t share_qid; /**< Shared Rx queue ID in group. */
 	/**
 	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
 	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
@@ -1445,6 +1453,16 @@ struct rte_eth_conf {
 #define RTE_ETH_DEV_CAPA_RUNTIME_RX_QUEUE_SETUP 0x00000001
 /** Device supports Tx queue setup after device started. */
 #define RTE_ETH_DEV_CAPA_RUNTIME_TX_QUEUE_SETUP 0x00000002
+/**
+ * Device supports shared Rx queue among ports within Rx domain and
+ * switch domain. Mbufs are consumed by shared Rx queue instead of
+ * each queue. Multiple groups are supported by share_group of Rx
+ * queue configuration. Shared Rx queue is identified by PMD using
+ * share_qid of Rx queue configuration. Polling any port in the group
+ * receive packets of all member ports, source port identified by
+ * mbuf->port field.
+ */
+#define RTE_ETH_DEV_CAPA_RXQ_SHARE              RTE_BIT64(2)
 /**@}*/
 
 /*
@@ -1488,6 +1506,12 @@ struct rte_eth_switch_info {
 	 * but each driver should explicitly define the mapping of switch
 	 * port identifier to that physical interconnect/switch
 	 */
+	/**
+	 * Shared Rx queue sub-domain boundary. Only ports in same Rx domain
+	 * and switch domain can share Rx queue. Valid only if device advertised
+	 * RTE_ETH_DEV_CAPA_RXQ_SHARE capability.
+	 */
+	uint16_t rx_domain;
 };
 
 /**
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v12 2/7] ethdev: get device capability name as string
  2021-10-21  5:08 ` [dpdk-dev] [PATCH v12 0/7] ethdev: introduce " Xueming Li
  2021-10-21  5:08   ` [dpdk-dev] [PATCH v12 1/7] " Xueming Li
@ 2021-10-21  5:08   ` Xueming Li
  2021-10-21  5:08   ` [dpdk-dev] [PATCH v12 3/7] app/testpmd: dump device capability and Rx domain info Xueming Li
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-21  5:08 UTC (permalink / raw)
  To: dev, Zhang Yuying, Li Xiaoyun
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde, Ray Kinsella

This patch adds API to return name of device capability.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
---
 lib/ethdev/rte_ethdev.c | 25 +++++++++++++++++++++++++
 lib/ethdev/rte_ethdev.h | 14 ++++++++++++++
 lib/ethdev/version.map  |  1 +
 3 files changed, 40 insertions(+)

diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 31a9cba065b..bfe5b0adbef 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -167,6 +167,15 @@ static const struct {
 
 #undef RTE_TX_OFFLOAD_BIT2STR
 
+static const struct {
+	uint64_t offload;
+	const char *name;
+} rte_eth_dev_capa_names[] = {
+	{RTE_ETH_DEV_CAPA_RUNTIME_RX_QUEUE_SETUP, "RUNTIME_RX_QUEUE_SETUP"},
+	{RTE_ETH_DEV_CAPA_RUNTIME_TX_QUEUE_SETUP, "RUNTIME_TX_QUEUE_SETUP"},
+	{RTE_ETH_DEV_CAPA_RXQ_SHARE, "RXQ_SHARE"},
+};
+
 /**
  * The user application callback description.
  *
@@ -1236,6 +1245,22 @@ rte_eth_dev_tx_offload_name(uint64_t offload)
 	return name;
 }
 
+const char *
+rte_eth_dev_capability_name(uint64_t capability)
+{
+	const char *name = "UNKNOWN";
+	unsigned int i;
+
+	for (i = 0; i < RTE_DIM(rte_eth_dev_capa_names); ++i) {
+		if (capability == rte_eth_dev_capa_names[i].offload) {
+			name = rte_eth_dev_capa_names[i].name;
+			break;
+		}
+	}
+
+	return name;
+}
+
 static inline int
 eth_dev_check_lro_pkt_size(uint16_t port_id, uint32_t config_size,
 		   uint32_t max_rx_pkt_len, uint32_t dev_info_size)
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 40f88cc3d64..9baca39e97a 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -2109,6 +2109,20 @@ const char *rte_eth_dev_rx_offload_name(uint64_t offload);
  */
 const char *rte_eth_dev_tx_offload_name(uint64_t offload);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Get RTE_ETH_DEV_CAPA_* flag name.
+ *
+ * @param capability
+ *   Capability flag.
+ * @return
+ *   Capability name or 'UNKNOWN' if the flag cannot be recognized.
+ */
+__rte_experimental
+const char *rte_eth_dev_capability_name(uint64_t capability);
+
 /**
  * Configure an Ethernet device.
  * This function must be invoked first before any other function in the
diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
index d552c955c94..e1abe997290 100644
--- a/lib/ethdev/version.map
+++ b/lib/ethdev/version.map
@@ -249,6 +249,7 @@ EXPERIMENTAL {
 	rte_mtr_meter_policy_validate;
 
 	# added in 21.11
+	rte_eth_dev_capability_name;
 	rte_eth_dev_conf_get;
 	rte_eth_macaddrs_get;
 	rte_eth_rx_metadata_negotiate;
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v12 3/7] app/testpmd: dump device capability and Rx domain info
  2021-10-21  5:08 ` [dpdk-dev] [PATCH v12 0/7] ethdev: introduce " Xueming Li
  2021-10-21  5:08   ` [dpdk-dev] [PATCH v12 1/7] " Xueming Li
  2021-10-21  5:08   ` [dpdk-dev] [PATCH v12 2/7] ethdev: get device capability name as string Xueming Li
@ 2021-10-21  5:08   ` Xueming Li
  2021-10-21  5:08   ` [dpdk-dev] [PATCH v12 4/7] app/testpmd: new parameter to enable shared Rx queue Xueming Li
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-21  5:08 UTC (permalink / raw)
  To: dev, Zhang Yuying, Li Xiaoyun
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde

Dump device capability and Rx domain ID if shared Rx queue is supported
by device.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Acked-by: Xiaoyun Li <xiaoyun.li@intel.com>
---
 app/test-pmd/config.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 23aa334cda0..db36ca41b94 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -644,6 +644,29 @@ device_infos_display(const char *identifier)
 	rte_devargs_reset(&da);
 }
 
+static void
+print_dev_capabilities(uint64_t capabilities)
+{
+	uint64_t single_capa;
+	int begin;
+	int end;
+	int bit;
+
+	if (capabilities == 0)
+		return;
+
+	begin = __builtin_ctzll(capabilities);
+	end = sizeof(capabilities) * CHAR_BIT - __builtin_clzll(capabilities);
+
+	single_capa = 1ULL << begin;
+	for (bit = begin; bit < end; bit++) {
+		if (capabilities & single_capa)
+			printf(" %s",
+			       rte_eth_dev_capability_name(single_capa));
+		single_capa <<= 1;
+	}
+}
+
 void
 port_infos_display(portid_t port_id)
 {
@@ -795,6 +818,9 @@ port_infos_display(portid_t port_id)
 	printf("Max segment number per MTU/TSO: %hu\n",
 		dev_info.tx_desc_lim.nb_mtu_seg_max);
 
+	printf("Device capabilities: 0x%"PRIx64"(", dev_info.dev_capa);
+	print_dev_capabilities(dev_info.dev_capa);
+	printf(" )\n");
 	/* Show switch info only if valid switch domain and port id is set */
 	if (dev_info.switch_info.domain_id !=
 		RTE_ETH_DEV_SWITCH_DOMAIN_ID_INVALID) {
@@ -805,6 +831,9 @@ port_infos_display(portid_t port_id)
 			dev_info.switch_info.domain_id);
 		printf("Switch Port Id: %u\n",
 			dev_info.switch_info.port_id);
+		if ((dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE) != 0)
+			printf("Switch Rx domain: %u\n",
+			       dev_info.switch_info.rx_domain);
 	}
 }
 
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v12 4/7] app/testpmd: new parameter to enable shared Rx queue
  2021-10-21  5:08 ` [dpdk-dev] [PATCH v12 0/7] ethdev: introduce " Xueming Li
                     ` (2 preceding siblings ...)
  2021-10-21  5:08   ` [dpdk-dev] [PATCH v12 3/7] app/testpmd: dump device capability and Rx domain info Xueming Li
@ 2021-10-21  5:08   ` Xueming Li
  2021-10-21  9:20     ` Thomas Monjalon
  2021-10-21  5:08   ` [dpdk-dev] [PATCH v12 5/7] app/testpmd: dump port info for " Xueming Li
                     ` (2 subsequent siblings)
  6 siblings, 1 reply; 266+ messages in thread
From: Xueming Li @ 2021-10-21  5:08 UTC (permalink / raw)
  To: dev, Zhang Yuying, Li Xiaoyun
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde

Adds "--rxq-share=X" parameter to enable shared RxQ.

Rx queue is shared if device supports, otherwise fallback to standard
RxQ.

Shared Rx queues are grouped per X ports. X defaults to UINT32_MAX,
implies all ports join share group 1. Queue ID is mapped equally with
shared Rx queue ID.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c                 |  7 ++++++-
 app/test-pmd/parameters.c             | 13 +++++++++++++
 app/test-pmd/testpmd.c                | 20 +++++++++++++++++---
 app/test-pmd/testpmd.h                |  2 ++
 doc/guides/testpmd_app_ug/run_app.rst |  6 ++++++
 5 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index db36ca41b94..e4bbf457916 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -2890,7 +2890,12 @@ rxtx_config_display(void)
 			printf("      RX threshold registers: pthresh=%d hthresh=%d "
 				" wthresh=%d\n",
 				pthresh_tmp, hthresh_tmp, wthresh_tmp);
-			printf("      RX Offloads=0x%"PRIx64"\n", offloads_tmp);
+			printf("      RX Offloads=0x%"PRIx64, offloads_tmp);
+			if (rx_conf->share_group > 0)
+				printf(" share_group=%u share_qid=%u",
+				       rx_conf->share_group,
+				       rx_conf->share_qid);
+			printf("\n");
 		}
 
 		/* per tx queue config only for first queue to be less verbose */
diff --git a/app/test-pmd/parameters.c b/app/test-pmd/parameters.c
index 779a721fa05..afc75f6bd21 100644
--- a/app/test-pmd/parameters.c
+++ b/app/test-pmd/parameters.c
@@ -171,6 +171,7 @@ usage(char* progname)
 	printf("  --tx-ip=src,dst: IP addresses in Tx-only mode\n");
 	printf("  --tx-udp=src[,dst]: UDP ports in Tx-only mode\n");
 	printf("  --eth-link-speed: force link speed.\n");
+	printf("  --rxq-share=X: number of ports per shared Rx queue groups, defaults to UINT32_MAX (1 group)\n");
 	printf("  --disable-link-check: disable check on link status when "
 	       "starting/stopping ports.\n");
 	printf("  --disable-device-start: do not automatically start port\n");
@@ -678,6 +679,7 @@ launch_args_parse(int argc, char** argv)
 		{ "rxpkts",			1, 0, 0 },
 		{ "txpkts",			1, 0, 0 },
 		{ "txonly-multi-flow",		0, 0, 0 },
+		{ "rxq-share",			2, 0, 0 },
 		{ "eth-link-speed",		1, 0, 0 },
 		{ "disable-link-check",		0, 0, 0 },
 		{ "disable-device-start",	0, 0, 0 },
@@ -1352,6 +1354,17 @@ launch_args_parse(int argc, char** argv)
 			}
 			if (!strcmp(lgopts[opt_idx].name, "txonly-multi-flow"))
 				txonly_multi_flow = 1;
+			if (!strcmp(lgopts[opt_idx].name, "rxq-share")) {
+				if (optarg == NULL) {
+					rxq_share = UINT32_MAX;
+				} else {
+					n = atoi(optarg);
+					if (n >= 0)
+						rxq_share = (uint32_t)n;
+					else
+						rte_exit(EXIT_FAILURE, "rxq-share must be >= 0\n");
+				}
+			}
 			if (!strcmp(lgopts[opt_idx].name, "no-flush-rx"))
 				no_flush_rx = 1;
 			if (!strcmp(lgopts[opt_idx].name, "eth-link-speed")) {
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index af0e79fe6d5..80337bad382 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -502,6 +502,11 @@ uint8_t record_core_cycles;
  */
 uint8_t record_burst_stats;
 
+/*
+ * Number of ports per shared Rx queue group, 0 disable.
+ */
+uint32_t rxq_share;
+
 unsigned int num_sockets = 0;
 unsigned int socket_ids[RTE_MAX_NUMA_NODES];
 
@@ -3629,14 +3634,23 @@ dev_event_callback(const char *device_name, enum rte_dev_event_type type,
 }
 
 static void
-rxtx_port_config(struct rte_port *port)
+rxtx_port_config(portid_t pid)
 {
 	uint16_t qid;
 	uint64_t offloads;
+	struct rte_port *port = &ports[pid];
 
 	for (qid = 0; qid < nb_rxq; qid++) {
 		offloads = port->rx_conf[qid].offloads;
 		port->rx_conf[qid] = port->dev_info.default_rxconf;
+
+		if (rxq_share > 0 &&
+		    (port->dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE)) {
+			/* Non-zero share group to enable RxQ share. */
+			port->rx_conf[qid].share_group = pid / rxq_share + 1;
+			port->rx_conf[qid].share_qid = qid; /* Equal mapping. */
+		}
+
 		if (offloads != 0)
 			port->rx_conf[qid].offloads = offloads;
 
@@ -3765,7 +3779,7 @@ init_port_config(void)
 			}
 		}
 
-		rxtx_port_config(port);
+		rxtx_port_config(pid);
 
 		ret = eth_macaddr_get_print_err(pid, &port->eth_addr);
 		if (ret != 0)
@@ -3977,7 +3991,7 @@ init_port_dcb_config(portid_t pid,
 
 	memcpy(&rte_port->dev_conf, &port_conf, sizeof(struct rte_eth_conf));
 
-	rxtx_port_config(rte_port);
+	rxtx_port_config(pid);
 	/* VLAN filter */
 	rte_port->dev_conf.rxmode.offloads |= DEV_RX_OFFLOAD_VLAN_FILTER;
 	for (i = 0; i < RTE_DIM(vlan_tags); i++)
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index e3995d24ab5..63f9913deb6 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -524,6 +524,8 @@ extern enum tx_pkt_split tx_pkt_split;
 
 extern uint8_t txonly_multi_flow;
 
+extern uint32_t rxq_share;
+
 extern uint16_t nb_pkt_per_burst;
 extern uint16_t nb_pkt_flowgen_clones;
 extern int nb_flows_flowgen;
diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
index 8ff7ab85369..faa3efb902c 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -395,6 +395,12 @@ The command line options are:
 
     Generate multiple flows in txonly mode.
 
+*   ``--rxq-share=[X]``
+
+    Create queues in shared Rx queue mode if device supports.
+    Shared Rx queues are grouped per X ports. X defaults to UINT32_MAX,
+    implies all ports join share group 1.
+
 *   ``--eth-link-speed``
 
     Set a forced link speed to the ethernet port::
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v12 5/7] app/testpmd: dump port info for shared Rx queue
  2021-10-21  5:08 ` [dpdk-dev] [PATCH v12 0/7] ethdev: introduce " Xueming Li
                     ` (3 preceding siblings ...)
  2021-10-21  5:08   ` [dpdk-dev] [PATCH v12 4/7] app/testpmd: new parameter to enable shared Rx queue Xueming Li
@ 2021-10-21  5:08   ` Xueming Li
  2021-10-21  5:08   ` [dpdk-dev] [PATCH v12 6/7] app/testpmd: force shared Rx queue polled on same core Xueming Li
  2021-10-21  5:08   ` [dpdk-dev] [PATCH v12 7/7] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
  6 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-21  5:08 UTC (permalink / raw)
  To: dev, Zhang Yuying, Li Xiaoyun
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde

In case of shared Rx queue, polling any member port returns mbufs for
all members. This patch dumps mbuf->port for each packet.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Xiaoyun Li <xiaoyun.li@intel.com>
---
 app/test-pmd/util.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/app/test-pmd/util.c b/app/test-pmd/util.c
index 26dc0c86406..f712f687287 100644
--- a/app/test-pmd/util.c
+++ b/app/test-pmd/util.c
@@ -101,6 +101,9 @@ dump_pkt_burst(uint16_t port_id, uint16_t queue, struct rte_mbuf *pkts[],
 		struct rte_port *port = &ports[port_id];
 
 		mb = pkts[i];
+		if (rxq_share > 0)
+			MKDUMPSTR(print_buf, buf_size, cur_len, "port %u, ",
+				  mb->port);
 		eth_hdr = rte_pktmbuf_read(mb, 0, sizeof(_eth_hdr), &_eth_hdr);
 		eth_type = RTE_BE_TO_CPU_16(eth_hdr->ether_type);
 		packet_type = mb->packet_type;
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v12 6/7] app/testpmd: force shared Rx queue polled on same core
  2021-10-21  5:08 ` [dpdk-dev] [PATCH v12 0/7] ethdev: introduce " Xueming Li
                     ` (4 preceding siblings ...)
  2021-10-21  5:08   ` [dpdk-dev] [PATCH v12 5/7] app/testpmd: dump port info for " Xueming Li
@ 2021-10-21  5:08   ` Xueming Li
  2021-10-21  6:35     ` Li, Xiaoyun
  2021-10-21  5:08   ` [dpdk-dev] [PATCH v12 7/7] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
  6 siblings, 1 reply; 266+ messages in thread
From: Xueming Li @ 2021-10-21  5:08 UTC (permalink / raw)
  To: dev, Zhang Yuying, Li Xiaoyun
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde

Shared Rx queue must be polled on same core. This patch checks and stops
forwarding if shared RxQ being scheduled on multiple
cores.

It's suggested to use same number of Rx queues and polling cores.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/config.c  | 105 +++++++++++++++++++++++++++++++++++++++++
 app/test-pmd/testpmd.c |   5 +-
 app/test-pmd/testpmd.h |   2 +
 3 files changed, 111 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index e4bbf457916..cad78350dcc 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -3067,6 +3067,111 @@ port_rss_hash_key_update(portid_t port_id, char rss_type[], uint8_t *hash_key,
 	}
 }
 
+/*
+ * Check whether a shared rxq scheduled on other lcores.
+ */
+static bool
+fwd_stream_on_other_lcores(uint16_t domain_id, lcoreid_t src_lc,
+			   portid_t src_port, queueid_t src_rxq,
+			   uint32_t share_group, queueid_t share_rxq)
+{
+	streamid_t sm_id;
+	streamid_t nb_fs_per_lcore;
+	lcoreid_t  nb_fc;
+	lcoreid_t  lc_id;
+	struct fwd_stream *fs;
+	struct rte_port *port;
+	struct rte_eth_dev_info *dev_info;
+	struct rte_eth_rxconf *rxq_conf;
+
+	nb_fc = cur_fwd_config.nb_fwd_lcores;
+	/* Check remaining cores. */
+	for (lc_id = src_lc + 1; lc_id < nb_fc; lc_id++) {
+		sm_id = fwd_lcores[lc_id]->stream_idx;
+		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
+		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
+		     sm_id++) {
+			fs = fwd_streams[sm_id];
+			port = &ports[fs->rx_port];
+			dev_info = &port->dev_info;
+			rxq_conf = &port->rx_conf[fs->rx_queue];
+			if ((dev_info->dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE)
+			    == 0 || rxq_conf->share_group == 0)
+				/* Not shared rxq. */
+				continue;
+			if (domain_id != port->dev_info.switch_info.domain_id)
+				continue;
+			if (rxq_conf->share_group != share_group)
+				continue;
+			if (rxq_conf->share_qid != share_rxq)
+				continue;
+			printf("Shared Rx queue group %u queue %hu can't be scheduled on different cores:\n",
+			       share_group, share_rxq);
+			printf("  lcore %hhu Port %hu queue %hu\n",
+			       src_lc, src_port, src_rxq);
+			printf("  lcore %hhu Port %hu queue %hu\n",
+			       lc_id, fs->rx_port, fs->rx_queue);
+			printf("Please use --nb-cores=%hu to limit number of forwarding cores\n",
+			       nb_rxq);
+			return true;
+		}
+	}
+	return false;
+}
+
+/*
+ * Check shared rxq configuration.
+ *
+ * Shared group must not being scheduled on different core.
+ */
+bool
+pkt_fwd_shared_rxq_check(void)
+{
+	streamid_t sm_id;
+	streamid_t nb_fs_per_lcore;
+	lcoreid_t  nb_fc;
+	lcoreid_t  lc_id;
+	struct fwd_stream *fs;
+	uint16_t domain_id;
+	struct rte_port *port;
+	struct rte_eth_dev_info *dev_info;
+	struct rte_eth_rxconf *rxq_conf;
+
+	if (rxq_share == 0)
+		return true;
+	nb_fc = cur_fwd_config.nb_fwd_lcores;
+	/*
+	 * Check streams on each core, make sure the same switch domain +
+	 * group + queue doesn't get scheduled on other cores.
+	 */
+	for (lc_id = 0; lc_id < nb_fc; lc_id++) {
+		sm_id = fwd_lcores[lc_id]->stream_idx;
+		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
+		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
+		     sm_id++) {
+			fs = fwd_streams[sm_id];
+			/* Update lcore info stream being scheduled. */
+			fs->lcore = fwd_lcores[lc_id];
+			port = &ports[fs->rx_port];
+			dev_info = &port->dev_info;
+			rxq_conf = &port->rx_conf[fs->rx_queue];
+			if ((dev_info->dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE)
+			    == 0 || rxq_conf->share_group == 0)
+				/* Not shared rxq. */
+				continue;
+			/* Check shared rxq not scheduled on remaining cores. */
+			domain_id = port->dev_info.switch_info.domain_id;
+			if (fwd_stream_on_other_lcores(domain_id, lc_id,
+						       fs->rx_port,
+						       fs->rx_queue,
+						       rxq_conf->share_group,
+						       rxq_conf->share_qid))
+				return false;
+		}
+	}
+	return true;
+}
+
 /*
  * Setup forwarding configuration for each logical core.
  */
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 80337bad382..d76d298a4b9 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -2309,6 +2309,10 @@ start_packet_forwarding(int with_tx_first)
 
 	fwd_config_setup();
 
+	pkt_fwd_config_display(&cur_fwd_config);
+	if (!pkt_fwd_shared_rxq_check())
+		return;
+
 	port_fwd_begin = cur_fwd_config.fwd_eng->port_fwd_begin;
 	if (port_fwd_begin != NULL) {
 		for (i = 0; i < cur_fwd_config.nb_fwd_ports; i++) {
@@ -2338,7 +2342,6 @@ start_packet_forwarding(int with_tx_first)
 	if(!no_flush_rx)
 		flush_fwd_rx_queues();
 
-	pkt_fwd_config_display(&cur_fwd_config);
 	rxtx_config_display();
 
 	fwd_stats_reset();
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 63f9913deb6..9482dab3071 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -147,6 +147,7 @@ struct fwd_stream {
 	uint64_t     core_cycles; /**< used for RX and TX processing */
 	struct pkt_burst_stats rx_burst_stats;
 	struct pkt_burst_stats tx_burst_stats;
+	struct fwd_lcore *lcore; /**< Lcore being scheduled. */
 };
 
 /**
@@ -842,6 +843,7 @@ void port_summary_header_display(void);
 void rx_queue_infos_display(portid_t port_idi, uint16_t queue_id);
 void tx_queue_infos_display(portid_t port_idi, uint16_t queue_id);
 void fwd_lcores_config_display(void);
+bool pkt_fwd_shared_rxq_check(void);
 void pkt_fwd_config_display(struct fwd_config *cfg);
 void rxtx_config_display(void);
 void fwd_config_setup(void);
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v12 7/7] app/testpmd: add forwarding engine for shared Rx queue
  2021-10-21  5:08 ` [dpdk-dev] [PATCH v12 0/7] ethdev: introduce " Xueming Li
                     ` (5 preceding siblings ...)
  2021-10-21  5:08   ` [dpdk-dev] [PATCH v12 6/7] app/testpmd: force shared Rx queue polled on same core Xueming Li
@ 2021-10-21  5:08   ` Xueming Li
  2021-10-21  6:33     ` Li, Xiaoyun
  2021-10-21  9:28     ` Thomas Monjalon
  6 siblings, 2 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-21  5:08 UTC (permalink / raw)
  To: dev, Zhang Yuying, Li Xiaoyun
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde

To support shared Rx queue, this patch introduces dedicate forwarding
engine. The engine groups received packets by mbuf->port into sub-group,
updates stream statistics and simply frees packets.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Xiaoyun Li <xiaoyun.li@intel.com>
Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
---
 app/test-pmd/meson.build                    |   1 +
 app/test-pmd/shared_rxq_fwd.c               | 113 ++++++++++++++++++++
 app/test-pmd/testpmd.c                      |   1 +
 app/test-pmd/testpmd.h                      |   5 +
 doc/guides/testpmd_app_ug/run_app.rst       |   5 +-
 doc/guides/testpmd_app_ug/testpmd_funcs.rst |   5 +-
 6 files changed, 128 insertions(+), 2 deletions(-)
 create mode 100644 app/test-pmd/shared_rxq_fwd.c

diff --git a/app/test-pmd/meson.build b/app/test-pmd/meson.build
index 1ad54caef2c..b5a0f7b6209 100644
--- a/app/test-pmd/meson.build
+++ b/app/test-pmd/meson.build
@@ -22,6 +22,7 @@ sources = files(
         'noisy_vnf.c',
         'parameters.c',
         'rxonly.c',
+        'shared_rxq_fwd.c',
         'testpmd.c',
         'txonly.c',
         'util.c',
diff --git a/app/test-pmd/shared_rxq_fwd.c b/app/test-pmd/shared_rxq_fwd.c
new file mode 100644
index 00000000000..c4684893674
--- /dev/null
+++ b/app/test-pmd/shared_rxq_fwd.c
@@ -0,0 +1,113 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2021 NVIDIA Corporation & Affiliates
+ */
+
+#include "testpmd.h"
+
+/*
+ * Rx only sub-burst forwarding.
+ */
+static void
+forward_rx_only(uint16_t nb_rx, struct rte_mbuf **pkts_burst)
+{
+	rte_pktmbuf_free_bulk(pkts_burst, nb_rx);
+}
+
+/**
+ * Get packet source stream by source port and queue.
+ * All streams of same shared Rx queue locates on same core.
+ */
+static struct fwd_stream *
+forward_stream_get(struct fwd_stream *fs, uint16_t port)
+{
+	streamid_t sm_id;
+	struct fwd_lcore *fc;
+	struct fwd_stream **fsm;
+	streamid_t nb_fs;
+
+	fc = fs->lcore;
+	fsm = &fwd_streams[fc->stream_idx];
+	nb_fs = fc->stream_nb;
+	for (sm_id = 0; sm_id < nb_fs; sm_id++) {
+		if (fsm[sm_id]->rx_port == port &&
+		    fsm[sm_id]->rx_queue == fs->rx_queue)
+			return fsm[sm_id];
+	}
+	return NULL;
+}
+
+/**
+ * Forward packet by source port and queue.
+ */
+static void
+forward_sub_burst(struct fwd_stream *src_fs, uint16_t port, uint16_t nb_rx,
+		  struct rte_mbuf **pkts)
+{
+	struct fwd_stream *fs = forward_stream_get(src_fs, port);
+
+	if (fs != NULL) {
+		fs->rx_packets += nb_rx;
+		forward_rx_only(nb_rx, pkts);
+	} else {
+		/* Source stream not found, drop all packets. */
+		src_fs->fwd_dropped += nb_rx;
+		while (nb_rx > 0)
+			rte_pktmbuf_free(pkts[--nb_rx]);
+	}
+}
+
+/**
+ * Forward packets from shared Rx queue.
+ *
+ * Source port of packets are identified by mbuf->port.
+ */
+static void
+forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
+		   struct rte_mbuf **pkts_burst)
+{
+	uint16_t i, nb_sub_burst, port, last_port;
+
+	nb_sub_burst = 0;
+	last_port = pkts_burst[0]->port;
+	/* Locate sub-burst according to mbuf->port. */
+	for (i = 0; i < nb_rx - 1; ++i) {
+		rte_prefetch0(pkts_burst[i + 1]);
+		port = pkts_burst[i]->port;
+		if (i > 0 && last_port != port) {
+			/* Forward packets with same source port. */
+			forward_sub_burst(fs, last_port, nb_sub_burst,
+					  &pkts_burst[i - nb_sub_burst]);
+			nb_sub_burst = 0;
+			last_port = port;
+		}
+		nb_sub_burst++;
+	}
+	/* Last sub-burst. */
+	nb_sub_burst++;
+	forward_sub_burst(fs, last_port, nb_sub_burst,
+			  &pkts_burst[nb_rx - nb_sub_burst]);
+}
+
+static void
+shared_rxq_fwd(struct fwd_stream *fs)
+{
+	struct rte_mbuf *pkts_burst[nb_pkt_per_burst];
+	uint16_t nb_rx;
+	uint64_t start_tsc = 0;
+
+	get_start_cycles(&start_tsc);
+	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
+				 nb_pkt_per_burst);
+	inc_rx_burst_stats(fs, nb_rx);
+	if (unlikely(nb_rx == 0))
+		return;
+	forward_shared_rxq(fs, nb_rx, pkts_burst);
+	get_end_cycles(fs, start_tsc);
+}
+
+struct fwd_engine shared_rxq_engine = {
+	.fwd_mode_name  = "shared_rxq",
+	.port_fwd_begin = NULL,
+	.port_fwd_end   = NULL,
+	.packet_fwd     = shared_rxq_fwd,
+};
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index d76d298a4b9..6d5bbc82404 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -188,6 +188,7 @@ struct fwd_engine * fwd_engines[] = {
 #ifdef RTE_LIBRTE_IEEE1588
 	&ieee1588_fwd_engine,
 #endif
+	&shared_rxq_engine,
 	NULL,
 };
 
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 9482dab3071..ef7a6199313 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -12,6 +12,10 @@
 #include <rte_gro.h>
 #include <rte_gso.h>
 #include <rte_os_shim.h>
+#include <rte_mbuf_dyn.h>
+#include <rte_flow.h>
+#include <rte_ethdev.h>
+
 #include <cmdline.h>
 #include <sys/queue.h>
 #ifdef RTE_HAS_JANSSON
@@ -339,6 +343,7 @@ extern struct fwd_engine five_tuple_swap_fwd_engine;
 #ifdef RTE_LIBRTE_IEEE1588
 extern struct fwd_engine ieee1588_fwd_engine;
 #endif
+extern struct fwd_engine shared_rxq_engine;
 
 extern struct fwd_engine * fwd_engines[]; /**< NULL terminated array. */
 extern cmdline_parse_inst_t cmd_set_raw;
diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
index faa3efb902c..74412bb82ca 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -258,6 +258,7 @@ The command line options are:
        tm
        noisy
        5tswap
+       shared-rxq
 
 *   ``--rss-ip``
 
@@ -399,7 +400,9 @@ The command line options are:
 
     Create queues in shared Rx queue mode if device supports.
     Shared Rx queues are grouped per X ports. X defaults to UINT32_MAX,
-    implies all ports join share group 1.
+    implies all ports join share group 1. A new forwarding engine
+    "shared-rxq" should be used for shared Rx queues. This engine does
+    Rx only and update stream statistics accordingly.
 
 *   ``--eth-link-speed``
 
diff --git a/doc/guides/testpmd_app_ug/testpmd_funcs.rst b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
index 6d127d9a7bc..78d23429c42 100644
--- a/doc/guides/testpmd_app_ug/testpmd_funcs.rst
+++ b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
@@ -314,7 +314,7 @@ set fwd
 Set the packet forwarding mode::
 
    testpmd> set fwd (io|mac|macswap|flowgen| \
-                     rxonly|txonly|csum|icmpecho|noisy|5tswap) (""|retry)
+                     rxonly|txonly|csum|icmpecho|noisy|5tswap|shared-rxq) (""|retry)
 
 ``retry`` can be specified for forwarding engines except ``rx_only``.
 
@@ -357,6 +357,9 @@ The available information categories are:
 
   L4 swaps the source port and destination port of transport layer (TCP and UDP).
 
+* ``shared-rxq``: Receive only for shared Rx queue.
+  Resolve packet source port from mbuf and update stream statistics accordingly.
+
 Example::
 
    testpmd> set fwd rxonly
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v11 4/7] app/testpmd: new parameter to enable shared Rx queue
  2021-10-21  3:58       ` Xueming(Steven) Li
@ 2021-10-21  5:15         ` Li, Xiaoyun
  0 siblings, 0 replies; 266+ messages in thread
From: Li, Xiaoyun @ 2021-10-21  5:15 UTC (permalink / raw)
  To: Xueming(Steven) Li, Zhang, Yuying, dev
  Cc: Ananyev, Konstantin, jerinjacobk, NBU-Contact-Thomas Monjalon,
	Slava Ovsiienko, ajit.khaparde, Yigit, Ferruh, andrew.rybchenko,
	Lior Margalit

> -----Original Message-----
> From: Xueming(Steven) Li <xuemingl@nvidia.com>
> Sent: Thursday, October 21, 2021 11:59
> To: Li, Xiaoyun <xiaoyun.li@intel.com>; Zhang, Yuying
> <yuying.zhang@intel.com>; dev@dpdk.org
> Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> jerinjacobk@gmail.com; NBU-Contact-Thomas Monjalon
> <thomas@monjalon.net>; Slava Ovsiienko <viacheslavo@nvidia.com>;
> ajit.khaparde@broadcom.com; Yigit, Ferruh <ferruh.yigit@intel.com>;
> andrew.rybchenko@oktetlabs.ru; Lior Margalit <lmargalit@nvidia.com>
> Subject: Re: [PATCH v11 4/7] app/testpmd: new parameter to enable shared Rx
> queue
> 
> On Thu, 2021-10-21 at 03:24 +0000, Li, Xiaoyun wrote:
> > Hi
> >
> > > -----Original Message-----
> > > From: Xueming Li <xuemingl@nvidia.com>
> > > Sent: Wednesday, October 20, 2021 15:53
> > > To: dev@dpdk.org; Zhang, Yuying <yuying.zhang@intel.com>
> > > Cc: xuemingl@nvidia.com; Jerin Jacob <jerinjacobk@gmail.com>; Yigit,
> > > Ferruh <ferruh.yigit@intel.com>; Andrew Rybchenko
> > > <andrew.rybchenko@oktetlabs.ru>; Viacheslav Ovsiienko
> > > <viacheslavo@nvidia.com>; Thomas Monjalon <thomas@monjalon.net>;
> > > Lior Margalit <lmargalit@nvidia.com>; Ananyev, Konstantin
> > > <konstantin.ananyev@intel.com>; Ajit Khaparde
> > > <ajit.khaparde@broadcom.com>; Li, Xiaoyun <xiaoyun.li@intel.com>
> > > Subject: [PATCH v11 4/7] app/testpmd: new parameter to enable shared
> > > Rx queue
> > >
> > > Adds "--rxq-share=X" parameter to enable shared RxQ, share if device
> > > supports, otherwise fallback to standard RxQ.
> > >
> > > Share group number grows per X ports. X defaults to MAX, implies all
> > > ports join
> >
> > X defaults to number of probed ports.
> 
> I will change to UINT32_MAX, thanks.
> 
> >
> > > share group 1. Queue ID is mapped equally with shared Rx queue ID.
> > >
> > > Forwarding engine "shared-rxq" should be used which Rx only and
> > > update stream statistics correctly.
> > >
> > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > ---
> > >  app/test-pmd/config.c                 |  7 ++++++-
> > >  app/test-pmd/parameters.c             | 13 +++++++++++++
> > >  app/test-pmd/testpmd.c                | 20 +++++++++++++++++---
> > >  app/test-pmd/testpmd.h                |  2 ++
> > >  doc/guides/testpmd_app_ug/run_app.rst |  7 +++++++
> > >  5 files changed, 45 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c index
> > > 2c1b06c544d..fa951a86704 100644
> > > --- a/app/test-pmd/config.c
> > > +++ b/app/test-pmd/config.c
> > <snip>
> > > @@ -1271,6 +1273,17 @@ launch_args_parse(int argc, char** argv)
> > >  			}
> > >  			if (!strcmp(lgopts[opt_idx].name, "txonly-multi-flow"))
> > >  				txonly_multi_flow = 1;
> > > +			if (!strcmp(lgopts[opt_idx].name, "rxq-share")) {
> > > +				if (optarg == NULL) {
> > > +					rxq_share = UINT32_MAX;
> >
> > Why not use "nb_ports" here? nb_ports is the number of probed ports.
> 
> Considering hotplug, nb_ports could grow later, I think UINT32_MAX is safe.

Yes. It will be safer if there's hotplug.
But I thought you won’t consider this case since if you consider about hotplug, your calculation for share_group using port_id is not correct.
		port->rx_conf[qid].share_group = pid / rxq_share + 1;

> 
> >
> > > +				} else {
> > > +					n = atoi(optarg);
> > > +					if (n >= 0)
> > > +						rxq_share = (uint32_t)n;
> > > +					else
> > > +						rte_exit(EXIT_FAILURE, "rxq-
> > > share must be >= 0\n");
> > > +				}
> > > +			}
> > >  			if (!strcmp(lgopts[opt_idx].name, "no-flush-rx"))
> > >  				no_flush_rx = 1;
> > >  			if (!strcmp(lgopts[opt_idx].name, "eth-link-speed"))
> > <snip>
> > >
> > > +*   ``--rxq-share=[X]``
> > > +
> > > +    Create queues in shared Rx queue mode if device supports.
> > > +    Group number grows per X ports. X defaults to MAX, implies all
> > > + ports
> >
> > X defaults to number of probed ports.
> > I suppose this is what you mean? Also, I agree with other comments
> > with the wording part
> >
> > > +    join share group 1. Forwarding engine "shared-rxq" should be used
> > > +    which Rx only and update stream statistics correctly.
> > > +
> > >  *   ``--eth-link-speed``
> > >
> > >      Set a forced link speed to the ethernet port::
> > > --
> > > 2.33.0
> >


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v12 7/7] app/testpmd: add forwarding engine for shared Rx queue
  2021-10-21  5:08   ` [dpdk-dev] [PATCH v12 7/7] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
@ 2021-10-21  6:33     ` Li, Xiaoyun
  2021-10-21  7:58       ` Xueming(Steven) Li
  2021-10-21  9:28     ` Thomas Monjalon
  1 sibling, 1 reply; 266+ messages in thread
From: Li, Xiaoyun @ 2021-10-21  6:33 UTC (permalink / raw)
  To: Xueming Li, dev, Zhang, Yuying
  Cc: Jerin Jacob, Yigit, Ferruh, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit, Ananyev,
	Konstantin, Ajit Khaparde

> -----Original Message-----
> From: Xueming Li <xuemingl@nvidia.com>
> Sent: Thursday, October 21, 2021 13:09
> To: dev@dpdk.org; Zhang, Yuying <yuying.zhang@intel.com>; Li, Xiaoyun
> <xiaoyun.li@intel.com>
> Cc: xuemingl@nvidia.com; Jerin Jacob <jerinjacobk@gmail.com>; Yigit, Ferruh
> <ferruh.yigit@intel.com>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>; Viacheslav Ovsiienko
> <viacheslavo@nvidia.com>; Thomas Monjalon <thomas@monjalon.net>; Lior
> Margalit <lmargalit@nvidia.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Ajit Khaparde
> <ajit.khaparde@broadcom.com>
> Subject: [PATCH v12 7/7] app/testpmd: add forwarding engine for shared Rx
> queue
> 
> To support shared Rx queue, this patch introduces dedicate forwarding engine.
> The engine groups received packets by mbuf->port into sub-group, updates
> stream statistics and simply frees packets.
> 
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> Acked-by: Xiaoyun Li <xiaoyun.li@intel.com>

I didn't ack you on this patch. I remember I added "+1" to the comment about your includes issue.
It will confuse reviewers not to review new versions.

> Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>

I didn't see he ack this patch as well.
Please remove these acks.

> ---
>  app/test-pmd/meson.build                    |   1 +
>  app/test-pmd/shared_rxq_fwd.c               | 113 ++++++++++++++++++++
>  app/test-pmd/testpmd.c                      |   1 +
>  app/test-pmd/testpmd.h                      |   5 +
>  doc/guides/testpmd_app_ug/run_app.rst       |   5 +-
>  doc/guides/testpmd_app_ug/testpmd_funcs.rst |   5 +-
>  6 files changed, 128 insertions(+), 2 deletions(-)  create mode 100644 app/test-
> pmd/shared_rxq_fwd.c
> 
> diff --git a/app/test-pmd/meson.build b/app/test-pmd/meson.build index
> 1ad54caef2c..b5a0f7b6209 100644
> --- a/app/test-pmd/meson.build
> +++ b/app/test-pmd/meson.build
> @@ -22,6 +22,7 @@ sources = files(
>          'noisy_vnf.c',
>          'parameters.c',
>          'rxonly.c',
> +        'shared_rxq_fwd.c',
>          'testpmd.c',
>          'txonly.c',
>          'util.c',
> diff --git a/app/test-pmd/shared_rxq_fwd.c b/app/test-pmd/shared_rxq_fwd.c
> new file mode 100644 index 00000000000..c4684893674
> --- /dev/null
> +++ b/app/test-pmd/shared_rxq_fwd.c
> @@ -0,0 +1,113 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright (c) 2021 NVIDIA Corporation & Affiliates  */
> +

Please add "#include <rte_ethdev.h>" here.
Your shared_rxq_fwd.c only needs this include.

> +#include "testpmd.h"
> +
> +/*
> + * Rx only sub-burst forwarding.
> + */
> +static void
> +forward_rx_only(uint16_t nb_rx, struct rte_mbuf **pkts_burst) {
> +	rte_pktmbuf_free_bulk(pkts_burst, nb_rx); }
> +
> +/**
> + * Get packet source stream by source port and queue.
> + * All streams of same shared Rx queue locates on same core.
> + */
> +static struct fwd_stream *
> +forward_stream_get(struct fwd_stream *fs, uint16_t port) {
<snip>
> diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h index
> 9482dab3071..ef7a6199313 100644
> --- a/app/test-pmd/testpmd.h
> +++ b/app/test-pmd/testpmd.h
> @@ -12,6 +12,10 @@
>  #include <rte_gro.h>
>  #include <rte_gso.h>
>  #include <rte_os_shim.h>
> +#include <rte_mbuf_dyn.h>
> +#include <rte_flow.h>
> +#include <rte_ethdev.h>
> +

Please remove these includes and this blank line.
You only need to add the lib you need in your file like I said above.

>  #include <cmdline.h>
>  #include <sys/queue.h>
>  #ifdef RTE_HAS_JANSSON
> @@ -339,6 +343,7 @@ extern struct fwd_engine five_tuple_swap_fwd_engine;
> #ifdef RTE_LIBRTE_IEEE1588  extern struct fwd_engine ieee1588_fwd_engine;
> #endif
> +extern struct fwd_engine shared_rxq_engine;
> 
>  extern struct fwd_engine * fwd_engines[]; /**< NULL terminated array. */
> extern cmdline_parse_inst_t cmd_set_raw; diff --git
> a/doc/guides/testpmd_app_ug/run_app.rst
> b/doc/guides/testpmd_app_ug/run_app.rst
> index faa3efb902c..74412bb82ca 100644
> --- a/doc/guides/testpmd_app_ug/run_app.rst
> +++ b/doc/guides/testpmd_app_ug/run_app.rst
> @@ -258,6 +258,7 @@ The command line options are:
>         tm
>         noisy
>         5tswap
> +       shared-rxq
> 
>  *   ``--rss-ip``
> 
> @@ -399,7 +400,9 @@ The command line options are:
> 
>      Create queues in shared Rx queue mode if device supports.
>      Shared Rx queues are grouped per X ports. X defaults to UINT32_MAX,
> -    implies all ports join share group 1.
> +    implies all ports join share group 1. A new forwarding engine
> +    "shared-rxq" should be used for shared Rx queues. This engine does
> +    Rx only and update stream statistics accordingly.
> 
>  *   ``--eth-link-speed``
> 
> diff --git a/doc/guides/testpmd_app_ug/testpmd_funcs.rst
> b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
> index 6d127d9a7bc..78d23429c42 100644
> --- a/doc/guides/testpmd_app_ug/testpmd_funcs.rst
> +++ b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
> @@ -314,7 +314,7 @@ set fwd
>  Set the packet forwarding mode::
> 
>     testpmd> set fwd (io|mac|macswap|flowgen| \
> -                     rxonly|txonly|csum|icmpecho|noisy|5tswap) (""|retry)
> +
> + rxonly|txonly|csum|icmpecho|noisy|5tswap|shared-rxq) (""|retry)
> 
>  ``retry`` can be specified for forwarding engines except ``rx_only``.
> 
> @@ -357,6 +357,9 @@ The available information categories are:
> 
>    L4 swaps the source port and destination port of transport layer (TCP and UDP).
> 
> +* ``shared-rxq``: Receive only for shared Rx queue.
> +  Resolve packet source port from mbuf and update stream statistics
> accordingly.
> +
>  Example::
> 
>     testpmd> set fwd rxonly
> --
> 2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v12 6/7] app/testpmd: force shared Rx queue polled on same core
  2021-10-21  5:08   ` [dpdk-dev] [PATCH v12 6/7] app/testpmd: force shared Rx queue polled on same core Xueming Li
@ 2021-10-21  6:35     ` Li, Xiaoyun
  0 siblings, 0 replies; 266+ messages in thread
From: Li, Xiaoyun @ 2021-10-21  6:35 UTC (permalink / raw)
  To: Xueming Li, dev, Zhang, Yuying
  Cc: Jerin Jacob, Yigit, Ferruh, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit, Ananyev,
	Konstantin, Ajit Khaparde

> -----Original Message-----
> From: Xueming Li <xuemingl@nvidia.com>
> Sent: Thursday, October 21, 2021 13:09
> To: dev@dpdk.org; Zhang, Yuying <yuying.zhang@intel.com>; Li, Xiaoyun
> <xiaoyun.li@intel.com>
> Cc: xuemingl@nvidia.com; Jerin Jacob <jerinjacobk@gmail.com>; Yigit, Ferruh
> <ferruh.yigit@intel.com>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>; Viacheslav Ovsiienko
> <viacheslavo@nvidia.com>; Thomas Monjalon <thomas@monjalon.net>; Lior
> Margalit <lmargalit@nvidia.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Ajit Khaparde
> <ajit.khaparde@broadcom.com>
> Subject: [PATCH v12 6/7] app/testpmd: force shared Rx queue polled on same
> core
> 
> Shared Rx queue must be polled on same core. This patch checks and stops
> forwarding if shared RxQ being scheduled on multiple cores.
> 
> It's suggested to use same number of Rx queues and polling cores.
> 
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> ---
>  app/test-pmd/config.c  | 105
> +++++++++++++++++++++++++++++++++++++++++
>  app/test-pmd/testpmd.c |   5 +-
>  app/test-pmd/testpmd.h |   2 +
>  3 files changed, 111 insertions(+), 1 deletion(-)
> 
Acked-by: Xiaoyun Li <xiaoyun.li@intel.com>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v12 7/7] app/testpmd: add forwarding engine for shared Rx queue
  2021-10-21  6:33     ` Li, Xiaoyun
@ 2021-10-21  7:58       ` Xueming(Steven) Li
  2021-10-21  8:01         ` Li, Xiaoyun
  0 siblings, 1 reply; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-10-21  7:58 UTC (permalink / raw)
  To: xiaoyun.li, yuying.zhang, dev
  Cc: konstantin.ananyev, jerinjacobk, NBU-Contact-Thomas Monjalon,
	Slava Ovsiienko, ajit.khaparde, ferruh.yigit, andrew.rybchenko,
	Lior Margalit

On Thu, 2021-10-21 at 06:33 +0000, Li, Xiaoyun wrote:
> > -----Original Message-----
> > From: Xueming Li <xuemingl@nvidia.com>
> > Sent: Thursday, October 21, 2021 13:09
> > To: dev@dpdk.org; Zhang, Yuying <yuying.zhang@intel.com>; Li,
> > Xiaoyun
> > <xiaoyun.li@intel.com>
> > Cc: xuemingl@nvidia.com; Jerin Jacob <jerinjacobk@gmail.com>;
> > Yigit, Ferruh
> > <ferruh.yigit@intel.com>; Andrew Rybchenko
> > <andrew.rybchenko@oktetlabs.ru>; Viacheslav Ovsiienko
> > <viacheslavo@nvidia.com>; Thomas Monjalon <thomas@monjalon.net>;
> > Lior
> > Margalit <lmargalit@nvidia.com>; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>; Ajit Khaparde
> > <ajit.khaparde@broadcom.com>
> > Subject: [PATCH v12 7/7] app/testpmd: add forwarding engine for
> > shared Rx
> > queue
> > 
> > To support shared Rx queue, this patch introduces dedicate
> > forwarding engine.
> > The engine groups received packets by mbuf->port into sub-group,
> > updates
> > stream statistics and simply frees packets.
> > 
> > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > Acked-by: Xiaoyun Li <xiaoyun.li@intel.com>
> 
> I didn't ack you on this patch. I remember I added "+1" to the
> comment about your includes issue.
> It will confuse reviewers not to review new versions.

Yes, they there by mistake.

> 
> > Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
> 
> I didn't see he ack this patch as well.
> Please remove these acks.
> 
> > ---
> >  app/test-pmd/meson.build                    |   1 +
> >  app/test-pmd/shared_rxq_fwd.c               | 113
> > ++++++++++++++++++++
> >  app/test-pmd/testpmd.c                      |   1 +
> >  app/test-pmd/testpmd.h                      |   5 +
> >  doc/guides/testpmd_app_ug/run_app.rst       |   5 +-
> >  doc/guides/testpmd_app_ug/testpmd_funcs.rst |   5 +-
> >  6 files changed, 128 insertions(+), 2 deletions(-)  create mode
> > 100644 app/test-
> > pmd/shared_rxq_fwd.c
> > 
> > diff --git a/app/test-pmd/meson.build b/app/test-pmd/meson.build
> > index
> > 1ad54caef2c..b5a0f7b6209 100644
> > --- a/app/test-pmd/meson.build
> > +++ b/app/test-pmd/meson.build
> > @@ -22,6 +22,7 @@ sources = files(
> >          'noisy_vnf.c',
> >          'parameters.c',
> >          'rxonly.c',
> > +        'shared_rxq_fwd.c',
> >          'testpmd.c',
> >          'txonly.c',
> >          'util.c',
> > diff --git a/app/test-pmd/shared_rxq_fwd.c b/app/test-
> > pmd/shared_rxq_fwd.c
> > new file mode 100644 index 00000000000..c4684893674
> > --- /dev/null
> > +++ b/app/test-pmd/shared_rxq_fwd.c
> > @@ -0,0 +1,113 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright (c) 2021 NVIDIA Corporation & Affiliates  */
> > +
> 
> Please add "#include <rte_ethdev.h>" here.
> Your shared_rxq_fwd.c only needs this include.

As explained below, testpmd relies on rte_ethdev.h.

> 
> > +#include "testpmd.h"
> > +
> > +/*
> > + * Rx only sub-burst forwarding.
> > + */
> > +static void
> > +forward_rx_only(uint16_t nb_rx, struct rte_mbuf **pkts_burst) {
> > +	rte_pktmbuf_free_bulk(pkts_burst, nb_rx); }
> > +
> > +/**
> > + * Get packet source stream by source port and queue.
> > + * All streams of same shared Rx queue locates on same core.
> > + */
> > +static struct fwd_stream *
> > +forward_stream_get(struct fwd_stream *fs, uint16_t port) {
> <snip>
> > diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h index
> > 9482dab3071..ef7a6199313 100644
> > --- a/app/test-pmd/testpmd.h
> > +++ b/app/test-pmd/testpmd.h
> > @@ -12,6 +12,10 @@
> >  #include <rte_gro.h>
> >  #include <rte_gso.h>
> >  #include <rte_os_shim.h>
> > +#include <rte_mbuf_dyn.h>
> > +#include <rte_flow.h>
> > +#include <rte_ethdev.h>
> > +
> 
> Please remove these includes and this blank line.
> You only need to add the lib you need in your file like I said above.

From test, testpmd.h used these headers, otherwise compile error if not
included by fwd engine.

> 
> >  #include <cmdline.h>
> >  #include <sys/queue.h>
> >  #ifdef RTE_HAS_JANSSON
> > @@ -339,6 +343,7 @@ extern struct fwd_engine
> > five_tuple_swap_fwd_engine;
> > #ifdef RTE_LIBRTE_IEEE1588  extern struct fwd_engine
> > ieee1588_fwd_engine;
> > #endif
> > +extern struct fwd_engine shared_rxq_engine;
> > 
> >  extern struct fwd_engine * fwd_engines[]; /**< NULL terminated
> > array. */
> > extern cmdline_parse_inst_t cmd_set_raw; diff --git
> > a/doc/guides/testpmd_app_ug/run_app.rst
> > b/doc/guides/testpmd_app_ug/run_app.rst
> > index faa3efb902c..74412bb82ca 100644
> > --- a/doc/guides/testpmd_app_ug/run_app.rst
> > +++ b/doc/guides/testpmd_app_ug/run_app.rst
> > @@ -258,6 +258,7 @@ The command line options are:
> >         tm
> >         noisy
> >         5tswap
> > +       shared-rxq
> > 
> >  *   ``--rss-ip``
> > 
> > @@ -399,7 +400,9 @@ The command line options are:
> > 
> >      Create queues in shared Rx queue mode if device supports.
> >      Shared Rx queues are grouped per X ports. X defaults to
> > UINT32_MAX,
> > -    implies all ports join share group 1.
> > +    implies all ports join share group 1. A new forwarding engine
> > +    "shared-rxq" should be used for shared Rx queues. This engine
> > does
> > +    Rx only and update stream statistics accordingly.
> > 
> >  *   ``--eth-link-speed``
> > 
> > diff --git a/doc/guides/testpmd_app_ug/testpmd_funcs.rst
> > b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
> > index 6d127d9a7bc..78d23429c42 100644
> > --- a/doc/guides/testpmd_app_ug/testpmd_funcs.rst
> > +++ b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
> > @@ -314,7 +314,7 @@ set fwd
> >  Set the packet forwarding mode::
> > 
> >     testpmd> set fwd (io|mac|macswap|flowgen| \
> > -                     rxonly|txonly|csum|icmpecho|noisy|5tswap)
> > (""|retry)
> > +
> > + rxonly|txonly|csum|icmpecho|noisy|5tswap|shared-rxq) (""|retry)
> > 
> >  ``retry`` can be specified for forwarding engines except
> > ``rx_only``.
> > 
> > @@ -357,6 +357,9 @@ The available information categories are:
> > 
> >    L4 swaps the source port and destination port of transport layer
> > (TCP and UDP).
> > 
> > +* ``shared-rxq``: Receive only for shared Rx queue.
> > +  Resolve packet source port from mbuf and update stream
> > statistics
> > accordingly.
> > +
> >  Example::
> > 
> >     testpmd> set fwd rxonly
> > --
> > 2.33.0
> 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v12 7/7] app/testpmd: add forwarding engine for shared Rx queue
  2021-10-21  7:58       ` Xueming(Steven) Li
@ 2021-10-21  8:01         ` Li, Xiaoyun
  2021-10-21  8:22           ` Xueming(Steven) Li
  0 siblings, 1 reply; 266+ messages in thread
From: Li, Xiaoyun @ 2021-10-21  8:01 UTC (permalink / raw)
  To: Xueming(Steven) Li, Zhang, Yuying, dev
  Cc: Ananyev, Konstantin, jerinjacobk, NBU-Contact-Thomas Monjalon,
	Slava Ovsiienko, ajit.khaparde, Yigit, Ferruh, andrew.rybchenko,
	Lior Margalit



> -----Original Message-----
> From: Xueming(Steven) Li <xuemingl@nvidia.com>
> Sent: Thursday, October 21, 2021 15:59
> To: Li, Xiaoyun <xiaoyun.li@intel.com>; Zhang, Yuying
> <yuying.zhang@intel.com>; dev@dpdk.org
> Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> jerinjacobk@gmail.com; NBU-Contact-Thomas Monjalon
> <thomas@monjalon.net>; Slava Ovsiienko <viacheslavo@nvidia.com>;
> ajit.khaparde@broadcom.com; Yigit, Ferruh <ferruh.yigit@intel.com>;
> andrew.rybchenko@oktetlabs.ru; Lior Margalit <lmargalit@nvidia.com>
> Subject: Re: [PATCH v12 7/7] app/testpmd: add forwarding engine for shared Rx
> queue
> 
> On Thu, 2021-10-21 at 06:33 +0000, Li, Xiaoyun wrote:
> > > -----Original Message-----
> > > From: Xueming Li <xuemingl@nvidia.com>
> > > Sent: Thursday, October 21, 2021 13:09
> > > To: dev@dpdk.org; Zhang, Yuying <yuying.zhang@intel.com>; Li,
> > > Xiaoyun <xiaoyun.li@intel.com>
> > > Cc: xuemingl@nvidia.com; Jerin Jacob <jerinjacobk@gmail.com>; Yigit,
> > > Ferruh <ferruh.yigit@intel.com>; Andrew Rybchenko
> > > <andrew.rybchenko@oktetlabs.ru>; Viacheslav Ovsiienko
> > > <viacheslavo@nvidia.com>; Thomas Monjalon <thomas@monjalon.net>;
> > > Lior Margalit <lmargalit@nvidia.com>; Ananyev, Konstantin
> > > <konstantin.ananyev@intel.com>; Ajit Khaparde
> > > <ajit.khaparde@broadcom.com>
> > > Subject: [PATCH v12 7/7] app/testpmd: add forwarding engine for
> > > shared Rx queue
> > >
> > > To support shared Rx queue, this patch introduces dedicate
> > > forwarding engine.
> > > The engine groups received packets by mbuf->port into sub-group,
> > > updates stream statistics and simply frees packets.
> > >
> > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > Acked-by: Xiaoyun Li <xiaoyun.li@intel.com>
> >
> > I didn't ack you on this patch. I remember I added "+1" to the comment
> > about your includes issue.
> > It will confuse reviewers not to review new versions.
> 
> Yes, they there by mistake.
> 
> >
> > > Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
> >
> > I didn't see he ack this patch as well.
> > Please remove these acks.
> >
> > > ---
> > >  app/test-pmd/meson.build                    |   1 +
> > >  app/test-pmd/shared_rxq_fwd.c               | 113
> > > ++++++++++++++++++++
> > >  app/test-pmd/testpmd.c                      |   1 +
> > >  app/test-pmd/testpmd.h                      |   5 +
> > >  doc/guides/testpmd_app_ug/run_app.rst       |   5 +-
> > >  doc/guides/testpmd_app_ug/testpmd_funcs.rst |   5 +-
> > >  6 files changed, 128 insertions(+), 2 deletions(-)  create mode
> > > 100644 app/test-
> > > pmd/shared_rxq_fwd.c
> > >
> > > diff --git a/app/test-pmd/meson.build b/app/test-pmd/meson.build
> > > index
> > > 1ad54caef2c..b5a0f7b6209 100644
> > > --- a/app/test-pmd/meson.build
> > > +++ b/app/test-pmd/meson.build
> > > @@ -22,6 +22,7 @@ sources = files(
> > >          'noisy_vnf.c',
> > >          'parameters.c',
> > >          'rxonly.c',
> > > +        'shared_rxq_fwd.c',
> > >          'testpmd.c',
> > >          'txonly.c',
> > >          'util.c',
> > > diff --git a/app/test-pmd/shared_rxq_fwd.c b/app/test-
> > > pmd/shared_rxq_fwd.c new file mode 100644 index
> > > 00000000000..c4684893674
> > > --- /dev/null
> > > +++ b/app/test-pmd/shared_rxq_fwd.c
> > > @@ -0,0 +1,113 @@
> > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > + * Copyright (c) 2021 NVIDIA Corporation & Affiliates  */
> > > +
> >
> > Please add "#include <rte_ethdev.h>" here.
> > Your shared_rxq_fwd.c only needs this include.
> 
> As explained below, testpmd relies on rte_ethdev.h.
> 
> >
> > > +#include "testpmd.h"
> > > +
> > > +/*
> > > + * Rx only sub-burst forwarding.
> > > + */
> > > +static void
> > > +forward_rx_only(uint16_t nb_rx, struct rte_mbuf **pkts_burst) {
> > > +	rte_pktmbuf_free_bulk(pkts_burst, nb_rx); }
> > > +
> > > +/**
> > > + * Get packet source stream by source port and queue.
> > > + * All streams of same shared Rx queue locates on same core.
> > > + */
> > > +static struct fwd_stream *
> > > +forward_stream_get(struct fwd_stream *fs, uint16_t port) {
> > <snip>
> > > diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h index
> > > 9482dab3071..ef7a6199313 100644
> > > --- a/app/test-pmd/testpmd.h
> > > +++ b/app/test-pmd/testpmd.h
> > > @@ -12,6 +12,10 @@
> > >  #include <rte_gro.h>
> > >  #include <rte_gso.h>
> > >  #include <rte_os_shim.h>
> > > +#include <rte_mbuf_dyn.h>
> > > +#include <rte_flow.h>
> > > +#include <rte_ethdev.h>
> > > +
> >
> > Please remove these includes and this blank line.
> > You only need to add the lib you need in your file like I said above.
> 
> From test, testpmd.h used these headers, otherwise compile error if not
> included by fwd engine.

Have you tried my way? Include "#include <rte_ethdev.h>" in shared_rxq_fwd.c.
Please try this and see if there's any compiling issues.

> 
> >
> > >  #include <cmdline.h>
> > >  #include <sys/queue.h>
> > >  #ifdef RTE_HAS_JANSSON
> > > @@ -339,6 +343,7 @@ extern struct fwd_engine
> > > five_tuple_swap_fwd_engine; #ifdef RTE_LIBRTE_IEEE1588  extern
> > > struct fwd_engine ieee1588_fwd_engine; #endif
> > > +extern struct fwd_engine shared_rxq_engine;
> > >
> > >  extern struct fwd_engine * fwd_engines[]; /**< NULL terminated
> > > array. */ extern cmdline_parse_inst_t cmd_set_raw; diff --git
> > > a/doc/guides/testpmd_app_ug/run_app.rst
> > > b/doc/guides/testpmd_app_ug/run_app.rst
> > > index faa3efb902c..74412bb82ca 100644
> > > --- a/doc/guides/testpmd_app_ug/run_app.rst
> > > +++ b/doc/guides/testpmd_app_ug/run_app.rst
> > > @@ -258,6 +258,7 @@ The command line options are:
> > >         tm
> > >         noisy
> > >         5tswap
> > > +       shared-rxq
> > >
> > >  *   ``--rss-ip``
> > >
> > > @@ -399,7 +400,9 @@ The command line options are:
> > >
> > >      Create queues in shared Rx queue mode if device supports.
> > >      Shared Rx queues are grouped per X ports. X defaults to
> > > UINT32_MAX,
> > > -    implies all ports join share group 1.
> > > +    implies all ports join share group 1. A new forwarding engine
> > > +    "shared-rxq" should be used for shared Rx queues. This engine
> > > does
> > > +    Rx only and update stream statistics accordingly.
> > >
> > >  *   ``--eth-link-speed``
> > >
> > > diff --git a/doc/guides/testpmd_app_ug/testpmd_funcs.rst
> > > b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
> > > index 6d127d9a7bc..78d23429c42 100644
> > > --- a/doc/guides/testpmd_app_ug/testpmd_funcs.rst
> > > +++ b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
> > > @@ -314,7 +314,7 @@ set fwd
> > >  Set the packet forwarding mode::
> > >
> > >     testpmd> set fwd (io|mac|macswap|flowgen| \
> > > -                     rxonly|txonly|csum|icmpecho|noisy|5tswap)
> > > (""|retry)
> > > +
> > > + rxonly|txonly|csum|icmpecho|noisy|5tswap|shared-rxq) (""|retry)
> > >
> > >  ``retry`` can be specified for forwarding engines except
> > > ``rx_only``.
> > >
> > > @@ -357,6 +357,9 @@ The available information categories are:
> > >
> > >    L4 swaps the source port and destination port of transport layer
> > > (TCP and UDP).
> > >
> > > +* ``shared-rxq``: Receive only for shared Rx queue.
> > > +  Resolve packet source port from mbuf and update stream
> > > statistics
> > > accordingly.
> > > +
> > >  Example::
> > >
> > >     testpmd> set fwd rxonly
> > > --
> > > 2.33.0
> >


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v12 7/7] app/testpmd: add forwarding engine for shared Rx queue
  2021-10-21  8:01         ` Li, Xiaoyun
@ 2021-10-21  8:22           ` Xueming(Steven) Li
  0 siblings, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-10-21  8:22 UTC (permalink / raw)
  To: xiaoyun.li, yuying.zhang, dev
  Cc: konstantin.ananyev, jerinjacobk, NBU-Contact-Thomas Monjalon,
	Slava Ovsiienko, ajit.khaparde, ferruh.yigit, andrew.rybchenko,
	Lior Margalit

On Thu, 2021-10-21 at 08:01 +0000, Li, Xiaoyun wrote:
> 
> > -----Original Message-----
> > From: Xueming(Steven) Li <xuemingl@nvidia.com>
> > Sent: Thursday, October 21, 2021 15:59
> > To: Li, Xiaoyun <xiaoyun.li@intel.com>; Zhang, Yuying
> > <yuying.zhang@intel.com>; dev@dpdk.org
> > Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> > jerinjacobk@gmail.com; NBU-Contact-Thomas Monjalon
> > <thomas@monjalon.net>; Slava Ovsiienko <viacheslavo@nvidia.com>;
> > ajit.khaparde@broadcom.com; Yigit, Ferruh <ferruh.yigit@intel.com>;
> > andrew.rybchenko@oktetlabs.ru; Lior Margalit <lmargalit@nvidia.com>
> > Subject: Re: [PATCH v12 7/7] app/testpmd: add forwarding engine for shared Rx
> > queue
> > 
> > On Thu, 2021-10-21 at 06:33 +0000, Li, Xiaoyun wrote:
> > > > -----Original Message-----
> > > > From: Xueming Li <xuemingl@nvidia.com>
> > > > Sent: Thursday, October 21, 2021 13:09
> > > > To: dev@dpdk.org; Zhang, Yuying <yuying.zhang@intel.com>; Li,
> > > > Xiaoyun <xiaoyun.li@intel.com>
> > > > Cc: xuemingl@nvidia.com; Jerin Jacob <jerinjacobk@gmail.com>; Yigit,
> > > > Ferruh <ferruh.yigit@intel.com>; Andrew Rybchenko
> > > > <andrew.rybchenko@oktetlabs.ru>; Viacheslav Ovsiienko
> > > > <viacheslavo@nvidia.com>; Thomas Monjalon <thomas@monjalon.net>;
> > > > Lior Margalit <lmargalit@nvidia.com>; Ananyev, Konstantin
> > > > <konstantin.ananyev@intel.com>; Ajit Khaparde
> > > > <ajit.khaparde@broadcom.com>
> > > > Subject: [PATCH v12 7/7] app/testpmd: add forwarding engine for
> > > > shared Rx queue
> > > > 
> > > > To support shared Rx queue, this patch introduces dedicate
> > > > forwarding engine.
> > > > The engine groups received packets by mbuf->port into sub-group,
> > > > updates stream statistics and simply frees packets.
> > > > 
> > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > Acked-by: Xiaoyun Li <xiaoyun.li@intel.com>
> > > 
> > > I didn't ack you on this patch. I remember I added "+1" to the comment
> > > about your includes issue.
> > > It will confuse reviewers not to review new versions.
> > 
> > Yes, they there by mistake.
> > 
> > > 
> > > > Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
> > > 
> > > I didn't see he ack this patch as well.
> > > Please remove these acks.
> > > 
> > > > ---
> > > >  app/test-pmd/meson.build                    |   1 +
> > > >  app/test-pmd/shared_rxq_fwd.c               | 113
> > > > ++++++++++++++++++++
> > > >  app/test-pmd/testpmd.c                      |   1 +
> > > >  app/test-pmd/testpmd.h                      |   5 +
> > > >  doc/guides/testpmd_app_ug/run_app.rst       |   5 +-
> > > >  doc/guides/testpmd_app_ug/testpmd_funcs.rst |   5 +-
> > > >  6 files changed, 128 insertions(+), 2 deletions(-)  create mode
> > > > 100644 app/test-
> > > > pmd/shared_rxq_fwd.c
> > > > 
> > > > diff --git a/app/test-pmd/meson.build b/app/test-pmd/meson.build
> > > > index
> > > > 1ad54caef2c..b5a0f7b6209 100644
> > > > --- a/app/test-pmd/meson.build
> > > > +++ b/app/test-pmd/meson.build
> > > > @@ -22,6 +22,7 @@ sources = files(
> > > >          'noisy_vnf.c',
> > > >          'parameters.c',
> > > >          'rxonly.c',
> > > > +        'shared_rxq_fwd.c',
> > > >          'testpmd.c',
> > > >          'txonly.c',
> > > >          'util.c',
> > > > diff --git a/app/test-pmd/shared_rxq_fwd.c b/app/test-
> > > > pmd/shared_rxq_fwd.c new file mode 100644 index
> > > > 00000000000..c4684893674
> > > > --- /dev/null
> > > > +++ b/app/test-pmd/shared_rxq_fwd.c
> > > > @@ -0,0 +1,113 @@
> > > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > > + * Copyright (c) 2021 NVIDIA Corporation & Affiliates  */
> > > > +
> > > 
> > > Please add "#include <rte_ethdev.h>" here.
> > > Your shared_rxq_fwd.c only needs this include.
> > 
> > As explained below, testpmd relies on rte_ethdev.h.
> > 
> > > 
> > > > +#include "testpmd.h"
> > > > +
> > > > +/*
> > > > + * Rx only sub-burst forwarding.
> > > > + */
> > > > +static void
> > > > +forward_rx_only(uint16_t nb_rx, struct rte_mbuf **pkts_burst) {
> > > > +	rte_pktmbuf_free_bulk(pkts_burst, nb_rx); }
> > > > +
> > > > +/**
> > > > + * Get packet source stream by source port and queue.
> > > > + * All streams of same shared Rx queue locates on same core.
> > > > + */
> > > > +static struct fwd_stream *
> > > > +forward_stream_get(struct fwd_stream *fs, uint16_t port) {
> > > <snip>
> > > > diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h index
> > > > 9482dab3071..ef7a6199313 100644
> > > > --- a/app/test-pmd/testpmd.h
> > > > +++ b/app/test-pmd/testpmd.h
> > > > @@ -12,6 +12,10 @@
> > > >  #include <rte_gro.h>
> > > >  #include <rte_gso.h>
> > > >  #include <rte_os_shim.h>
> > > > +#include <rte_mbuf_dyn.h>
> > > > +#include <rte_flow.h>
> > > > +#include <rte_ethdev.h>
> > > > +
> > > 
> > > Please remove these includes and this blank line.
> > > You only need to add the lib you need in your file like I said above.
> > 
> > From test, testpmd.h used these headers, otherwise compile error if not
> > included by fwd engine.
> 
> Have you tried my way? Include "#include <rte_ethdev.h>" in shared_rxq_fwd.c.
> Please try this and see if there's any compiling issues.

It works, seems rte_ethdev.h has everything needed, thanks!

> 
> > 
> > > 
> > > >  #include <cmdline.h>
> > > >  #include <sys/queue.h>
> > > >  #ifdef RTE_HAS_JANSSON
> > > > @@ -339,6 +343,7 @@ extern struct fwd_engine
> > > > five_tuple_swap_fwd_engine; #ifdef RTE_LIBRTE_IEEE1588  extern
> > > > struct fwd_engine ieee1588_fwd_engine; #endif
> > > > +extern struct fwd_engine shared_rxq_engine;
> > > > 
> > > >  extern struct fwd_engine * fwd_engines[]; /**< NULL terminated
> > > > array. */ extern cmdline_parse_inst_t cmd_set_raw; diff --git
> > > > a/doc/guides/testpmd_app_ug/run_app.rst
> > > > b/doc/guides/testpmd_app_ug/run_app.rst
> > > > index faa3efb902c..74412bb82ca 100644
> > > > --- a/doc/guides/testpmd_app_ug/run_app.rst
> > > > +++ b/doc/guides/testpmd_app_ug/run_app.rst
> > > > @@ -258,6 +258,7 @@ The command line options are:
> > > >         tm
> > > >         noisy
> > > >         5tswap
> > > > +       shared-rxq
> > > > 
> > > >  *   ``--rss-ip``
> > > > 
> > > > @@ -399,7 +400,9 @@ The command line options are:
> > > > 
> > > >      Create queues in shared Rx queue mode if device supports.
> > > >      Shared Rx queues are grouped per X ports. X defaults to
> > > > UINT32_MAX,
> > > > -    implies all ports join share group 1.
> > > > +    implies all ports join share group 1. A new forwarding engine
> > > > +    "shared-rxq" should be used for shared Rx queues. This engine
> > > > does
> > > > +    Rx only and update stream statistics accordingly.
> > > > 
> > > >  *   ``--eth-link-speed``
> > > > 
> > > > diff --git a/doc/guides/testpmd_app_ug/testpmd_funcs.rst
> > > > b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
> > > > index 6d127d9a7bc..78d23429c42 100644
> > > > --- a/doc/guides/testpmd_app_ug/testpmd_funcs.rst
> > > > +++ b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
> > > > @@ -314,7 +314,7 @@ set fwd
> > > >  Set the packet forwarding mode::
> > > > 
> > > >     testpmd> set fwd (io|mac|macswap|flowgen| \
> > > > -                     rxonly|txonly|csum|icmpecho|noisy|5tswap)
> > > > (""|retry)
> > > > +
> > > > + rxonly|txonly|csum|icmpecho|noisy|5tswap|shared-rxq) (""|retry)
> > > > 
> > > >  ``retry`` can be specified for forwarding engines except
> > > > ``rx_only``.
> > > > 
> > > > @@ -357,6 +357,9 @@ The available information categories are:
> > > > 
> > > >    L4 swaps the source port and destination port of transport layer
> > > > (TCP and UDP).
> > > > 
> > > > +* ``shared-rxq``: Receive only for shared Rx queue.
> > > > +  Resolve packet source port from mbuf and update stream
> > > > statistics
> > > > accordingly.
> > > > +
> > > >  Example::
> > > > 
> > > >     testpmd> set fwd rxonly
> > > > --
> > > > 2.33.0
> > > 
> 


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v12 4/7] app/testpmd: new parameter to enable shared Rx queue
  2021-10-21  5:08   ` [dpdk-dev] [PATCH v12 4/7] app/testpmd: new parameter to enable shared Rx queue Xueming Li
@ 2021-10-21  9:20     ` Thomas Monjalon
  0 siblings, 0 replies; 266+ messages in thread
From: Thomas Monjalon @ 2021-10-21  9:20 UTC (permalink / raw)
  To: Xueming Li
  Cc: dev, Zhang Yuying, Li Xiaoyun, Jerin Jacob, Ferruh Yigit,
	Andrew Rybchenko, Viacheslav Ovsiienko, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde

21/10/2021 07:08, Xueming Li:
> Adds "--rxq-share=X" parameter to enable shared RxQ.
> 
> Rx queue is shared if device supports, otherwise fallback to standard
> RxQ.
> 
> Shared Rx queues are grouped per X ports. X defaults to UINT32_MAX,
> implies all ports join share group 1. Queue ID is mapped equally with
> shared Rx queue ID.
> 
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>

Acked-by: Thomas Monjalon <thomas@monjalon.net>




^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v12 7/7] app/testpmd: add forwarding engine for shared Rx queue
  2021-10-21  5:08   ` [dpdk-dev] [PATCH v12 7/7] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
  2021-10-21  6:33     ` Li, Xiaoyun
@ 2021-10-21  9:28     ` Thomas Monjalon
  1 sibling, 0 replies; 266+ messages in thread
From: Thomas Monjalon @ 2021-10-21  9:28 UTC (permalink / raw)
  To: Xueming Li
  Cc: dev, Zhang Yuying, Li Xiaoyun, Jerin Jacob, Ferruh Yigit,
	Andrew Rybchenko, Viacheslav Ovsiienko, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde

21/10/2021 07:08, Xueming Li:
> +    implies all ports join share group 1. A new forwarding engine

Don't say "new" as it will not be new forever.
You can say a "specific" or "specialized".

> +    "shared-rxq" should be used for shared Rx queues. This engine does
> +    Rx only and update stream statistics accordingly.




^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v13 0/7] ethdev: introduce shared Rx queue
  2021-07-27  3:42 [dpdk-dev] [RFC] ethdev: introduce shared Rx queue Xueming Li
                   ` (13 preceding siblings ...)
  2021-10-21  5:08 ` [dpdk-dev] [PATCH v12 0/7] ethdev: introduce " Xueming Li
@ 2021-10-21 10:41 ` Xueming Li
  2021-10-21 10:41   ` [dpdk-dev] [PATCH v13 1/7] " Xueming Li
                     ` (8 more replies)
  2021-11-03  7:58 ` [dpdk-dev] [PATCH v3 00/14] net/mlx5: support " Xueming Li
  2021-11-04 12:33 ` [dpdk-dev] [PATCH v4 00/14] net/mlx5: support shared Rx queue Xueming Li
  16 siblings, 9 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-21 10:41 UTC (permalink / raw)
  To: dev, Zhang Yuying, Li Xiaoyun
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde

In current DPDK framework, all Rx queues is pre-loaded with mbufs for
incoming packets. When number of representors scale out in a switch
domain, the memory consumption became significant. Further more,
polling all ports leads to high cache miss, high latency and low
throughputs.

This patch introduces shared Rx queue. PF and representors in same
Rx domain and switch domain could share Rx queue set by specifying
non-zero share group value in Rx queue configuration.

All ports that share Rx queue actually shares hardware descriptor
queue and feed all Rx queues with one descriptor supply, memory is saved.

Polling any queue using same shared Rx queue receives packets from all
member ports. Source port is identified by mbuf->port.

Multiple groups is supported by group ID. Port queue number in a shared
group should be identical. Queue index is 1:1 mapped in shared group.
An example of two share groups:
 Group1, 4 shared Rx queues per member port: PF, repr0, repr1
 Group2, 2 shared Rx queues per member port: repr2, repr3, ... repr127
 Poll first port for each group:
  core	port	queue
  0	0	0
  1	0	1
  2	0	2
  3	0	3
  4	2	0
  5	2	1

Shared Rx queue must be polled on single thread or core. If both PF0 and
representor0 joined same share group, can't poll pf0rxq0 on core1 and
rep0rxq0 on core2. Actually, polling one port within share group is
sufficient since polling any port in group will return packets for any
port in group.

There was some discussion to aggregate member ports in same group into a
dummy port, several ways to achieve it. Since it optional, need to collect
more feedback and requirement from user, make better decision later.

v1:
  - initial version
v2:
  - add testpmd patches
v3:
  - change common forwarding api to macro for performance, thanks Jerin.
  - save global variable accessed in forwarding to flowstream to minimize
    cache miss
  - combined patches for each forwarding engine
  - support multiple groups in testpmd "--share-rxq" parameter
  - new api to aggregate shared rxq group
v4:
  - spelling fixes
  - remove shared-rxq support for all forwarding engines
  - add dedicate shared-rxq forwarding engine
v5:
 - fix grammars
 - remove aggregate api and leave it for later discussion
 - add release notes
 - add deployment example
v6:
 - replace RxQ offload flag with device offload capability flag
 - add Rx domain
 - RxQ is shared when share group > 0
 - update testpmd accordingly
v7:
 - fix testpmd share group id allocation
 - change rx_domain to 16bits
v8:
 - add new patch for testpmd to show device Rx domain ID and capability
 - new share_qid in RxQ configuration
v9:
 - fix some spelling
v10:
 - add device capability name api
v11:
 - remove macro from device capability name list
v12:
 - rephrase
 - in forwarding core check, add  global flag and RxQ enabled check
v13:
 - update imports of new forwarding engine
 - rephrase

Xueming Li (7):
  ethdev: introduce shared Rx queue
  ethdev: get device capability name as string
  app/testpmd: dump device capability and Rx domain info
  app/testpmd: new parameter to enable shared Rx queue
  app/testpmd: dump port info for shared Rx queue
  app/testpmd: force shared Rx queue polled on same core
  app/testpmd: add forwarding engine for shared Rx queue

 app/test-pmd/config.c                         | 141 +++++++++++++++++-
 app/test-pmd/meson.build                      |   1 +
 app/test-pmd/parameters.c                     |  13 ++
 app/test-pmd/shared_rxq_fwd.c                 | 115 ++++++++++++++
 app/test-pmd/testpmd.c                        |  26 +++-
 app/test-pmd/testpmd.h                        |   5 +
 app/test-pmd/util.c                           |   3 +
 doc/guides/nics/features.rst                  |  13 ++
 doc/guides/nics/features/default.ini          |   1 +
 .../prog_guide/switch_representation.rst      |  11 ++
 doc/guides/rel_notes/release_21_11.rst        |   6 +
 doc/guides/testpmd_app_ug/run_app.rst         |   9 ++
 doc/guides/testpmd_app_ug/testpmd_funcs.rst   |   5 +-
 lib/ethdev/rte_ethdev.c                       |  33 ++++
 lib/ethdev/rte_ethdev.h                       |  38 +++++
 lib/ethdev/version.map                        |   1 +
 16 files changed, 415 insertions(+), 6 deletions(-)
 create mode 100644 app/test-pmd/shared_rxq_fwd.c

-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v13 1/7] ethdev: introduce shared Rx queue
  2021-10-21 10:41 ` [dpdk-dev] [PATCH v13 0/7] ethdev: introduce " Xueming Li
@ 2021-10-21 10:41   ` Xueming Li
  2021-10-21 10:41   ` [dpdk-dev] [PATCH v13 2/7] ethdev: get device capability name as string Xueming Li
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-21 10:41 UTC (permalink / raw)
  To: dev, Zhang Yuying, Li Xiaoyun
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde

In current DPDK framework, each Rx queue is pre-loaded with mbufs to
save incoming packets. For some PMDs, when number of representors scale
out in a switch domain, the memory consumption became significant.
Polling all ports also leads to high cache miss, high latency and low
throughput.

This patch introduces shared Rx queue. Ports in same Rx domain and
switch domain could share Rx queue set by specifying non-zero sharing
group in Rx queue configuration.

Shared Rx queue is identified by share_rxq field of Rx queue
configuration. Port A RxQ X can share RxQ with Port B RxQ Y by using
same shared Rx queue ID.

No special API is defined to receive packets from shared Rx queue.
Polling any member port of a shared Rx queue receives packets of that
queue for all member ports, port_id is identified by mbuf->port. PMD is
responsible to resolve shared Rx queue from device and queue data.

Shared Rx queue must be polled in same thread or core, polling a queue
ID of any member port is essentially same.

Multiple share groups are supported. PMD should support mixed
configuration by allowing multiple share groups and non-shared Rx queue
on one port.

Example grouping and polling model to reflect service priority:
 Group1, 2 shared Rx queues per port: PF, rep0, rep1
 Group2, 1 shared Rx queue per port: rep2, rep3, ... rep127
 Core0: poll PF queue0
 Core1: poll PF queue1
 Core2: poll rep2 queue0

PMD advertise shared Rx queue capability via RTE_ETH_DEV_CAPA_RXQ_SHARE.

PMD is responsible for shared Rx queue consistency checks to avoid
member port's configuration contradict each other.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
---
 doc/guides/nics/features.rst                  | 13 ++++++++++
 doc/guides/nics/features/default.ini          |  1 +
 .../prog_guide/switch_representation.rst      | 11 +++++++++
 doc/guides/rel_notes/release_21_11.rst        |  6 +++++
 lib/ethdev/rte_ethdev.c                       |  8 +++++++
 lib/ethdev/rte_ethdev.h                       | 24 +++++++++++++++++++
 6 files changed, 63 insertions(+)

diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index 8dd421ca013..d35751d5b5a 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -614,6 +614,19 @@ Supports inner packet L4 checksum.
   ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
 
 
+.. _nic_features_shared_rx_queue:
+
+Shared Rx queue
+---------------
+
+Supports shared Rx queue for ports in same Rx domain of a switch domain.
+
+* **[uses]     rte_eth_dev_info**: ``dev_capa:RTE_ETH_DEV_CAPA_RXQ_SHARE``.
+* **[uses]     rte_eth_dev_info,rte_eth_switch_info**: ``rx_domain``, ``domain_id``.
+* **[uses]     rte_eth_rxconf**: ``share_group``, ``share_qid``.
+* **[provides] mbuf**: ``mbuf.port``.
+
+
 .. _nic_features_packet_type_parsing:
 
 Packet type parsing
diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
index 09914b1ad32..39d21fcd379 100644
--- a/doc/guides/nics/features/default.ini
+++ b/doc/guides/nics/features/default.ini
@@ -19,6 +19,7 @@ Free Tx mbuf on demand =
 Queue start/stop     =
 Runtime Rx queue setup =
 Runtime Tx queue setup =
+Shared Rx queue      =
 Burst mode info      =
 Power mgmt address monitor =
 MTU update           =
diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
index ff6aa91c806..4f2532a91ea 100644
--- a/doc/guides/prog_guide/switch_representation.rst
+++ b/doc/guides/prog_guide/switch_representation.rst
@@ -123,6 +123,17 @@ thought as a software "patch panel" front-end for applications.
 .. [1] `Ethernet switch device driver model (switchdev)
        <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
 
+- For some PMDs, memory usage of representors is huge when number of
+  representor grows, mbufs are allocated for each descriptor of Rx queue.
+  Polling large number of ports brings more CPU load, cache miss and
+  latency. Shared Rx queue can be used to share Rx queue between PF and
+  representors among same Rx domain. ``RTE_ETH_DEV_CAPA_RXQ_SHARE`` in
+  device info is used to indicate the capability. Setting non-zero share
+  group in Rx queue configuration to enable share, share_qid is used to
+  identify the shared Rx queue in group. Polling any member port can
+  receive packets of all member ports in the group, port ID is saved in
+  ``mbuf.port``.
+
 Basic SR-IOV
 ------------
 
diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
index 74776ca0691..f4fb68e7408 100644
--- a/doc/guides/rel_notes/release_21_11.rst
+++ b/doc/guides/rel_notes/release_21_11.rst
@@ -75,6 +75,12 @@ New Features
     operations.
   * Added multi-process support.
 
+* **Added ethdev shared Rx queue support.**
+
+  * Added new device capability flag and Rx domain field to switch info.
+  * Added share group and share queue ID to Rx queue configuration.
+  * Added testpmd support and dedicate forwarding engine.
+
 * **Added support to get all MAC addresses of a device.**
 
   Added ``rte_eth_macaddrs_get`` to allow user to retrieve all Ethernet
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 1f18aa916cc..31a9cba065b 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -2175,6 +2175,14 @@ rte_eth_rx_queue_setup(uint16_t port_id, uint16_t rx_queue_id,
 		return -EINVAL;
 	}
 
+	if (local_conf.share_group > 0 &&
+	    (dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE) == 0) {
+		RTE_ETHDEV_LOG(ERR,
+			"Ethdev port_id=%d rx_queue_id=%d, enabled share_group=%hu while device doesn't support Rx queue share\n",
+			port_id, rx_queue_id, local_conf.share_group);
+		return -EINVAL;
+	}
+
 	/*
 	 * If LRO is enabled, check that the maximum aggregated packet
 	 * size is supported by the configured device.
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 014270d3167..40f88cc3d64 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1045,6 +1045,14 @@ struct rte_eth_rxconf {
 	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
 	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
 	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
+	/**
+	 * Share group index in Rx domain and switch domain.
+	 * Non-zero value to enable Rx queue share, zero value disable share.
+	 * PMD is responsible for Rx queue consistency checks to avoid member
+	 * port's configuration contradict to each other.
+	 */
+	uint16_t share_group;
+	uint16_t share_qid; /**< Shared Rx queue ID in group. */
 	/**
 	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
 	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
@@ -1445,6 +1453,16 @@ struct rte_eth_conf {
 #define RTE_ETH_DEV_CAPA_RUNTIME_RX_QUEUE_SETUP 0x00000001
 /** Device supports Tx queue setup after device started. */
 #define RTE_ETH_DEV_CAPA_RUNTIME_TX_QUEUE_SETUP 0x00000002
+/**
+ * Device supports shared Rx queue among ports within Rx domain and
+ * switch domain. Mbufs are consumed by shared Rx queue instead of
+ * each queue. Multiple groups are supported by share_group of Rx
+ * queue configuration. Shared Rx queue is identified by PMD using
+ * share_qid of Rx queue configuration. Polling any port in the group
+ * receive packets of all member ports, source port identified by
+ * mbuf->port field.
+ */
+#define RTE_ETH_DEV_CAPA_RXQ_SHARE              RTE_BIT64(2)
 /**@}*/
 
 /*
@@ -1488,6 +1506,12 @@ struct rte_eth_switch_info {
 	 * but each driver should explicitly define the mapping of switch
 	 * port identifier to that physical interconnect/switch
 	 */
+	/**
+	 * Shared Rx queue sub-domain boundary. Only ports in same Rx domain
+	 * and switch domain can share Rx queue. Valid only if device advertised
+	 * RTE_ETH_DEV_CAPA_RXQ_SHARE capability.
+	 */
+	uint16_t rx_domain;
 };
 
 /**
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v13 2/7] ethdev: get device capability name as string
  2021-10-21 10:41 ` [dpdk-dev] [PATCH v13 0/7] ethdev: introduce " Xueming Li
  2021-10-21 10:41   ` [dpdk-dev] [PATCH v13 1/7] " Xueming Li
@ 2021-10-21 10:41   ` Xueming Li
  2021-10-21 10:41   ` [dpdk-dev] [PATCH v13 3/7] app/testpmd: dump device capability and Rx domain info Xueming Li
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-21 10:41 UTC (permalink / raw)
  To: dev, Zhang Yuying, Li Xiaoyun
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde, Ray Kinsella

This patch adds API to return name of device capability.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
---
 lib/ethdev/rte_ethdev.c | 25 +++++++++++++++++++++++++
 lib/ethdev/rte_ethdev.h | 14 ++++++++++++++
 lib/ethdev/version.map  |  1 +
 3 files changed, 40 insertions(+)

diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 31a9cba065b..bfe5b0adbef 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -167,6 +167,15 @@ static const struct {
 
 #undef RTE_TX_OFFLOAD_BIT2STR
 
+static const struct {
+	uint64_t offload;
+	const char *name;
+} rte_eth_dev_capa_names[] = {
+	{RTE_ETH_DEV_CAPA_RUNTIME_RX_QUEUE_SETUP, "RUNTIME_RX_QUEUE_SETUP"},
+	{RTE_ETH_DEV_CAPA_RUNTIME_TX_QUEUE_SETUP, "RUNTIME_TX_QUEUE_SETUP"},
+	{RTE_ETH_DEV_CAPA_RXQ_SHARE, "RXQ_SHARE"},
+};
+
 /**
  * The user application callback description.
  *
@@ -1236,6 +1245,22 @@ rte_eth_dev_tx_offload_name(uint64_t offload)
 	return name;
 }
 
+const char *
+rte_eth_dev_capability_name(uint64_t capability)
+{
+	const char *name = "UNKNOWN";
+	unsigned int i;
+
+	for (i = 0; i < RTE_DIM(rte_eth_dev_capa_names); ++i) {
+		if (capability == rte_eth_dev_capa_names[i].offload) {
+			name = rte_eth_dev_capa_names[i].name;
+			break;
+		}
+	}
+
+	return name;
+}
+
 static inline int
 eth_dev_check_lro_pkt_size(uint16_t port_id, uint32_t config_size,
 		   uint32_t max_rx_pkt_len, uint32_t dev_info_size)
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 40f88cc3d64..9baca39e97a 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -2109,6 +2109,20 @@ const char *rte_eth_dev_rx_offload_name(uint64_t offload);
  */
 const char *rte_eth_dev_tx_offload_name(uint64_t offload);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Get RTE_ETH_DEV_CAPA_* flag name.
+ *
+ * @param capability
+ *   Capability flag.
+ * @return
+ *   Capability name or 'UNKNOWN' if the flag cannot be recognized.
+ */
+__rte_experimental
+const char *rte_eth_dev_capability_name(uint64_t capability);
+
 /**
  * Configure an Ethernet device.
  * This function must be invoked first before any other function in the
diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
index d552c955c94..e1abe997290 100644
--- a/lib/ethdev/version.map
+++ b/lib/ethdev/version.map
@@ -249,6 +249,7 @@ EXPERIMENTAL {
 	rte_mtr_meter_policy_validate;
 
 	# added in 21.11
+	rte_eth_dev_capability_name;
 	rte_eth_dev_conf_get;
 	rte_eth_macaddrs_get;
 	rte_eth_rx_metadata_negotiate;
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v13 3/7] app/testpmd: dump device capability and Rx domain info
  2021-10-21 10:41 ` [dpdk-dev] [PATCH v13 0/7] ethdev: introduce " Xueming Li
  2021-10-21 10:41   ` [dpdk-dev] [PATCH v13 1/7] " Xueming Li
  2021-10-21 10:41   ` [dpdk-dev] [PATCH v13 2/7] ethdev: get device capability name as string Xueming Li
@ 2021-10-21 10:41   ` Xueming Li
  2021-10-21 10:41   ` [dpdk-dev] [PATCH v13 4/7] app/testpmd: new parameter to enable shared Rx queue Xueming Li
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-21 10:41 UTC (permalink / raw)
  To: dev, Zhang Yuying, Li Xiaoyun
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde

Dump device capability and Rx domain ID if shared Rx queue is supported
by device.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Acked-by: Xiaoyun Li <xiaoyun.li@intel.com>
Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
---
 app/test-pmd/config.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 23aa334cda0..db36ca41b94 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -644,6 +644,29 @@ device_infos_display(const char *identifier)
 	rte_devargs_reset(&da);
 }
 
+static void
+print_dev_capabilities(uint64_t capabilities)
+{
+	uint64_t single_capa;
+	int begin;
+	int end;
+	int bit;
+
+	if (capabilities == 0)
+		return;
+
+	begin = __builtin_ctzll(capabilities);
+	end = sizeof(capabilities) * CHAR_BIT - __builtin_clzll(capabilities);
+
+	single_capa = 1ULL << begin;
+	for (bit = begin; bit < end; bit++) {
+		if (capabilities & single_capa)
+			printf(" %s",
+			       rte_eth_dev_capability_name(single_capa));
+		single_capa <<= 1;
+	}
+}
+
 void
 port_infos_display(portid_t port_id)
 {
@@ -795,6 +818,9 @@ port_infos_display(portid_t port_id)
 	printf("Max segment number per MTU/TSO: %hu\n",
 		dev_info.tx_desc_lim.nb_mtu_seg_max);
 
+	printf("Device capabilities: 0x%"PRIx64"(", dev_info.dev_capa);
+	print_dev_capabilities(dev_info.dev_capa);
+	printf(" )\n");
 	/* Show switch info only if valid switch domain and port id is set */
 	if (dev_info.switch_info.domain_id !=
 		RTE_ETH_DEV_SWITCH_DOMAIN_ID_INVALID) {
@@ -805,6 +831,9 @@ port_infos_display(portid_t port_id)
 			dev_info.switch_info.domain_id);
 		printf("Switch Port Id: %u\n",
 			dev_info.switch_info.port_id);
+		if ((dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE) != 0)
+			printf("Switch Rx domain: %u\n",
+			       dev_info.switch_info.rx_domain);
 	}
 }
 
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v13 4/7] app/testpmd: new parameter to enable shared Rx queue
  2021-10-21 10:41 ` [dpdk-dev] [PATCH v13 0/7] ethdev: introduce " Xueming Li
                     ` (2 preceding siblings ...)
  2021-10-21 10:41   ` [dpdk-dev] [PATCH v13 3/7] app/testpmd: dump device capability and Rx domain info Xueming Li
@ 2021-10-21 10:41   ` Xueming Li
  2021-10-21 19:45     ` Ajit Khaparde
  2021-10-21 10:41   ` [dpdk-dev] [PATCH v13 5/7] app/testpmd: dump port info for " Xueming Li
                     ` (4 subsequent siblings)
  8 siblings, 1 reply; 266+ messages in thread
From: Xueming Li @ 2021-10-21 10:41 UTC (permalink / raw)
  To: dev, Zhang Yuying, Li Xiaoyun
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde

Adds "--rxq-share=X" parameter to enable shared RxQ.

Rx queue is shared if device supports, otherwise fallback to standard
RxQ.

Shared Rx queues are grouped per X ports. X defaults to UINT32_MAX,
implies all ports join share group 1. Queue ID is mapped equally with
shared Rx queue ID.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>
---
 app/test-pmd/config.c                 |  7 ++++++-
 app/test-pmd/parameters.c             | 13 +++++++++++++
 app/test-pmd/testpmd.c                | 20 +++++++++++++++++---
 app/test-pmd/testpmd.h                |  2 ++
 doc/guides/testpmd_app_ug/run_app.rst |  6 ++++++
 5 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index db36ca41b94..e4bbf457916 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -2890,7 +2890,12 @@ rxtx_config_display(void)
 			printf("      RX threshold registers: pthresh=%d hthresh=%d "
 				" wthresh=%d\n",
 				pthresh_tmp, hthresh_tmp, wthresh_tmp);
-			printf("      RX Offloads=0x%"PRIx64"\n", offloads_tmp);
+			printf("      RX Offloads=0x%"PRIx64, offloads_tmp);
+			if (rx_conf->share_group > 0)
+				printf(" share_group=%u share_qid=%u",
+				       rx_conf->share_group,
+				       rx_conf->share_qid);
+			printf("\n");
 		}
 
 		/* per tx queue config only for first queue to be less verbose */
diff --git a/app/test-pmd/parameters.c b/app/test-pmd/parameters.c
index 779a721fa05..afc75f6bd21 100644
--- a/app/test-pmd/parameters.c
+++ b/app/test-pmd/parameters.c
@@ -171,6 +171,7 @@ usage(char* progname)
 	printf("  --tx-ip=src,dst: IP addresses in Tx-only mode\n");
 	printf("  --tx-udp=src[,dst]: UDP ports in Tx-only mode\n");
 	printf("  --eth-link-speed: force link speed.\n");
+	printf("  --rxq-share=X: number of ports per shared Rx queue groups, defaults to UINT32_MAX (1 group)\n");
 	printf("  --disable-link-check: disable check on link status when "
 	       "starting/stopping ports.\n");
 	printf("  --disable-device-start: do not automatically start port\n");
@@ -678,6 +679,7 @@ launch_args_parse(int argc, char** argv)
 		{ "rxpkts",			1, 0, 0 },
 		{ "txpkts",			1, 0, 0 },
 		{ "txonly-multi-flow",		0, 0, 0 },
+		{ "rxq-share",			2, 0, 0 },
 		{ "eth-link-speed",		1, 0, 0 },
 		{ "disable-link-check",		0, 0, 0 },
 		{ "disable-device-start",	0, 0, 0 },
@@ -1352,6 +1354,17 @@ launch_args_parse(int argc, char** argv)
 			}
 			if (!strcmp(lgopts[opt_idx].name, "txonly-multi-flow"))
 				txonly_multi_flow = 1;
+			if (!strcmp(lgopts[opt_idx].name, "rxq-share")) {
+				if (optarg == NULL) {
+					rxq_share = UINT32_MAX;
+				} else {
+					n = atoi(optarg);
+					if (n >= 0)
+						rxq_share = (uint32_t)n;
+					else
+						rte_exit(EXIT_FAILURE, "rxq-share must be >= 0\n");
+				}
+			}
 			if (!strcmp(lgopts[opt_idx].name, "no-flush-rx"))
 				no_flush_rx = 1;
 			if (!strcmp(lgopts[opt_idx].name, "eth-link-speed")) {
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index af0e79fe6d5..80337bad382 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -502,6 +502,11 @@ uint8_t record_core_cycles;
  */
 uint8_t record_burst_stats;
 
+/*
+ * Number of ports per shared Rx queue group, 0 disable.
+ */
+uint32_t rxq_share;
+
 unsigned int num_sockets = 0;
 unsigned int socket_ids[RTE_MAX_NUMA_NODES];
 
@@ -3629,14 +3634,23 @@ dev_event_callback(const char *device_name, enum rte_dev_event_type type,
 }
 
 static void
-rxtx_port_config(struct rte_port *port)
+rxtx_port_config(portid_t pid)
 {
 	uint16_t qid;
 	uint64_t offloads;
+	struct rte_port *port = &ports[pid];
 
 	for (qid = 0; qid < nb_rxq; qid++) {
 		offloads = port->rx_conf[qid].offloads;
 		port->rx_conf[qid] = port->dev_info.default_rxconf;
+
+		if (rxq_share > 0 &&
+		    (port->dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE)) {
+			/* Non-zero share group to enable RxQ share. */
+			port->rx_conf[qid].share_group = pid / rxq_share + 1;
+			port->rx_conf[qid].share_qid = qid; /* Equal mapping. */
+		}
+
 		if (offloads != 0)
 			port->rx_conf[qid].offloads = offloads;
 
@@ -3765,7 +3779,7 @@ init_port_config(void)
 			}
 		}
 
-		rxtx_port_config(port);
+		rxtx_port_config(pid);
 
 		ret = eth_macaddr_get_print_err(pid, &port->eth_addr);
 		if (ret != 0)
@@ -3977,7 +3991,7 @@ init_port_dcb_config(portid_t pid,
 
 	memcpy(&rte_port->dev_conf, &port_conf, sizeof(struct rte_eth_conf));
 
-	rxtx_port_config(rte_port);
+	rxtx_port_config(pid);
 	/* VLAN filter */
 	rte_port->dev_conf.rxmode.offloads |= DEV_RX_OFFLOAD_VLAN_FILTER;
 	for (i = 0; i < RTE_DIM(vlan_tags); i++)
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index e3995d24ab5..63f9913deb6 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -524,6 +524,8 @@ extern enum tx_pkt_split tx_pkt_split;
 
 extern uint8_t txonly_multi_flow;
 
+extern uint32_t rxq_share;
+
 extern uint16_t nb_pkt_per_burst;
 extern uint16_t nb_pkt_flowgen_clones;
 extern int nb_flows_flowgen;
diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
index 8ff7ab85369..faa3efb902c 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -395,6 +395,12 @@ The command line options are:
 
     Generate multiple flows in txonly mode.
 
+*   ``--rxq-share=[X]``
+
+    Create queues in shared Rx queue mode if device supports.
+    Shared Rx queues are grouped per X ports. X defaults to UINT32_MAX,
+    implies all ports join share group 1.
+
 *   ``--eth-link-speed``
 
     Set a forced link speed to the ethernet port::
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v13 5/7] app/testpmd: dump port info for shared Rx queue
  2021-10-21 10:41 ` [dpdk-dev] [PATCH v13 0/7] ethdev: introduce " Xueming Li
                     ` (3 preceding siblings ...)
  2021-10-21 10:41   ` [dpdk-dev] [PATCH v13 4/7] app/testpmd: new parameter to enable shared Rx queue Xueming Li
@ 2021-10-21 10:41   ` Xueming Li
  2021-10-21 19:48     ` Ajit Khaparde
  2021-10-21 10:41   ` [dpdk-dev] [PATCH v13 6/7] app/testpmd: force shared Rx queue polled on same core Xueming Li
                     ` (3 subsequent siblings)
  8 siblings, 1 reply; 266+ messages in thread
From: Xueming Li @ 2021-10-21 10:41 UTC (permalink / raw)
  To: dev, Zhang Yuying, Li Xiaoyun
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde

In case of shared Rx queue, source port mbuf from polling result isn't
the Rx port of forwarding stream. To provide original port ID, this
patch dumps mbuf->port for each packet in verbose mode if shared Rx
queue enabled.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Xiaoyun Li <xiaoyun.li@intel.com>
---
 app/test-pmd/util.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/app/test-pmd/util.c b/app/test-pmd/util.c
index 26dc0c86406..f712f687287 100644
--- a/app/test-pmd/util.c
+++ b/app/test-pmd/util.c
@@ -101,6 +101,9 @@ dump_pkt_burst(uint16_t port_id, uint16_t queue, struct rte_mbuf *pkts[],
 		struct rte_port *port = &ports[port_id];
 
 		mb = pkts[i];
+		if (rxq_share > 0)
+			MKDUMPSTR(print_buf, buf_size, cur_len, "port %u, ",
+				  mb->port);
 		eth_hdr = rte_pktmbuf_read(mb, 0, sizeof(_eth_hdr), &_eth_hdr);
 		eth_type = RTE_BE_TO_CPU_16(eth_hdr->ether_type);
 		packet_type = mb->packet_type;
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v13 6/7] app/testpmd: force shared Rx queue polled on same core
  2021-10-21 10:41 ` [dpdk-dev] [PATCH v13 0/7] ethdev: introduce " Xueming Li
                     ` (4 preceding siblings ...)
  2021-10-21 10:41   ` [dpdk-dev] [PATCH v13 5/7] app/testpmd: dump port info for " Xueming Li
@ 2021-10-21 10:41   ` Xueming Li
  2021-10-21 10:41   ` [dpdk-dev] [PATCH v13 7/7] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-21 10:41 UTC (permalink / raw)
  To: dev, Zhang Yuying, Li Xiaoyun
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde

Shared Rx queue must be polled on same core. This patch checks and stops
forwarding if shared RxQ being scheduled on multiple
cores.

It's suggested to use same number of Rx queues and polling cores.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Xiaoyun Li <xiaoyun.li@intel.com>
---
 app/test-pmd/config.c  | 105 +++++++++++++++++++++++++++++++++++++++++
 app/test-pmd/testpmd.c |   5 +-
 app/test-pmd/testpmd.h |   2 +
 3 files changed, 111 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index e4bbf457916..cad78350dcc 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -3067,6 +3067,111 @@ port_rss_hash_key_update(portid_t port_id, char rss_type[], uint8_t *hash_key,
 	}
 }
 
+/*
+ * Check whether a shared rxq scheduled on other lcores.
+ */
+static bool
+fwd_stream_on_other_lcores(uint16_t domain_id, lcoreid_t src_lc,
+			   portid_t src_port, queueid_t src_rxq,
+			   uint32_t share_group, queueid_t share_rxq)
+{
+	streamid_t sm_id;
+	streamid_t nb_fs_per_lcore;
+	lcoreid_t  nb_fc;
+	lcoreid_t  lc_id;
+	struct fwd_stream *fs;
+	struct rte_port *port;
+	struct rte_eth_dev_info *dev_info;
+	struct rte_eth_rxconf *rxq_conf;
+
+	nb_fc = cur_fwd_config.nb_fwd_lcores;
+	/* Check remaining cores. */
+	for (lc_id = src_lc + 1; lc_id < nb_fc; lc_id++) {
+		sm_id = fwd_lcores[lc_id]->stream_idx;
+		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
+		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
+		     sm_id++) {
+			fs = fwd_streams[sm_id];
+			port = &ports[fs->rx_port];
+			dev_info = &port->dev_info;
+			rxq_conf = &port->rx_conf[fs->rx_queue];
+			if ((dev_info->dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE)
+			    == 0 || rxq_conf->share_group == 0)
+				/* Not shared rxq. */
+				continue;
+			if (domain_id != port->dev_info.switch_info.domain_id)
+				continue;
+			if (rxq_conf->share_group != share_group)
+				continue;
+			if (rxq_conf->share_qid != share_rxq)
+				continue;
+			printf("Shared Rx queue group %u queue %hu can't be scheduled on different cores:\n",
+			       share_group, share_rxq);
+			printf("  lcore %hhu Port %hu queue %hu\n",
+			       src_lc, src_port, src_rxq);
+			printf("  lcore %hhu Port %hu queue %hu\n",
+			       lc_id, fs->rx_port, fs->rx_queue);
+			printf("Please use --nb-cores=%hu to limit number of forwarding cores\n",
+			       nb_rxq);
+			return true;
+		}
+	}
+	return false;
+}
+
+/*
+ * Check shared rxq configuration.
+ *
+ * Shared group must not being scheduled on different core.
+ */
+bool
+pkt_fwd_shared_rxq_check(void)
+{
+	streamid_t sm_id;
+	streamid_t nb_fs_per_lcore;
+	lcoreid_t  nb_fc;
+	lcoreid_t  lc_id;
+	struct fwd_stream *fs;
+	uint16_t domain_id;
+	struct rte_port *port;
+	struct rte_eth_dev_info *dev_info;
+	struct rte_eth_rxconf *rxq_conf;
+
+	if (rxq_share == 0)
+		return true;
+	nb_fc = cur_fwd_config.nb_fwd_lcores;
+	/*
+	 * Check streams on each core, make sure the same switch domain +
+	 * group + queue doesn't get scheduled on other cores.
+	 */
+	for (lc_id = 0; lc_id < nb_fc; lc_id++) {
+		sm_id = fwd_lcores[lc_id]->stream_idx;
+		nb_fs_per_lcore = fwd_lcores[lc_id]->stream_nb;
+		for (; sm_id < fwd_lcores[lc_id]->stream_idx + nb_fs_per_lcore;
+		     sm_id++) {
+			fs = fwd_streams[sm_id];
+			/* Update lcore info stream being scheduled. */
+			fs->lcore = fwd_lcores[lc_id];
+			port = &ports[fs->rx_port];
+			dev_info = &port->dev_info;
+			rxq_conf = &port->rx_conf[fs->rx_queue];
+			if ((dev_info->dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE)
+			    == 0 || rxq_conf->share_group == 0)
+				/* Not shared rxq. */
+				continue;
+			/* Check shared rxq not scheduled on remaining cores. */
+			domain_id = port->dev_info.switch_info.domain_id;
+			if (fwd_stream_on_other_lcores(domain_id, lc_id,
+						       fs->rx_port,
+						       fs->rx_queue,
+						       rxq_conf->share_group,
+						       rxq_conf->share_qid))
+				return false;
+		}
+	}
+	return true;
+}
+
 /*
  * Setup forwarding configuration for each logical core.
  */
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 80337bad382..d76d298a4b9 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -2309,6 +2309,10 @@ start_packet_forwarding(int with_tx_first)
 
 	fwd_config_setup();
 
+	pkt_fwd_config_display(&cur_fwd_config);
+	if (!pkt_fwd_shared_rxq_check())
+		return;
+
 	port_fwd_begin = cur_fwd_config.fwd_eng->port_fwd_begin;
 	if (port_fwd_begin != NULL) {
 		for (i = 0; i < cur_fwd_config.nb_fwd_ports; i++) {
@@ -2338,7 +2342,6 @@ start_packet_forwarding(int with_tx_first)
 	if(!no_flush_rx)
 		flush_fwd_rx_queues();
 
-	pkt_fwd_config_display(&cur_fwd_config);
 	rxtx_config_display();
 
 	fwd_stats_reset();
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 63f9913deb6..9482dab3071 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -147,6 +147,7 @@ struct fwd_stream {
 	uint64_t     core_cycles; /**< used for RX and TX processing */
 	struct pkt_burst_stats rx_burst_stats;
 	struct pkt_burst_stats tx_burst_stats;
+	struct fwd_lcore *lcore; /**< Lcore being scheduled. */
 };
 
 /**
@@ -842,6 +843,7 @@ void port_summary_header_display(void);
 void rx_queue_infos_display(portid_t port_idi, uint16_t queue_id);
 void tx_queue_infos_display(portid_t port_idi, uint16_t queue_id);
 void fwd_lcores_config_display(void);
+bool pkt_fwd_shared_rxq_check(void);
 void pkt_fwd_config_display(struct fwd_config *cfg);
 void rxtx_config_display(void);
 void fwd_config_setup(void);
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v13 7/7] app/testpmd: add forwarding engine for shared Rx queue
  2021-10-21 10:41 ` [dpdk-dev] [PATCH v13 0/7] ethdev: introduce " Xueming Li
                     ` (5 preceding siblings ...)
  2021-10-21 10:41   ` [dpdk-dev] [PATCH v13 6/7] app/testpmd: force shared Rx queue polled on same core Xueming Li
@ 2021-10-21 10:41   ` Xueming Li
  2021-10-21 23:41   ` [dpdk-dev] [PATCH v13 0/7] ethdev: introduce " Ferruh Yigit
  2021-11-04 15:52   ` Tom Barbette
  8 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-10-21 10:41 UTC (permalink / raw)
  To: dev, Zhang Yuying, Li Xiaoyun
  Cc: xuemingl, Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde

To support shared Rx queue, this patch introduces dedicate forwarding
engine. The engine groups received packets by mbuf->port into sub-group,
updates stream statistics and simply frees packets.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 app/test-pmd/meson.build                    |   1 +
 app/test-pmd/shared_rxq_fwd.c               | 115 ++++++++++++++++++++
 app/test-pmd/testpmd.c                      |   1 +
 app/test-pmd/testpmd.h                      |   1 +
 doc/guides/testpmd_app_ug/run_app.rst       |   5 +-
 doc/guides/testpmd_app_ug/testpmd_funcs.rst |   5 +-
 6 files changed, 126 insertions(+), 2 deletions(-)
 create mode 100644 app/test-pmd/shared_rxq_fwd.c

diff --git a/app/test-pmd/meson.build b/app/test-pmd/meson.build
index 1ad54caef2c..b5a0f7b6209 100644
--- a/app/test-pmd/meson.build
+++ b/app/test-pmd/meson.build
@@ -22,6 +22,7 @@ sources = files(
         'noisy_vnf.c',
         'parameters.c',
         'rxonly.c',
+        'shared_rxq_fwd.c',
         'testpmd.c',
         'txonly.c',
         'util.c',
diff --git a/app/test-pmd/shared_rxq_fwd.c b/app/test-pmd/shared_rxq_fwd.c
new file mode 100644
index 00000000000..da54a383fd5
--- /dev/null
+++ b/app/test-pmd/shared_rxq_fwd.c
@@ -0,0 +1,115 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2021 NVIDIA Corporation & Affiliates
+ */
+
+#include <rte_ethdev.h>
+
+#include "testpmd.h"
+
+/*
+ * Rx only sub-burst forwarding.
+ */
+static void
+forward_rx_only(uint16_t nb_rx, struct rte_mbuf **pkts_burst)
+{
+	rte_pktmbuf_free_bulk(pkts_burst, nb_rx);
+}
+
+/**
+ * Get packet source stream by source port and queue.
+ * All streams of same shared Rx queue locates on same core.
+ */
+static struct fwd_stream *
+forward_stream_get(struct fwd_stream *fs, uint16_t port)
+{
+	streamid_t sm_id;
+	struct fwd_lcore *fc;
+	struct fwd_stream **fsm;
+	streamid_t nb_fs;
+
+	fc = fs->lcore;
+	fsm = &fwd_streams[fc->stream_idx];
+	nb_fs = fc->stream_nb;
+	for (sm_id = 0; sm_id < nb_fs; sm_id++) {
+		if (fsm[sm_id]->rx_port == port &&
+		    fsm[sm_id]->rx_queue == fs->rx_queue)
+			return fsm[sm_id];
+	}
+	return NULL;
+}
+
+/**
+ * Forward packet by source port and queue.
+ */
+static void
+forward_sub_burst(struct fwd_stream *src_fs, uint16_t port, uint16_t nb_rx,
+		  struct rte_mbuf **pkts)
+{
+	struct fwd_stream *fs = forward_stream_get(src_fs, port);
+
+	if (fs != NULL) {
+		fs->rx_packets += nb_rx;
+		forward_rx_only(nb_rx, pkts);
+	} else {
+		/* Source stream not found, drop all packets. */
+		src_fs->fwd_dropped += nb_rx;
+		while (nb_rx > 0)
+			rte_pktmbuf_free(pkts[--nb_rx]);
+	}
+}
+
+/**
+ * Forward packets from shared Rx queue.
+ *
+ * Source port of packets are identified by mbuf->port.
+ */
+static void
+forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
+		   struct rte_mbuf **pkts_burst)
+{
+	uint16_t i, nb_sub_burst, port, last_port;
+
+	nb_sub_burst = 0;
+	last_port = pkts_burst[0]->port;
+	/* Locate sub-burst according to mbuf->port. */
+	for (i = 0; i < nb_rx - 1; ++i) {
+		rte_prefetch0(pkts_burst[i + 1]);
+		port = pkts_burst[i]->port;
+		if (i > 0 && last_port != port) {
+			/* Forward packets with same source port. */
+			forward_sub_burst(fs, last_port, nb_sub_burst,
+					  &pkts_burst[i - nb_sub_burst]);
+			nb_sub_burst = 0;
+			last_port = port;
+		}
+		nb_sub_burst++;
+	}
+	/* Last sub-burst. */
+	nb_sub_burst++;
+	forward_sub_burst(fs, last_port, nb_sub_burst,
+			  &pkts_burst[nb_rx - nb_sub_burst]);
+}
+
+static void
+shared_rxq_fwd(struct fwd_stream *fs)
+{
+	struct rte_mbuf *pkts_burst[nb_pkt_per_burst];
+	uint16_t nb_rx;
+	uint64_t start_tsc = 0;
+
+	get_start_cycles(&start_tsc);
+	nb_rx = rte_eth_rx_burst(fs->rx_port, fs->rx_queue, pkts_burst,
+				 nb_pkt_per_burst);
+	inc_rx_burst_stats(fs, nb_rx);
+	if (unlikely(nb_rx == 0))
+		return;
+	forward_shared_rxq(fs, nb_rx, pkts_burst);
+	get_end_cycles(fs, start_tsc);
+}
+
+struct fwd_engine shared_rxq_engine = {
+	.fwd_mode_name  = "shared_rxq",
+	.port_fwd_begin = NULL,
+	.port_fwd_end   = NULL,
+	.packet_fwd     = shared_rxq_fwd,
+};
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index d76d298a4b9..6d5bbc82404 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -188,6 +188,7 @@ struct fwd_engine * fwd_engines[] = {
 #ifdef RTE_LIBRTE_IEEE1588
 	&ieee1588_fwd_engine,
 #endif
+	&shared_rxq_engine,
 	NULL,
 };
 
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 9482dab3071..bf3669134aa 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -339,6 +339,7 @@ extern struct fwd_engine five_tuple_swap_fwd_engine;
 #ifdef RTE_LIBRTE_IEEE1588
 extern struct fwd_engine ieee1588_fwd_engine;
 #endif
+extern struct fwd_engine shared_rxq_engine;
 
 extern struct fwd_engine * fwd_engines[]; /**< NULL terminated array. */
 extern cmdline_parse_inst_t cmd_set_raw;
diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
index faa3efb902c..d23e0b6a7a2 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -258,6 +258,7 @@ The command line options are:
        tm
        noisy
        5tswap
+       shared-rxq
 
 *   ``--rss-ip``
 
@@ -399,7 +400,9 @@ The command line options are:
 
     Create queues in shared Rx queue mode if device supports.
     Shared Rx queues are grouped per X ports. X defaults to UINT32_MAX,
-    implies all ports join share group 1.
+    implies all ports join share group 1. Forwarding engine "shared-rxq"
+    should be used for shared Rx queues. This engine does Rx only and
+    update stream statistics accordingly.
 
 *   ``--eth-link-speed``
 
diff --git a/doc/guides/testpmd_app_ug/testpmd_funcs.rst b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
index 6d127d9a7bc..78d23429c42 100644
--- a/doc/guides/testpmd_app_ug/testpmd_funcs.rst
+++ b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
@@ -314,7 +314,7 @@ set fwd
 Set the packet forwarding mode::
 
    testpmd> set fwd (io|mac|macswap|flowgen| \
-                     rxonly|txonly|csum|icmpecho|noisy|5tswap) (""|retry)
+                     rxonly|txonly|csum|icmpecho|noisy|5tswap|shared-rxq) (""|retry)
 
 ``retry`` can be specified for forwarding engines except ``rx_only``.
 
@@ -357,6 +357,9 @@ The available information categories are:
 
   L4 swaps the source port and destination port of transport layer (TCP and UDP).
 
+* ``shared-rxq``: Receive only for shared Rx queue.
+  Resolve packet source port from mbuf and update stream statistics accordingly.
+
 Example::
 
    testpmd> set fwd rxonly
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v13 4/7] app/testpmd: new parameter to enable shared Rx queue
  2021-10-21 10:41   ` [dpdk-dev] [PATCH v13 4/7] app/testpmd: new parameter to enable shared Rx queue Xueming Li
@ 2021-10-21 19:45     ` Ajit Khaparde
  0 siblings, 0 replies; 266+ messages in thread
From: Ajit Khaparde @ 2021-10-21 19:45 UTC (permalink / raw)
  To: Xueming Li
  Cc: dpdk-dev, Zhang Yuying, Li Xiaoyun, Jerin Jacob, Ferruh Yigit,
	Andrew Rybchenko, Viacheslav Ovsiienko, Thomas Monjalon,
	Lior Margalit, Ananyev Konstantin

On Thu, Oct 21, 2021 at 3:42 AM Xueming Li <xuemingl@nvidia.com> wrote:
>
> Adds "--rxq-share=X" parameter to enable shared RxQ.
>
> Rx queue is shared if device supports, otherwise fallback to standard
> RxQ.
>
> Shared Rx queues are grouped per X ports. X defaults to UINT32_MAX,
> implies all ports join share group 1. Queue ID is mapped equally with
> shared Rx queue ID.
>
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> Acked-by: Thomas Monjalon <thomas@monjalon.net>
Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>

> ---
>  app/test-pmd/config.c                 |  7 ++++++-
>  app/test-pmd/parameters.c             | 13 +++++++++++++
>  app/test-pmd/testpmd.c                | 20 +++++++++++++++++---
>  app/test-pmd/testpmd.h                |  2 ++
>  doc/guides/testpmd_app_ug/run_app.rst |  6 ++++++
>  5 files changed, 44 insertions(+), 4 deletions(-)
>
> diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
> index db36ca41b94..e4bbf457916 100644
> --- a/app/test-pmd/config.c
> +++ b/app/test-pmd/config.c
> @@ -2890,7 +2890,12 @@ rxtx_config_display(void)
>                         printf("      RX threshold registers: pthresh=%d hthresh=%d "
>                                 " wthresh=%d\n",
>                                 pthresh_tmp, hthresh_tmp, wthresh_tmp);
> -                       printf("      RX Offloads=0x%"PRIx64"\n", offloads_tmp);
> +                       printf("      RX Offloads=0x%"PRIx64, offloads_tmp);
> +                       if (rx_conf->share_group > 0)
> +                               printf(" share_group=%u share_qid=%u",
> +                                      rx_conf->share_group,
> +                                      rx_conf->share_qid);
> +                       printf("\n");
>                 }
>
>                 /* per tx queue config only for first queue to be less verbose */
> diff --git a/app/test-pmd/parameters.c b/app/test-pmd/parameters.c
> index 779a721fa05..afc75f6bd21 100644
> --- a/app/test-pmd/parameters.c
> +++ b/app/test-pmd/parameters.c
> @@ -171,6 +171,7 @@ usage(char* progname)
>         printf("  --tx-ip=src,dst: IP addresses in Tx-only mode\n");
>         printf("  --tx-udp=src[,dst]: UDP ports in Tx-only mode\n");
>         printf("  --eth-link-speed: force link speed.\n");
> +       printf("  --rxq-share=X: number of ports per shared Rx queue groups, defaults to UINT32_MAX (1 group)\n");
>         printf("  --disable-link-check: disable check on link status when "
>                "starting/stopping ports.\n");
>         printf("  --disable-device-start: do not automatically start port\n");
> @@ -678,6 +679,7 @@ launch_args_parse(int argc, char** argv)
>                 { "rxpkts",                     1, 0, 0 },
>                 { "txpkts",                     1, 0, 0 },
>                 { "txonly-multi-flow",          0, 0, 0 },
> +               { "rxq-share",                  2, 0, 0 },
>                 { "eth-link-speed",             1, 0, 0 },
>                 { "disable-link-check",         0, 0, 0 },
>                 { "disable-device-start",       0, 0, 0 },
> @@ -1352,6 +1354,17 @@ launch_args_parse(int argc, char** argv)
>                         }
>                         if (!strcmp(lgopts[opt_idx].name, "txonly-multi-flow"))
>                                 txonly_multi_flow = 1;
> +                       if (!strcmp(lgopts[opt_idx].name, "rxq-share")) {
> +                               if (optarg == NULL) {
> +                                       rxq_share = UINT32_MAX;
> +                               } else {
> +                                       n = atoi(optarg);
> +                                       if (n >= 0)
> +                                               rxq_share = (uint32_t)n;
> +                                       else
> +                                               rte_exit(EXIT_FAILURE, "rxq-share must be >= 0\n");
> +                               }
> +                       }
>                         if (!strcmp(lgopts[opt_idx].name, "no-flush-rx"))
>                                 no_flush_rx = 1;
>                         if (!strcmp(lgopts[opt_idx].name, "eth-link-speed")) {
> diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
> index af0e79fe6d5..80337bad382 100644
> --- a/app/test-pmd/testpmd.c
> +++ b/app/test-pmd/testpmd.c
> @@ -502,6 +502,11 @@ uint8_t record_core_cycles;
>   */
>  uint8_t record_burst_stats;
>
> +/*
> + * Number of ports per shared Rx queue group, 0 disable.
> + */
> +uint32_t rxq_share;
> +
>  unsigned int num_sockets = 0;
>  unsigned int socket_ids[RTE_MAX_NUMA_NODES];
>
> @@ -3629,14 +3634,23 @@ dev_event_callback(const char *device_name, enum rte_dev_event_type type,
>  }
>
>  static void
> -rxtx_port_config(struct rte_port *port)
> +rxtx_port_config(portid_t pid)
>  {
>         uint16_t qid;
>         uint64_t offloads;
> +       struct rte_port *port = &ports[pid];
>
>         for (qid = 0; qid < nb_rxq; qid++) {
>                 offloads = port->rx_conf[qid].offloads;
>                 port->rx_conf[qid] = port->dev_info.default_rxconf;
> +
> +               if (rxq_share > 0 &&
> +                   (port->dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE)) {
> +                       /* Non-zero share group to enable RxQ share. */
> +                       port->rx_conf[qid].share_group = pid / rxq_share + 1;
> +                       port->rx_conf[qid].share_qid = qid; /* Equal mapping. */
> +               }
> +
>                 if (offloads != 0)
>                         port->rx_conf[qid].offloads = offloads;
>
> @@ -3765,7 +3779,7 @@ init_port_config(void)
>                         }
>                 }
>
> -               rxtx_port_config(port);
> +               rxtx_port_config(pid);
>
>                 ret = eth_macaddr_get_print_err(pid, &port->eth_addr);
>                 if (ret != 0)
> @@ -3977,7 +3991,7 @@ init_port_dcb_config(portid_t pid,
>
>         memcpy(&rte_port->dev_conf, &port_conf, sizeof(struct rte_eth_conf));
>
> -       rxtx_port_config(rte_port);
> +       rxtx_port_config(pid);
>         /* VLAN filter */
>         rte_port->dev_conf.rxmode.offloads |= DEV_RX_OFFLOAD_VLAN_FILTER;
>         for (i = 0; i < RTE_DIM(vlan_tags); i++)
> diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
> index e3995d24ab5..63f9913deb6 100644
> --- a/app/test-pmd/testpmd.h
> +++ b/app/test-pmd/testpmd.h
> @@ -524,6 +524,8 @@ extern enum tx_pkt_split tx_pkt_split;
>
>  extern uint8_t txonly_multi_flow;
>
> +extern uint32_t rxq_share;
> +
>  extern uint16_t nb_pkt_per_burst;
>  extern uint16_t nb_pkt_flowgen_clones;
>  extern int nb_flows_flowgen;
> diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
> index 8ff7ab85369..faa3efb902c 100644
> --- a/doc/guides/testpmd_app_ug/run_app.rst
> +++ b/doc/guides/testpmd_app_ug/run_app.rst
> @@ -395,6 +395,12 @@ The command line options are:
>
>      Generate multiple flows in txonly mode.
>
> +*   ``--rxq-share=[X]``
> +
> +    Create queues in shared Rx queue mode if device supports.
> +    Shared Rx queues are grouped per X ports. X defaults to UINT32_MAX,
> +    implies all ports join share group 1.
> +
>  *   ``--eth-link-speed``
>
>      Set a forced link speed to the ethernet port::
> --
> 2.33.0
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v13 5/7] app/testpmd: dump port info for shared Rx queue
  2021-10-21 10:41   ` [dpdk-dev] [PATCH v13 5/7] app/testpmd: dump port info for " Xueming Li
@ 2021-10-21 19:48     ` Ajit Khaparde
  0 siblings, 0 replies; 266+ messages in thread
From: Ajit Khaparde @ 2021-10-21 19:48 UTC (permalink / raw)
  To: Xueming Li
  Cc: dpdk-dev, Zhang Yuying, Li Xiaoyun, Jerin Jacob, Ferruh Yigit,
	Andrew Rybchenko, Viacheslav Ovsiienko, Thomas Monjalon,
	Lior Margalit, Ananyev Konstantin

On Thu, Oct 21, 2021 at 3:42 AM Xueming Li <xuemingl@nvidia.com> wrote:
>
> In case of shared Rx queue, source port mbuf from polling result isn't
> the Rx port of forwarding stream. To provide original port ID, this
> patch dumps mbuf->port for each packet in verbose mode if shared Rx
> queue enabled.
>
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> Acked-by: Xiaoyun Li <xiaoyun.li@intel.com>

Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>

> ---
>  app/test-pmd/util.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/app/test-pmd/util.c b/app/test-pmd/util.c
> index 26dc0c86406..f712f687287 100644
> --- a/app/test-pmd/util.c
> +++ b/app/test-pmd/util.c
> @@ -101,6 +101,9 @@ dump_pkt_burst(uint16_t port_id, uint16_t queue, struct rte_mbuf *pkts[],
>                 struct rte_port *port = &ports[port_id];
>
>                 mb = pkts[i];
> +               if (rxq_share > 0)
> +                       MKDUMPSTR(print_buf, buf_size, cur_len, "port %u, ",
> +                                 mb->port);
>                 eth_hdr = rte_pktmbuf_read(mb, 0, sizeof(_eth_hdr), &_eth_hdr);
>                 eth_type = RTE_BE_TO_CPU_16(eth_hdr->ether_type);
>                 packet_type = mb->packet_type;
> --
> 2.33.0
>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v13 0/7] ethdev: introduce shared Rx queue
  2021-10-21 10:41 ` [dpdk-dev] [PATCH v13 0/7] ethdev: introduce " Xueming Li
                     ` (6 preceding siblings ...)
  2021-10-21 10:41   ` [dpdk-dev] [PATCH v13 7/7] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
@ 2021-10-21 23:41   ` Ferruh Yigit
  2021-10-22  6:31     ` Xueming(Steven) Li
  2021-11-04 15:52   ` Tom Barbette
  8 siblings, 1 reply; 266+ messages in thread
From: Ferruh Yigit @ 2021-10-21 23:41 UTC (permalink / raw)
  To: Xueming Li, dev, Zhang Yuying, Li Xiaoyun
  Cc: Jerin Jacob, Andrew Rybchenko, Viacheslav Ovsiienko,
	Thomas Monjalon, Lior Margalit, Ananyev Konstantin,
	Ajit Khaparde

On 10/21/2021 11:41 AM, Xueming Li wrote:
> In current DPDK framework, all Rx queues is pre-loaded with mbufs for
> incoming packets. When number of representors scale out in a switch
> domain, the memory consumption became significant. Further more,
> polling all ports leads to high cache miss, high latency and low
> throughputs.
> 
> This patch introduces shared Rx queue. PF and representors in same
> Rx domain and switch domain could share Rx queue set by specifying
> non-zero share group value in Rx queue configuration.
> 
> All ports that share Rx queue actually shares hardware descriptor
> queue and feed all Rx queues with one descriptor supply, memory is saved.
> 
> Polling any queue using same shared Rx queue receives packets from all
> member ports. Source port is identified by mbuf->port.
> 
> Multiple groups is supported by group ID. Port queue number in a shared
> group should be identical. Queue index is 1:1 mapped in shared group.
> An example of two share groups:
>   Group1, 4 shared Rx queues per member port: PF, repr0, repr1
>   Group2, 2 shared Rx queues per member port: repr2, repr3, ... repr127
>   Poll first port for each group:
>    core	port	queue
>    0	0	0
>    1	0	1
>    2	0	2
>    3	0	3
>    4	2	0
>    5	2	1
> 
> Shared Rx queue must be polled on single thread or core. If both PF0 and
> representor0 joined same share group, can't poll pf0rxq0 on core1 and
> rep0rxq0 on core2. Actually, polling one port within share group is
> sufficient since polling any port in group will return packets for any
> port in group.
> 
> There was some discussion to aggregate member ports in same group into a
> dummy port, several ways to achieve it. Since it optional, need to collect
> more feedback and requirement from user, make better decision later.
> 
> v1:
>    - initial version
> v2:
>    - add testpmd patches
> v3:
>    - change common forwarding api to macro for performance, thanks Jerin.
>    - save global variable accessed in forwarding to flowstream to minimize
>      cache miss
>    - combined patches for each forwarding engine
>    - support multiple groups in testpmd "--share-rxq" parameter
>    - new api to aggregate shared rxq group
> v4:
>    - spelling fixes
>    - remove shared-rxq support for all forwarding engines
>    - add dedicate shared-rxq forwarding engine
> v5:
>   - fix grammars
>   - remove aggregate api and leave it for later discussion
>   - add release notes
>   - add deployment example
> v6:
>   - replace RxQ offload flag with device offload capability flag
>   - add Rx domain
>   - RxQ is shared when share group > 0
>   - update testpmd accordingly
> v7:
>   - fix testpmd share group id allocation
>   - change rx_domain to 16bits
> v8:
>   - add new patch for testpmd to show device Rx domain ID and capability
>   - new share_qid in RxQ configuration
> v9:
>   - fix some spelling
> v10:
>   - add device capability name api
> v11:
>   - remove macro from device capability name list
> v12:
>   - rephrase
>   - in forwarding core check, add  global flag and RxQ enabled check
> v13:
>   - update imports of new forwarding engine
>   - rephrase
> 
> Xueming Li (7):
>    ethdev: introduce shared Rx queue
>    ethdev: get device capability name as string
>    app/testpmd: dump device capability and Rx domain info
>    app/testpmd: new parameter to enable shared Rx queue
>    app/testpmd: dump port info for shared Rx queue
>    app/testpmd: force shared Rx queue polled on same core
>    app/testpmd: add forwarding engine for shared Rx queue
> 

This patch is changing some common ethdev structs for a use case I am
not sure how common, I would like to see more reviews from more vendors
but we didn't get, at this stage I will proceed based on Andres's review.

Since only nvidia will be able to test this feature in this release, can
you please be sure nvidia test report contains this feature? To be sure
the feature is tested at least by a vendor.


Series applied to dpdk-next-net/main, thanks.

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v13 0/7] ethdev: introduce shared Rx queue
  2021-10-21 23:41   ` [dpdk-dev] [PATCH v13 0/7] ethdev: introduce " Ferruh Yigit
@ 2021-10-22  6:31     ` Xueming(Steven) Li
  0 siblings, 0 replies; 266+ messages in thread
From: Xueming(Steven) Li @ 2021-10-22  6:31 UTC (permalink / raw)
  To: yuying.zhang, xiaoyun.li, Raslan Darawsheh, dev, ferruh.yigit,
	Ali Alnubani
  Cc: jerinjacobk, NBU-Contact-Thomas Monjalon, andrew.rybchenko,
	Slava Ovsiienko, konstantin.ananyev, ajit.khaparde,
	Lior Margalit

On Fri, 2021-10-22 at 00:41 +0100, Ferruh Yigit wrote:
> On 10/21/2021 11:41 AM, Xueming Li wrote:
> > In current DPDK framework, all Rx queues is pre-loaded with mbufs for
> > incoming packets. When number of representors scale out in a switch
> > domain, the memory consumption became significant. Further more,
> > polling all ports leads to high cache miss, high latency and low
> > throughputs.
> > 
> > This patch introduces shared Rx queue. PF and representors in same
> > Rx domain and switch domain could share Rx queue set by specifying
> > non-zero share group value in Rx queue configuration.
> > 
> > All ports that share Rx queue actually shares hardware descriptor
> > queue and feed all Rx queues with one descriptor supply, memory is saved.
> > 
> > Polling any queue using same shared Rx queue receives packets from all
> > member ports. Source port is identified by mbuf->port.
> > 
> > Multiple groups is supported by group ID. Port queue number in a shared
> > group should be identical. Queue index is 1:1 mapped in shared group.
> > An example of two share groups:
> >   Group1, 4 shared Rx queues per member port: PF, repr0, repr1
> >   Group2, 2 shared Rx queues per member port: repr2, repr3, ... repr127
> >   Poll first port for each group:
> >    core	port	queue
> >    0	0	0
> >    1	0	1
> >    2	0	2
> >    3	0	3
> >    4	2	0
> >    5	2	1
> > 
> > Shared Rx queue must be polled on single thread or core. If both PF0 and
> > representor0 joined same share group, can't poll pf0rxq0 on core1 and
> > rep0rxq0 on core2. Actually, polling one port within share group is
> > sufficient since polling any port in group will return packets for any
> > port in group.
> > 
> > There was some discussion to aggregate member ports in same group into a
> > dummy port, several ways to achieve it. Since it optional, need to collect
> > more feedback and requirement from user, make better decision later.
> > 
> > v1:
> >    - initial version
> > v2:
> >    - add testpmd patches
> > v3:
> >    - change common forwarding api to macro for performance, thanks Jerin.
> >    - save global variable accessed in forwarding to flowstream to minimize
> >      cache miss
> >    - combined patches for each forwarding engine
> >    - support multiple groups in testpmd "--share-rxq" parameter
> >    - new api to aggregate shared rxq group
> > v4:
> >    - spelling fixes
> >    - remove shared-rxq support for all forwarding engines
> >    - add dedicate shared-rxq forwarding engine
> > v5:
> >   - fix grammars
> >   - remove aggregate api and leave it for later discussion
> >   - add release notes
> >   - add deployment example
> > v6:
> >   - replace RxQ offload flag with device offload capability flag
> >   - add Rx domain
> >   - RxQ is shared when share group > 0
> >   - update testpmd accordingly
> > v7:
> >   - fix testpmd share group id allocation
> >   - change rx_domain to 16bits
> > v8:
> >   - add new patch for testpmd to show device Rx domain ID and capability
> >   - new share_qid in RxQ configuration
> > v9:
> >   - fix some spelling
> > v10:
> >   - add device capability name api
> > v11:
> >   - remove macro from device capability name list
> > v12:
> >   - rephrase
> >   - in forwarding core check, add  global flag and RxQ enabled check
> > v13:
> >   - update imports of new forwarding engine
> >   - rephrase
> > 
> > Xueming Li (7):
> >    ethdev: introduce shared Rx queue
> >    ethdev: get device capability name as string
> >    app/testpmd: dump device capability and Rx domain info
> >    app/testpmd: new parameter to enable shared Rx queue
> >    app/testpmd: dump port info for shared Rx queue
> >    app/testpmd: force shared Rx queue polled on same core
> >    app/testpmd: add forwarding engine for shared Rx queue
> > 
> 
> This patch is changing some common ethdev structs for a use case I am
> not sure how common, I would like to see more reviews from more vendors
> but we didn't get, at this stage I will proceed based on Andres's review.
> 
> Since only nvidia will be able to test this feature in this release, can
> you please be sure nvidia test report contains this feature? To be sure
> the feature is tested at least by a vendor.
> 
> 
> Series applied to dpdk-next-net/main, thanks.

Hi Ferruh,

Thanks very much for your help!

+Raslan, Ali
Let's make sure the test report contains this feature.

Best Regards,
Xueming Li

^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 00/14] net/mlx5: support shared Rx queue
  2021-07-27  3:42 [dpdk-dev] [RFC] ethdev: introduce shared Rx queue Xueming Li
                   ` (14 preceding siblings ...)
  2021-10-21 10:41 ` [dpdk-dev] [PATCH v13 0/7] ethdev: introduce " Xueming Li
@ 2021-11-03  7:58 ` Xueming Li
  2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 01/14] common/mlx5: introduce user index field in completion Xueming Li
                     ` (13 more replies)
  2021-11-04 12:33 ` [dpdk-dev] [PATCH v4 00/14] net/mlx5: support shared Rx queue Xueming Li
  16 siblings, 14 replies; 266+ messages in thread
From: Xueming Li @ 2021-11-03  7:58 UTC (permalink / raw)
  To: dev; +Cc: xuemingl, Lior Margalit

Implemetation of Shared Rx queue.

Depends-on: series-20232 ("Flow entites behavior on port restart")

v1:
- initial version
v2:
- rebased on latest dependent series
- fully tested
v3:
- support share_qid of RxQ configuration
v4:
- internally reviewed
- removed MPRQ support
- fixed multi-segment support
- fixed configure not applied after port restart

Viacheslav Ovsiienko (1):
  net/mlx5: add shared Rx queue port datapath support

Xueming Li (13):
  common/mlx5: introduce user index field in completion
  net/mlx5: fix field reference for PPC
  common/mlx5: adds basic receive memory pool support
  common/mlx5: support receive memory pool
  net/mlx5: fix Rx queue memory allocation return value
  net/mlx5: clean Rx queue code
  net/mlx5: split Rx queue into shareable and private
  net/mlx5: move Rx queue reference count
  net/mlx5: move Rx queue hairpin info to private data
  net/mlx5: remove port info from shareable Rx queue
  net/mlx5: move Rx queue DevX resource
  net/mlx5: remove Rx queue data list from device
  net/mlx5: support shared Rx queue

 doc/guides/nics/features/mlx5.ini        |   1 +
 doc/guides/nics/mlx5.rst                 |   6 +
 drivers/common/mlx5/mlx5_common_devx.c   | 295 +++++++++--
 drivers/common/mlx5/mlx5_common_devx.h   |  19 +-
 drivers/common/mlx5/mlx5_devx_cmds.c     |  52 ++
 drivers/common/mlx5/mlx5_devx_cmds.h     |  16 +
 drivers/common/mlx5/mlx5_prm.h           |  93 +++-
 drivers/common/mlx5/version.map          |   1 +
 drivers/net/mlx5/linux/mlx5_os.c         |   2 +
 drivers/net/mlx5/linux/mlx5_verbs.c      | 169 +++---
 drivers/net/mlx5/mlx5.c                  |  10 +-
 drivers/net/mlx5/mlx5.h                  |  17 +-
 drivers/net/mlx5/mlx5_devx.c             | 270 +++++-----
 drivers/net/mlx5/mlx5_ethdev.c           |  21 +-
 drivers/net/mlx5/mlx5_flow.c             |  47 +-
 drivers/net/mlx5/mlx5_rss.c              |   6 +-
 drivers/net/mlx5/mlx5_rx.c               |  31 +-
 drivers/net/mlx5/mlx5_rx.h               |  45 +-
 drivers/net/mlx5/mlx5_rxq.c              | 631 +++++++++++++++++------
 drivers/net/mlx5/mlx5_rxtx.c             |   6 +-
 drivers/net/mlx5/mlx5_rxtx_vec.c         |   8 +-
 drivers/net/mlx5/mlx5_rxtx_vec_altivec.h |  14 +-
 drivers/net/mlx5/mlx5_rxtx_vec_neon.h    |  12 +-
 drivers/net/mlx5/mlx5_rxtx_vec_sse.h     |   8 +-
 drivers/net/mlx5/mlx5_stats.c            |   9 +-
 drivers/net/mlx5/mlx5_trigger.c          | 155 +++---
 drivers/net/mlx5/mlx5_vlan.c             |  16 +-
 drivers/regex/mlx5/mlx5_regex_fastpath.c |   2 +-
 28 files changed, 1378 insertions(+), 584 deletions(-)

-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 01/14] common/mlx5: introduce user index field in completion
  2021-11-03  7:58 ` [dpdk-dev] [PATCH v3 00/14] net/mlx5: support " Xueming Li
@ 2021-11-03  7:58   ` Xueming Li
  2021-11-04  9:14     ` Slava Ovsiienko
  2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 02/14] net/mlx5: fix field reference for PPC Xueming Li
                     ` (12 subsequent siblings)
  13 siblings, 1 reply; 266+ messages in thread
From: Xueming Li @ 2021-11-03  7:58 UTC (permalink / raw)
  To: dev; +Cc: xuemingl, Lior Margalit, Matan Azrad, Viacheslav Ovsiienko, Ori Kam

On ConnectX devices the completion entry provides the dedicated 24-bit
field, that is filled up with some static value assigned at the
Receiving Queue creation moment. This patch declares this field. This is
a preparation step for supporting shared RQs and the field is supposed
to provide actual port index while handling the shared receiving
queue(s).


Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 drivers/common/mlx5/mlx5_prm.h           | 8 +++++++-
 drivers/regex/mlx5/mlx5_regex_fastpath.c | 2 +-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/common/mlx5/mlx5_prm.h b/drivers/common/mlx5/mlx5_prm.h
index eab80eaead9..53931ebf1cc 100644
--- a/drivers/common/mlx5/mlx5_prm.h
+++ b/drivers/common/mlx5/mlx5_prm.h
@@ -393,7 +393,13 @@ struct mlx5_cqe {
 	uint16_t hdr_type_etc;
 	uint16_t vlan_info;
 	uint8_t lro_num_seg;
-	uint8_t rsvd3[3];
+	union {
+		uint8_t user_index_bytes[3];
+		struct {
+			uint8_t user_index_hi;
+			uint16_t user_index_low;
+		} __rte_packed;
+	};
 	uint32_t flow_table_metadata;
 	uint8_t rsvd4[4];
 	uint32_t byte_cnt;
diff --git a/drivers/regex/mlx5/mlx5_regex_fastpath.c b/drivers/regex/mlx5/mlx5_regex_fastpath.c
index adb5343a46b..6836203ecf2 100644
--- a/drivers/regex/mlx5/mlx5_regex_fastpath.c
+++ b/drivers/regex/mlx5/mlx5_regex_fastpath.c
@@ -559,7 +559,7 @@ mlx5_regexdev_dequeue(struct rte_regexdev *dev, uint16_t qp_id,
 		uint16_t wq_counter
 			= (rte_be_to_cpu_16(cqe->wqe_counter) + 1) &
 			  MLX5_REGEX_MAX_WQE_INDEX;
-		size_t hw_qpid = cqe->rsvd3[2];
+		size_t hw_qpid = cqe->user_index_bytes[2];
 		struct mlx5_regex_hw_qp *qp_obj = &queue->qps[hw_qpid];
 
 		/* UMR mode WQE counter move as WQE set(4 WQEBBS).*/
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 02/14] net/mlx5: fix field reference for PPC
  2021-11-03  7:58 ` [dpdk-dev] [PATCH v3 00/14] net/mlx5: support " Xueming Li
  2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 01/14] common/mlx5: introduce user index field in completion Xueming Li
@ 2021-11-03  7:58   ` Xueming Li
  2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 03/14] common/mlx5: adds basic receive memory pool support Xueming Li
                     ` (11 subsequent siblings)
  13 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-11-03  7:58 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Lior Margalit, viacheslavo, stable, David Christensen,
	Matan Azrad, Yongseok Koh

This patch fixes stale field reference.

Fixes: a18ac6113331 ("net/mlx5: add metadata support to Rx datapath")
Cc: viacheslavo@nvidia.com
Cc: stable@dpdk.org

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 drivers/net/mlx5/mlx5_rxtx_vec_altivec.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h b/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h
index bcf487c34e9..1d00c1c43d1 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h
@@ -974,10 +974,10 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq,
 			(vector unsigned short)cqe_tmp1, cqe_sel_mask1);
 		cqe_tmp2 = (vector unsigned char)(vector unsigned long){
 			*(__rte_aligned(8) unsigned long *)
-			&cq[pos + p3].rsvd3[9], 0LL};
+			&cq[pos + p3].rsvd4[2], 0LL};
 		cqe_tmp1 = (vector unsigned char)(vector unsigned long){
 			*(__rte_aligned(8) unsigned long *)
-			&cq[pos + p2].rsvd3[9], 0LL};
+			&cq[pos + p2].rsvd4[2], 0LL};
 		cqes[3] = (vector unsigned char)
 			vec_sel((vector unsigned short)cqes[3],
 			(vector unsigned short)cqe_tmp2,
@@ -1037,10 +1037,10 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq,
 			(vector unsigned short)cqe_tmp1, cqe_sel_mask1);
 		cqe_tmp2 = (vector unsigned char)(vector unsigned long){
 			*(__rte_aligned(8) unsigned long *)
-			&cq[pos + p1].rsvd3[9], 0LL};
+			&cq[pos + p1].rsvd4[2], 0LL};
 		cqe_tmp1 = (vector unsigned char)(vector unsigned long){
 			*(__rte_aligned(8) unsigned long *)
-			&cq[pos].rsvd3[9], 0LL};
+			&cq[pos].rsvd4[2], 0LL};
 		cqes[1] = (vector unsigned char)
 			vec_sel((vector unsigned short)cqes[1],
 			(vector unsigned short)cqe_tmp2, cqe_sel_mask2);
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 03/14] common/mlx5: adds basic receive memory pool support
  2021-11-03  7:58 ` [dpdk-dev] [PATCH v3 00/14] net/mlx5: support " Xueming Li
  2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 01/14] common/mlx5: introduce user index field in completion Xueming Li
  2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 02/14] net/mlx5: fix field reference for PPC Xueming Li
@ 2021-11-03  7:58   ` Xueming Li
  2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 04/14] common/mlx5: support receive memory pool Xueming Li
                     ` (10 subsequent siblings)
  13 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-11-03  7:58 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Lior Margalit, Matan Azrad, Viacheslav Ovsiienko, Ray Kinsella

The hardware Receive Memory Pool (RMP) object holds the destination for
incoming packets/messages that are routed to the RMP through RQs. RMP
enables sharing of memory across multiple Receive Queues. Multiple
Receive Queues can be attached to the same RMP and consume memory
from that shared poll. When using RMPs, completions are reported to the
CQ pointed to by the RQ, and this Completion Queue can be shared as
well.

This patch adds DevX supports of PRM RMP object.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 drivers/common/mlx5/mlx5_devx_cmds.c | 52 +++++++++++++++++
 drivers/common/mlx5/mlx5_devx_cmds.h | 16 ++++++
 drivers/common/mlx5/mlx5_prm.h       | 85 +++++++++++++++++++++++++++-
 drivers/common/mlx5/version.map      |  1 +
 4 files changed, 153 insertions(+), 1 deletion(-)

diff --git a/drivers/common/mlx5/mlx5_devx_cmds.c b/drivers/common/mlx5/mlx5_devx_cmds.c
index fb7c8e986f8..119641df470 100644
--- a/drivers/common/mlx5/mlx5_devx_cmds.c
+++ b/drivers/common/mlx5/mlx5_devx_cmds.c
@@ -766,6 +766,8 @@ mlx5_devx_cmd_query_hca_attr(void *ctx,
 			MLX5_GET(cmd_hca_cap, hcattr, flow_counter_bulk_alloc);
 	attr->flow_counters_dump = MLX5_GET(cmd_hca_cap, hcattr,
 					    flow_counters_dump);
+	attr->log_max_rmp = MLX5_GET(cmd_hca_cap, hcattr, log_max_rmp);
+	attr->mem_rq_rmp = MLX5_GET(cmd_hca_cap, hcattr, mem_rq_rmp);
 	attr->log_max_rqt_size = MLX5_GET(cmd_hca_cap, hcattr,
 					  log_max_rqt_size);
 	attr->eswitch_manager = MLX5_GET(cmd_hca_cap, hcattr, eswitch_manager);
@@ -1277,6 +1279,56 @@ mlx5_devx_cmd_modify_rq(struct mlx5_devx_obj *rq,
 }
 
 /**
+ * Create RMP using DevX API.
+ *
+ * @param[in] ctx
+ *   Context returned from mlx5 open_device() glue function.
+ * @param [in] rmp_attr
+ *   Pointer to create RMP attributes structure.
+ * @param [in] socket
+ *   CPU socket ID for allocations.
+ *
+ * @return
+ *   The DevX object created, NULL otherwise and rte_errno is set.
+ */
+struct mlx5_devx_obj *
+mlx5_devx_cmd_create_rmp(void *ctx,
+			 struct mlx5_devx_create_rmp_attr *rmp_attr,
+			 int socket)
+{
+	uint32_t in[MLX5_ST_SZ_DW(create_rmp_in)] = {0};
+	uint32_t out[MLX5_ST_SZ_DW(create_rmp_out)] = {0};
+	void *rmp_ctx, *wq_ctx;
+	struct mlx5_devx_wq_attr *wq_attr;
+	struct mlx5_devx_obj *rmp = NULL;
+
+	rmp = mlx5_malloc(MLX5_MEM_ZERO, sizeof(*rmp), 0, socket);
+	if (!rmp) {
+		DRV_LOG(ERR, "Failed to allocate RMP data");
+		rte_errno = ENOMEM;
+		return NULL;
+	}
+	MLX5_SET(create_rmp_in, in, opcode, MLX5_CMD_OP_CREATE_RMP);
+	rmp_ctx = MLX5_ADDR_OF(create_rmp_in, in, ctx);
+	MLX5_SET(rmpc, rmp_ctx, state, rmp_attr->state);
+	MLX5_SET(rmpc, rmp_ctx, basic_cyclic_rcv_wqe,
+		 rmp_attr->basic_cyclic_rcv_wqe);
+	wq_ctx = MLX5_ADDR_OF(rmpc, rmp_ctx, wq);
+	wq_attr = &rmp_attr->wq_attr;
+	devx_cmd_fill_wq_data(wq_ctx, wq_attr);
+	rmp->obj = mlx5_glue->devx_obj_create(ctx, in, sizeof(in), out,
+					      sizeof(out));
+	if (!rmp->obj) {
+		DRV_LOG(ERR, "Failed to create RMP using DevX");
+		rte_errno = errno;
+		mlx5_free(rmp);
+		return NULL;
+	}
+	rmp->id = MLX5_GET(create_rmp_out, out, rmpn);
+	return rmp;
+}
+
+/*
  * Create TIR using DevX API.
  *
  * @param[in] ctx
diff --git a/drivers/common/mlx5/mlx5_devx_cmds.h b/drivers/common/mlx5/mlx5_devx_cmds.h
index 80b5dca1eb4..5759c4c9473 100644
--- a/drivers/common/mlx5/mlx5_devx_cmds.h
+++ b/drivers/common/mlx5/mlx5_devx_cmds.h
@@ -93,6 +93,8 @@ struct mlx5_hca_flow_attr {
 struct mlx5_hca_attr {
 	uint32_t eswitch_manager:1;
 	uint32_t flow_counters_dump:1;
+	uint32_t mem_rq_rmp:1;
+	uint32_t log_max_rmp:5;
 	uint32_t log_max_rqt_size:5;
 	uint32_t parse_graph_flex_node:1;
 	uint8_t flow_counter_bulk_alloc_bitmap;
@@ -259,6 +261,17 @@ struct mlx5_devx_modify_rq_attr {
 	uint32_t lwm:16; /* Contained WQ lwm. */
 };
 
+/* Create RMP attributes structure, used by create RMP operation. */
+struct mlx5_devx_create_rmp_attr {
+	uint32_t rsvd0:8;
+	uint32_t state:4;
+	uint32_t rsvd1:20;
+	uint32_t basic_cyclic_rcv_wqe:1;
+	uint32_t rsvd4:31;
+	uint32_t rsvd8[10];
+	struct mlx5_devx_wq_attr wq_attr;
+};
+
 struct mlx5_rx_hash_field_select {
 	uint32_t l3_prot_type:1;
 	uint32_t l4_prot_type:1;
@@ -536,6 +549,9 @@ __rte_internal
 int mlx5_devx_cmd_modify_rq(struct mlx5_devx_obj *rq,
 			    struct mlx5_devx_modify_rq_attr *rq_attr);
 __rte_internal
+struct mlx5_devx_obj *mlx5_devx_cmd_create_rmp(void *ctx,
+			struct mlx5_devx_create_rmp_attr *rq_attr, int socket);
+__rte_internal
 struct mlx5_devx_obj *mlx5_devx_cmd_create_tir(void *ctx,
 					   struct mlx5_devx_tir_attr *tir_attr);
 __rte_internal
diff --git a/drivers/common/mlx5/mlx5_prm.h b/drivers/common/mlx5/mlx5_prm.h
index 53931ebf1cc..7063b195ff4 100644
--- a/drivers/common/mlx5/mlx5_prm.h
+++ b/drivers/common/mlx5/mlx5_prm.h
@@ -1062,6 +1062,10 @@ enum {
 	MLX5_CMD_OP_CREATE_RQ = 0x908,
 	MLX5_CMD_OP_MODIFY_RQ = 0x909,
 	MLX5_CMD_OP_QUERY_RQ = 0x90b,
+	MLX5_CMD_OP_CREATE_RMP = 0x90c,
+	MLX5_CMD_OP_MODIFY_RMP = 0x90d,
+	MLX5_CMD_OP_DESTROY_RMP = 0x90e,
+	MLX5_CMD_OP_QUERY_RMP = 0x90f,
 	MLX5_CMD_OP_CREATE_TIS = 0x912,
 	MLX5_CMD_OP_QUERY_TIS = 0x915,
 	MLX5_CMD_OP_CREATE_RQT = 0x916,
@@ -1561,7 +1565,8 @@ struct mlx5_ifc_cmd_hca_cap_bits {
 	u8 reserved_at_378[0x3];
 	u8 log_max_tis[0x5];
 	u8 basic_cyclic_rcv_wqe[0x1];
-	u8 reserved_at_381[0x2];
+	u8 reserved_at_381[0x1];
+	u8 mem_rq_rmp[0x1];
 	u8 log_max_rmp[0x5];
 	u8 reserved_at_388[0x3];
 	u8 log_max_rqt[0x5];
@@ -2209,6 +2214,84 @@ struct mlx5_ifc_query_rq_in_bits {
 	u8 reserved_at_60[0x20];
 };
 
+enum {
+	MLX5_RMPC_STATE_RDY = 0x1,
+	MLX5_RMPC_STATE_ERR = 0x3,
+};
+
+struct mlx5_ifc_rmpc_bits {
+	u8 reserved_at_0[0x8];
+	u8 state[0x4];
+	u8 reserved_at_c[0x14];
+	u8 basic_cyclic_rcv_wqe[0x1];
+	u8 reserved_at_21[0x1f];
+	u8 reserved_at_40[0x140];
+	struct mlx5_ifc_wq_bits wq;
+};
+
+struct mlx5_ifc_query_rmp_out_bits {
+	u8 status[0x8];
+	u8 reserved_at_8[0x18];
+	u8 syndrome[0x20];
+	u8 reserved_at_40[0xc0];
+	struct mlx5_ifc_rmpc_bits rmp_context;
+};
+
+struct mlx5_ifc_query_rmp_in_bits {
+	u8 opcode[0x10];
+	u8 reserved_at_10[0x10];
+	u8 reserved_at_20[0x10];
+	u8 op_mod[0x10];
+	u8 reserved_at_40[0x8];
+	u8 rmpn[0x18];
+	u8 reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_modify_rmp_out_bits {
+	u8 status[0x8];
+	u8 reserved_at_8[0x18];
+	u8 syndrome[0x20];
+	u8 reserved_at_40[0x40];
+};
+
+struct mlx5_ifc_rmp_bitmask_bits {
+	u8 reserved_at_0[0x20];
+	u8 reserved_at_20[0x1f];
+	u8 lwm[0x1];
+};
+
+struct mlx5_ifc_modify_rmp_in_bits {
+	u8 opcode[0x10];
+	u8 uid[0x10];
+	u8 reserved_at_20[0x10];
+	u8 op_mod[0x10];
+	u8 rmp_state[0x4];
+	u8 reserved_at_44[0x4];
+	u8 rmpn[0x18];
+	u8 reserved_at_60[0x20];
+	struct mlx5_ifc_rmp_bitmask_bits bitmask;
+	u8 reserved_at_c0[0x40];
+	struct mlx5_ifc_rmpc_bits ctx;
+};
+
+struct mlx5_ifc_create_rmp_out_bits {
+	u8 status[0x8];
+	u8 reserved_at_8[0x18];
+	u8 syndrome[0x20];
+	u8 reserved_at_40[0x8];
+	u8 rmpn[0x18];
+	u8 reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_create_rmp_in_bits {
+	u8 opcode[0x10];
+	u8 uid[0x10];
+	u8 reserved_at_20[0x10];
+	u8 op_mod[0x10];
+	u8 reserved_at_40[0xc0];
+	struct mlx5_ifc_rmpc_bits ctx;
+};
+
 struct mlx5_ifc_create_tis_out_bits {
 	u8 status[0x8];
 	u8 reserved_at_8[0x18];
diff --git a/drivers/common/mlx5/version.map b/drivers/common/mlx5/version.map
index 0ea8325f9ac..7265ff8c56f 100644
--- a/drivers/common/mlx5/version.map
+++ b/drivers/common/mlx5/version.map
@@ -30,6 +30,7 @@ INTERNAL {
 	mlx5_devx_cmd_create_geneve_tlv_option;
 	mlx5_devx_cmd_create_import_kek_obj;
 	mlx5_devx_cmd_create_qp;
+	mlx5_devx_cmd_create_rmp;
 	mlx5_devx_cmd_create_rq;
 	mlx5_devx_cmd_create_rqt;
 	mlx5_devx_cmd_create_sq;
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 04/14] common/mlx5: support receive memory pool
  2021-11-03  7:58 ` [dpdk-dev] [PATCH v3 00/14] net/mlx5: support " Xueming Li
                     ` (2 preceding siblings ...)
  2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 03/14] common/mlx5: adds basic receive memory pool support Xueming Li
@ 2021-11-03  7:58   ` Xueming Li
  2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 05/14] net/mlx5: fix Rx queue memory allocation return value Xueming Li
                     ` (9 subsequent siblings)
  13 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-11-03  7:58 UTC (permalink / raw)
  To: dev; +Cc: xuemingl, Lior Margalit, Matan Azrad, Viacheslav Ovsiienko

The hardware Receive Memory Pool (RMP) object holds the destination for
incoming packets/messages that are routed to the RMP through RQs. RMP
enables sharing of memory across multiple Receive Queues. Multiple
Receive Queues can be attached to the same RMP and consume memory
from that shared poll. When using RMPs, completions are reported to the
CQ pointed to by the RQ, user index that set in RQ creation time is
carried to completion entry.

This patch enables RMP based RQ, RMP is created when mlx5_devx_rq.rmp is
set.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 drivers/common/mlx5/mlx5_common_devx.c | 295 +++++++++++++++++++++----
 drivers/common/mlx5/mlx5_common_devx.h |  19 +-
 drivers/net/mlx5/mlx5_devx.c           |   4 +-
 3 files changed, 271 insertions(+), 47 deletions(-)

diff --git a/drivers/common/mlx5/mlx5_common_devx.c b/drivers/common/mlx5/mlx5_common_devx.c
index 825f84b1833..85b5282061a 100644
--- a/drivers/common/mlx5/mlx5_common_devx.c
+++ b/drivers/common/mlx5/mlx5_common_devx.c
@@ -271,6 +271,39 @@ mlx5_devx_sq_create(void *ctx, struct mlx5_devx_sq *sq_obj, uint16_t log_wqbb_n,
 	return -rte_errno;
 }
 
+/**
+ * Destroy DevX Receive Queue resources.
+ *
+ * @param[in] rq_res
+ *   DevX RQ resource to destroy.
+ */
+static void
+mlx5_devx_wq_res_destroy(struct mlx5_devx_wq_res *rq_res)
+{
+	if (rq_res->umem_obj)
+		claim_zero(mlx5_os_umem_dereg(rq_res->umem_obj));
+	if (rq_res->umem_buf)
+		mlx5_free((void *)(uintptr_t)rq_res->umem_buf);
+	memset(rq_res, 0, sizeof(*rq_res));
+}
+
+/**
+ * Destroy DevX Receive Memory Pool.
+ *
+ * @param[in] rmp
+ *   DevX RMP to destroy.
+ */
+static void
+mlx5_devx_rmp_destroy(struct mlx5_devx_rmp *rmp)
+{
+	MLX5_ASSERT(rmp->ref_cnt == 0);
+	if (rmp->rmp) {
+		claim_zero(mlx5_devx_cmd_destroy(rmp->rmp));
+		rmp->rmp = NULL;
+	}
+	mlx5_devx_wq_res_destroy(&rmp->wq);
+}
+
 /**
  * Destroy DevX Queue Pair.
  *
@@ -389,55 +422,48 @@ mlx5_devx_qp_create(void *ctx, struct mlx5_devx_qp *qp_obj, uint16_t log_wqbb_n,
 void
 mlx5_devx_rq_destroy(struct mlx5_devx_rq *rq)
 {
-	if (rq->rq)
+	if (rq->rq) {
 		claim_zero(mlx5_devx_cmd_destroy(rq->rq));
-	if (rq->umem_obj)
-		claim_zero(mlx5_os_umem_dereg(rq->umem_obj));
-	if (rq->umem_buf)
-		mlx5_free((void *)(uintptr_t)rq->umem_buf);
+		rq->rq = NULL;
+		if (rq->rmp)
+			rq->rmp->ref_cnt--;
+	}
+	if (rq->rmp == NULL) {
+		mlx5_devx_wq_res_destroy(&rq->wq);
+	} else {
+		if (rq->rmp->ref_cnt == 0)
+			mlx5_devx_rmp_destroy(rq->rmp);
+	}
 }
 
 /**
- * Create Receive Queue using DevX API.
- *
- * Get a pointer to partially initialized attributes structure, and updates the
- * following fields:
- *   wq_umem_valid
- *   wq_umem_id
- *   wq_umem_offset
- *   dbr_umem_valid
- *   dbr_umem_id
- *   dbr_addr
- *   log_wq_pg_sz
- * All other fields are updated by caller.
+ * Create WQ resources using DevX API.
  *
  * @param[in] ctx
  *   Context returned from mlx5 open_device() glue function.
- * @param[in/out] rq_obj
- *   Pointer to RQ to create.
  * @param[in] wqe_size
  *   Size of WQE structure.
  * @param[in] log_wqbb_n
  *   Log of number of WQBBs in queue.
- * @param[in] attr
- *   Pointer to RQ attributes structure.
  * @param[in] socket
  *   Socket to use for allocation.
+ * @param[out] wq_attr
+ *   Pointer to WQ attributes structure.
+ * @param[out] wq_res
+ *   Pointer to WQ resource to create.
  *
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
-int
-mlx5_devx_rq_create(void *ctx, struct mlx5_devx_rq *rq_obj, uint32_t wqe_size,
-		    uint16_t log_wqbb_n,
-		    struct mlx5_devx_create_rq_attr *attr, int socket)
+static int
+mlx5_devx_wq_init(void *ctx, uint32_t wqe_size, uint16_t log_wqbb_n, int socket,
+		  struct mlx5_devx_wq_attr *wq_attr,
+		  struct mlx5_devx_wq_res *wq_res)
 {
-	struct mlx5_devx_obj *rq = NULL;
 	struct mlx5dv_devx_umem *umem_obj = NULL;
 	void *umem_buf = NULL;
 	size_t alignment = MLX5_WQE_BUF_ALIGNMENT;
 	uint32_t umem_size, umem_dbrec;
-	uint16_t rq_size = 1 << log_wqbb_n;
 	int ret;
 
 	if (alignment == (size_t)-1) {
@@ -446,7 +472,7 @@ mlx5_devx_rq_create(void *ctx, struct mlx5_devx_rq *rq_obj, uint32_t wqe_size,
 		return -rte_errno;
 	}
 	/* Allocate memory buffer for WQEs and doorbell record. */
-	umem_size = wqe_size * rq_size;
+	umem_size = wqe_size * (1 << log_wqbb_n);
 	umem_dbrec = RTE_ALIGN(umem_size, MLX5_DBR_SIZE);
 	umem_size += MLX5_DBR_SIZE;
 	umem_buf = mlx5_malloc(MLX5_MEM_RTE | MLX5_MEM_ZERO, umem_size,
@@ -464,14 +490,60 @@ mlx5_devx_rq_create(void *ctx, struct mlx5_devx_rq *rq_obj, uint32_t wqe_size,
 		rte_errno = errno;
 		goto error;
 	}
+	/* Fill WQ attributes for RQ/RMP object creation. */
+	wq_attr->wq_umem_valid = 1;
+	wq_attr->wq_umem_id = mlx5_os_get_umem_id(umem_obj);
+	wq_attr->wq_umem_offset = 0;
+	wq_attr->dbr_umem_valid = 1;
+	wq_attr->dbr_umem_id = wq_attr->wq_umem_id;
+	wq_attr->dbr_addr = umem_dbrec;
+	wq_attr->log_wq_pg_sz = MLX5_LOG_PAGE_SIZE;
 	/* Fill attributes for RQ object creation. */
-	attr->wq_attr.wq_umem_valid = 1;
-	attr->wq_attr.wq_umem_id = mlx5_os_get_umem_id(umem_obj);
-	attr->wq_attr.wq_umem_offset = 0;
-	attr->wq_attr.dbr_umem_valid = 1;
-	attr->wq_attr.dbr_umem_id = attr->wq_attr.wq_umem_id;
-	attr->wq_attr.dbr_addr = umem_dbrec;
-	attr->wq_attr.log_wq_pg_sz = MLX5_LOG_PAGE_SIZE;
+	wq_res->umem_buf = umem_buf;
+	wq_res->umem_obj = umem_obj;
+	wq_res->db_rec = RTE_PTR_ADD(umem_buf, umem_dbrec);
+	return 0;
+error:
+	ret = rte_errno;
+	if (umem_obj)
+		claim_zero(mlx5_os_umem_dereg(umem_obj));
+	if (umem_buf)
+		mlx5_free((void *)(uintptr_t)umem_buf);
+	rte_errno = ret;
+	return -rte_errno;
+}
+
+/**
+ * Create standalone Receive Queue using DevX API.
+ *
+ * @param[in] ctx
+ *   Context returned from mlx5 open_device() glue function.
+ * @param[in/out] rq_obj
+ *   Pointer to RQ to create.
+ * @param[in] wqe_size
+ *   Size of WQE structure.
+ * @param[in] log_wqbb_n
+ *   Log of number of WQBBs in queue.
+ * @param[in] attr
+ *   Pointer to RQ attributes structure.
+ * @param[in] socket
+ *   Socket to use for allocation.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+static int
+mlx5_devx_rq_std_create(void *ctx, struct mlx5_devx_rq *rq_obj,
+			uint32_t wqe_size, uint16_t log_wqbb_n,
+			struct mlx5_devx_create_rq_attr *attr, int socket)
+{
+	struct mlx5_devx_obj *rq;
+	int ret;
+
+	ret = mlx5_devx_wq_init(ctx, wqe_size, log_wqbb_n, socket,
+				&attr->wq_attr, &rq_obj->wq);
+	if (ret != 0)
+		return ret;
 	/* Create receive queue object with DevX. */
 	rq = mlx5_devx_cmd_create_rq(ctx, attr, socket);
 	if (!rq) {
@@ -479,21 +551,160 @@ mlx5_devx_rq_create(void *ctx, struct mlx5_devx_rq *rq_obj, uint32_t wqe_size,
 		rte_errno = ENOMEM;
 		goto error;
 	}
-	rq_obj->umem_buf = umem_buf;
-	rq_obj->umem_obj = umem_obj;
 	rq_obj->rq = rq;
-	rq_obj->db_rec = RTE_PTR_ADD(rq_obj->umem_buf, umem_dbrec);
 	return 0;
 error:
 	ret = rte_errno;
-	if (umem_obj)
-		claim_zero(mlx5_os_umem_dereg(umem_obj));
-	if (umem_buf)
-		mlx5_free((void *)(uintptr_t)umem_buf);
+	mlx5_devx_wq_res_destroy(&rq_obj->wq);
+	rte_errno = ret;
+	return -rte_errno;
+}
+
+/**
+ * Create Receive Memory Pool using DevX API.
+ *
+ * @param[in] ctx
+ *   Context returned from mlx5 open_device() glue function.
+ * @param[in/out] rq_obj
+ *   Pointer to RQ to create.
+ * @param[in] wqe_size
+ *   Size of WQE structure.
+ * @param[in] log_wqbb_n
+ *   Log of number of WQBBs in queue.
+ * @param[in] attr
+ *   Pointer to RQ attributes structure.
+ * @param[in] socket
+ *   Socket to use for allocation.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+static int
+mlx5_devx_rmp_create(void *ctx, struct mlx5_devx_rmp *rmp_obj,
+		     uint32_t wqe_size, uint16_t log_wqbb_n,
+		     struct mlx5_devx_wq_attr *wq_attr, int socket)
+{
+	struct mlx5_devx_create_rmp_attr rmp_attr = { 0 };
+	int ret;
+
+	if (rmp_obj->rmp != NULL)
+		return 0;
+	rmp_attr.wq_attr = *wq_attr;
+	ret = mlx5_devx_wq_init(ctx, wqe_size, log_wqbb_n, socket,
+				&rmp_attr.wq_attr, &rmp_obj->wq);
+	if (ret != 0)
+		return ret;
+	rmp_attr.state = MLX5_RMPC_STATE_RDY;
+	rmp_attr.basic_cyclic_rcv_wqe =
+		wq_attr->wq_type != MLX5_WQ_TYPE_CYCLIC_STRIDING_RQ;
+	/* Create receive memory pool object with DevX. */
+	rmp_obj->rmp = mlx5_devx_cmd_create_rmp(ctx, &rmp_attr, socket);
+	if (rmp_obj->rmp == NULL) {
+		DRV_LOG(ERR, "Can't create DevX RMP object.");
+		rte_errno = ENOMEM;
+		goto error;
+	}
+	return 0;
+error:
+	ret = rte_errno;
+	mlx5_devx_wq_res_destroy(&rmp_obj->wq);
+	rte_errno = ret;
+	return -rte_errno;
+}
+
+/**
+ * Create Shared Receive Queue based on RMP using DevX API.
+ *
+ * @param[in] ctx
+ *   Context returned from mlx5 open_device() glue function.
+ * @param[in/out] rq_obj
+ *   Pointer to RQ to create.
+ * @param[in] wqe_size
+ *   Size of WQE structure.
+ * @param[in] log_wqbb_n
+ *   Log of number of WQBBs in queue.
+ * @param[in] attr
+ *   Pointer to RQ attributes structure.
+ * @param[in] socket
+ *   Socket to use for allocation.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+static int
+mlx5_devx_rq_shared_create(void *ctx, struct mlx5_devx_rq *rq_obj,
+			   uint32_t wqe_size, uint16_t log_wqbb_n,
+			   struct mlx5_devx_create_rq_attr *attr, int socket)
+{
+	struct mlx5_devx_obj *rq;
+	int ret;
+
+	ret = mlx5_devx_rmp_create(ctx, rq_obj->rmp, wqe_size, log_wqbb_n,
+				   &attr->wq_attr, socket);
+	if (ret != 0)
+		return ret;
+	attr->mem_rq_type = MLX5_RQC_MEM_RQ_TYPE_MEMORY_RQ_RMP;
+	attr->rmpn = rq_obj->rmp->rmp->id;
+	attr->flush_in_error_en = 0;
+	memset(&attr->wq_attr, 0, sizeof(attr->wq_attr));
+	/* Create receive queue object with DevX. */
+	rq = mlx5_devx_cmd_create_rq(ctx, attr, socket);
+	if (!rq) {
+		DRV_LOG(ERR, "Can't create DevX RMP RQ object.");
+		rte_errno = ENOMEM;
+		goto error;
+	}
+	rq_obj->rq = rq;
+	rq_obj->rmp->ref_cnt++;
+	return 0;
+error:
+	ret = rte_errno;
+	mlx5_devx_rq_destroy(rq_obj);
 	rte_errno = ret;
 	return -rte_errno;
 }
 
+/**
+ * Create Receive Queue using DevX API. Shared RQ is created only if rmp set.
+ *
+ * Get a pointer to partially initialized attributes structure, and updates the
+ * following fields:
+ *   wq_umem_valid
+ *   wq_umem_id
+ *   wq_umem_offset
+ *   dbr_umem_valid
+ *   dbr_umem_id
+ *   dbr_addr
+ *   log_wq_pg_sz
+ * All other fields are updated by caller.
+ *
+ * @param[in] ctx
+ *   Context returned from mlx5 open_device() glue function.
+ * @param[in/out] rq_obj
+ *   Pointer to RQ to create.
+ * @param[in] wqe_size
+ *   Size of WQE structure.
+ * @param[in] log_wqbb_n
+ *   Log of number of WQBBs in queue.
+ * @param[in] attr
+ *   Pointer to RQ attributes structure.
+ * @param[in] socket
+ *   Socket to use for allocation.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+int
+mlx5_devx_rq_create(void *ctx, struct mlx5_devx_rq *rq_obj,
+		    uint32_t wqe_size, uint16_t log_wqbb_n,
+		    struct mlx5_devx_create_rq_attr *attr, int socket)
+{
+	if (rq_obj->rmp == NULL)
+		return mlx5_devx_rq_std_create(ctx, rq_obj, wqe_size,
+					       log_wqbb_n, attr, socket);
+	return mlx5_devx_rq_shared_create(ctx, rq_obj, wqe_size,
+					  log_wqbb_n, attr, socket);
+}
 
 /**
  * Change QP state to RTS.
diff --git a/drivers/common/mlx5/mlx5_common_devx.h b/drivers/common/mlx5/mlx5_common_devx.h
index f699405f69b..7ceac040f8b 100644
--- a/drivers/common/mlx5/mlx5_common_devx.h
+++ b/drivers/common/mlx5/mlx5_common_devx.h
@@ -45,14 +45,27 @@ struct mlx5_devx_qp {
 	volatile uint32_t *db_rec; /* The QP doorbell record. */
 };
 
-/* DevX Receive Queue structure. */
-struct mlx5_devx_rq {
-	struct mlx5_devx_obj *rq; /* The RQ DevX object. */
+/* DevX Receive Queue resource structure. */
+struct mlx5_devx_wq_res {
 	void *umem_obj; /* The RQ umem object. */
 	volatile void *umem_buf;
 	volatile uint32_t *db_rec; /* The RQ doorbell record. */
 };
 
+/* DevX Receive Memory Pool structure. */
+struct mlx5_devx_rmp {
+	struct mlx5_devx_obj *rmp; /* The RMP DevX object. */
+	uint32_t ref_cnt; /* Reference count. */
+	struct mlx5_devx_wq_res wq;
+};
+
+/* DevX Receive Queue structure. */
+struct mlx5_devx_rq {
+	struct mlx5_devx_obj *rq; /* The RQ DevX object. */
+	struct mlx5_devx_rmp *rmp; /* Shared RQ RMP object. */
+	struct mlx5_devx_wq_res wq; /* WQ resource of standalone RQ. */
+};
+
 /* mlx5_common_devx.c */
 
 __rte_internal
diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index 424f77be790..443252df05d 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -515,8 +515,8 @@ mlx5_rxq_devx_obj_new(struct rte_eth_dev *dev, uint16_t idx)
 	ret = mlx5_devx_modify_rq(tmpl, MLX5_RXQ_MOD_RST2RDY);
 	if (ret)
 		goto error;
-	rxq_data->wqes = (void *)(uintptr_t)tmpl->rq_obj.umem_buf;
-	rxq_data->rq_db = (uint32_t *)(uintptr_t)tmpl->rq_obj.db_rec;
+	rxq_data->wqes = (void *)(uintptr_t)tmpl->rq_obj.wq.umem_buf;
+	rxq_data->rq_db = (uint32_t *)(uintptr_t)tmpl->rq_obj.wq.db_rec;
 	rxq_data->cq_arm_sn = 0;
 	rxq_data->cq_ci = 0;
 	mlx5_rxq_initialize(rxq_data);
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 05/14] net/mlx5: fix Rx queue memory allocation return value
  2021-11-03  7:58 ` [dpdk-dev] [PATCH v3 00/14] net/mlx5: support " Xueming Li
                     ` (3 preceding siblings ...)
  2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 04/14] common/mlx5: support receive memory pool Xueming Li
@ 2021-11-03  7:58   ` Xueming Li
  2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 06/14] net/mlx5: clean Rx queue code Xueming Li
                     ` (8 subsequent siblings)
  13 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-11-03  7:58 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Lior Margalit, akozyrev, stable, Matan Azrad,
	Viacheslav Ovsiienko

If error happened during Rx queue mbuf allocation, boolean value
returned. From description, return value should be error number.

This patch returns negative error number.

Fixes: 0f20acbf5eda ("net/mlx5: implement vectorized MPRQ burst")
Cc: akozyrev@nvidia.com
Cc: stable@dpdk.org

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 drivers/net/mlx5/mlx5_rxq.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index 9220bb2c15c..4567b43c1b6 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -129,7 +129,7 @@ rxq_alloc_elts_mprq(struct mlx5_rxq_ctrl *rxq_ctrl)
  *   Pointer to RX queue structure.
  *
  * @return
- *   0 on success, errno value on failure.
+ *   0 on success, negative errno value on failure.
  */
 static int
 rxq_alloc_elts_sprq(struct mlx5_rxq_ctrl *rxq_ctrl)
@@ -220,7 +220,7 @@ rxq_alloc_elts_sprq(struct mlx5_rxq_ctrl *rxq_ctrl)
  *   Pointer to RX queue structure.
  *
  * @return
- *   0 on success, errno value on failure.
+ *   0 on success, negative errno value on failure.
  */
 int
 rxq_alloc_elts(struct mlx5_rxq_ctrl *rxq_ctrl)
@@ -233,7 +233,9 @@ rxq_alloc_elts(struct mlx5_rxq_ctrl *rxq_ctrl)
 	 */
 	if (mlx5_rxq_mprq_enabled(&rxq_ctrl->rxq))
 		ret = rxq_alloc_elts_mprq(rxq_ctrl);
-	return (ret || rxq_alloc_elts_sprq(rxq_ctrl));
+	if (ret == 0)
+		ret = rxq_alloc_elts_sprq(rxq_ctrl);
+	return ret;
 }
 
 /**
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 06/14] net/mlx5: clean Rx queue code
  2021-11-03  7:58 ` [dpdk-dev] [PATCH v3 00/14] net/mlx5: support " Xueming Li
                     ` (4 preceding siblings ...)
  2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 05/14] net/mlx5: fix Rx queue memory allocation return value Xueming Li
@ 2021-11-03  7:58   ` Xueming Li
  2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 07/14] net/mlx5: split Rx queue into shareable and private Xueming Li
                     ` (7 subsequent siblings)
  13 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-11-03  7:58 UTC (permalink / raw)
  To: dev; +Cc: xuemingl, Lior Margalit, Matan Azrad, Viacheslav Ovsiienko

This patch removes unused Rx queue code.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 drivers/net/mlx5/mlx5_rxq.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index 4567b43c1b6..b2e4389ad60 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -674,9 +674,7 @@ mlx5_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		    struct rte_mempool *mp)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_data *rxq = (*priv->rxqs)[idx];
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-		container_of(rxq, struct mlx5_rxq_ctrl, rxq);
+	struct mlx5_rxq_ctrl *rxq_ctrl;
 	struct rte_eth_rxseg_split *rx_seg =
 				(struct rte_eth_rxseg_split *)conf->rx_seg;
 	struct rte_eth_rxseg_split rx_single = {.mp = mp};
@@ -743,9 +741,7 @@ mlx5_rx_hairpin_queue_setup(struct rte_eth_dev *dev, uint16_t idx,
 			    const struct rte_eth_hairpin_conf *hairpin_conf)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_data *rxq = (*priv->rxqs)[idx];
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-		container_of(rxq, struct mlx5_rxq_ctrl, rxq);
+	struct mlx5_rxq_ctrl *rxq_ctrl;
 	int res;
 
 	res = mlx5_rx_queue_pre_setup(dev, idx, &desc);
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 07/14] net/mlx5: split Rx queue into shareable and private
  2021-11-03  7:58 ` [dpdk-dev] [PATCH v3 00/14] net/mlx5: support " Xueming Li
                     ` (5 preceding siblings ...)
  2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 06/14] net/mlx5: clean Rx queue code Xueming Li
@ 2021-11-03  7:58   ` Xueming Li
  2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 08/14] net/mlx5: move Rx queue reference count Xueming Li
                     ` (6 subsequent siblings)
  13 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-11-03  7:58 UTC (permalink / raw)
  To: dev; +Cc: xuemingl, Lior Margalit, Matan Azrad, Viacheslav Ovsiienko

To prepare shared Rx queue, splits RxQ data into shareable and private.
Struct mlx5_rxq_priv is per queue data.
Struct mlx5_rxq_ctrl is shared queue resources and data.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 drivers/net/mlx5/mlx5.c        |  4 +++
 drivers/net/mlx5/mlx5.h        |  5 ++-
 drivers/net/mlx5/mlx5_ethdev.c | 10 ++++++
 drivers/net/mlx5/mlx5_rx.h     | 17 +++++++--
 drivers/net/mlx5/mlx5_rxq.c    | 66 ++++++++++++++++++++++++++++------
 5 files changed, 88 insertions(+), 14 deletions(-)

diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 4ba850af263..d0fae518025 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -1699,6 +1699,10 @@ mlx5_dev_close(struct rte_eth_dev *dev)
 		mlx5_free(dev->intr_handle);
 		dev->intr_handle = NULL;
 	}
+	if (priv->rxq_privs != NULL) {
+		mlx5_free(priv->rxq_privs);
+		priv->rxq_privs = NULL;
+	}
 	if (priv->txqs != NULL) {
 		/* XXX race condition if mlx5_tx_burst() is still running. */
 		rte_delay_us_sleep(1000);
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 39c001aa1bf..3e008241ca8 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1317,6 +1317,8 @@ enum mlx5_txq_modify_type {
 	MLX5_TXQ_MOD_ERR2RDY, /* modify state from error to ready. */
 };
 
+struct mlx5_rxq_priv;
+
 /* HW objects operations structure. */
 struct mlx5_obj_ops {
 	int (*rxq_obj_modify_vlan_strip)(struct mlx5_rxq_obj *rxq_obj, int on);
@@ -1380,7 +1382,8 @@ struct mlx5_priv {
 	/* RX/TX queues. */
 	unsigned int rxqs_n; /* RX queues array size. */
 	unsigned int txqs_n; /* TX queues array size. */
-	struct mlx5_rxq_data *(*rxqs)[]; /* RX queues. */
+	struct mlx5_rxq_priv *(*rxq_privs)[]; /* RX queue non-shared data. */
+	struct mlx5_rxq_data *(*rxqs)[]; /* (Shared) RX queues. */
 	struct mlx5_txq_data *(*txqs)[]; /* TX queues. */
 	struct rte_mempool *mprq_mp; /* Mempool for Multi-Packet RQ. */
 	struct rte_eth_rss_conf rss_conf; /* RSS configuration. */
diff --git a/drivers/net/mlx5/mlx5_ethdev.c b/drivers/net/mlx5/mlx5_ethdev.c
index 81fa8845bb5..cde505955df 100644
--- a/drivers/net/mlx5/mlx5_ethdev.c
+++ b/drivers/net/mlx5/mlx5_ethdev.c
@@ -104,6 +104,16 @@ mlx5_dev_configure(struct rte_eth_dev *dev)
 	       MLX5_RSS_HASH_KEY_LEN);
 	priv->rss_conf.rss_key_len = MLX5_RSS_HASH_KEY_LEN;
 	priv->rss_conf.rss_hf = dev->data->dev_conf.rx_adv_conf.rss_conf.rss_hf;
+	priv->rxq_privs = mlx5_realloc(priv->rxq_privs,
+				       MLX5_MEM_RTE | MLX5_MEM_ZERO,
+				       sizeof(void *) * rxqs_n, 0,
+				       SOCKET_ID_ANY);
+	if (priv->rxq_privs == NULL) {
+		DRV_LOG(ERR, "port %u cannot allocate rxq private data",
+			dev->data->port_id);
+		rte_errno = ENOMEM;
+		return -rte_errno;
+	}
 	priv->rxqs = (void *)dev->data->rx_queues;
 	priv->txqs = (void *)dev->data->tx_queues;
 	if (txqs_n != priv->txqs_n) {
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index 69b1263339e..fa24f5cdf3a 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -150,10 +150,14 @@ struct mlx5_rxq_ctrl {
 	struct mlx5_rxq_data rxq; /* Data path structure. */
 	LIST_ENTRY(mlx5_rxq_ctrl) next; /* Pointer to the next element. */
 	uint32_t refcnt; /* Reference counter. */
+	LIST_HEAD(priv, mlx5_rxq_priv) owners; /* Owner rxq list. */
 	struct mlx5_rxq_obj *obj; /* Verbs/DevX elements. */
+	struct mlx5_dev_ctx_shared *sh; /* Shared context. */
 	struct mlx5_priv *priv; /* Back pointer to private data. */
 	enum mlx5_rxq_type type; /* Rxq type. */
 	unsigned int socket; /* CPU socket ID for allocations. */
+	uint32_t share_group; /* Group ID of shared RXQ. */
+	uint16_t share_qid; /* Shared RxQ ID in group. */
 	unsigned int irq:1; /* Whether IRQ is enabled. */
 	uint32_t flow_mark_n; /* Number of Mark/Flag flows using this Queue. */
 	uint32_t flow_tunnels_n[MLX5_FLOW_TUNNEL]; /* Tunnels counters. */
@@ -163,6 +167,14 @@ struct mlx5_rxq_ctrl {
 	uint32_t hairpin_status; /* Hairpin binding status. */
 };
 
+/* RX queue private data. */
+struct mlx5_rxq_priv {
+	uint16_t idx; /* Queue index. */
+	struct mlx5_rxq_ctrl *ctrl; /* Shared Rx Queue. */
+	LIST_ENTRY(mlx5_rxq_priv) owner_entry; /* Entry in shared rxq_ctrl. */
+	struct mlx5_priv *priv; /* Back pointer to private data. */
+};
+
 /* mlx5_rxq.c */
 
 extern uint8_t rss_hash_default_key[];
@@ -186,13 +198,14 @@ void mlx5_rx_intr_vec_disable(struct rte_eth_dev *dev);
 int mlx5_rx_intr_enable(struct rte_eth_dev *dev, uint16_t rx_queue_id);
 int mlx5_rx_intr_disable(struct rte_eth_dev *dev, uint16_t rx_queue_id);
 int mlx5_rxq_obj_verify(struct rte_eth_dev *dev);
-struct mlx5_rxq_ctrl *mlx5_rxq_new(struct rte_eth_dev *dev, uint16_t idx,
+struct mlx5_rxq_ctrl *mlx5_rxq_new(struct rte_eth_dev *dev,
+				   struct mlx5_rxq_priv *rxq,
 				   uint16_t desc, unsigned int socket,
 				   const struct rte_eth_rxconf *conf,
 				   const struct rte_eth_rxseg_split *rx_seg,
 				   uint16_t n_seg);
 struct mlx5_rxq_ctrl *mlx5_rxq_hairpin_new
-	(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
+	(struct rte_eth_dev *dev, struct mlx5_rxq_priv *rxq, uint16_t desc,
 	 const struct rte_eth_hairpin_conf *hairpin_conf);
 struct mlx5_rxq_ctrl *mlx5_rxq_get(struct rte_eth_dev *dev, uint16_t idx);
 int mlx5_rxq_release(struct rte_eth_dev *dev, uint16_t idx);
diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index b2e4389ad60..00df245a5c6 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -674,6 +674,7 @@ mlx5_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		    struct rte_mempool *mp)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
+	struct mlx5_rxq_priv *rxq;
 	struct mlx5_rxq_ctrl *rxq_ctrl;
 	struct rte_eth_rxseg_split *rx_seg =
 				(struct rte_eth_rxseg_split *)conf->rx_seg;
@@ -708,10 +709,23 @@ mlx5_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 	res = mlx5_rx_queue_pre_setup(dev, idx, &desc);
 	if (res)
 		return res;
-	rxq_ctrl = mlx5_rxq_new(dev, idx, desc, socket, conf, rx_seg, n_seg);
+	rxq = mlx5_malloc(MLX5_MEM_RTE | MLX5_MEM_ZERO, sizeof(*rxq), 0,
+			  SOCKET_ID_ANY);
+	if (!rxq) {
+		DRV_LOG(ERR, "port %u unable to allocate rx queue index %u private data",
+			dev->data->port_id, idx);
+		rte_errno = ENOMEM;
+		return -rte_errno;
+	}
+	rxq->priv = priv;
+	rxq->idx = idx;
+	(*priv->rxq_privs)[idx] = rxq;
+	rxq_ctrl = mlx5_rxq_new(dev, rxq, desc, socket, conf, rx_seg, n_seg);
 	if (!rxq_ctrl) {
-		DRV_LOG(ERR, "port %u unable to allocate queue index %u",
+		DRV_LOG(ERR, "port %u unable to allocate rx queue index %u",
 			dev->data->port_id, idx);
+		mlx5_free(rxq);
+		(*priv->rxq_privs)[idx] = NULL;
 		rte_errno = ENOMEM;
 		return -rte_errno;
 	}
@@ -741,6 +755,7 @@ mlx5_rx_hairpin_queue_setup(struct rte_eth_dev *dev, uint16_t idx,
 			    const struct rte_eth_hairpin_conf *hairpin_conf)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
+	struct mlx5_rxq_priv *rxq;
 	struct mlx5_rxq_ctrl *rxq_ctrl;
 	int res;
 
@@ -776,14 +791,27 @@ mlx5_rx_hairpin_queue_setup(struct rte_eth_dev *dev, uint16_t idx,
 			return -rte_errno;
 		}
 	}
-	rxq_ctrl = mlx5_rxq_hairpin_new(dev, idx, desc, hairpin_conf);
+	rxq = mlx5_malloc(MLX5_MEM_RTE | MLX5_MEM_ZERO, sizeof(*rxq), 0,
+			  SOCKET_ID_ANY);
+	if (!rxq) {
+		DRV_LOG(ERR, "port %u unable to allocate hairpin rx queue index %u private data",
+			dev->data->port_id, idx);
+		rte_errno = ENOMEM;
+		return -rte_errno;
+	}
+	rxq->priv = priv;
+	rxq->idx = idx;
+	(*priv->rxq_privs)[idx] = rxq;
+	rxq_ctrl = mlx5_rxq_hairpin_new(dev, rxq, desc, hairpin_conf);
 	if (!rxq_ctrl) {
-		DRV_LOG(ERR, "port %u unable to allocate queue index %u",
+		DRV_LOG(ERR, "port %u unable to allocate hairpin queue index %u",
 			dev->data->port_id, idx);
+		mlx5_free(rxq);
+		(*priv->rxq_privs)[idx] = NULL;
 		rte_errno = ENOMEM;
 		return -rte_errno;
 	}
-	DRV_LOG(DEBUG, "port %u adding Rx queue %u to list",
+	DRV_LOG(DEBUG, "port %u adding hairpin Rx queue %u to list",
 		dev->data->port_id, idx);
 	(*priv->rxqs)[idx] = &rxq_ctrl->rxq;
 	return 0;
@@ -1319,8 +1347,8 @@ mlx5_max_lro_msg_size_adjust(struct rte_eth_dev *dev, uint16_t idx,
  *
  * @param dev
  *   Pointer to Ethernet device.
- * @param idx
- *   RX queue index.
+ * @param rxq
+ *   RX queue private data.
  * @param desc
  *   Number of descriptors to configure in queue.
  * @param socket
@@ -1330,10 +1358,12 @@ mlx5_max_lro_msg_size_adjust(struct rte_eth_dev *dev, uint16_t idx,
  *   A DPDK queue object on success, NULL otherwise and rte_errno is set.
  */
 struct mlx5_rxq_ctrl *
-mlx5_rxq_new(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
+mlx5_rxq_new(struct rte_eth_dev *dev, struct mlx5_rxq_priv *rxq,
+	     uint16_t desc,
 	     unsigned int socket, const struct rte_eth_rxconf *conf,
 	     const struct rte_eth_rxseg_split *rx_seg, uint16_t n_seg)
 {
+	uint16_t idx = rxq->idx;
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_rxq_ctrl *tmpl;
 	unsigned int mb_len = rte_pktmbuf_data_room_size(rx_seg[0].mp);
@@ -1377,6 +1407,9 @@ mlx5_rxq_new(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		rte_errno = ENOMEM;
 		return NULL;
 	}
+	LIST_INIT(&tmpl->owners);
+	rxq->ctrl = tmpl;
+	LIST_INSERT_HEAD(&tmpl->owners, rxq, owner_entry);
 	MLX5_ASSERT(n_seg && n_seg <= MLX5_MAX_RXQ_NSEG);
 	/*
 	 * Build the array of actual buffer offsets and lengths.
@@ -1610,6 +1643,7 @@ mlx5_rxq_new(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 	tmpl->rxq.rss_hash = !!priv->rss_conf.rss_hf &&
 		(!!(dev->data->dev_conf.rxmode.mq_mode & RTE_ETH_MQ_RX_RSS));
 	tmpl->rxq.port_id = dev->data->port_id;
+	tmpl->sh = priv->sh;
 	tmpl->priv = priv;
 	tmpl->rxq.mp = rx_seg[0].mp;
 	tmpl->rxq.elts_n = log2above(desc);
@@ -1637,8 +1671,8 @@ mlx5_rxq_new(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
  *
  * @param dev
  *   Pointer to Ethernet device.
- * @param idx
- *   RX queue index.
+ * @param rxq
+ *   RX queue.
  * @param desc
  *   Number of descriptors to configure in queue.
  * @param hairpin_conf
@@ -1648,9 +1682,11 @@ mlx5_rxq_new(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
  *   A DPDK queue object on success, NULL otherwise and rte_errno is set.
  */
 struct mlx5_rxq_ctrl *
-mlx5_rxq_hairpin_new(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
+mlx5_rxq_hairpin_new(struct rte_eth_dev *dev, struct mlx5_rxq_priv *rxq,
+		     uint16_t desc,
 		     const struct rte_eth_hairpin_conf *hairpin_conf)
 {
+	uint16_t idx = rxq->idx;
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_rxq_ctrl *tmpl;
 
@@ -1660,10 +1696,14 @@ mlx5_rxq_hairpin_new(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		rte_errno = ENOMEM;
 		return NULL;
 	}
+	LIST_INIT(&tmpl->owners);
+	rxq->ctrl = tmpl;
+	LIST_INSERT_HEAD(&tmpl->owners, rxq, owner_entry);
 	tmpl->type = MLX5_RXQ_TYPE_HAIRPIN;
 	tmpl->socket = SOCKET_ID_ANY;
 	tmpl->rxq.rss_hash = 0;
 	tmpl->rxq.port_id = dev->data->port_id;
+	tmpl->sh = priv->sh;
 	tmpl->priv = priv;
 	tmpl->rxq.mp = NULL;
 	tmpl->rxq.elts_n = log2above(desc);
@@ -1717,6 +1757,7 @@ mlx5_rxq_release(struct rte_eth_dev *dev, uint16_t idx)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_rxq_ctrl *rxq_ctrl;
+	struct mlx5_rxq_priv *rxq = (*priv->rxq_privs)[idx];
 
 	if (priv->rxqs == NULL || (*priv->rxqs)[idx] == NULL)
 		return 0;
@@ -1736,9 +1777,12 @@ mlx5_rxq_release(struct rte_eth_dev *dev, uint16_t idx)
 	if (!__atomic_load_n(&rxq_ctrl->refcnt, __ATOMIC_RELAXED)) {
 		if (rxq_ctrl->type == MLX5_RXQ_TYPE_STANDARD)
 			mlx5_mr_btree_free(&rxq_ctrl->rxq.mr_ctrl.cache_bh);
+		LIST_REMOVE(rxq, owner_entry);
 		LIST_REMOVE(rxq_ctrl, next);
 		mlx5_free(rxq_ctrl);
 		(*priv->rxqs)[idx] = NULL;
+		mlx5_free(rxq);
+		(*priv->rxq_privs)[idx] = NULL;
 	}
 	return 0;
 }
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 08/14] net/mlx5: move Rx queue reference count
  2021-11-03  7:58 ` [dpdk-dev] [PATCH v3 00/14] net/mlx5: support " Xueming Li
                     ` (6 preceding siblings ...)
  2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 07/14] net/mlx5: split Rx queue into shareable and private Xueming Li
@ 2021-11-03  7:58   ` Xueming Li
  2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 09/14] net/mlx5: move Rx queue hairpin info to private data Xueming Li
                     ` (5 subsequent siblings)
  13 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-11-03  7:58 UTC (permalink / raw)
  To: dev; +Cc: xuemingl, Lior Margalit, Matan Azrad, Viacheslav Ovsiienko

Rx queue reference count is counter of RQ, used to count reference to RQ
object. To prepare for shared Rx queue, this patch moves it from rxq_ctrl
to Rx queue private data.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 drivers/net/mlx5/mlx5_rx.h      |   8 +-
 drivers/net/mlx5/mlx5_rxq.c     | 169 +++++++++++++++++++++-----------
 drivers/net/mlx5/mlx5_trigger.c |  57 +++++------
 3 files changed, 142 insertions(+), 92 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index fa24f5cdf3a..eccfbf1108d 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -149,7 +149,6 @@ enum mlx5_rxq_type {
 struct mlx5_rxq_ctrl {
 	struct mlx5_rxq_data rxq; /* Data path structure. */
 	LIST_ENTRY(mlx5_rxq_ctrl) next; /* Pointer to the next element. */
-	uint32_t refcnt; /* Reference counter. */
 	LIST_HEAD(priv, mlx5_rxq_priv) owners; /* Owner rxq list. */
 	struct mlx5_rxq_obj *obj; /* Verbs/DevX elements. */
 	struct mlx5_dev_ctx_shared *sh; /* Shared context. */
@@ -170,6 +169,7 @@ struct mlx5_rxq_ctrl {
 /* RX queue private data. */
 struct mlx5_rxq_priv {
 	uint16_t idx; /* Queue index. */
+	uint32_t refcnt; /* Reference counter. */
 	struct mlx5_rxq_ctrl *ctrl; /* Shared Rx Queue. */
 	LIST_ENTRY(mlx5_rxq_priv) owner_entry; /* Entry in shared rxq_ctrl. */
 	struct mlx5_priv *priv; /* Back pointer to private data. */
@@ -207,7 +207,11 @@ struct mlx5_rxq_ctrl *mlx5_rxq_new(struct rte_eth_dev *dev,
 struct mlx5_rxq_ctrl *mlx5_rxq_hairpin_new
 	(struct rte_eth_dev *dev, struct mlx5_rxq_priv *rxq, uint16_t desc,
 	 const struct rte_eth_hairpin_conf *hairpin_conf);
-struct mlx5_rxq_ctrl *mlx5_rxq_get(struct rte_eth_dev *dev, uint16_t idx);
+struct mlx5_rxq_priv *mlx5_rxq_ref(struct rte_eth_dev *dev, uint16_t idx);
+uint32_t mlx5_rxq_deref(struct rte_eth_dev *dev, uint16_t idx);
+struct mlx5_rxq_priv *mlx5_rxq_get(struct rte_eth_dev *dev, uint16_t idx);
+struct mlx5_rxq_ctrl *mlx5_rxq_ctrl_get(struct rte_eth_dev *dev, uint16_t idx);
+struct mlx5_rxq_data *mlx5_rxq_data_get(struct rte_eth_dev *dev, uint16_t idx);
 int mlx5_rxq_release(struct rte_eth_dev *dev, uint16_t idx);
 int mlx5_rxq_verify(struct rte_eth_dev *dev);
 int rxq_alloc_elts(struct mlx5_rxq_ctrl *rxq_ctrl);
diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index 00df245a5c6..8071ddbd61c 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -386,15 +386,13 @@ mlx5_get_rx_port_offloads(void)
 static int
 mlx5_rxq_releasable(struct rte_eth_dev *dev, uint16_t idx)
 {
-	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_ctrl *rxq_ctrl;
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, idx);
 
-	if (!(*priv->rxqs)[idx]) {
+	if (rxq == NULL) {
 		rte_errno = EINVAL;
 		return -rte_errno;
 	}
-	rxq_ctrl = container_of((*priv->rxqs)[idx], struct mlx5_rxq_ctrl, rxq);
-	return (__atomic_load_n(&rxq_ctrl->refcnt, __ATOMIC_RELAXED) == 1);
+	return (__atomic_load_n(&rxq->refcnt, __ATOMIC_RELAXED) == 1);
 }
 
 /* Fetches and drops all SW-owned and error CQEs to synchronize CQ. */
@@ -874,8 +872,8 @@ mlx5_rx_intr_vec_enable(struct rte_eth_dev *dev)
 
 	for (i = 0; i != n; ++i) {
 		/* This rxq obj must not be released in this function. */
-		struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_get(dev, i);
-		struct mlx5_rxq_obj *rxq_obj = rxq_ctrl ? rxq_ctrl->obj : NULL;
+		struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, i);
+		struct mlx5_rxq_obj *rxq_obj = rxq ? rxq->ctrl->obj : NULL;
 		int rc;
 
 		/* Skip queues that cannot request interrupts. */
@@ -885,11 +883,9 @@ mlx5_rx_intr_vec_enable(struct rte_eth_dev *dev)
 			if (rte_intr_vec_list_index_set(intr_handle, i,
 			   RTE_INTR_VEC_RXTX_OFFSET + RTE_MAX_RXTX_INTR_VEC_ID))
 				return -rte_errno;
-			/* Decrease the rxq_ctrl's refcnt */
-			if (rxq_ctrl)
-				mlx5_rxq_release(dev, i);
 			continue;
 		}
+		mlx5_rxq_ref(dev, i);
 		if (count >= RTE_MAX_RXTX_INTR_VEC_ID) {
 			DRV_LOG(ERR,
 				"port %u too many Rx queues for interrupt"
@@ -954,7 +950,7 @@ mlx5_rx_intr_vec_disable(struct rte_eth_dev *dev)
 		 * Need to access directly the queue to release the reference
 		 * kept in mlx5_rx_intr_vec_enable().
 		 */
-		mlx5_rxq_release(dev, i);
+		mlx5_rxq_deref(dev, i);
 	}
 free:
 	rte_intr_free_epoll_fd(intr_handle);
@@ -1003,19 +999,14 @@ mlx5_arm_cq(struct mlx5_rxq_data *rxq, int sq_n_rxq)
 int
 mlx5_rx_intr_enable(struct rte_eth_dev *dev, uint16_t rx_queue_id)
 {
-	struct mlx5_rxq_ctrl *rxq_ctrl;
-
-	rxq_ctrl = mlx5_rxq_get(dev, rx_queue_id);
-	if (!rxq_ctrl)
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, rx_queue_id);
+	if (!rxq)
 		goto error;
-	if (rxq_ctrl->irq) {
-		if (!rxq_ctrl->obj) {
-			mlx5_rxq_release(dev, rx_queue_id);
+	if (rxq->ctrl->irq) {
+		if (!rxq->ctrl->obj)
 			goto error;
-		}
-		mlx5_arm_cq(&rxq_ctrl->rxq, rxq_ctrl->rxq.cq_arm_sn);
+		mlx5_arm_cq(&rxq->ctrl->rxq, rxq->ctrl->rxq.cq_arm_sn);
 	}
-	mlx5_rxq_release(dev, rx_queue_id);
 	return 0;
 error:
 	rte_errno = EINVAL;
@@ -1037,23 +1028,21 @@ int
 mlx5_rx_intr_disable(struct rte_eth_dev *dev, uint16_t rx_queue_id)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_ctrl *rxq_ctrl;
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, rx_queue_id);
 	int ret = 0;
 
-	rxq_ctrl = mlx5_rxq_get(dev, rx_queue_id);
-	if (!rxq_ctrl) {
+	if (!rxq) {
 		rte_errno = EINVAL;
 		return -rte_errno;
 	}
-	if (!rxq_ctrl->obj)
+	if (!rxq->ctrl->obj)
 		goto error;
-	if (rxq_ctrl->irq) {
-		ret = priv->obj_ops.rxq_event_get(rxq_ctrl->obj);
+	if (rxq->ctrl->irq) {
+		ret = priv->obj_ops.rxq_event_get(rxq->ctrl->obj);
 		if (ret < 0)
 			goto error;
-		rxq_ctrl->rxq.cq_arm_sn++;
+		rxq->ctrl->rxq.cq_arm_sn++;
 	}
-	mlx5_rxq_release(dev, rx_queue_id);
 	return 0;
 error:
 	/**
@@ -1064,12 +1053,9 @@ mlx5_rx_intr_disable(struct rte_eth_dev *dev, uint16_t rx_queue_id)
 		rte_errno = errno;
 	else
 		rte_errno = EINVAL;
-	ret = rte_errno; /* Save rte_errno before cleanup. */
-	mlx5_rxq_release(dev, rx_queue_id);
-	if (ret != EAGAIN)
+	if (rte_errno != EAGAIN)
 		DRV_LOG(WARNING, "port %u unable to disable interrupt on Rx queue %d",
 			dev->data->port_id, rx_queue_id);
-	rte_errno = ret; /* Restore rte_errno. */
 	return -rte_errno;
 }
 
@@ -1657,7 +1643,7 @@ mlx5_rxq_new(struct rte_eth_dev *dev, struct mlx5_rxq_priv *rxq,
 	tmpl->rxq.uar_lock_cq = &priv->sh->uar_lock_cq;
 #endif
 	tmpl->rxq.idx = idx;
-	__atomic_fetch_add(&tmpl->refcnt, 1, __ATOMIC_RELAXED);
+	mlx5_rxq_ref(dev, idx);
 	LIST_INSERT_HEAD(&priv->rxqsctrl, tmpl, next);
 	return tmpl;
 error:
@@ -1711,11 +1697,53 @@ mlx5_rxq_hairpin_new(struct rte_eth_dev *dev, struct mlx5_rxq_priv *rxq,
 	tmpl->rxq.mr_ctrl.cache_bh = (struct mlx5_mr_btree) { 0 };
 	tmpl->hairpin_conf = *hairpin_conf;
 	tmpl->rxq.idx = idx;
-	__atomic_fetch_add(&tmpl->refcnt, 1, __ATOMIC_RELAXED);
+	mlx5_rxq_ref(dev, idx);
 	LIST_INSERT_HEAD(&priv->rxqsctrl, tmpl, next);
 	return tmpl;
 }
 
+/**
+ * Increase Rx queue reference count.
+ *
+ * @param dev
+ *   Pointer to Ethernet device.
+ * @param idx
+ *   RX queue index.
+ *
+ * @return
+ *   A pointer to the queue if it exists, NULL otherwise.
+ */
+struct mlx5_rxq_priv *
+mlx5_rxq_ref(struct rte_eth_dev *dev, uint16_t idx)
+{
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, idx);
+
+	if (rxq != NULL)
+		__atomic_fetch_add(&rxq->refcnt, 1, __ATOMIC_RELAXED);
+	return rxq;
+}
+
+/**
+ * Dereference a Rx queue.
+ *
+ * @param dev
+ *   Pointer to Ethernet device.
+ * @param idx
+ *   RX queue index.
+ *
+ * @return
+ *   Updated reference count.
+ */
+uint32_t
+mlx5_rxq_deref(struct rte_eth_dev *dev, uint16_t idx)
+{
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, idx);
+
+	if (rxq == NULL)
+		return 0;
+	return __atomic_sub_fetch(&rxq->refcnt, 1, __ATOMIC_RELAXED);
+}
+
 /**
  * Get a Rx queue.
  *
@@ -1727,18 +1755,52 @@ mlx5_rxq_hairpin_new(struct rte_eth_dev *dev, struct mlx5_rxq_priv *rxq,
  * @return
  *   A pointer to the queue if it exists, NULL otherwise.
  */
-struct mlx5_rxq_ctrl *
+struct mlx5_rxq_priv *
 mlx5_rxq_get(struct rte_eth_dev *dev, uint16_t idx)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_data *rxq_data = (*priv->rxqs)[idx];
-	struct mlx5_rxq_ctrl *rxq_ctrl = NULL;
 
-	if (rxq_data) {
-		rxq_ctrl = container_of(rxq_data, struct mlx5_rxq_ctrl, rxq);
-		__atomic_fetch_add(&rxq_ctrl->refcnt, 1, __ATOMIC_RELAXED);
-	}
-	return rxq_ctrl;
+	if (priv->rxq_privs == NULL)
+		return NULL;
+	return (*priv->rxq_privs)[idx];
+}
+
+/**
+ * Get Rx queue shareable control.
+ *
+ * @param dev
+ *   Pointer to Ethernet device.
+ * @param idx
+ *   RX queue index.
+ *
+ * @return
+ *   A pointer to the queue control if it exists, NULL otherwise.
+ */
+struct mlx5_rxq_ctrl *
+mlx5_rxq_ctrl_get(struct rte_eth_dev *dev, uint16_t idx)
+{
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, idx);
+
+	return rxq == NULL ? NULL : rxq->ctrl;
+}
+
+/**
+ * Get Rx queue shareable data.
+ *
+ * @param dev
+ *   Pointer to Ethernet device.
+ * @param idx
+ *   RX queue index.
+ *
+ * @return
+ *   A pointer to the queue data if it exists, NULL otherwise.
+ */
+struct mlx5_rxq_data *
+mlx5_rxq_data_get(struct rte_eth_dev *dev, uint16_t idx)
+{
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, idx);
+
+	return rxq == NULL ? NULL : &rxq->ctrl->rxq;
 }
 
 /**
@@ -1756,13 +1818,12 @@ int
 mlx5_rxq_release(struct rte_eth_dev *dev, uint16_t idx)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_ctrl *rxq_ctrl;
-	struct mlx5_rxq_priv *rxq = (*priv->rxq_privs)[idx];
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, idx);
+	struct mlx5_rxq_ctrl *rxq_ctrl = rxq->ctrl;
 
 	if (priv->rxqs == NULL || (*priv->rxqs)[idx] == NULL)
 		return 0;
-	rxq_ctrl = container_of((*priv->rxqs)[idx], struct mlx5_rxq_ctrl, rxq);
-	if (__atomic_sub_fetch(&rxq_ctrl->refcnt, 1, __ATOMIC_RELAXED) > 1)
+	if (mlx5_rxq_deref(dev, idx) > 1)
 		return 1;
 	if (rxq_ctrl->obj) {
 		priv->obj_ops.rxq_obj_release(rxq_ctrl->obj);
@@ -1774,7 +1835,7 @@ mlx5_rxq_release(struct rte_eth_dev *dev, uint16_t idx)
 		rxq_free_elts(rxq_ctrl);
 		dev->data->rx_queue_state[idx] = RTE_ETH_QUEUE_STATE_STOPPED;
 	}
-	if (!__atomic_load_n(&rxq_ctrl->refcnt, __ATOMIC_RELAXED)) {
+	if (!__atomic_load_n(&rxq->refcnt, __ATOMIC_RELAXED)) {
 		if (rxq_ctrl->type == MLX5_RXQ_TYPE_STANDARD)
 			mlx5_mr_btree_free(&rxq_ctrl->rxq.mr_ctrl.cache_bh);
 		LIST_REMOVE(rxq, owner_entry);
@@ -1952,7 +2013,7 @@ mlx5_ind_table_obj_release(struct rte_eth_dev *dev,
 		return 1;
 	priv->obj_ops.ind_table_destroy(ind_tbl);
 	for (i = 0; i != ind_tbl->queues_n; ++i)
-		claim_nonzero(mlx5_rxq_release(dev, ind_tbl->queues[i]));
+		claim_nonzero(mlx5_rxq_deref(dev, ind_tbl->queues[i]));
 	mlx5_free(ind_tbl);
 	return 0;
 }
@@ -2009,7 +2070,7 @@ mlx5_ind_table_obj_setup(struct rte_eth_dev *dev,
 			       log2above(priv->config.ind_table_max_size);
 
 	for (i = 0; i != queues_n; ++i) {
-		if (!mlx5_rxq_get(dev, queues[i])) {
+		if (mlx5_rxq_ref(dev, queues[i]) == NULL) {
 			ret = -rte_errno;
 			goto error;
 		}
@@ -2022,7 +2083,7 @@ mlx5_ind_table_obj_setup(struct rte_eth_dev *dev,
 error:
 	err = rte_errno;
 	for (j = 0; j < i; j++)
-		mlx5_rxq_release(dev, ind_tbl->queues[j]);
+		mlx5_rxq_deref(dev, ind_tbl->queues[j]);
 	rte_errno = err;
 	DRV_LOG(DEBUG, "Port %u cannot setup indirection table.",
 		dev->data->port_id);
@@ -2118,7 +2179,7 @@ mlx5_ind_table_obj_modify(struct rte_eth_dev *dev,
 			  bool standalone)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	unsigned int i, j;
+	unsigned int i;
 	int ret = 0, err;
 	const unsigned int n = rte_is_power_of_2(queues_n) ?
 			       log2above(queues_n) :
@@ -2138,15 +2199,11 @@ mlx5_ind_table_obj_modify(struct rte_eth_dev *dev,
 	ret = priv->obj_ops.ind_table_modify(dev, n, queues, queues_n, ind_tbl);
 	if (ret)
 		goto error;
-	for (j = 0; j < ind_tbl->queues_n; j++)
-		mlx5_rxq_release(dev, ind_tbl->queues[j]);
 	ind_tbl->queues_n = queues_n;
 	ind_tbl->queues = queues;
 	return 0;
 error:
 	err = rte_errno;
-	for (j = 0; j < i; j++)
-		mlx5_rxq_release(dev, queues[j]);
 	rte_errno = err;
 	DRV_LOG(DEBUG, "Port %u cannot setup indirection table.",
 		dev->data->port_id);
diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index ebeeae279e2..e5d74d275f8 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -201,10 +201,12 @@ mlx5_rxq_start(struct rte_eth_dev *dev)
 	DRV_LOG(DEBUG, "Port %u device_attr.max_sge is %d.",
 		dev->data->port_id, priv->sh->device_attr.max_sge);
 	for (i = 0; i != priv->rxqs_n; ++i) {
-		struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_get(dev, i);
+		struct mlx5_rxq_priv *rxq = mlx5_rxq_ref(dev, i);
+		struct mlx5_rxq_ctrl *rxq_ctrl;
 
-		if (!rxq_ctrl)
+		if (rxq == NULL)
 			continue;
+		rxq_ctrl = rxq->ctrl;
 		if (rxq_ctrl->type == MLX5_RXQ_TYPE_STANDARD) {
 			/*
 			 * Pre-register the mempools. Regardless of whether
@@ -266,6 +268,7 @@ mlx5_hairpin_auto_bind(struct rte_eth_dev *dev)
 	struct mlx5_devx_modify_sq_attr sq_attr = { 0 };
 	struct mlx5_devx_modify_rq_attr rq_attr = { 0 };
 	struct mlx5_txq_ctrl *txq_ctrl;
+	struct mlx5_rxq_priv *rxq;
 	struct mlx5_rxq_ctrl *rxq_ctrl;
 	struct mlx5_devx_obj *sq;
 	struct mlx5_devx_obj *rq;
@@ -310,9 +313,8 @@ mlx5_hairpin_auto_bind(struct rte_eth_dev *dev)
 			return -rte_errno;
 		}
 		sq = txq_ctrl->obj->sq;
-		rxq_ctrl = mlx5_rxq_get(dev,
-					txq_ctrl->hairpin_conf.peers[0].queue);
-		if (!rxq_ctrl) {
+		rxq = mlx5_rxq_get(dev, txq_ctrl->hairpin_conf.peers[0].queue);
+		if (rxq == NULL) {
 			mlx5_txq_release(dev, i);
 			rte_errno = EINVAL;
 			DRV_LOG(ERR, "port %u no rxq object found: %d",
@@ -320,6 +322,7 @@ mlx5_hairpin_auto_bind(struct rte_eth_dev *dev)
 				txq_ctrl->hairpin_conf.peers[0].queue);
 			return -rte_errno;
 		}
+		rxq_ctrl = rxq->ctrl;
 		if (rxq_ctrl->type != MLX5_RXQ_TYPE_HAIRPIN ||
 		    rxq_ctrl->hairpin_conf.peers[0].queue != i) {
 			rte_errno = ENOMEM;
@@ -354,12 +357,10 @@ mlx5_hairpin_auto_bind(struct rte_eth_dev *dev)
 		rxq_ctrl->hairpin_status = 1;
 		txq_ctrl->hairpin_status = 1;
 		mlx5_txq_release(dev, i);
-		mlx5_rxq_release(dev, txq_ctrl->hairpin_conf.peers[0].queue);
 	}
 	return 0;
 error:
 	mlx5_txq_release(dev, i);
-	mlx5_rxq_release(dev, txq_ctrl->hairpin_conf.peers[0].queue);
 	return -rte_errno;
 }
 
@@ -432,27 +433,26 @@ mlx5_hairpin_queue_peer_update(struct rte_eth_dev *dev, uint16_t peer_queue,
 		peer_info->manual_bind = txq_ctrl->hairpin_conf.manual_bind;
 		mlx5_txq_release(dev, peer_queue);
 	} else { /* Peer port used as ingress. */
+		struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, peer_queue);
 		struct mlx5_rxq_ctrl *rxq_ctrl;
 
-		rxq_ctrl = mlx5_rxq_get(dev, peer_queue);
-		if (rxq_ctrl == NULL) {
+		if (rxq == NULL) {
 			rte_errno = EINVAL;
 			DRV_LOG(ERR, "Failed to get port %u Rx queue %d",
 				dev->data->port_id, peer_queue);
 			return -rte_errno;
 		}
+		rxq_ctrl = rxq->ctrl;
 		if (rxq_ctrl->type != MLX5_RXQ_TYPE_HAIRPIN) {
 			rte_errno = EINVAL;
 			DRV_LOG(ERR, "port %u queue %d is not a hairpin Rxq",
 				dev->data->port_id, peer_queue);
-			mlx5_rxq_release(dev, peer_queue);
 			return -rte_errno;
 		}
 		if (rxq_ctrl->obj == NULL || rxq_ctrl->obj->rq == NULL) {
 			rte_errno = ENOMEM;
 			DRV_LOG(ERR, "port %u no Rxq object found: %d",
 				dev->data->port_id, peer_queue);
-			mlx5_rxq_release(dev, peer_queue);
 			return -rte_errno;
 		}
 		peer_info->qp_id = rxq_ctrl->obj->rq->id;
@@ -460,7 +460,6 @@ mlx5_hairpin_queue_peer_update(struct rte_eth_dev *dev, uint16_t peer_queue,
 		peer_info->peer_q = rxq_ctrl->hairpin_conf.peers[0].queue;
 		peer_info->tx_explicit = rxq_ctrl->hairpin_conf.tx_explicit;
 		peer_info->manual_bind = rxq_ctrl->hairpin_conf.manual_bind;
-		mlx5_rxq_release(dev, peer_queue);
 	}
 	return 0;
 }
@@ -559,34 +558,32 @@ mlx5_hairpin_queue_peer_bind(struct rte_eth_dev *dev, uint16_t cur_queue,
 			txq_ctrl->hairpin_status = 1;
 		mlx5_txq_release(dev, cur_queue);
 	} else {
+		struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, cur_queue);
 		struct mlx5_rxq_ctrl *rxq_ctrl;
 		struct mlx5_devx_modify_rq_attr rq_attr = { 0 };
 
-		rxq_ctrl = mlx5_rxq_get(dev, cur_queue);
-		if (rxq_ctrl == NULL) {
+		if (rxq == NULL) {
 			rte_errno = EINVAL;
 			DRV_LOG(ERR, "Failed to get port %u Rx queue %d",
 				dev->data->port_id, cur_queue);
 			return -rte_errno;
 		}
+		rxq_ctrl = rxq->ctrl;
 		if (rxq_ctrl->type != MLX5_RXQ_TYPE_HAIRPIN) {
 			rte_errno = EINVAL;
 			DRV_LOG(ERR, "port %u queue %d not a hairpin Rxq",
 				dev->data->port_id, cur_queue);
-			mlx5_rxq_release(dev, cur_queue);
 			return -rte_errno;
 		}
 		if (rxq_ctrl->obj == NULL || rxq_ctrl->obj->rq == NULL) {
 			rte_errno = ENOMEM;
 			DRV_LOG(ERR, "port %u no Rxq object found: %d",
 				dev->data->port_id, cur_queue);
-			mlx5_rxq_release(dev, cur_queue);
 			return -rte_errno;
 		}
 		if (rxq_ctrl->hairpin_status != 0) {
 			DRV_LOG(DEBUG, "port %u Rx queue %d is already bound",
 				dev->data->port_id, cur_queue);
-			mlx5_rxq_release(dev, cur_queue);
 			return 0;
 		}
 		if (peer_info->tx_explicit !=
@@ -594,7 +591,6 @@ mlx5_hairpin_queue_peer_bind(struct rte_eth_dev *dev, uint16_t cur_queue,
 			rte_errno = EINVAL;
 			DRV_LOG(ERR, "port %u Rx queue %d and peer Tx rule mode"
 				" mismatch", dev->data->port_id, cur_queue);
-			mlx5_rxq_release(dev, cur_queue);
 			return -rte_errno;
 		}
 		if (peer_info->manual_bind !=
@@ -602,7 +598,6 @@ mlx5_hairpin_queue_peer_bind(struct rte_eth_dev *dev, uint16_t cur_queue,
 			rte_errno = EINVAL;
 			DRV_LOG(ERR, "port %u Rx queue %d and peer binding mode"
 				" mismatch", dev->data->port_id, cur_queue);
-			mlx5_rxq_release(dev, cur_queue);
 			return -rte_errno;
 		}
 		rq_attr.state = MLX5_SQC_STATE_RDY;
@@ -612,7 +607,6 @@ mlx5_hairpin_queue_peer_bind(struct rte_eth_dev *dev, uint16_t cur_queue,
 		ret = mlx5_devx_cmd_modify_rq(rxq_ctrl->obj->rq, &rq_attr);
 		if (ret == 0)
 			rxq_ctrl->hairpin_status = 1;
-		mlx5_rxq_release(dev, cur_queue);
 	}
 	return ret;
 }
@@ -677,34 +671,32 @@ mlx5_hairpin_queue_peer_unbind(struct rte_eth_dev *dev, uint16_t cur_queue,
 			txq_ctrl->hairpin_status = 0;
 		mlx5_txq_release(dev, cur_queue);
 	} else {
+		struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, cur_queue);
 		struct mlx5_rxq_ctrl *rxq_ctrl;
 		struct mlx5_devx_modify_rq_attr rq_attr = { 0 };
 
-		rxq_ctrl = mlx5_rxq_get(dev, cur_queue);
-		if (rxq_ctrl == NULL) {
+		if (rxq == NULL) {
 			rte_errno = EINVAL;
 			DRV_LOG(ERR, "Failed to get port %u Rx queue %d",
 				dev->data->port_id, cur_queue);
 			return -rte_errno;
 		}
+		rxq_ctrl = rxq->ctrl;
 		if (rxq_ctrl->type != MLX5_RXQ_TYPE_HAIRPIN) {
 			rte_errno = EINVAL;
 			DRV_LOG(ERR, "port %u queue %d not a hairpin Rxq",
 				dev->data->port_id, cur_queue);
-			mlx5_rxq_release(dev, cur_queue);
 			return -rte_errno;
 		}
 		if (rxq_ctrl->hairpin_status == 0) {
 			DRV_LOG(DEBUG, "port %u Rx queue %d is already unbound",
 				dev->data->port_id, cur_queue);
-			mlx5_rxq_release(dev, cur_queue);
 			return 0;
 		}
 		if (rxq_ctrl->obj == NULL || rxq_ctrl->obj->rq == NULL) {
 			rte_errno = ENOMEM;
 			DRV_LOG(ERR, "port %u no Rxq object found: %d",
 				dev->data->port_id, cur_queue);
-			mlx5_rxq_release(dev, cur_queue);
 			return -rte_errno;
 		}
 		rq_attr.state = MLX5_SQC_STATE_RST;
@@ -712,7 +704,6 @@ mlx5_hairpin_queue_peer_unbind(struct rte_eth_dev *dev, uint16_t cur_queue,
 		ret = mlx5_devx_cmd_modify_rq(rxq_ctrl->obj->rq, &rq_attr);
 		if (ret == 0)
 			rxq_ctrl->hairpin_status = 0;
-		mlx5_rxq_release(dev, cur_queue);
 	}
 	return ret;
 }
@@ -1014,7 +1005,6 @@ mlx5_hairpin_get_peer_ports(struct rte_eth_dev *dev, uint16_t *peer_ports,
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_txq_ctrl *txq_ctrl;
-	struct mlx5_rxq_ctrl *rxq_ctrl;
 	uint32_t i;
 	uint16_t pp;
 	uint32_t bits[(RTE_MAX_ETHPORTS + 31) / 32] = {0};
@@ -1043,24 +1033,23 @@ mlx5_hairpin_get_peer_ports(struct rte_eth_dev *dev, uint16_t *peer_ports,
 		}
 	} else {
 		for (i = 0; i < priv->rxqs_n; i++) {
-			rxq_ctrl = mlx5_rxq_get(dev, i);
-			if (!rxq_ctrl)
+			struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, i);
+			struct mlx5_rxq_ctrl *rxq_ctrl;
+
+			if (rxq == NULL)
 				continue;
-			if (rxq_ctrl->type != MLX5_RXQ_TYPE_HAIRPIN) {
-				mlx5_rxq_release(dev, i);
+			rxq_ctrl = rxq->ctrl;
+			if (rxq_ctrl->type != MLX5_RXQ_TYPE_HAIRPIN)
 				continue;
-			}
 			pp = rxq_ctrl->hairpin_conf.peers[0].port;
 			if (pp >= RTE_MAX_ETHPORTS) {
 				rte_errno = ERANGE;
-				mlx5_rxq_release(dev, i);
 				DRV_LOG(ERR, "port %hu queue %u peer port "
 					"out of range %hu",
 					priv->dev_data->port_id, i, pp);
 				return -rte_errno;
 			}
 			bits[pp / 32] |= 1 << (pp % 32);
-			mlx5_rxq_release(dev, i);
 		}
 	}
 	for (i = 0; i < RTE_MAX_ETHPORTS; i++) {
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 09/14] net/mlx5: move Rx queue hairpin info to private data
  2021-11-03  7:58 ` [dpdk-dev] [PATCH v3 00/14] net/mlx5: support " Xueming Li
                     ` (7 preceding siblings ...)
  2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 08/14] net/mlx5: move Rx queue reference count Xueming Li
@ 2021-11-03  7:58   ` Xueming Li
  2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 10/14] net/mlx5: remove port info from shareable Rx queue Xueming Li
                     ` (4 subsequent siblings)
  13 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-11-03  7:58 UTC (permalink / raw)
  To: dev; +Cc: xuemingl, Lior Margalit, Matan Azrad, Viacheslav Ovsiienko

Hairpin info of Rx queue can't be shared, moves to private queue data.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 drivers/net/mlx5/mlx5_rx.h      |  4 ++--
 drivers/net/mlx5/mlx5_rxq.c     | 13 +++++--------
 drivers/net/mlx5/mlx5_trigger.c | 24 ++++++++++++------------
 3 files changed, 19 insertions(+), 22 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index eccfbf1108d..b21918223b8 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -162,8 +162,6 @@ struct mlx5_rxq_ctrl {
 	uint32_t flow_tunnels_n[MLX5_FLOW_TUNNEL]; /* Tunnels counters. */
 	uint32_t wqn; /* WQ number. */
 	uint16_t dump_file_n; /* Number of dump files. */
-	struct rte_eth_hairpin_conf hairpin_conf; /* Hairpin configuration. */
-	uint32_t hairpin_status; /* Hairpin binding status. */
 };
 
 /* RX queue private data. */
@@ -173,6 +171,8 @@ struct mlx5_rxq_priv {
 	struct mlx5_rxq_ctrl *ctrl; /* Shared Rx Queue. */
 	LIST_ENTRY(mlx5_rxq_priv) owner_entry; /* Entry in shared rxq_ctrl. */
 	struct mlx5_priv *priv; /* Back pointer to private data. */
+	struct rte_eth_hairpin_conf hairpin_conf; /* Hairpin configuration. */
+	uint32_t hairpin_status; /* Hairpin binding status. */
 };
 
 /* mlx5_rxq.c */
diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index 8071ddbd61c..7b637fda643 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -1695,8 +1695,8 @@ mlx5_rxq_hairpin_new(struct rte_eth_dev *dev, struct mlx5_rxq_priv *rxq,
 	tmpl->rxq.elts_n = log2above(desc);
 	tmpl->rxq.elts = NULL;
 	tmpl->rxq.mr_ctrl.cache_bh = (struct mlx5_mr_btree) { 0 };
-	tmpl->hairpin_conf = *hairpin_conf;
 	tmpl->rxq.idx = idx;
+	rxq->hairpin_conf = *hairpin_conf;
 	mlx5_rxq_ref(dev, idx);
 	LIST_INSERT_HEAD(&priv->rxqsctrl, tmpl, next);
 	return tmpl;
@@ -1913,14 +1913,11 @@ const struct rte_eth_hairpin_conf *
 mlx5_rxq_get_hairpin_conf(struct rte_eth_dev *dev, uint16_t idx)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_ctrl *rxq_ctrl = NULL;
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, idx);
 
-	if (idx < priv->rxqs_n && (*priv->rxqs)[idx]) {
-		rxq_ctrl = container_of((*priv->rxqs)[idx],
-					struct mlx5_rxq_ctrl,
-					rxq);
-		if (rxq_ctrl->type == MLX5_RXQ_TYPE_HAIRPIN)
-			return &rxq_ctrl->hairpin_conf;
+	if (idx < priv->rxqs_n && rxq != NULL) {
+		if (rxq->ctrl->type == MLX5_RXQ_TYPE_HAIRPIN)
+			return &rxq->hairpin_conf;
 	}
 	return NULL;
 }
diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index e5d74d275f8..a124f74fcda 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -324,7 +324,7 @@ mlx5_hairpin_auto_bind(struct rte_eth_dev *dev)
 		}
 		rxq_ctrl = rxq->ctrl;
 		if (rxq_ctrl->type != MLX5_RXQ_TYPE_HAIRPIN ||
-		    rxq_ctrl->hairpin_conf.peers[0].queue != i) {
+		    rxq->hairpin_conf.peers[0].queue != i) {
 			rte_errno = ENOMEM;
 			DRV_LOG(ERR, "port %u Tx queue %d can't be binded to "
 				"Rx queue %d", dev->data->port_id,
@@ -354,7 +354,7 @@ mlx5_hairpin_auto_bind(struct rte_eth_dev *dev)
 		if (ret)
 			goto error;
 		/* Qs with auto-bind will be destroyed directly. */
-		rxq_ctrl->hairpin_status = 1;
+		rxq->hairpin_status = 1;
 		txq_ctrl->hairpin_status = 1;
 		mlx5_txq_release(dev, i);
 	}
@@ -457,9 +457,9 @@ mlx5_hairpin_queue_peer_update(struct rte_eth_dev *dev, uint16_t peer_queue,
 		}
 		peer_info->qp_id = rxq_ctrl->obj->rq->id;
 		peer_info->vhca_id = priv->config.hca_attr.vhca_id;
-		peer_info->peer_q = rxq_ctrl->hairpin_conf.peers[0].queue;
-		peer_info->tx_explicit = rxq_ctrl->hairpin_conf.tx_explicit;
-		peer_info->manual_bind = rxq_ctrl->hairpin_conf.manual_bind;
+		peer_info->peer_q = rxq->hairpin_conf.peers[0].queue;
+		peer_info->tx_explicit = rxq->hairpin_conf.tx_explicit;
+		peer_info->manual_bind = rxq->hairpin_conf.manual_bind;
 	}
 	return 0;
 }
@@ -581,20 +581,20 @@ mlx5_hairpin_queue_peer_bind(struct rte_eth_dev *dev, uint16_t cur_queue,
 				dev->data->port_id, cur_queue);
 			return -rte_errno;
 		}
-		if (rxq_ctrl->hairpin_status != 0) {
+		if (rxq->hairpin_status != 0) {
 			DRV_LOG(DEBUG, "port %u Rx queue %d is already bound",
 				dev->data->port_id, cur_queue);
 			return 0;
 		}
 		if (peer_info->tx_explicit !=
-		    rxq_ctrl->hairpin_conf.tx_explicit) {
+		    rxq->hairpin_conf.tx_explicit) {
 			rte_errno = EINVAL;
 			DRV_LOG(ERR, "port %u Rx queue %d and peer Tx rule mode"
 				" mismatch", dev->data->port_id, cur_queue);
 			return -rte_errno;
 		}
 		if (peer_info->manual_bind !=
-		    rxq_ctrl->hairpin_conf.manual_bind) {
+		    rxq->hairpin_conf.manual_bind) {
 			rte_errno = EINVAL;
 			DRV_LOG(ERR, "port %u Rx queue %d and peer binding mode"
 				" mismatch", dev->data->port_id, cur_queue);
@@ -606,7 +606,7 @@ mlx5_hairpin_queue_peer_bind(struct rte_eth_dev *dev, uint16_t cur_queue,
 		rq_attr.hairpin_peer_vhca = peer_info->vhca_id;
 		ret = mlx5_devx_cmd_modify_rq(rxq_ctrl->obj->rq, &rq_attr);
 		if (ret == 0)
-			rxq_ctrl->hairpin_status = 1;
+			rxq->hairpin_status = 1;
 	}
 	return ret;
 }
@@ -688,7 +688,7 @@ mlx5_hairpin_queue_peer_unbind(struct rte_eth_dev *dev, uint16_t cur_queue,
 				dev->data->port_id, cur_queue);
 			return -rte_errno;
 		}
-		if (rxq_ctrl->hairpin_status == 0) {
+		if (rxq->hairpin_status == 0) {
 			DRV_LOG(DEBUG, "port %u Rx queue %d is already unbound",
 				dev->data->port_id, cur_queue);
 			return 0;
@@ -703,7 +703,7 @@ mlx5_hairpin_queue_peer_unbind(struct rte_eth_dev *dev, uint16_t cur_queue,
 		rq_attr.rq_state = MLX5_SQC_STATE_RST;
 		ret = mlx5_devx_cmd_modify_rq(rxq_ctrl->obj->rq, &rq_attr);
 		if (ret == 0)
-			rxq_ctrl->hairpin_status = 0;
+			rxq->hairpin_status = 0;
 	}
 	return ret;
 }
@@ -1041,7 +1041,7 @@ mlx5_hairpin_get_peer_ports(struct rte_eth_dev *dev, uint16_t *peer_ports,
 			rxq_ctrl = rxq->ctrl;
 			if (rxq_ctrl->type != MLX5_RXQ_TYPE_HAIRPIN)
 				continue;
-			pp = rxq_ctrl->hairpin_conf.peers[0].port;
+			pp = rxq->hairpin_conf.peers[0].port;
 			if (pp >= RTE_MAX_ETHPORTS) {
 				rte_errno = ERANGE;
 				DRV_LOG(ERR, "port %hu queue %u peer port "
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 10/14] net/mlx5: remove port info from shareable Rx queue
  2021-11-03  7:58 ` [dpdk-dev] [PATCH v3 00/14] net/mlx5: support " Xueming Li
                     ` (8 preceding siblings ...)
  2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 09/14] net/mlx5: move Rx queue hairpin info to private data Xueming Li
@ 2021-11-03  7:58   ` Xueming Li
  2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 11/14] net/mlx5: move Rx queue DevX resource Xueming Li
                     ` (3 subsequent siblings)
  13 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-11-03  7:58 UTC (permalink / raw)
  To: dev; +Cc: xuemingl, Lior Margalit, Matan Azrad, Viacheslav Ovsiienko

To prepare for shared Rx queue, removes port info from shareable Rx
queue control.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 drivers/net/mlx5/mlx5_devx.c     |  2 +-
 drivers/net/mlx5/mlx5_rx.c       | 15 +++--------
 drivers/net/mlx5/mlx5_rx.h       |  7 ++++--
 drivers/net/mlx5/mlx5_rxq.c      | 43 ++++++++++++++++++++++----------
 drivers/net/mlx5/mlx5_rxtx_vec.c |  2 +-
 drivers/net/mlx5/mlx5_trigger.c  | 13 +++++-----
 6 files changed, 47 insertions(+), 35 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index 443252df05d..8b3651f5034 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -918,7 +918,7 @@ mlx5_rxq_devx_obj_drop_create(struct rte_eth_dev *dev)
 	}
 	rxq->rxq_ctrl = rxq_ctrl;
 	rxq_ctrl->type = MLX5_RXQ_TYPE_STANDARD;
-	rxq_ctrl->priv = priv;
+	rxq_ctrl->sh = priv->sh;
 	rxq_ctrl->obj = rxq;
 	rxq_data = &rxq_ctrl->rxq;
 	/* Create CQ using DevX API. */
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 258a6453144..d41905a2a04 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -118,15 +118,7 @@ int
 mlx5_rx_descriptor_status(void *rx_queue, uint16_t offset)
 {
 	struct mlx5_rxq_data *rxq = rx_queue;
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-			container_of(rxq, struct mlx5_rxq_ctrl, rxq);
-	struct rte_eth_dev *dev = ETH_DEV(rxq_ctrl->priv);
 
-	if (dev->rx_pkt_burst == NULL ||
-	    dev->rx_pkt_burst == removed_rx_burst) {
-		rte_errno = ENOTSUP;
-		return -rte_errno;
-	}
 	if (offset >= (1 << rxq->cqe_n)) {
 		rte_errno = EINVAL;
 		return -rte_errno;
@@ -438,10 +430,10 @@ mlx5_rx_err_handle(struct mlx5_rxq_data *rxq, uint8_t vec)
 		sm.is_wq = 1;
 		sm.queue_id = rxq->idx;
 		sm.state = IBV_WQS_RESET;
-		if (mlx5_queue_state_modify(ETH_DEV(rxq_ctrl->priv), &sm))
+		if (mlx5_queue_state_modify(RXQ_DEV(rxq_ctrl), &sm))
 			return -1;
 		if (rxq_ctrl->dump_file_n <
-		    rxq_ctrl->priv->config.max_dump_files_num) {
+		    RXQ_PORT(rxq_ctrl)->config.max_dump_files_num) {
 			MKSTR(err_str, "Unexpected CQE error syndrome "
 			      "0x%02x CQN = %u RQN = %u wqe_counter = %u"
 			      " rq_ci = %u cq_ci = %u", u.err_cqe->syndrome,
@@ -478,8 +470,7 @@ mlx5_rx_err_handle(struct mlx5_rxq_data *rxq, uint8_t vec)
 			sm.is_wq = 1;
 			sm.queue_id = rxq->idx;
 			sm.state = IBV_WQS_RDY;
-			if (mlx5_queue_state_modify(ETH_DEV(rxq_ctrl->priv),
-						    &sm))
+			if (mlx5_queue_state_modify(RXQ_DEV(rxq_ctrl), &sm))
 				return -1;
 			if (vec) {
 				const uint32_t elts_n =
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index b21918223b8..c04c0c73349 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -22,6 +22,10 @@
 /* Support tunnel matching. */
 #define MLX5_FLOW_TUNNEL 10
 
+#define RXQ_PORT(rxq_ctrl) LIST_FIRST(&(rxq_ctrl)->owners)->priv
+#define RXQ_DEV(rxq_ctrl) ETH_DEV(RXQ_PORT(rxq_ctrl))
+#define RXQ_PORT_ID(rxq_ctrl) PORT_ID(RXQ_PORT(rxq_ctrl))
+
 /* First entry must be NULL for comparison. */
 #define mlx5_mr_btree_len(bt) ((bt)->len - 1)
 
@@ -152,7 +156,6 @@ struct mlx5_rxq_ctrl {
 	LIST_HEAD(priv, mlx5_rxq_priv) owners; /* Owner rxq list. */
 	struct mlx5_rxq_obj *obj; /* Verbs/DevX elements. */
 	struct mlx5_dev_ctx_shared *sh; /* Shared context. */
-	struct mlx5_priv *priv; /* Back pointer to private data. */
 	enum mlx5_rxq_type type; /* Rxq type. */
 	unsigned int socket; /* CPU socket ID for allocations. */
 	uint32_t share_group; /* Group ID of shared RXQ. */
@@ -318,7 +321,7 @@ mlx5_rx_addr2mr(struct mlx5_rxq_data *rxq, uintptr_t addr)
 	 */
 	rxq_ctrl = container_of(rxq, struct mlx5_rxq_ctrl, rxq);
 	mp = mlx5_rxq_mprq_enabled(rxq) ? rxq->mprq_mp : rxq->mp;
-	return mlx5_mr_mempool2mr_bh(&rxq_ctrl->priv->sh->cdev->mr_scache,
+	return mlx5_mr_mempool2mr_bh(&rxq_ctrl->sh->cdev->mr_scache,
 				     mr_ctrl, mp, addr);
 }
 
diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index 7b637fda643..5a20966e2ca 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -148,8 +148,14 @@ rxq_alloc_elts_sprq(struct mlx5_rxq_ctrl *rxq_ctrl)
 
 		buf = rte_pktmbuf_alloc(seg->mp);
 		if (buf == NULL) {
-			DRV_LOG(ERR, "port %u empty mbuf pool",
-				PORT_ID(rxq_ctrl->priv));
+			if (rxq_ctrl->share_group == 0)
+				DRV_LOG(ERR, "port %u queue %u empty mbuf pool",
+					RXQ_PORT_ID(rxq_ctrl),
+					rxq_ctrl->rxq.idx);
+			else
+				DRV_LOG(ERR, "share group %u queue %u empty mbuf pool",
+					rxq_ctrl->share_group,
+					rxq_ctrl->share_qid);
 			rte_errno = ENOMEM;
 			goto error;
 		}
@@ -193,11 +199,16 @@ rxq_alloc_elts_sprq(struct mlx5_rxq_ctrl *rxq_ctrl)
 		for (j = 0; j < MLX5_VPMD_DESCS_PER_LOOP; ++j)
 			(*rxq->elts)[elts_n + j] = &rxq->fake_mbuf;
 	}
-	DRV_LOG(DEBUG,
-		"port %u SPRQ queue %u allocated and configured %u segments"
-		" (max %u packets)",
-		PORT_ID(rxq_ctrl->priv), rxq_ctrl->rxq.idx, elts_n,
-		elts_n / (1 << rxq_ctrl->rxq.sges_n));
+	if (rxq_ctrl->share_group == 0)
+		DRV_LOG(DEBUG,
+			"port %u SPRQ queue %u allocated and configured %u segments (max %u packets)",
+			RXQ_PORT_ID(rxq_ctrl), rxq_ctrl->rxq.idx, elts_n,
+			elts_n / (1 << rxq_ctrl->rxq.sges_n));
+	else
+		DRV_LOG(DEBUG,
+			"share group %u SPRQ queue %u allocated and configured %u segments (max %u packets)",
+			rxq_ctrl->share_group, rxq_ctrl->share_qid, elts_n,
+			elts_n / (1 << rxq_ctrl->rxq.sges_n));
 	return 0;
 error:
 	err = rte_errno; /* Save rte_errno before cleanup. */
@@ -207,8 +218,12 @@ rxq_alloc_elts_sprq(struct mlx5_rxq_ctrl *rxq_ctrl)
 			rte_pktmbuf_free_seg((*rxq_ctrl->rxq.elts)[i]);
 		(*rxq_ctrl->rxq.elts)[i] = NULL;
 	}
-	DRV_LOG(DEBUG, "port %u SPRQ queue %u failed, freed everything",
-		PORT_ID(rxq_ctrl->priv), rxq_ctrl->rxq.idx);
+	if (rxq_ctrl->share_group == 0)
+		DRV_LOG(DEBUG, "port %u SPRQ queue %u failed, freed everything",
+			RXQ_PORT_ID(rxq_ctrl), rxq_ctrl->rxq.idx);
+	else
+		DRV_LOG(DEBUG, "share group %u SPRQ queue %u failed, freed everything",
+			rxq_ctrl->share_group, rxq_ctrl->share_qid);
 	rte_errno = err; /* Restore rte_errno. */
 	return -rte_errno;
 }
@@ -284,8 +299,12 @@ rxq_free_elts_sprq(struct mlx5_rxq_ctrl *rxq_ctrl)
 	uint16_t used = q_n - (elts_ci - rxq->rq_pi);
 	uint16_t i;
 
-	DRV_LOG(DEBUG, "port %u Rx queue %u freeing %d WRs",
-		PORT_ID(rxq_ctrl->priv), rxq->idx, q_n);
+	if (rxq_ctrl->share_group == 0)
+		DRV_LOG(DEBUG, "port %u Rx queue %u freeing %d WRs",
+			RXQ_PORT_ID(rxq_ctrl), rxq->idx, q_n);
+	else
+		DRV_LOG(DEBUG, "share group %u Rx queue %u freeing %d WRs",
+			rxq_ctrl->share_group, rxq_ctrl->share_qid, q_n);
 	if (rxq->elts == NULL)
 		return;
 	/**
@@ -1630,7 +1649,6 @@ mlx5_rxq_new(struct rte_eth_dev *dev, struct mlx5_rxq_priv *rxq,
 		(!!(dev->data->dev_conf.rxmode.mq_mode & RTE_ETH_MQ_RX_RSS));
 	tmpl->rxq.port_id = dev->data->port_id;
 	tmpl->sh = priv->sh;
-	tmpl->priv = priv;
 	tmpl->rxq.mp = rx_seg[0].mp;
 	tmpl->rxq.elts_n = log2above(desc);
 	tmpl->rxq.rq_repl_thresh =
@@ -1690,7 +1708,6 @@ mlx5_rxq_hairpin_new(struct rte_eth_dev *dev, struct mlx5_rxq_priv *rxq,
 	tmpl->rxq.rss_hash = 0;
 	tmpl->rxq.port_id = dev->data->port_id;
 	tmpl->sh = priv->sh;
-	tmpl->priv = priv;
 	tmpl->rxq.mp = NULL;
 	tmpl->rxq.elts_n = log2above(desc);
 	tmpl->rxq.elts = NULL;
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.c b/drivers/net/mlx5/mlx5_rxtx_vec.c
index ecd273e00a8..511681841ca 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec.c
+++ b/drivers/net/mlx5/mlx5_rxtx_vec.c
@@ -550,7 +550,7 @@ mlx5_rxq_check_vec_support(struct mlx5_rxq_data *rxq)
 	struct mlx5_rxq_ctrl *ctrl =
 		container_of(rxq, struct mlx5_rxq_ctrl, rxq);
 
-	if (!ctrl->priv->config.rx_vec_en || rxq->sges_n != 0)
+	if (!RXQ_PORT(ctrl)->config.rx_vec_en || rxq->sges_n != 0)
 		return -ENOTSUP;
 	if (rxq->lro)
 		return -ENOTSUP;
diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index a124f74fcda..caafdf27e8f 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -131,9 +131,11 @@ mlx5_rxq_mempool_register_cb(struct rte_mempool *mp, void *opaque,
  *   0 on success, (-1) on failure and rte_errno is set.
  */
 static int
-mlx5_rxq_mempool_register(struct mlx5_rxq_ctrl *rxq_ctrl)
+mlx5_rxq_mempool_register(struct rte_eth_dev *dev,
+			  struct mlx5_rxq_ctrl *rxq_ctrl)
 {
-	struct mlx5_priv *priv = rxq_ctrl->priv;
+	struct mlx5_priv *priv = dev->data->dev_private;
+	struct mlx5_dev_ctx_shared *sh = rxq_ctrl->sh;
 	struct rte_mempool *mp;
 	uint32_t s;
 	int ret = 0;
@@ -148,9 +150,8 @@ mlx5_rxq_mempool_register(struct mlx5_rxq_ctrl *rxq_ctrl)
 	}
 	for (s = 0; s < rxq_ctrl->rxq.rxseg_n; s++) {
 		mp = rxq_ctrl->rxq.rxseg[s].mp;
-		ret = mlx5_mr_mempool_register(&priv->sh->cdev->mr_scache,
-					       priv->sh->cdev->pd, mp,
-					       &priv->mp_id);
+		ret = mlx5_mr_mempool_register(&sh->cdev->mr_scache,
+					       sh->cdev->pd, mp, &priv->mp_id);
 		if (ret < 0 && rte_errno != EEXIST)
 			return ret;
 		rte_mempool_mem_iter(mp, mlx5_rxq_mempool_register_cb,
@@ -213,7 +214,7 @@ mlx5_rxq_start(struct rte_eth_dev *dev)
 			 * the implicit registration is enabled or not,
 			 * Rx mempool destruction is tracked to free MRs.
 			 */
-			if (mlx5_rxq_mempool_register(rxq_ctrl) < 0)
+			if (mlx5_rxq_mempool_register(dev, rxq_ctrl) < 0)
 				goto error;
 			ret = rxq_alloc_elts(rxq_ctrl);
 			if (ret)
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 11/14] net/mlx5: move Rx queue DevX resource
  2021-11-03  7:58 ` [dpdk-dev] [PATCH v3 00/14] net/mlx5: support " Xueming Li
                     ` (9 preceding siblings ...)
  2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 10/14] net/mlx5: remove port info from shareable Rx queue Xueming Li
@ 2021-11-03  7:58   ` Xueming Li
  2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 12/14] net/mlx5: remove Rx queue data list from device Xueming Li
                     ` (2 subsequent siblings)
  13 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-11-03  7:58 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Lior Margalit, Matan Azrad, Viacheslav Ovsiienko,
	Anatoly Burakov

To support shared RX queue, moves DevX RQ which is per queue resource to
Rx queue private data.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 drivers/net/mlx5/linux/mlx5_verbs.c | 154 +++++++++++--------
 drivers/net/mlx5/mlx5.h             |  11 +-
 drivers/net/mlx5/mlx5_devx.c        | 227 +++++++++++++---------------
 drivers/net/mlx5/mlx5_rx.h          |   1 +
 drivers/net/mlx5/mlx5_rxq.c         |  44 +++---
 drivers/net/mlx5/mlx5_rxtx.c        |   6 +-
 drivers/net/mlx5/mlx5_trigger.c     |   2 +-
 drivers/net/mlx5/mlx5_vlan.c        |  16 +-
 8 files changed, 240 insertions(+), 221 deletions(-)

diff --git a/drivers/net/mlx5/linux/mlx5_verbs.c b/drivers/net/mlx5/linux/mlx5_verbs.c
index 4779b37aa65..5d4ae3ea752 100644
--- a/drivers/net/mlx5/linux/mlx5_verbs.c
+++ b/drivers/net/mlx5/linux/mlx5_verbs.c
@@ -29,13 +29,13 @@
 /**
  * Modify Rx WQ vlan stripping offload
  *
- * @param rxq_obj
- *   Rx queue object.
+ * @param rxq
+ *   Rx queue.
  *
  * @return 0 on success, non-0 otherwise
  */
 static int
-mlx5_rxq_obj_modify_wq_vlan_strip(struct mlx5_rxq_obj *rxq_obj, int on)
+mlx5_rxq_obj_modify_wq_vlan_strip(struct mlx5_rxq_priv *rxq, int on)
 {
 	uint16_t vlan_offloads =
 		(on ? IBV_WQ_FLAGS_CVLAN_STRIPPING : 0) |
@@ -47,14 +47,14 @@ mlx5_rxq_obj_modify_wq_vlan_strip(struct mlx5_rxq_obj *rxq_obj, int on)
 		.flags = vlan_offloads,
 	};
 
-	return mlx5_glue->modify_wq(rxq_obj->wq, &mod);
+	return mlx5_glue->modify_wq(rxq->ctrl->obj->wq, &mod);
 }
 
 /**
  * Modifies the attributes for the specified WQ.
  *
- * @param rxq_obj
- *   Verbs Rx queue object.
+ * @param rxq
+ *   Verbs Rx queue.
  * @param type
  *   Type of change queue state.
  *
@@ -62,14 +62,14 @@ mlx5_rxq_obj_modify_wq_vlan_strip(struct mlx5_rxq_obj *rxq_obj, int on)
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
 static int
-mlx5_ibv_modify_wq(struct mlx5_rxq_obj *rxq_obj, uint8_t type)
+mlx5_ibv_modify_wq(struct mlx5_rxq_priv *rxq, uint8_t type)
 {
 	struct ibv_wq_attr mod = {
 		.attr_mask = IBV_WQ_ATTR_STATE,
 		.wq_state = (enum ibv_wq_state)type,
 	};
 
-	return mlx5_glue->modify_wq(rxq_obj->wq, &mod);
+	return mlx5_glue->modify_wq(rxq->ctrl->obj->wq, &mod);
 }
 
 /**
@@ -139,21 +139,18 @@ mlx5_ibv_modify_qp(struct mlx5_txq_obj *obj, enum mlx5_txq_modify_type type,
 /**
  * Create a CQ Verbs object.
  *
- * @param dev
- *   Pointer to Ethernet device.
- * @param idx
- *   Queue index in DPDK Rx queue array.
+ * @param rxq
+ *   Pointer to Rx queue.
  *
  * @return
  *   The Verbs CQ object initialized, NULL otherwise and rte_errno is set.
  */
 static struct ibv_cq *
-mlx5_rxq_ibv_cq_create(struct rte_eth_dev *dev, uint16_t idx)
+mlx5_rxq_ibv_cq_create(struct mlx5_rxq_priv *rxq)
 {
-	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_data *rxq_data = (*priv->rxqs)[idx];
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-		container_of(rxq_data, struct mlx5_rxq_ctrl, rxq);
+	struct mlx5_priv *priv = rxq->priv;
+	struct mlx5_rxq_ctrl *rxq_ctrl = rxq->ctrl;
+	struct mlx5_rxq_data *rxq_data = &rxq_ctrl->rxq;
 	struct mlx5_rxq_obj *rxq_obj = rxq_ctrl->obj;
 	unsigned int cqe_n = mlx5_rxq_cqe_num(rxq_data);
 	struct {
@@ -199,7 +196,7 @@ mlx5_rxq_ibv_cq_create(struct rte_eth_dev *dev, uint16_t idx)
 		DRV_LOG(DEBUG,
 			"Port %u Rx CQE compression is disabled for HW"
 			" timestamp.",
-			dev->data->port_id);
+			priv->dev_data->port_id);
 	}
 #ifdef HAVE_IBV_MLX5_MOD_CQE_128B_PAD
 	if (RTE_CACHE_LINE_SIZE == 128) {
@@ -216,21 +213,18 @@ mlx5_rxq_ibv_cq_create(struct rte_eth_dev *dev, uint16_t idx)
 /**
  * Create a WQ Verbs object.
  *
- * @param dev
- *   Pointer to Ethernet device.
- * @param idx
- *   Queue index in DPDK Rx queue array.
+ * @param rxq
+ *   Pointer to Rx queue.
  *
  * @return
  *   The Verbs WQ object initialized, NULL otherwise and rte_errno is set.
  */
 static struct ibv_wq *
-mlx5_rxq_ibv_wq_create(struct rte_eth_dev *dev, uint16_t idx)
+mlx5_rxq_ibv_wq_create(struct mlx5_rxq_priv *rxq)
 {
-	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_data *rxq_data = (*priv->rxqs)[idx];
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-		container_of(rxq_data, struct mlx5_rxq_ctrl, rxq);
+	struct mlx5_priv *priv = rxq->priv;
+	struct mlx5_rxq_ctrl *rxq_ctrl = rxq->ctrl;
+	struct mlx5_rxq_data *rxq_data = &rxq_ctrl->rxq;
 	struct mlx5_rxq_obj *rxq_obj = rxq_ctrl->obj;
 	unsigned int wqe_n = 1 << rxq_data->elts_n;
 	struct {
@@ -297,7 +291,7 @@ mlx5_rxq_ibv_wq_create(struct rte_eth_dev *dev, uint16_t idx)
 			DRV_LOG(ERR,
 				"Port %u Rx queue %u requested %u*%u but got"
 				" %u*%u WRs*SGEs.",
-				dev->data->port_id, idx,
+				priv->dev_data->port_id, rxq->idx,
 				wqe_n >> rxq_data->sges_n,
 				(1 << rxq_data->sges_n),
 				wq_attr.ibv.max_wr, wq_attr.ibv.max_sge);
@@ -312,21 +306,20 @@ mlx5_rxq_ibv_wq_create(struct rte_eth_dev *dev, uint16_t idx)
 /**
  * Create the Rx queue Verbs object.
  *
- * @param dev
- *   Pointer to Ethernet device.
- * @param idx
- *   Queue index in DPDK Rx queue array.
+ * @param rxq
+ *   Pointer to Rx queue.
  *
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
 static int
-mlx5_rxq_ibv_obj_new(struct rte_eth_dev *dev, uint16_t idx)
+mlx5_rxq_ibv_obj_new(struct mlx5_rxq_priv *rxq)
 {
-	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_data *rxq_data = (*priv->rxqs)[idx];
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-		container_of(rxq_data, struct mlx5_rxq_ctrl, rxq);
+	uint16_t idx = rxq->idx;
+	struct mlx5_priv *priv = rxq->priv;
+	uint16_t port_id = priv->dev_data->port_id;
+	struct mlx5_rxq_ctrl *rxq_ctrl = rxq->ctrl;
+	struct mlx5_rxq_data *rxq_data = &rxq_ctrl->rxq;
 	struct mlx5_rxq_obj *tmpl = rxq_ctrl->obj;
 	struct mlx5dv_cq cq_info;
 	struct mlx5dv_rwq rwq;
@@ -341,17 +334,17 @@ mlx5_rxq_ibv_obj_new(struct rte_eth_dev *dev, uint16_t idx)
 			mlx5_glue->create_comp_channel(priv->sh->cdev->ctx);
 		if (!tmpl->ibv_channel) {
 			DRV_LOG(ERR, "Port %u: comp channel creation failure.",
-				dev->data->port_id);
+				port_id);
 			rte_errno = ENOMEM;
 			goto error;
 		}
 		tmpl->fd = ((struct ibv_comp_channel *)(tmpl->ibv_channel))->fd;
 	}
 	/* Create CQ using Verbs API. */
-	tmpl->ibv_cq = mlx5_rxq_ibv_cq_create(dev, idx);
+	tmpl->ibv_cq = mlx5_rxq_ibv_cq_create(rxq);
 	if (!tmpl->ibv_cq) {
 		DRV_LOG(ERR, "Port %u Rx queue %u CQ creation failure.",
-			dev->data->port_id, idx);
+			port_id, idx);
 		rte_errno = ENOMEM;
 		goto error;
 	}
@@ -366,7 +359,7 @@ mlx5_rxq_ibv_obj_new(struct rte_eth_dev *dev, uint16_t idx)
 		DRV_LOG(ERR,
 			"Port %u wrong MLX5_CQE_SIZE environment "
 			"variable value: it should be set to %u.",
-			dev->data->port_id, RTE_CACHE_LINE_SIZE);
+			port_id, RTE_CACHE_LINE_SIZE);
 		rte_errno = EINVAL;
 		goto error;
 	}
@@ -377,19 +370,19 @@ mlx5_rxq_ibv_obj_new(struct rte_eth_dev *dev, uint16_t idx)
 	rxq_data->cq_uar = cq_info.cq_uar;
 	rxq_data->cqn = cq_info.cqn;
 	/* Create WQ (RQ) using Verbs API. */
-	tmpl->wq = mlx5_rxq_ibv_wq_create(dev, idx);
+	tmpl->wq = mlx5_rxq_ibv_wq_create(rxq);
 	if (!tmpl->wq) {
 		DRV_LOG(ERR, "Port %u Rx queue %u WQ creation failure.",
-			dev->data->port_id, idx);
+			port_id, idx);
 		rte_errno = ENOMEM;
 		goto error;
 	}
 	/* Change queue state to ready. */
-	ret = mlx5_ibv_modify_wq(tmpl, IBV_WQS_RDY);
+	ret = mlx5_ibv_modify_wq(rxq, IBV_WQS_RDY);
 	if (ret) {
 		DRV_LOG(ERR,
 			"Port %u Rx queue %u WQ state to IBV_WQS_RDY failed.",
-			dev->data->port_id, idx);
+			port_id, idx);
 		rte_errno = ret;
 		goto error;
 	}
@@ -405,7 +398,7 @@ mlx5_rxq_ibv_obj_new(struct rte_eth_dev *dev, uint16_t idx)
 	rxq_data->cq_arm_sn = 0;
 	mlx5_rxq_initialize(rxq_data);
 	rxq_data->cq_ci = 0;
-	dev->data->rx_queue_state[idx] = RTE_ETH_QUEUE_STATE_STARTED;
+	priv->dev_data->rx_queue_state[idx] = RTE_ETH_QUEUE_STATE_STARTED;
 	rxq_ctrl->wqn = ((struct ibv_wq *)(tmpl->wq))->wq_num;
 	return 0;
 error:
@@ -423,12 +416,14 @@ mlx5_rxq_ibv_obj_new(struct rte_eth_dev *dev, uint16_t idx)
 /**
  * Release an Rx verbs queue object.
  *
- * @param rxq_obj
- *   Verbs Rx queue object.
+ * @param rxq
+ *   Pointer to Rx queue.
  */
 static void
-mlx5_rxq_ibv_obj_release(struct mlx5_rxq_obj *rxq_obj)
+mlx5_rxq_ibv_obj_release(struct mlx5_rxq_priv *rxq)
 {
+	struct mlx5_rxq_obj *rxq_obj = rxq->ctrl->obj;
+
 	MLX5_ASSERT(rxq_obj);
 	MLX5_ASSERT(rxq_obj->wq);
 	MLX5_ASSERT(rxq_obj->ibv_cq);
@@ -652,12 +647,24 @@ static void
 mlx5_rxq_ibv_obj_drop_release(struct rte_eth_dev *dev)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_obj *rxq = priv->drop_queue.rxq;
+	struct mlx5_rxq_priv *rxq = priv->drop_queue.rxq;
+	struct mlx5_rxq_obj *rxq_obj;
 
-	if (rxq->wq)
-		claim_zero(mlx5_glue->destroy_wq(rxq->wq));
-	if (rxq->ibv_cq)
-		claim_zero(mlx5_glue->destroy_cq(rxq->ibv_cq));
+	if (rxq == NULL)
+		return;
+	if (rxq->ctrl == NULL)
+		goto free_priv;
+	rxq_obj = rxq->ctrl->obj;
+	if (rxq_obj == NULL)
+		goto free_ctrl;
+	if (rxq_obj->wq)
+		claim_zero(mlx5_glue->destroy_wq(rxq_obj->wq));
+	if (rxq_obj->ibv_cq)
+		claim_zero(mlx5_glue->destroy_cq(rxq_obj->ibv_cq));
+	mlx5_free(rxq_obj);
+free_ctrl:
+	mlx5_free(rxq->ctrl);
+free_priv:
 	mlx5_free(rxq);
 	priv->drop_queue.rxq = NULL;
 }
@@ -676,39 +683,58 @@ mlx5_rxq_ibv_obj_drop_create(struct rte_eth_dev *dev)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct ibv_context *ctx = priv->sh->cdev->ctx;
-	struct mlx5_rxq_obj *rxq = priv->drop_queue.rxq;
+	struct mlx5_rxq_priv *rxq = priv->drop_queue.rxq;
+	struct mlx5_rxq_ctrl *rxq_ctrl = NULL;
+	struct mlx5_rxq_obj *rxq_obj = NULL;
 
-	if (rxq)
+	if (rxq != NULL)
 		return 0;
 	rxq = mlx5_malloc(MLX5_MEM_ZERO, sizeof(*rxq), 0, SOCKET_ID_ANY);
-	if (!rxq) {
+	if (rxq == NULL) {
 		DRV_LOG(DEBUG, "Port %u cannot allocate drop Rx queue memory.",
 		      dev->data->port_id);
 		rte_errno = ENOMEM;
 		return -rte_errno;
 	}
 	priv->drop_queue.rxq = rxq;
-	rxq->ibv_cq = mlx5_glue->create_cq(ctx, 1, NULL, NULL, 0);
-	if (!rxq->ibv_cq) {
+	rxq_ctrl = mlx5_malloc(MLX5_MEM_ZERO, sizeof(*rxq_ctrl), 0,
+			       SOCKET_ID_ANY);
+	if (rxq_ctrl == NULL) {
+		DRV_LOG(DEBUG, "Port %u cannot allocate drop Rx queue control memory.",
+		      dev->data->port_id);
+		rte_errno = ENOMEM;
+		goto error;
+	}
+	rxq->ctrl = rxq_ctrl;
+	rxq_obj = mlx5_malloc(MLX5_MEM_ZERO, sizeof(*rxq_obj), 0,
+			      SOCKET_ID_ANY);
+	if (rxq_obj == NULL) {
+		DRV_LOG(DEBUG, "Port %u cannot allocate drop Rx queue memory.",
+		      dev->data->port_id);
+		rte_errno = ENOMEM;
+		goto error;
+	}
+	rxq_ctrl->obj = rxq_obj;
+	rxq_obj->ibv_cq = mlx5_glue->create_cq(ctx, 1, NULL, NULL, 0);
+	if (!rxq_obj->ibv_cq) {
 		DRV_LOG(DEBUG, "Port %u cannot allocate CQ for drop queue.",
 		      dev->data->port_id);
 		rte_errno = errno;
 		goto error;
 	}
-	rxq->wq = mlx5_glue->create_wq(ctx, &(struct ibv_wq_init_attr){
+	rxq_obj->wq = mlx5_glue->create_wq(ctx, &(struct ibv_wq_init_attr){
 						    .wq_type = IBV_WQT_RQ,
 						    .max_wr = 1,
 						    .max_sge = 1,
 						    .pd = priv->sh->cdev->pd,
-						    .cq = rxq->ibv_cq,
+						    .cq = rxq_obj->ibv_cq,
 					      });
-	if (!rxq->wq) {
+	if (!rxq_obj->wq) {
 		DRV_LOG(DEBUG, "Port %u cannot allocate WQ for drop queue.",
 		      dev->data->port_id);
 		rte_errno = errno;
 		goto error;
 	}
-	priv->drop_queue.rxq = rxq;
 	return 0;
 error:
 	mlx5_rxq_ibv_obj_drop_release(dev);
@@ -737,7 +763,7 @@ mlx5_ibv_drop_action_create(struct rte_eth_dev *dev)
 	ret = mlx5_rxq_ibv_obj_drop_create(dev);
 	if (ret < 0)
 		goto error;
-	rxq = priv->drop_queue.rxq;
+	rxq = priv->drop_queue.rxq->ctrl->obj;
 	ind_tbl = mlx5_glue->create_rwq_ind_table
 				(priv->sh->cdev->ctx,
 				 &(struct ibv_rwq_ind_table_init_attr){
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 3e008241ca8..bc1b6b96cda 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -294,7 +294,7 @@ struct mlx5_vf_vlan {
 /* Flow drop context necessary due to Verbs API. */
 struct mlx5_drop {
 	struct mlx5_hrxq *hrxq; /* Hash Rx queue queue. */
-	struct mlx5_rxq_obj *rxq; /* Rx queue object. */
+	struct mlx5_rxq_priv *rxq; /* Rx queue. */
 };
 
 /* Loopback dummy queue resources required due to Verbs API. */
@@ -1239,7 +1239,6 @@ struct mlx5_rxq_obj {
 		};
 		struct mlx5_devx_obj *rq; /* DevX RQ object for hairpin. */
 		struct {
-			struct mlx5_devx_rq rq_obj; /* DevX RQ object. */
 			struct mlx5_devx_cq cq_obj; /* DevX CQ object. */
 			void *devx_channel;
 		};
@@ -1321,11 +1320,11 @@ struct mlx5_rxq_priv;
 
 /* HW objects operations structure. */
 struct mlx5_obj_ops {
-	int (*rxq_obj_modify_vlan_strip)(struct mlx5_rxq_obj *rxq_obj, int on);
-	int (*rxq_obj_new)(struct rte_eth_dev *dev, uint16_t idx);
+	int (*rxq_obj_modify_vlan_strip)(struct mlx5_rxq_priv *rxq, int on);
+	int (*rxq_obj_new)(struct mlx5_rxq_priv *rxq);
 	int (*rxq_event_get)(struct mlx5_rxq_obj *rxq_obj);
-	int (*rxq_obj_modify)(struct mlx5_rxq_obj *rxq_obj, uint8_t type);
-	void (*rxq_obj_release)(struct mlx5_rxq_obj *rxq_obj);
+	int (*rxq_obj_modify)(struct mlx5_rxq_priv *rxq, uint8_t type);
+	void (*rxq_obj_release)(struct mlx5_rxq_priv *rxq);
 	int (*ind_table_new)(struct rte_eth_dev *dev, const unsigned int log_n,
 			     struct mlx5_ind_table_obj *ind_tbl);
 	int (*ind_table_modify)(struct rte_eth_dev *dev,
diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index 8b3651f5034..b90a5d82458 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -30,14 +30,16 @@
 /**
  * Modify RQ vlan stripping offload
  *
- * @param rxq_obj
- *   Rx queue object.
+ * @param rxq
+ *   Rx queue.
+ * @param on
+ *   Enable/disable VLAN stripping.
  *
  * @return
  *   0 on success, non-0 otherwise
  */
 static int
-mlx5_rxq_obj_modify_rq_vlan_strip(struct mlx5_rxq_obj *rxq_obj, int on)
+mlx5_rxq_obj_modify_rq_vlan_strip(struct mlx5_rxq_priv *rxq, int on)
 {
 	struct mlx5_devx_modify_rq_attr rq_attr;
 
@@ -46,14 +48,14 @@ mlx5_rxq_obj_modify_rq_vlan_strip(struct mlx5_rxq_obj *rxq_obj, int on)
 	rq_attr.state = MLX5_RQC_STATE_RDY;
 	rq_attr.vsd = (on ? 0 : 1);
 	rq_attr.modify_bitmask = MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_VSD;
-	return mlx5_devx_cmd_modify_rq(rxq_obj->rq_obj.rq, &rq_attr);
+	return mlx5_devx_cmd_modify_rq(rxq->devx_rq.rq, &rq_attr);
 }
 
 /**
  * Modify RQ using DevX API.
  *
- * @param rxq_obj
- *   DevX Rx queue object.
+ * @param rxq
+ *   DevX rx queue.
  * @param type
  *   Type of change queue state.
  *
@@ -61,7 +63,7 @@ mlx5_rxq_obj_modify_rq_vlan_strip(struct mlx5_rxq_obj *rxq_obj, int on)
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
 static int
-mlx5_devx_modify_rq(struct mlx5_rxq_obj *rxq_obj, uint8_t type)
+mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type)
 {
 	struct mlx5_devx_modify_rq_attr rq_attr;
 
@@ -86,7 +88,7 @@ mlx5_devx_modify_rq(struct mlx5_rxq_obj *rxq_obj, uint8_t type)
 	default:
 		break;
 	}
-	return mlx5_devx_cmd_modify_rq(rxq_obj->rq_obj.rq, &rq_attr);
+	return mlx5_devx_cmd_modify_rq(rxq->devx_rq.rq, &rq_attr);
 }
 
 /**
@@ -145,42 +147,34 @@ mlx5_txq_devx_modify(struct mlx5_txq_obj *obj, enum mlx5_txq_modify_type type,
 	return 0;
 }
 
-/**
- * Destroy the Rx queue DevX object.
- *
- * @param rxq_obj
- *   Rxq object to destroy.
- */
-static void
-mlx5_rxq_release_devx_resources(struct mlx5_rxq_obj *rxq_obj)
-{
-	mlx5_devx_rq_destroy(&rxq_obj->rq_obj);
-	memset(&rxq_obj->rq_obj, 0, sizeof(rxq_obj->rq_obj));
-	mlx5_devx_cq_destroy(&rxq_obj->cq_obj);
-	memset(&rxq_obj->cq_obj, 0, sizeof(rxq_obj->cq_obj));
-}
-
 /**
  * Release an Rx DevX queue object.
  *
- * @param rxq_obj
- *   DevX Rx queue object.
+ * @param rxq
+ *   DevX Rx queue.
  */
 static void
-mlx5_rxq_devx_obj_release(struct mlx5_rxq_obj *rxq_obj)
+mlx5_rxq_devx_obj_release(struct mlx5_rxq_priv *rxq)
 {
-	MLX5_ASSERT(rxq_obj);
+	struct mlx5_rxq_ctrl *rxq_ctrl = rxq->ctrl;
+	struct mlx5_rxq_obj *rxq_obj = rxq_ctrl->obj;
+
+	MLX5_ASSERT(rxq != NULL);
+	MLX5_ASSERT(rxq_ctrl != NULL);
 	if (rxq_obj->rxq_ctrl->type == MLX5_RXQ_TYPE_HAIRPIN) {
 		MLX5_ASSERT(rxq_obj->rq);
-		mlx5_devx_modify_rq(rxq_obj, MLX5_RXQ_MOD_RDY2RST);
+		mlx5_devx_modify_rq(rxq, MLX5_RXQ_MOD_RDY2RST);
 		claim_zero(mlx5_devx_cmd_destroy(rxq_obj->rq));
 	} else {
-		MLX5_ASSERT(rxq_obj->cq_obj.cq);
-		MLX5_ASSERT(rxq_obj->rq_obj.rq);
-		mlx5_rxq_release_devx_resources(rxq_obj);
-		if (rxq_obj->devx_channel)
+		mlx5_devx_rq_destroy(&rxq->devx_rq);
+		memset(&rxq->devx_rq, 0, sizeof(rxq->devx_rq));
+		mlx5_devx_cq_destroy(&rxq_obj->cq_obj);
+		memset(&rxq_obj->cq_obj, 0, sizeof(rxq_obj->cq_obj));
+		if (rxq_obj->devx_channel) {
 			mlx5_os_devx_destroy_event_channel
 							(rxq_obj->devx_channel);
+			rxq_obj->devx_channel = NULL;
+		}
 	}
 }
 
@@ -224,22 +218,19 @@ mlx5_rx_devx_get_event(struct mlx5_rxq_obj *rxq_obj)
 /**
  * Create a RQ object using DevX.
  *
- * @param dev
- *   Pointer to Ethernet device.
- * @param rxq_data
- *   RX queue data.
+ * @param rxq
+ *   Pointer to Rx queue.
  *
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
 static int
-mlx5_rxq_create_devx_rq_resources(struct rte_eth_dev *dev,
-				  struct mlx5_rxq_data *rxq_data)
+mlx5_rxq_create_devx_rq_resources(struct mlx5_rxq_priv *rxq)
 {
-	struct mlx5_priv *priv = dev->data->dev_private;
+	struct mlx5_priv *priv = rxq->priv;
 	struct mlx5_common_device *cdev = priv->sh->cdev;
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-		container_of(rxq_data, struct mlx5_rxq_ctrl, rxq);
+	struct mlx5_rxq_ctrl *rxq_ctrl = rxq->ctrl;
+	struct mlx5_rxq_data *rxq_data = &rxq->ctrl->rxq;
 	struct mlx5_devx_create_rq_attr rq_attr = { 0 };
 	uint16_t log_desc_n = rxq_data->elts_n - rxq_data->sges_n;
 	uint32_t wqe_size, log_wqe_size;
@@ -281,31 +272,29 @@ mlx5_rxq_create_devx_rq_resources(struct rte_eth_dev *dev,
 	rq_attr.wq_attr.pd = cdev->pdn;
 	rq_attr.counter_set_id = priv->counter_set_id;
 	/* Create RQ using DevX API. */
-	return mlx5_devx_rq_create(cdev->ctx, &rxq_ctrl->obj->rq_obj, wqe_size,
+	return mlx5_devx_rq_create(cdev->ctx, &rxq->devx_rq, wqe_size,
 				   log_desc_n, &rq_attr, rxq_ctrl->socket);
 }
 
 /**
  * Create a DevX CQ object for an Rx queue.
  *
- * @param dev
- *   Pointer to Ethernet device.
- * @param rxq_data
- *   RX queue data.
+ * @param rxq
+ *   Pointer to Rx queue.
  *
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
 static int
-mlx5_rxq_create_devx_cq_resources(struct rte_eth_dev *dev,
-				  struct mlx5_rxq_data *rxq_data)
+mlx5_rxq_create_devx_cq_resources(struct mlx5_rxq_priv *rxq)
 {
 	struct mlx5_devx_cq *cq_obj = 0;
 	struct mlx5_devx_cq_attr cq_attr = { 0 };
-	struct mlx5_priv *priv = dev->data->dev_private;
+	struct mlx5_priv *priv = rxq->priv;
 	struct mlx5_dev_ctx_shared *sh = priv->sh;
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-		container_of(rxq_data, struct mlx5_rxq_ctrl, rxq);
+	uint16_t port_id = priv->dev_data->port_id;
+	struct mlx5_rxq_ctrl *rxq_ctrl = rxq->ctrl;
+	struct mlx5_rxq_data *rxq_data = &rxq_ctrl->rxq;
 	unsigned int cqe_n = mlx5_rxq_cqe_num(rxq_data);
 	uint32_t log_cqe_n;
 	uint16_t event_nums[1] = { 0 };
@@ -346,7 +335,7 @@ mlx5_rxq_create_devx_cq_resources(struct rte_eth_dev *dev,
 		}
 		DRV_LOG(DEBUG,
 			"Port %u Rx CQE compression is enabled, format %d.",
-			dev->data->port_id, priv->config.cqe_comp_fmt);
+			port_id, priv->config.cqe_comp_fmt);
 		/*
 		 * For vectorized Rx, it must not be doubled in order to
 		 * make cq_ci and rq_ci aligned.
@@ -355,13 +344,12 @@ mlx5_rxq_create_devx_cq_resources(struct rte_eth_dev *dev,
 			cqe_n *= 2;
 	} else if (priv->config.cqe_comp && rxq_data->hw_timestamp) {
 		DRV_LOG(DEBUG,
-			"Port %u Rx CQE compression is disabled for HW"
-			" timestamp.",
-			dev->data->port_id);
+			"Port %u Rx CQE compression is disabled for HW timestamp.",
+			port_id);
 	} else if (priv->config.cqe_comp && rxq_data->lro) {
 		DRV_LOG(DEBUG,
 			"Port %u Rx CQE compression is disabled for LRO.",
-			dev->data->port_id);
+			port_id);
 	}
 	cq_attr.uar_page_id = mlx5_os_get_devx_uar_page_id(sh->devx_rx_uar);
 	log_cqe_n = log2above(cqe_n);
@@ -399,27 +387,23 @@ mlx5_rxq_create_devx_cq_resources(struct rte_eth_dev *dev,
 /**
  * Create the Rx hairpin queue object.
  *
- * @param dev
- *   Pointer to Ethernet device.
- * @param idx
- *   Queue index in DPDK Rx queue array.
+ * @param rxq
+ *   Pointer to Rx queue.
  *
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
 static int
-mlx5_rxq_obj_hairpin_new(struct rte_eth_dev *dev, uint16_t idx)
+mlx5_rxq_obj_hairpin_new(struct mlx5_rxq_priv *rxq)
 {
-	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_data *rxq_data = (*priv->rxqs)[idx];
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-		container_of(rxq_data, struct mlx5_rxq_ctrl, rxq);
+	uint16_t idx = rxq->idx;
+	struct mlx5_priv *priv = rxq->priv;
+	struct mlx5_rxq_ctrl *rxq_ctrl = rxq->ctrl;
 	struct mlx5_devx_create_rq_attr attr = { 0 };
 	struct mlx5_rxq_obj *tmpl = rxq_ctrl->obj;
 	uint32_t max_wq_data;
 
-	MLX5_ASSERT(rxq_data);
-	MLX5_ASSERT(tmpl);
+	MLX5_ASSERT(rxq != NULL && rxq->ctrl != NULL && tmpl != NULL);
 	tmpl->rxq_ctrl = rxq_ctrl;
 	attr.hairpin = 1;
 	max_wq_data = priv->config.hca_attr.log_max_hairpin_wq_data_sz;
@@ -448,39 +432,36 @@ mlx5_rxq_obj_hairpin_new(struct rte_eth_dev *dev, uint16_t idx)
 	if (!tmpl->rq) {
 		DRV_LOG(ERR,
 			"Port %u Rx hairpin queue %u can't create rq object.",
-			dev->data->port_id, idx);
+			priv->dev_data->port_id, idx);
 		rte_errno = errno;
 		return -rte_errno;
 	}
-	dev->data->rx_queue_state[idx] = RTE_ETH_QUEUE_STATE_HAIRPIN;
+	priv->dev_data->rx_queue_state[idx] = RTE_ETH_QUEUE_STATE_HAIRPIN;
 	return 0;
 }
 
 /**
  * Create the Rx queue DevX object.
  *
- * @param dev
- *   Pointer to Ethernet device.
- * @param idx
- *   Queue index in DPDK Rx queue array.
+ * @param rxq
+ *   Pointer to Rx queue.
  *
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
 static int
-mlx5_rxq_devx_obj_new(struct rte_eth_dev *dev, uint16_t idx)
+mlx5_rxq_devx_obj_new(struct mlx5_rxq_priv *rxq)
 {
-	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_data *rxq_data = (*priv->rxqs)[idx];
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-		container_of(rxq_data, struct mlx5_rxq_ctrl, rxq);
+	struct mlx5_priv *priv = rxq->priv;
+	struct mlx5_rxq_ctrl *rxq_ctrl = rxq->ctrl;
+	struct mlx5_rxq_data *rxq_data = &rxq_ctrl->rxq;
 	struct mlx5_rxq_obj *tmpl = rxq_ctrl->obj;
 	int ret = 0;
 
 	MLX5_ASSERT(rxq_data);
 	MLX5_ASSERT(tmpl);
 	if (rxq_ctrl->type == MLX5_RXQ_TYPE_HAIRPIN)
-		return mlx5_rxq_obj_hairpin_new(dev, idx);
+		return mlx5_rxq_obj_hairpin_new(rxq);
 	tmpl->rxq_ctrl = rxq_ctrl;
 	if (rxq_ctrl->irq) {
 		int devx_ev_flag =
@@ -498,34 +479,32 @@ mlx5_rxq_devx_obj_new(struct rte_eth_dev *dev, uint16_t idx)
 		tmpl->fd = mlx5_os_get_devx_channel_fd(tmpl->devx_channel);
 	}
 	/* Create CQ using DevX API. */
-	ret = mlx5_rxq_create_devx_cq_resources(dev, rxq_data);
+	ret = mlx5_rxq_create_devx_cq_resources(rxq);
 	if (ret) {
 		DRV_LOG(ERR, "Failed to create CQ.");
 		goto error;
 	}
 	/* Create RQ using DevX API. */
-	ret = mlx5_rxq_create_devx_rq_resources(dev, rxq_data);
+	ret = mlx5_rxq_create_devx_rq_resources(rxq);
 	if (ret) {
 		DRV_LOG(ERR, "Port %u Rx queue %u RQ creation failure.",
-			dev->data->port_id, idx);
+			priv->dev_data->port_id, rxq->idx);
 		rte_errno = ENOMEM;
 		goto error;
 	}
 	/* Change queue state to ready. */
-	ret = mlx5_devx_modify_rq(tmpl, MLX5_RXQ_MOD_RST2RDY);
+	ret = mlx5_devx_modify_rq(rxq, MLX5_RXQ_MOD_RST2RDY);
 	if (ret)
 		goto error;
-	rxq_data->wqes = (void *)(uintptr_t)tmpl->rq_obj.wq.umem_buf;
-	rxq_data->rq_db = (uint32_t *)(uintptr_t)tmpl->rq_obj.wq.db_rec;
-	rxq_data->cq_arm_sn = 0;
-	rxq_data->cq_ci = 0;
+	rxq_data->wqes = (void *)(uintptr_t)rxq->devx_rq.wq.umem_buf;
+	rxq_data->rq_db = (uint32_t *)(uintptr_t)rxq->devx_rq.wq.db_rec;
 	mlx5_rxq_initialize(rxq_data);
-	dev->data->rx_queue_state[idx] = RTE_ETH_QUEUE_STATE_STARTED;
-	rxq_ctrl->wqn = tmpl->rq_obj.rq->id;
+	priv->dev_data->rx_queue_state[rxq->idx] = RTE_ETH_QUEUE_STATE_STARTED;
+	rxq_ctrl->wqn = rxq->devx_rq.rq->id;
 	return 0;
 error:
 	ret = rte_errno; /* Save rte_errno before cleanup. */
-	mlx5_rxq_devx_obj_release(tmpl);
+	mlx5_rxq_devx_obj_release(rxq);
 	rte_errno = ret; /* Restore rte_errno. */
 	return -rte_errno;
 }
@@ -571,15 +550,15 @@ mlx5_devx_ind_table_create_rqt_attr(struct rte_eth_dev *dev,
 	rqt_attr->rqt_actual_size = rqt_n;
 	if (queues == NULL) {
 		for (i = 0; i < rqt_n; i++)
-			rqt_attr->rq_list[i] = priv->drop_queue.rxq->rq->id;
+			rqt_attr->rq_list[i] =
+					priv->drop_queue.rxq->devx_rq.rq->id;
 		return rqt_attr;
 	}
 	for (i = 0; i != queues_n; ++i) {
-		struct mlx5_rxq_data *rxq = (*priv->rxqs)[queues[i]];
-		struct mlx5_rxq_ctrl *rxq_ctrl =
-				container_of(rxq, struct mlx5_rxq_ctrl, rxq);
+		struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, queues[i]);
 
-		rqt_attr->rq_list[i] = rxq_ctrl->obj->rq_obj.rq->id;
+		MLX5_ASSERT(rxq != NULL);
+		rqt_attr->rq_list[i] = rxq->devx_rq.rq->id;
 	}
 	MLX5_ASSERT(i > 0);
 	for (j = 0; i != rqt_n; ++j, ++i)
@@ -719,7 +698,7 @@ mlx5_devx_tir_attr_set(struct rte_eth_dev *dev, const uint8_t *rss_key,
 			}
 		}
 	} else {
-		rxq_obj_type = priv->drop_queue.rxq->rxq_ctrl->type;
+		rxq_obj_type = priv->drop_queue.rxq->ctrl->type;
 	}
 	memset(tir_attr, 0, sizeof(*tir_attr));
 	tir_attr->disp_type = MLX5_TIRC_DISP_TYPE_INDIRECT;
@@ -891,9 +870,9 @@ mlx5_rxq_devx_obj_drop_create(struct rte_eth_dev *dev)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 	int socket_id = dev->device->numa_node;
-	struct mlx5_rxq_ctrl *rxq_ctrl;
-	struct mlx5_rxq_data *rxq_data;
-	struct mlx5_rxq_obj *rxq = NULL;
+	struct mlx5_rxq_priv *rxq;
+	struct mlx5_rxq_ctrl *rxq_ctrl = NULL;
+	struct mlx5_rxq_obj *rxq_obj = NULL;
 	int ret;
 
 	/*
@@ -901,6 +880,13 @@ mlx5_rxq_devx_obj_drop_create(struct rte_eth_dev *dev)
 	 * They are required to hold pointers for cleanup
 	 * and are only accessible via drop queue DevX objects.
 	 */
+	rxq = mlx5_malloc(MLX5_MEM_ZERO, sizeof(*rxq), 0, socket_id);
+	if (rxq == NULL) {
+		DRV_LOG(ERR, "Port %u could not allocate drop queue private",
+			dev->data->port_id);
+		rte_errno = ENOMEM;
+		goto error;
+	}
 	rxq_ctrl = mlx5_malloc(MLX5_MEM_ZERO, sizeof(*rxq_ctrl),
 			       0, socket_id);
 	if (rxq_ctrl == NULL) {
@@ -909,27 +895,29 @@ mlx5_rxq_devx_obj_drop_create(struct rte_eth_dev *dev)
 		rte_errno = ENOMEM;
 		goto error;
 	}
-	rxq = mlx5_malloc(MLX5_MEM_ZERO, sizeof(*rxq), 0, socket_id);
-	if (rxq == NULL) {
+	rxq_obj = mlx5_malloc(MLX5_MEM_ZERO, sizeof(*rxq_obj), 0, socket_id);
+	if (rxq_obj == NULL) {
 		DRV_LOG(ERR, "Port %u could not allocate drop queue object",
 			dev->data->port_id);
 		rte_errno = ENOMEM;
 		goto error;
 	}
-	rxq->rxq_ctrl = rxq_ctrl;
+	rxq_obj->rxq_ctrl = rxq_ctrl;
 	rxq_ctrl->type = MLX5_RXQ_TYPE_STANDARD;
 	rxq_ctrl->sh = priv->sh;
-	rxq_ctrl->obj = rxq;
-	rxq_data = &rxq_ctrl->rxq;
+	rxq_ctrl->obj = rxq_obj;
+	rxq->ctrl = rxq_ctrl;
+	rxq->priv = priv;
+	LIST_INSERT_HEAD(&rxq_ctrl->owners, rxq, owner_entry);
 	/* Create CQ using DevX API. */
-	ret = mlx5_rxq_create_devx_cq_resources(dev, rxq_data);
+	ret = mlx5_rxq_create_devx_cq_resources(rxq);
 	if (ret != 0) {
 		DRV_LOG(ERR, "Port %u drop queue CQ creation failed.",
 			dev->data->port_id);
 		goto error;
 	}
 	/* Create RQ using DevX API. */
-	ret = mlx5_rxq_create_devx_rq_resources(dev, rxq_data);
+	ret = mlx5_rxq_create_devx_rq_resources(rxq);
 	if (ret != 0) {
 		DRV_LOG(ERR, "Port %u drop queue RQ creation failed.",
 			dev->data->port_id);
@@ -945,18 +933,20 @@ mlx5_rxq_devx_obj_drop_create(struct rte_eth_dev *dev)
 	return 0;
 error:
 	ret = rte_errno; /* Save rte_errno before cleanup. */
-	if (rxq != NULL) {
-		if (rxq->rq_obj.rq != NULL)
-			mlx5_devx_rq_destroy(&rxq->rq_obj);
-		if (rxq->cq_obj.cq != NULL)
-			mlx5_devx_cq_destroy(&rxq->cq_obj);
-		if (rxq->devx_channel)
+	if (rxq != NULL && rxq->devx_rq.rq != NULL)
+		mlx5_devx_rq_destroy(&rxq->devx_rq);
+	if (rxq_obj != NULL) {
+		if (rxq_obj->cq_obj.cq != NULL)
+			mlx5_devx_cq_destroy(&rxq_obj->cq_obj);
+		if (rxq_obj->devx_channel)
 			mlx5_os_devx_destroy_event_channel
-							(rxq->devx_channel);
-		mlx5_free(rxq);
+							(rxq_obj->devx_channel);
+		mlx5_free(rxq_obj);
 	}
 	if (rxq_ctrl != NULL)
 		mlx5_free(rxq_ctrl);
+	if (rxq != NULL)
+		mlx5_free(rxq);
 	rte_errno = ret; /* Restore rte_errno. */
 	return -rte_errno;
 }
@@ -971,12 +961,13 @@ static void
 mlx5_rxq_devx_obj_drop_release(struct rte_eth_dev *dev)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_obj *rxq = priv->drop_queue.rxq;
-	struct mlx5_rxq_ctrl *rxq_ctrl = rxq->rxq_ctrl;
+	struct mlx5_rxq_priv *rxq = priv->drop_queue.rxq;
+	struct mlx5_rxq_ctrl *rxq_ctrl = rxq->ctrl;
 
 	mlx5_rxq_devx_obj_release(rxq);
-	mlx5_free(rxq);
+	mlx5_free(rxq_ctrl->obj);
 	mlx5_free(rxq_ctrl);
+	mlx5_free(rxq);
 	priv->drop_queue.rxq = NULL;
 }
 
@@ -996,7 +987,7 @@ mlx5_devx_drop_action_destroy(struct rte_eth_dev *dev)
 		mlx5_devx_tir_destroy(hrxq);
 	if (hrxq->ind_table->ind_table != NULL)
 		mlx5_devx_ind_table_destroy(hrxq->ind_table);
-	if (priv->drop_queue.rxq->rq != NULL)
+	if (priv->drop_queue.rxq->devx_rq.rq != NULL)
 		mlx5_rxq_devx_obj_drop_release(dev);
 }
 
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index c04c0c73349..337dcca59fb 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -174,6 +174,7 @@ struct mlx5_rxq_priv {
 	struct mlx5_rxq_ctrl *ctrl; /* Shared Rx Queue. */
 	LIST_ENTRY(mlx5_rxq_priv) owner_entry; /* Entry in shared rxq_ctrl. */
 	struct mlx5_priv *priv; /* Back pointer to private data. */
+	struct mlx5_devx_rq devx_rq;
 	struct rte_eth_hairpin_conf hairpin_conf; /* Hairpin configuration. */
 	uint32_t hairpin_status; /* Hairpin binding status. */
 };
diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index 5a20966e2ca..2850a220399 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -471,13 +471,13 @@ int
 mlx5_rx_queue_stop_primary(struct rte_eth_dev *dev, uint16_t idx)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_data *rxq = (*priv->rxqs)[idx];
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-			container_of(rxq, struct mlx5_rxq_ctrl, rxq);
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, idx);
+	struct mlx5_rxq_ctrl *rxq_ctrl = rxq->ctrl;
 	int ret;
 
+	MLX5_ASSERT(rxq != NULL && rxq_ctrl != NULL);
 	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
-	ret = priv->obj_ops.rxq_obj_modify(rxq_ctrl->obj, MLX5_RXQ_MOD_RDY2RST);
+	ret = priv->obj_ops.rxq_obj_modify(rxq, MLX5_RXQ_MOD_RDY2RST);
 	if (ret) {
 		DRV_LOG(ERR, "Cannot change Rx WQ state to RESET:  %s",
 			strerror(errno));
@@ -485,7 +485,7 @@ mlx5_rx_queue_stop_primary(struct rte_eth_dev *dev, uint16_t idx)
 		return ret;
 	}
 	/* Remove all processes CQEs. */
-	rxq_sync_cq(rxq);
+	rxq_sync_cq(&rxq_ctrl->rxq);
 	/* Free all involved mbufs. */
 	rxq_free_elts(rxq_ctrl);
 	/* Set the actual queue state. */
@@ -557,26 +557,26 @@ int
 mlx5_rx_queue_start_primary(struct rte_eth_dev *dev, uint16_t idx)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_data *rxq = (*priv->rxqs)[idx];
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-			container_of(rxq, struct mlx5_rxq_ctrl, rxq);
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, idx);
+	struct mlx5_rxq_data *rxq_data = &rxq->ctrl->rxq;
 	int ret;
 
-	MLX5_ASSERT(rte_eal_process_type() ==  RTE_PROC_PRIMARY);
+	MLX5_ASSERT(rxq != NULL && rxq->ctrl != NULL);
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
 	/* Allocate needed buffers. */
-	ret = rxq_alloc_elts(rxq_ctrl);
+	ret = rxq_alloc_elts(rxq->ctrl);
 	if (ret) {
 		DRV_LOG(ERR, "Cannot reallocate buffers for Rx WQ");
 		rte_errno = errno;
 		return ret;
 	}
 	rte_io_wmb();
-	*rxq->cq_db = rte_cpu_to_be_32(rxq->cq_ci);
+	*rxq_data->cq_db = rte_cpu_to_be_32(rxq_data->cq_ci);
 	rte_io_wmb();
 	/* Reset RQ consumer before moving queue to READY state. */
-	*rxq->rq_db = rte_cpu_to_be_32(0);
+	*rxq_data->rq_db = rte_cpu_to_be_32(0);
 	rte_io_wmb();
-	ret = priv->obj_ops.rxq_obj_modify(rxq_ctrl->obj, MLX5_RXQ_MOD_RST2RDY);
+	ret = priv->obj_ops.rxq_obj_modify(rxq, MLX5_RXQ_MOD_RST2RDY);
 	if (ret) {
 		DRV_LOG(ERR, "Cannot change Rx WQ state to READY:  %s",
 			strerror(errno));
@@ -584,8 +584,8 @@ mlx5_rx_queue_start_primary(struct rte_eth_dev *dev, uint16_t idx)
 		return ret;
 	}
 	/* Reinitialize RQ - set WQEs. */
-	mlx5_rxq_initialize(rxq);
-	rxq->err_state = MLX5_RXQ_ERR_STATE_NO_ERROR;
+	mlx5_rxq_initialize(rxq_data);
+	rxq_data->err_state = MLX5_RXQ_ERR_STATE_NO_ERROR;
 	/* Set actual queue state. */
 	dev->data->rx_queue_state[idx] = RTE_ETH_QUEUE_STATE_STARTED;
 	return 0;
@@ -1835,15 +1835,19 @@ int
 mlx5_rxq_release(struct rte_eth_dev *dev, uint16_t idx)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, idx);
-	struct mlx5_rxq_ctrl *rxq_ctrl = rxq->ctrl;
+	struct mlx5_rxq_priv *rxq;
+	struct mlx5_rxq_ctrl *rxq_ctrl;
 
-	if (priv->rxqs == NULL || (*priv->rxqs)[idx] == NULL)
+	if (priv->rxq_privs == NULL)
+		return 0;
+	rxq = mlx5_rxq_get(dev, idx);
+	if (rxq == NULL)
 		return 0;
 	if (mlx5_rxq_deref(dev, idx) > 1)
 		return 1;
-	if (rxq_ctrl->obj) {
-		priv->obj_ops.rxq_obj_release(rxq_ctrl->obj);
+	rxq_ctrl = rxq->ctrl;
+	if (rxq_ctrl->obj != NULL) {
+		priv->obj_ops.rxq_obj_release(rxq);
 		LIST_REMOVE(rxq_ctrl->obj, next);
 		mlx5_free(rxq_ctrl->obj);
 		rxq_ctrl->obj = NULL;
diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index 0bcdff1b116..54d410b513b 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -373,11 +373,9 @@ mlx5_queue_state_modify_primary(struct rte_eth_dev *dev,
 	struct mlx5_priv *priv = dev->data->dev_private;
 
 	if (sm->is_wq) {
-		struct mlx5_rxq_data *rxq = (*priv->rxqs)[sm->queue_id];
-		struct mlx5_rxq_ctrl *rxq_ctrl =
-			container_of(rxq, struct mlx5_rxq_ctrl, rxq);
+		struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, sm->queue_id);
 
-		ret = priv->obj_ops.rxq_obj_modify(rxq_ctrl->obj, sm->state);
+		ret = priv->obj_ops.rxq_obj_modify(rxq, sm->state);
 		if (ret) {
 			DRV_LOG(ERR, "Cannot change Rx WQ state to %u  - %s",
 					sm->state, strerror(errno));
diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index caafdf27e8f..2cf62a9780d 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -231,7 +231,7 @@ mlx5_rxq_start(struct rte_eth_dev *dev)
 			rte_errno = ENOMEM;
 			goto error;
 		}
-		ret = priv->obj_ops.rxq_obj_new(dev, i);
+		ret = priv->obj_ops.rxq_obj_new(rxq);
 		if (ret) {
 			mlx5_free(rxq_ctrl->obj);
 			rxq_ctrl->obj = NULL;
diff --git a/drivers/net/mlx5/mlx5_vlan.c b/drivers/net/mlx5/mlx5_vlan.c
index 07792fc5d94..ea841bb32fb 100644
--- a/drivers/net/mlx5/mlx5_vlan.c
+++ b/drivers/net/mlx5/mlx5_vlan.c
@@ -91,11 +91,11 @@ void
 mlx5_vlan_strip_queue_set(struct rte_eth_dev *dev, uint16_t queue, int on)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_data *rxq = (*priv->rxqs)[queue];
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-		container_of(rxq, struct mlx5_rxq_ctrl, rxq);
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, queue);
+	struct mlx5_rxq_data *rxq_data = &rxq->ctrl->rxq;
 	int ret = 0;
 
+	MLX5_ASSERT(rxq != NULL && rxq->ctrl != NULL);
 	/* Validate hw support */
 	if (!priv->config.hw_vlan_strip) {
 		DRV_LOG(ERR, "port %u VLAN stripping is not supported",
@@ -109,20 +109,20 @@ mlx5_vlan_strip_queue_set(struct rte_eth_dev *dev, uint16_t queue, int on)
 		return;
 	}
 	DRV_LOG(DEBUG, "port %u set VLAN stripping offloads %d for port %uqueue %d",
-		dev->data->port_id, on, rxq->port_id, queue);
-	if (!rxq_ctrl->obj) {
+		dev->data->port_id, on, rxq_data->port_id, queue);
+	if (rxq->ctrl->obj == NULL) {
 		/* Update related bits in RX queue. */
-		rxq->vlan_strip = !!on;
+		rxq_data->vlan_strip = !!on;
 		return;
 	}
-	ret = priv->obj_ops.rxq_obj_modify_vlan_strip(rxq_ctrl->obj, on);
+	ret = priv->obj_ops.rxq_obj_modify_vlan_strip(rxq, on);
 	if (ret) {
 		DRV_LOG(ERR, "Port %u failed to modify object stripping mode:"
 			" %s", dev->data->port_id, strerror(rte_errno));
 		return;
 	}
 	/* Update related bits in RX queue. */
-	rxq->vlan_strip = !!on;
+	rxq_data->vlan_strip = !!on;
 }
 
 /**
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 12/14] net/mlx5: remove Rx queue data list from device
  2021-11-03  7:58 ` [dpdk-dev] [PATCH v3 00/14] net/mlx5: support " Xueming Li
                     ` (10 preceding siblings ...)
  2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 11/14] net/mlx5: move Rx queue DevX resource Xueming Li
@ 2021-11-03  7:58   ` Xueming Li
  2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 13/14] net/mlx5: support shared Rx queue Xueming Li
  2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 14/14] net/mlx5: add shared Rx queue port datapath support Xueming Li
  13 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-11-03  7:58 UTC (permalink / raw)
  To: dev; +Cc: xuemingl, Lior Margalit, Matan Azrad, Viacheslav Ovsiienko

Rx queue data list(priv->rxqs) can be replaced by Rx queue
list(priv->rxq_privs), removes it and replaces with universal wrapper
API.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 drivers/net/mlx5/linux/mlx5_verbs.c |  7 ++---
 drivers/net/mlx5/mlx5.c             | 10 +-----
 drivers/net/mlx5/mlx5.h             |  1 -
 drivers/net/mlx5/mlx5_devx.c        | 12 +++++---
 drivers/net/mlx5/mlx5_ethdev.c      |  6 +---
 drivers/net/mlx5/mlx5_flow.c        | 47 +++++++++++++++--------------
 drivers/net/mlx5/mlx5_rss.c         |  6 ++--
 drivers/net/mlx5/mlx5_rx.c          | 15 +++++----
 drivers/net/mlx5/mlx5_rx.h          |  9 +++---
 drivers/net/mlx5/mlx5_rxq.c         | 43 ++++++++++++--------------
 drivers/net/mlx5/mlx5_rxtx_vec.c    |  6 ++--
 drivers/net/mlx5/mlx5_stats.c       |  9 +++---
 drivers/net/mlx5/mlx5_trigger.c     |  2 +-
 13 files changed, 79 insertions(+), 94 deletions(-)

diff --git a/drivers/net/mlx5/linux/mlx5_verbs.c b/drivers/net/mlx5/linux/mlx5_verbs.c
index 5d4ae3ea752..f78916c868f 100644
--- a/drivers/net/mlx5/linux/mlx5_verbs.c
+++ b/drivers/net/mlx5/linux/mlx5_verbs.c
@@ -486,11 +486,10 @@ mlx5_ibv_ind_table_new(struct rte_eth_dev *dev, const unsigned int log_n,
 
 	MLX5_ASSERT(ind_tbl);
 	for (i = 0; i != ind_tbl->queues_n; ++i) {
-		struct mlx5_rxq_data *rxq = (*priv->rxqs)[ind_tbl->queues[i]];
-		struct mlx5_rxq_ctrl *rxq_ctrl =
-				container_of(rxq, struct mlx5_rxq_ctrl, rxq);
+		struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev,
+							 ind_tbl->queues[i]);
 
-		wq[i] = rxq_ctrl->obj->wq;
+		wq[i] = rxq->ctrl->obj->wq;
 	}
 	MLX5_ASSERT(i > 0);
 	/* Finalise indirection table. */
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index d0fae518025..b81c959d642 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -1686,20 +1686,12 @@ mlx5_dev_close(struct rte_eth_dev *dev)
 	mlx5_mp_os_req_stop_rxtx(dev);
 	/* Free the eCPRI flex parser resource. */
 	mlx5_flex_parser_ecpri_release(dev);
-	if (priv->rxqs != NULL) {
+	if (priv->rxq_privs != NULL) {
 		/* XXX race condition if mlx5_rx_burst() is still running. */
 		rte_delay_us_sleep(1000);
 		for (i = 0; (i != priv->rxqs_n); ++i)
 			mlx5_rxq_release(dev, i);
 		priv->rxqs_n = 0;
-		priv->rxqs = NULL;
-	}
-	if (priv->representor) {
-		/* Each representor has a dedicated interrupts handler */
-		mlx5_free(dev->intr_handle);
-		dev->intr_handle = NULL;
-	}
-	if (priv->rxq_privs != NULL) {
 		mlx5_free(priv->rxq_privs);
 		priv->rxq_privs = NULL;
 	}
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index bc1b6b96cda..75c58b93f91 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1382,7 +1382,6 @@ struct mlx5_priv {
 	unsigned int rxqs_n; /* RX queues array size. */
 	unsigned int txqs_n; /* TX queues array size. */
 	struct mlx5_rxq_priv *(*rxq_privs)[]; /* RX queue non-shared data. */
-	struct mlx5_rxq_data *(*rxqs)[]; /* (Shared) RX queues. */
 	struct mlx5_txq_data *(*txqs)[]; /* TX queues. */
 	struct rte_mempool *mprq_mp; /* Mempool for Multi-Packet RQ. */
 	struct rte_eth_rss_conf rss_conf; /* RSS configuration. */
diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index b90a5d82458..668d47025e8 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -684,15 +684,17 @@ mlx5_devx_tir_attr_set(struct rte_eth_dev *dev, const uint8_t *rss_key,
 
 	/* NULL queues designate drop queue. */
 	if (ind_tbl->queues != NULL) {
-		struct mlx5_rxq_data *rxq_data =
-					(*priv->rxqs)[ind_tbl->queues[0]];
 		struct mlx5_rxq_ctrl *rxq_ctrl =
-			container_of(rxq_data, struct mlx5_rxq_ctrl, rxq);
-		rxq_obj_type = rxq_ctrl->type;
+				mlx5_rxq_ctrl_get(dev, ind_tbl->queues[0]);
+		rxq_obj_type = rxq_ctrl != NULL ? rxq_ctrl->type :
+						  MLX5_RXQ_TYPE_STANDARD;
 
 		/* Enable TIR LRO only if all the queues were configured for. */
 		for (i = 0; i < ind_tbl->queues_n; ++i) {
-			if (!(*priv->rxqs)[ind_tbl->queues[i]]->lro) {
+			struct mlx5_rxq_data *rxq_i =
+				mlx5_rxq_data_get(dev, ind_tbl->queues[i]);
+
+			if (rxq_i != NULL && !rxq_i->lro) {
 				lro = false;
 				break;
 			}
diff --git a/drivers/net/mlx5/mlx5_ethdev.c b/drivers/net/mlx5/mlx5_ethdev.c
index cde505955df..bb38d5d2ade 100644
--- a/drivers/net/mlx5/mlx5_ethdev.c
+++ b/drivers/net/mlx5/mlx5_ethdev.c
@@ -114,7 +114,6 @@ mlx5_dev_configure(struct rte_eth_dev *dev)
 		rte_errno = ENOMEM;
 		return -rte_errno;
 	}
-	priv->rxqs = (void *)dev->data->rx_queues;
 	priv->txqs = (void *)dev->data->tx_queues;
 	if (txqs_n != priv->txqs_n) {
 		DRV_LOG(INFO, "port %u Tx queues number update: %u -> %u",
@@ -171,11 +170,8 @@ mlx5_dev_configure_rss_reta(struct rte_eth_dev *dev)
 		return -rte_errno;
 	}
 	for (i = 0, j = 0; i < rxqs_n; i++) {
-		struct mlx5_rxq_data *rxq_data;
-		struct mlx5_rxq_ctrl *rxq_ctrl;
+		struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_ctrl_get(dev, i);
 
-		rxq_data = (*priv->rxqs)[i];
-		rxq_ctrl = container_of(rxq_data, struct mlx5_rxq_ctrl, rxq);
 		if (rxq_ctrl && rxq_ctrl->type == MLX5_RXQ_TYPE_STANDARD)
 			rss_queue_arr[j++] = i;
 	}
diff --git a/drivers/net/mlx5/mlx5_flow.c b/drivers/net/mlx5/mlx5_flow.c
index 9904bc5863d..f3a39c50581 100644
--- a/drivers/net/mlx5/mlx5_flow.c
+++ b/drivers/net/mlx5/mlx5_flow.c
@@ -1200,10 +1200,11 @@ flow_drv_rxq_flags_set(struct rte_eth_dev *dev,
 		return;
 	for (i = 0; i != ind_tbl->queues_n; ++i) {
 		int idx = ind_tbl->queues[i];
-		struct mlx5_rxq_ctrl *rxq_ctrl =
-			container_of((*priv->rxqs)[idx],
-				     struct mlx5_rxq_ctrl, rxq);
+		struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_ctrl_get(dev, idx);
 
+		MLX5_ASSERT(rxq_ctrl != NULL);
+		if (rxq_ctrl == NULL)
+			continue;
 		/*
 		 * To support metadata register copy on Tx loopback,
 		 * this must be always enabled (metadata may arive
@@ -1295,10 +1296,11 @@ flow_drv_rxq_flags_trim(struct rte_eth_dev *dev,
 	MLX5_ASSERT(dev->data->dev_started);
 	for (i = 0; i != ind_tbl->queues_n; ++i) {
 		int idx = ind_tbl->queues[i];
-		struct mlx5_rxq_ctrl *rxq_ctrl =
-			container_of((*priv->rxqs)[idx],
-				     struct mlx5_rxq_ctrl, rxq);
+		struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_ctrl_get(dev, idx);
 
+		MLX5_ASSERT(rxq_ctrl != NULL);
+		if (rxq_ctrl == NULL)
+			continue;
 		if (priv->config.dv_flow_en &&
 		    priv->config.dv_xmeta_en != MLX5_XMETA_MODE_LEGACY &&
 		    mlx5_flow_ext_mreg_supported(dev)) {
@@ -1359,18 +1361,16 @@ flow_rxq_flags_clear(struct rte_eth_dev *dev)
 	unsigned int i;
 
 	for (i = 0; i != priv->rxqs_n; ++i) {
-		struct mlx5_rxq_ctrl *rxq_ctrl;
+		struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, i);
 		unsigned int j;
 
-		if (!(*priv->rxqs)[i])
+		if (rxq == NULL || rxq->ctrl == NULL)
 			continue;
-		rxq_ctrl = container_of((*priv->rxqs)[i],
-					struct mlx5_rxq_ctrl, rxq);
-		rxq_ctrl->flow_mark_n = 0;
-		rxq_ctrl->rxq.mark = 0;
+		rxq->ctrl->flow_mark_n = 0;
+		rxq->ctrl->rxq.mark = 0;
 		for (j = 0; j != MLX5_FLOW_TUNNEL; ++j)
-			rxq_ctrl->flow_tunnels_n[j] = 0;
-		rxq_ctrl->rxq.tunnel = 0;
+			rxq->ctrl->flow_tunnels_n[j] = 0;
+		rxq->ctrl->rxq.tunnel = 0;
 	}
 }
 
@@ -1384,13 +1384,15 @@ void
 mlx5_flow_rxq_dynf_metadata_set(struct rte_eth_dev *dev)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_data *data;
 	unsigned int i;
 
 	for (i = 0; i != priv->rxqs_n; ++i) {
-		if (!(*priv->rxqs)[i])
+		struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, i);
+		struct mlx5_rxq_data *data;
+
+		if (rxq == NULL || rxq->ctrl == NULL)
 			continue;
-		data = (*priv->rxqs)[i];
+		data = &rxq->ctrl->rxq;
 		if (!rte_flow_dynf_metadata_avail()) {
 			data->dynf_meta = 0;
 			data->flow_meta_mask = 0;
@@ -1581,7 +1583,7 @@ mlx5_flow_validate_action_queue(const struct rte_flow_action *action,
 					  RTE_FLOW_ERROR_TYPE_ACTION_CONF,
 					  &queue->index,
 					  "queue index out of range");
-	if (!(*priv->rxqs)[queue->index])
+	if (mlx5_rxq_get(dev, queue->index) == NULL)
 		return rte_flow_error_set(error, EINVAL,
 					  RTE_FLOW_ERROR_TYPE_ACTION_CONF,
 					  &queue->index,
@@ -1612,7 +1614,7 @@ mlx5_flow_validate_action_queue(const struct rte_flow_action *action,
  *   0 on success, a negative errno code on error.
  */
 static int
-mlx5_validate_rss_queues(const struct rte_eth_dev *dev,
+mlx5_validate_rss_queues(struct rte_eth_dev *dev,
 			 const uint16_t *queues, uint32_t queues_n,
 			 const char **error, uint32_t *queue_idx)
 {
@@ -1621,20 +1623,19 @@ mlx5_validate_rss_queues(const struct rte_eth_dev *dev,
 	uint32_t i;
 
 	for (i = 0; i != queues_n; ++i) {
-		struct mlx5_rxq_ctrl *rxq_ctrl;
+		struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_ctrl_get(dev,
+								   queues[i]);
 
 		if (queues[i] >= priv->rxqs_n) {
 			*error = "queue index out of range";
 			*queue_idx = i;
 			return -EINVAL;
 		}
-		if (!(*priv->rxqs)[queues[i]]) {
+		if (rxq_ctrl == NULL) {
 			*error =  "queue is not configured";
 			*queue_idx = i;
 			return -EINVAL;
 		}
-		rxq_ctrl = container_of((*priv->rxqs)[queues[i]],
-					struct mlx5_rxq_ctrl, rxq);
 		if (i == 0)
 			rxq_type = rxq_ctrl->type;
 		if (rxq_type != rxq_ctrl->type) {
diff --git a/drivers/net/mlx5/mlx5_rss.c b/drivers/net/mlx5/mlx5_rss.c
index a04e22398db..75af05b7b02 100644
--- a/drivers/net/mlx5/mlx5_rss.c
+++ b/drivers/net/mlx5/mlx5_rss.c
@@ -65,9 +65,11 @@ mlx5_rss_hash_update(struct rte_eth_dev *dev,
 	priv->rss_conf.rss_hf = rss_conf->rss_hf;
 	/* Enable the RSS hash in all Rx queues. */
 	for (i = 0, idx = 0; idx != priv->rxqs_n; ++i) {
-		if (!(*priv->rxqs)[i])
+		struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, i);
+
+		if (rxq == NULL || rxq->ctrl == NULL)
 			continue;
-		(*priv->rxqs)[i]->rss_hash = !!rss_conf->rss_hf &&
+		rxq->ctrl->rxq.rss_hash = !!rss_conf->rss_hf &&
 			!!(dev->data->dev_conf.rxmode.mq_mode & RTE_ETH_MQ_RX_RSS);
 		++idx;
 	}
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index d41905a2a04..1ffa1b95b88 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -148,10 +148,8 @@ void
 mlx5_rxq_info_get(struct rte_eth_dev *dev, uint16_t rx_queue_id,
 		  struct rte_eth_rxq_info *qinfo)
 {
-	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_data *rxq = (*priv->rxqs)[rx_queue_id];
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-		container_of(rxq, struct mlx5_rxq_ctrl, rxq);
+	struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_ctrl_get(dev, rx_queue_id);
+	struct mlx5_rxq_data *rxq = mlx5_rxq_data_get(dev, rx_queue_id);
 
 	if (!rxq)
 		return;
@@ -162,7 +160,10 @@ mlx5_rxq_info_get(struct rte_eth_dev *dev, uint16_t rx_queue_id,
 	qinfo->conf.rx_thresh.wthresh = 0;
 	qinfo->conf.rx_free_thresh = rxq->rq_repl_thresh;
 	qinfo->conf.rx_drop_en = 1;
-	qinfo->conf.rx_deferred_start = rxq_ctrl ? 0 : 1;
+	if (rxq_ctrl == NULL || rxq_ctrl->obj == NULL)
+		qinfo->conf.rx_deferred_start = 0;
+	else
+		qinfo->conf.rx_deferred_start = 1;
 	qinfo->conf.offloads = dev->data->dev_conf.rxmode.offloads;
 	qinfo->scattered_rx = dev->data->scattered_rx;
 	qinfo->nb_desc = mlx5_rxq_mprq_enabled(rxq) ?
@@ -191,10 +192,8 @@ mlx5_rx_burst_mode_get(struct rte_eth_dev *dev,
 		       struct rte_eth_burst_mode *mode)
 {
 	eth_rx_burst_t pkt_burst = dev->rx_pkt_burst;
-	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_data *rxq;
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, rx_queue_id);
 
-	rxq = (*priv->rxqs)[rx_queue_id];
 	if (!rxq) {
 		rte_errno = EINVAL;
 		return -rte_errno;
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index 337dcca59fb..413e36f6d8d 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -603,14 +603,13 @@ mlx5_mprq_enabled(struct rte_eth_dev *dev)
 		return 0;
 	/* All the configured queues should be enabled. */
 	for (i = 0; i < priv->rxqs_n; ++i) {
-		struct mlx5_rxq_data *rxq = (*priv->rxqs)[i];
-		struct mlx5_rxq_ctrl *rxq_ctrl = container_of
-			(rxq, struct mlx5_rxq_ctrl, rxq);
+		struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_ctrl_get(dev, i);
 
-		if (rxq == NULL || rxq_ctrl->type != MLX5_RXQ_TYPE_STANDARD)
+		if (rxq_ctrl == NULL ||
+		    rxq_ctrl->type != MLX5_RXQ_TYPE_STANDARD)
 			continue;
 		n_ibv++;
-		if (mlx5_rxq_mprq_enabled(rxq))
+		if (mlx5_rxq_mprq_enabled(&rxq_ctrl->rxq))
 			++n;
 	}
 	/* Multi-Packet RQ can't be partially configured. */
diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index 2850a220399..f3fc618ed2c 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -748,7 +748,7 @@ mlx5_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 	}
 	DRV_LOG(DEBUG, "port %u adding Rx queue %u to list",
 		dev->data->port_id, idx);
-	(*priv->rxqs)[idx] = &rxq_ctrl->rxq;
+	dev->data->rx_queues[idx] = &rxq_ctrl->rxq;
 	return 0;
 }
 
@@ -830,7 +830,7 @@ mlx5_rx_hairpin_queue_setup(struct rte_eth_dev *dev, uint16_t idx,
 	}
 	DRV_LOG(DEBUG, "port %u adding hairpin Rx queue %u to list",
 		dev->data->port_id, idx);
-	(*priv->rxqs)[idx] = &rxq_ctrl->rxq;
+	dev->data->rx_queues[idx] = &rxq_ctrl->rxq;
 	return 0;
 }
 
@@ -1163,7 +1163,7 @@ mlx5_mprq_free_mp(struct rte_eth_dev *dev)
 	rte_mempool_free(mp);
 	/* Unset mempool for each Rx queue. */
 	for (i = 0; i != priv->rxqs_n; ++i) {
-		struct mlx5_rxq_data *rxq = (*priv->rxqs)[i];
+		struct mlx5_rxq_data *rxq = mlx5_rxq_data_get(dev, i);
 
 		if (rxq == NULL)
 			continue;
@@ -1204,12 +1204,13 @@ mlx5_mprq_alloc_mp(struct rte_eth_dev *dev)
 		return 0;
 	/* Count the total number of descriptors configured. */
 	for (i = 0; i != priv->rxqs_n; ++i) {
-		struct mlx5_rxq_data *rxq = (*priv->rxqs)[i];
-		struct mlx5_rxq_ctrl *rxq_ctrl = container_of
-			(rxq, struct mlx5_rxq_ctrl, rxq);
+		struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_ctrl_get(dev, i);
+		struct mlx5_rxq_data *rxq;
 
-		if (rxq == NULL || rxq_ctrl->type != MLX5_RXQ_TYPE_STANDARD)
+		if (rxq_ctrl == NULL ||
+		    rxq_ctrl->type != MLX5_RXQ_TYPE_STANDARD)
 			continue;
+		rxq = &rxq_ctrl->rxq;
 		n_ibv++;
 		desc += 1 << rxq->elts_n;
 		/* Get the max number of strides. */
@@ -1292,13 +1293,12 @@ mlx5_mprq_alloc_mp(struct rte_eth_dev *dev)
 exit:
 	/* Set mempool for each Rx queue. */
 	for (i = 0; i != priv->rxqs_n; ++i) {
-		struct mlx5_rxq_data *rxq = (*priv->rxqs)[i];
-		struct mlx5_rxq_ctrl *rxq_ctrl = container_of
-			(rxq, struct mlx5_rxq_ctrl, rxq);
+		struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_ctrl_get(dev, i);
 
-		if (rxq == NULL || rxq_ctrl->type != MLX5_RXQ_TYPE_STANDARD)
+		if (rxq_ctrl == NULL ||
+		    rxq_ctrl->type != MLX5_RXQ_TYPE_STANDARD)
 			continue;
-		rxq->mprq_mp = mp;
+		rxq_ctrl->rxq.mprq_mp = mp;
 	}
 	DRV_LOG(INFO, "port %u Multi-Packet RQ is configured",
 		dev->data->port_id);
@@ -1777,8 +1777,7 @@ mlx5_rxq_get(struct rte_eth_dev *dev, uint16_t idx)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 
-	if (priv->rxq_privs == NULL)
-		return NULL;
+	MLX5_ASSERT(priv->rxq_privs != NULL);
 	return (*priv->rxq_privs)[idx];
 }
 
@@ -1862,7 +1861,7 @@ mlx5_rxq_release(struct rte_eth_dev *dev, uint16_t idx)
 		LIST_REMOVE(rxq, owner_entry);
 		LIST_REMOVE(rxq_ctrl, next);
 		mlx5_free(rxq_ctrl);
-		(*priv->rxqs)[idx] = NULL;
+		dev->data->rx_queues[idx] = NULL;
 		mlx5_free(rxq);
 		(*priv->rxq_privs)[idx] = NULL;
 	}
@@ -1908,14 +1907,10 @@ enum mlx5_rxq_type
 mlx5_rxq_get_type(struct rte_eth_dev *dev, uint16_t idx)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_ctrl *rxq_ctrl = NULL;
+	struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_ctrl_get(dev, idx);
 
-	if (idx < priv->rxqs_n && (*priv->rxqs)[idx]) {
-		rxq_ctrl = container_of((*priv->rxqs)[idx],
-					struct mlx5_rxq_ctrl,
-					rxq);
+	if (idx < priv->rxqs_n && rxq_ctrl != NULL)
 		return rxq_ctrl->type;
-	}
 	return MLX5_RXQ_TYPE_UNDEFINED;
 }
 
@@ -2682,13 +2677,13 @@ mlx5_rxq_timestamp_set(struct rte_eth_dev *dev)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_dev_ctx_shared *sh = priv->sh;
-	struct mlx5_rxq_data *data;
 	unsigned int i;
 
 	for (i = 0; i != priv->rxqs_n; ++i) {
-		if (!(*priv->rxqs)[i])
+		struct mlx5_rxq_data *data = mlx5_rxq_data_get(dev, i);
+
+		if (data == NULL)
 			continue;
-		data = (*priv->rxqs)[i];
 		data->sh = sh;
 		data->rt_timestamp = priv->config.rt_timestamp;
 	}
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.c b/drivers/net/mlx5/mlx5_rxtx_vec.c
index 511681841ca..6212ce8247d 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec.c
+++ b/drivers/net/mlx5/mlx5_rxtx_vec.c
@@ -578,11 +578,11 @@ mlx5_check_vec_rx_support(struct rte_eth_dev *dev)
 		return -ENOTSUP;
 	/* All the configured queues should support. */
 	for (i = 0; i < priv->rxqs_n; ++i) {
-		struct mlx5_rxq_data *rxq = (*priv->rxqs)[i];
+		struct mlx5_rxq_data *rxq_data = mlx5_rxq_data_get(dev, i);
 
-		if (!rxq)
+		if (!rxq_data)
 			continue;
-		if (mlx5_rxq_check_vec_support(rxq) < 0)
+		if (mlx5_rxq_check_vec_support(rxq_data) < 0)
 			break;
 	}
 	if (i != priv->rxqs_n)
diff --git a/drivers/net/mlx5/mlx5_stats.c b/drivers/net/mlx5/mlx5_stats.c
index ae2f5668a74..732775954ad 100644
--- a/drivers/net/mlx5/mlx5_stats.c
+++ b/drivers/net/mlx5/mlx5_stats.c
@@ -107,7 +107,7 @@ mlx5_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *stats)
 	memset(&tmp, 0, sizeof(tmp));
 	/* Add software counters. */
 	for (i = 0; (i != priv->rxqs_n); ++i) {
-		struct mlx5_rxq_data *rxq = (*priv->rxqs)[i];
+		struct mlx5_rxq_data *rxq = mlx5_rxq_data_get(dev, i);
 
 		if (rxq == NULL)
 			continue;
@@ -181,10 +181,11 @@ mlx5_stats_reset(struct rte_eth_dev *dev)
 	unsigned int i;
 
 	for (i = 0; (i != priv->rxqs_n); ++i) {
-		if ((*priv->rxqs)[i] == NULL)
+		struct mlx5_rxq_data *rxq_data = mlx5_rxq_data_get(dev, i);
+
+		if (rxq_data == NULL)
 			continue;
-		memset(&(*priv->rxqs)[i]->stats, 0,
-		       sizeof(struct mlx5_rxq_stats));
+		memset(&rxq_data->stats, 0, sizeof(struct mlx5_rxq_stats));
 	}
 	for (i = 0; (i != priv->txqs_n); ++i) {
 		if ((*priv->txqs)[i] == NULL)
diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index 2cf62a9780d..72475e4b5b5 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -227,7 +227,7 @@ mlx5_rxq_start(struct rte_eth_dev *dev)
 		if (!rxq_ctrl->obj) {
 			DRV_LOG(ERR,
 				"Port %u Rx queue %u can't allocate resources.",
-				dev->data->port_id, (*priv->rxqs)[i]->idx);
+				dev->data->port_id, i);
 			rte_errno = ENOMEM;
 			goto error;
 		}
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 13/14] net/mlx5: support shared Rx queue
  2021-11-03  7:58 ` [dpdk-dev] [PATCH v3 00/14] net/mlx5: support " Xueming Li
                     ` (11 preceding siblings ...)
  2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 12/14] net/mlx5: remove Rx queue data list from device Xueming Li
@ 2021-11-03  7:58   ` Xueming Li
  2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 14/14] net/mlx5: add shared Rx queue port datapath support Xueming Li
  13 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-11-03  7:58 UTC (permalink / raw)
  To: dev; +Cc: xuemingl, Lior Margalit, Matan Azrad, Viacheslav Ovsiienko

This patch introduces shared RxQ. All shared Rx queues with same group
and queue ID share the same rxq_ctrl. Rxq_ctrl and rxq_data are shared,
all queues from different member port share same WQ and CQ, essentially
one Rx WQ, mbufs are filled into this singleton WQ.

Shared rxq_data is set into device Rx queues of all member ports as
RxQ object, used for receiving packets. Polling queue of any member
ports returns packets of any member, mbuf->port is used to identify
source port.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 doc/guides/nics/features/mlx5.ini   |   1 +
 doc/guides/nics/mlx5.rst            |   6 +
 drivers/net/mlx5/linux/mlx5_os.c    |   2 +
 drivers/net/mlx5/linux/mlx5_verbs.c |   8 +-
 drivers/net/mlx5/mlx5.h             |   2 +
 drivers/net/mlx5/mlx5_devx.c        |  46 +++--
 drivers/net/mlx5/mlx5_ethdev.c      |   5 +
 drivers/net/mlx5/mlx5_rx.h          |   3 +
 drivers/net/mlx5/mlx5_rxq.c         | 274 ++++++++++++++++++++++++----
 drivers/net/mlx5/mlx5_trigger.c     |  61 ++++---
 10 files changed, 330 insertions(+), 78 deletions(-)

diff --git a/doc/guides/nics/features/mlx5.ini b/doc/guides/nics/features/mlx5.ini
index 403f58cd7e2..7cbd11bb160 100644
--- a/doc/guides/nics/features/mlx5.ini
+++ b/doc/guides/nics/features/mlx5.ini
@@ -11,6 +11,7 @@ Removal event        = Y
 Rx interrupt         = Y
 Fast mbuf free       = Y
 Queue start/stop     = Y
+Shared Rx queue      = Y
 Burst mode info      = Y
 Power mgmt address monitor = Y
 MTU update           = Y
diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index bb92520dff4..824971d89ae 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -113,6 +113,7 @@ Features
 - Connection tracking.
 - Sub-Function representors.
 - Sub-Function.
+- Shared Rx queue.
 
 
 Limitations
@@ -465,6 +466,11 @@ Limitations
   - In order to achieve best insertion rate, application should manage the flows per lcore.
   - Better to disable memory reclaim by setting ``reclaim_mem_mode`` to 0 to accelerate the flow object allocation and release with cache.
 
+ Shared Rx queue:
+
+  - Counters of received packets and bytes number of devices in same share group are same.
+  - Counters of received packets and bytes number of queues in same group and queue ID are same.
+
 - HW hashed bonding
 
   - TXQ affinity subjects to HW hash once enabled.
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index dd4fc0c7165..48acae65133 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -410,6 +410,7 @@ mlx5_alloc_shared_dr(struct mlx5_priv *priv)
 			mlx5_glue->dr_create_flow_action_default_miss();
 	if (!sh->default_miss_action)
 		DRV_LOG(WARNING, "Default miss action is not supported.");
+	LIST_INIT(&sh->shared_rxqs);
 	return 0;
 error:
 	/* Rollback the created objects. */
@@ -484,6 +485,7 @@ mlx5_os_free_shared_dr(struct mlx5_priv *priv)
 	MLX5_ASSERT(sh && sh->refcnt);
 	if (sh->refcnt > 1)
 		return;
+	MLX5_ASSERT(LIST_EMPTY(&sh->shared_rxqs));
 #ifdef HAVE_MLX5DV_DR
 	if (sh->rx_domain) {
 		mlx5_glue->dr_destroy_domain(sh->rx_domain);
diff --git a/drivers/net/mlx5/linux/mlx5_verbs.c b/drivers/net/mlx5/linux/mlx5_verbs.c
index f78916c868f..9d299542614 100644
--- a/drivers/net/mlx5/linux/mlx5_verbs.c
+++ b/drivers/net/mlx5/linux/mlx5_verbs.c
@@ -424,14 +424,16 @@ mlx5_rxq_ibv_obj_release(struct mlx5_rxq_priv *rxq)
 {
 	struct mlx5_rxq_obj *rxq_obj = rxq->ctrl->obj;
 
-	MLX5_ASSERT(rxq_obj);
-	MLX5_ASSERT(rxq_obj->wq);
-	MLX5_ASSERT(rxq_obj->ibv_cq);
+	if (rxq_obj == NULL || rxq_obj->wq == NULL)
+		return;
 	claim_zero(mlx5_glue->destroy_wq(rxq_obj->wq));
+	rxq_obj->wq = NULL;
+	MLX5_ASSERT(rxq_obj->ibv_cq);
 	claim_zero(mlx5_glue->destroy_cq(rxq_obj->ibv_cq));
 	if (rxq_obj->ibv_channel)
 		claim_zero(mlx5_glue->destroy_comp_channel
 							(rxq_obj->ibv_channel));
+	rxq->ctrl->started = false;
 }
 
 /**
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 75c58b93f91..3950f0dabb0 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1172,6 +1172,7 @@ struct mlx5_dev_ctx_shared {
 	struct mlx5_flex_parser_profiles fp[MLX5_FLEX_PARSER_MAX];
 	/* Flex parser profiles information. */
 	void *devx_rx_uar; /* DevX UAR for Rx. */
+	LIST_HEAD(shared_rxqs, mlx5_rxq_ctrl) shared_rxqs; /* Shared RXQs. */
 	struct mlx5_aso_age_mng *aso_age_mng;
 	/* Management data for aging mechanism using ASO Flow Hit. */
 	struct mlx5_geneve_tlv_option_resource *geneve_tlv_option_resource;
@@ -1239,6 +1240,7 @@ struct mlx5_rxq_obj {
 		};
 		struct mlx5_devx_obj *rq; /* DevX RQ object for hairpin. */
 		struct {
+			struct mlx5_devx_rmp devx_rmp; /* RMP for shared RQ. */
 			struct mlx5_devx_cq cq_obj; /* DevX CQ object. */
 			void *devx_channel;
 		};
diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index 668d47025e8..d3d189ab7f2 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -88,6 +88,8 @@ mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type)
 	default:
 		break;
 	}
+	if (rxq->ctrl->type == MLX5_RXQ_TYPE_HAIRPIN)
+		return mlx5_devx_cmd_modify_rq(rxq->ctrl->obj->rq, &rq_attr);
 	return mlx5_devx_cmd_modify_rq(rxq->devx_rq.rq, &rq_attr);
 }
 
@@ -156,18 +158,21 @@ mlx5_txq_devx_modify(struct mlx5_txq_obj *obj, enum mlx5_txq_modify_type type,
 static void
 mlx5_rxq_devx_obj_release(struct mlx5_rxq_priv *rxq)
 {
-	struct mlx5_rxq_ctrl *rxq_ctrl = rxq->ctrl;
-	struct mlx5_rxq_obj *rxq_obj = rxq_ctrl->obj;
+	struct mlx5_rxq_obj *rxq_obj = rxq->ctrl->obj;
 
-	MLX5_ASSERT(rxq != NULL);
-	MLX5_ASSERT(rxq_ctrl != NULL);
+	if (rxq_obj == NULL)
+		return;
 	if (rxq_obj->rxq_ctrl->type == MLX5_RXQ_TYPE_HAIRPIN) {
-		MLX5_ASSERT(rxq_obj->rq);
+		if (rxq_obj->rq == NULL)
+			return;
 		mlx5_devx_modify_rq(rxq, MLX5_RXQ_MOD_RDY2RST);
 		claim_zero(mlx5_devx_cmd_destroy(rxq_obj->rq));
 	} else {
+		if (rxq->devx_rq.rq == NULL)
+			return;
 		mlx5_devx_rq_destroy(&rxq->devx_rq);
-		memset(&rxq->devx_rq, 0, sizeof(rxq->devx_rq));
+		if (rxq->devx_rq.rmp != NULL && rxq->devx_rq.rmp->ref_cnt > 0)
+			return;
 		mlx5_devx_cq_destroy(&rxq_obj->cq_obj);
 		memset(&rxq_obj->cq_obj, 0, sizeof(rxq_obj->cq_obj));
 		if (rxq_obj->devx_channel) {
@@ -176,6 +181,7 @@ mlx5_rxq_devx_obj_release(struct mlx5_rxq_priv *rxq)
 			rxq_obj->devx_channel = NULL;
 		}
 	}
+	rxq->ctrl->started = false;
 }
 
 /**
@@ -271,6 +277,8 @@ mlx5_rxq_create_devx_rq_resources(struct mlx5_rxq_priv *rxq)
 						MLX5_WQ_END_PAD_MODE_NONE;
 	rq_attr.wq_attr.pd = cdev->pdn;
 	rq_attr.counter_set_id = priv->counter_set_id;
+	if (rxq_data->shared) /* Create RMP based RQ. */
+		rxq->devx_rq.rmp = &rxq_ctrl->obj->devx_rmp;
 	/* Create RQ using DevX API. */
 	return mlx5_devx_rq_create(cdev->ctx, &rxq->devx_rq, wqe_size,
 				   log_desc_n, &rq_attr, rxq_ctrl->socket);
@@ -300,6 +308,8 @@ mlx5_rxq_create_devx_cq_resources(struct mlx5_rxq_priv *rxq)
 	uint16_t event_nums[1] = { 0 };
 	int ret = 0;
 
+	if (rxq_ctrl->started)
+		return 0;
 	if (priv->config.cqe_comp && !rxq_data->hw_timestamp &&
 	    !rxq_data->lro) {
 		cq_attr.cqe_comp_en = 1u;
@@ -365,6 +375,7 @@ mlx5_rxq_create_devx_cq_resources(struct mlx5_rxq_priv *rxq)
 	rxq_data->cq_uar = mlx5_os_get_devx_uar_base_addr(sh->devx_rx_uar);
 	rxq_data->cqe_n = log_cqe_n;
 	rxq_data->cqn = cq_obj->cq->id;
+	rxq_data->cq_ci = 0;
 	if (rxq_ctrl->obj->devx_channel) {
 		ret = mlx5_os_devx_subscribe_devx_event
 					      (rxq_ctrl->obj->devx_channel,
@@ -463,7 +474,7 @@ mlx5_rxq_devx_obj_new(struct mlx5_rxq_priv *rxq)
 	if (rxq_ctrl->type == MLX5_RXQ_TYPE_HAIRPIN)
 		return mlx5_rxq_obj_hairpin_new(rxq);
 	tmpl->rxq_ctrl = rxq_ctrl;
-	if (rxq_ctrl->irq) {
+	if (rxq_ctrl->irq && !rxq_ctrl->started) {
 		int devx_ev_flag =
 			  MLX5DV_DEVX_CREATE_EVENT_CHANNEL_FLAGS_OMIT_EV_DATA;
 
@@ -496,11 +507,19 @@ mlx5_rxq_devx_obj_new(struct mlx5_rxq_priv *rxq)
 	ret = mlx5_devx_modify_rq(rxq, MLX5_RXQ_MOD_RST2RDY);
 	if (ret)
 		goto error;
-	rxq_data->wqes = (void *)(uintptr_t)rxq->devx_rq.wq.umem_buf;
-	rxq_data->rq_db = (uint32_t *)(uintptr_t)rxq->devx_rq.wq.db_rec;
-	mlx5_rxq_initialize(rxq_data);
+	if (!rxq_data->shared) {
+		rxq_data->wqes = (void *)(uintptr_t)rxq->devx_rq.wq.umem_buf;
+		rxq_data->rq_db = (uint32_t *)(uintptr_t)rxq->devx_rq.wq.db_rec;
+	} else if (!rxq_ctrl->started) {
+		rxq_data->wqes = (void *)(uintptr_t)tmpl->devx_rmp.wq.umem_buf;
+		rxq_data->rq_db =
+				(uint32_t *)(uintptr_t)tmpl->devx_rmp.wq.db_rec;
+	}
+	if (!rxq_ctrl->started) {
+		mlx5_rxq_initialize(rxq_data);
+		rxq_ctrl->wqn = rxq->devx_rq.rq->id;
+	}
 	priv->dev_data->rx_queue_state[rxq->idx] = RTE_ETH_QUEUE_STATE_STARTED;
-	rxq_ctrl->wqn = rxq->devx_rq.rq->id;
 	return 0;
 error:
 	ret = rte_errno; /* Save rte_errno before cleanup. */
@@ -558,7 +577,10 @@ mlx5_devx_ind_table_create_rqt_attr(struct rte_eth_dev *dev,
 		struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, queues[i]);
 
 		MLX5_ASSERT(rxq != NULL);
-		rqt_attr->rq_list[i] = rxq->devx_rq.rq->id;
+		if (rxq->ctrl->type == MLX5_RXQ_TYPE_HAIRPIN)
+			rqt_attr->rq_list[i] = rxq->ctrl->obj->rq->id;
+		else
+			rqt_attr->rq_list[i] = rxq->devx_rq.rq->id;
 	}
 	MLX5_ASSERT(i > 0);
 	for (j = 0; i != rqt_n; ++j, ++i)
diff --git a/drivers/net/mlx5/mlx5_ethdev.c b/drivers/net/mlx5/mlx5_ethdev.c
index bb38d5d2ade..dc647d5580c 100644
--- a/drivers/net/mlx5/mlx5_ethdev.c
+++ b/drivers/net/mlx5/mlx5_ethdev.c
@@ -26,6 +26,7 @@
 #include "mlx5_rx.h"
 #include "mlx5_tx.h"
 #include "mlx5_autoconf.h"
+#include "mlx5_devx.h"
 
 /**
  * Get the interface index from device name.
@@ -336,9 +337,13 @@ mlx5_dev_infos_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *info)
 	info->flow_type_rss_offloads = ~MLX5_RSS_HF_MASK;
 	mlx5_set_default_params(dev, info);
 	mlx5_set_txlimit_params(dev, info);
+	if (priv->config.hca_attr.mem_rq_rmp &&
+	    priv->obj_ops.rxq_obj_new == devx_obj_ops.rxq_obj_new)
+		info->dev_capa |= RTE_ETH_DEV_CAPA_RXQ_SHARE;
 	info->switch_info.name = dev->data->name;
 	info->switch_info.domain_id = priv->domain_id;
 	info->switch_info.port_id = priv->representor_id;
+	info->switch_info.rx_domain = 0; /* No sub Rx domains. */
 	if (priv->representor) {
 		uint16_t port_id;
 
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index 413e36f6d8d..eda6eca8dea 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -96,6 +96,7 @@ struct mlx5_rxq_data {
 	unsigned int lro:1; /* Enable LRO. */
 	unsigned int dynf_meta:1; /* Dynamic metadata is configured. */
 	unsigned int mcqe_format:3; /* CQE compression format. */
+	unsigned int shared:1; /* Shared RXQ. */
 	volatile uint32_t *rq_db;
 	volatile uint32_t *cq_db;
 	uint16_t port_id;
@@ -158,8 +159,10 @@ struct mlx5_rxq_ctrl {
 	struct mlx5_dev_ctx_shared *sh; /* Shared context. */
 	enum mlx5_rxq_type type; /* Rxq type. */
 	unsigned int socket; /* CPU socket ID for allocations. */
+	LIST_ENTRY(mlx5_rxq_ctrl) share_entry; /* Entry in shared RXQ list. */
 	uint32_t share_group; /* Group ID of shared RXQ. */
 	uint16_t share_qid; /* Shared RxQ ID in group. */
+	unsigned int started:1; /* Whether (shared) RXQ has been started. */
 	unsigned int irq:1; /* Whether IRQ is enabled. */
 	uint32_t flow_mark_n; /* Number of Mark/Flag flows using this Queue. */
 	uint32_t flow_tunnels_n[MLX5_FLOW_TUNNEL]; /* Tunnels counters. */
diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index f3fc618ed2c..0f1f4660bc7 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -29,6 +29,7 @@
 #include "mlx5_rx.h"
 #include "mlx5_utils.h"
 #include "mlx5_autoconf.h"
+#include "mlx5_devx.h"
 
 
 /* Default RSS hash key also used for ConnectX-3. */
@@ -633,14 +634,19 @@ mlx5_rx_queue_start(struct rte_eth_dev *dev, uint16_t idx)
  *   RX queue index.
  * @param desc
  *   Number of descriptors to configure in queue.
+ * @param[out] rxq_ctrl
+ *   Address of pointer to shared Rx queue control.
  *
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
 static int
-mlx5_rx_queue_pre_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t *desc)
+mlx5_rx_queue_pre_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t *desc,
+			struct mlx5_rxq_ctrl **rxq_ctrl)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
+	struct mlx5_rxq_priv *rxq;
+	bool empty;
 
 	if (!rte_is_power_of_2(*desc)) {
 		*desc = 1 << log2above(*desc);
@@ -657,16 +663,143 @@ mlx5_rx_queue_pre_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t *desc)
 		rte_errno = EOVERFLOW;
 		return -rte_errno;
 	}
-	if (!mlx5_rxq_releasable(dev, idx)) {
-		DRV_LOG(ERR, "port %u unable to release queue index %u",
-			dev->data->port_id, idx);
-		rte_errno = EBUSY;
-		return -rte_errno;
+	if (rxq_ctrl == NULL || *rxq_ctrl == NULL)
+		return 0;
+	if (!(*rxq_ctrl)->rxq.shared) {
+		if (!mlx5_rxq_releasable(dev, idx)) {
+			DRV_LOG(ERR, "port %u unable to release queue index %u",
+				dev->data->port_id, idx);
+			rte_errno = EBUSY;
+			return -rte_errno;
+		}
+		mlx5_rxq_release(dev, idx);
+	} else {
+		if ((*rxq_ctrl)->obj != NULL)
+			/* Some port using shared Rx queue has been started. */
+			return 0;
+		/* Release all owner RxQ to reconfigure Shared RxQ. */
+		do {
+			rxq = LIST_FIRST(&(*rxq_ctrl)->owners);
+			LIST_REMOVE(rxq, owner_entry);
+			empty = LIST_EMPTY(&(*rxq_ctrl)->owners);
+			mlx5_rxq_release(ETH_DEV(rxq->priv), rxq->idx);
+		} while (!empty);
+		*rxq_ctrl = NULL;
 	}
-	mlx5_rxq_release(dev, idx);
 	return 0;
 }
 
+/**
+ * Get the shared Rx queue object that matches group and queue index.
+ *
+ * @param dev
+ *   Pointer to Ethernet device structure.
+ * @param group
+ *   Shared RXQ group.
+ * @param share_qid
+ *   Shared RX queue index.
+ *
+ * @return
+ *   Shared RXQ object that matching, or NULL if not found.
+ */
+static struct mlx5_rxq_ctrl *
+mlx5_shared_rxq_get(struct rte_eth_dev *dev, uint32_t group, uint16_t share_qid)
+{
+	struct mlx5_rxq_ctrl *rxq_ctrl;
+	struct mlx5_priv *priv = dev->data->dev_private;
+
+	LIST_FOREACH(rxq_ctrl, &priv->sh->shared_rxqs, share_entry) {
+		if (rxq_ctrl->share_group == group &&
+		    rxq_ctrl->share_qid == share_qid)
+			return rxq_ctrl;
+	}
+	return NULL;
+}
+
+/**
+ * Check whether requested Rx queue configuration matches shared RXQ.
+ *
+ * @param rxq_ctrl
+ *   Pointer to shared RXQ.
+ * @param dev
+ *   Pointer to Ethernet device structure.
+ * @param idx
+ *   Queue index.
+ * @param desc
+ *   Number of descriptors to configure in queue.
+ * @param socket
+ *   NUMA socket on which memory must be allocated.
+ * @param[in] conf
+ *   Thresholds parameters.
+ * @param mp
+ *   Memory pool for buffer allocations.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+static bool
+mlx5_shared_rxq_match(struct mlx5_rxq_ctrl *rxq_ctrl, struct rte_eth_dev *dev,
+		      uint16_t idx, uint16_t desc, unsigned int socket,
+		      const struct rte_eth_rxconf *conf,
+		      struct rte_mempool *mp)
+{
+	struct mlx5_priv *spriv = LIST_FIRST(&rxq_ctrl->owners)->priv;
+	struct mlx5_priv *priv = dev->data->dev_private;
+	unsigned int i;
+
+	RTE_SET_USED(conf);
+	if (rxq_ctrl->socket != socket) {
+		DRV_LOG(ERR, "port %u queue index %u failed to join shared group: socket mismatch",
+			dev->data->port_id, idx);
+		return false;
+	}
+	if (rxq_ctrl->rxq.elts_n != log2above(desc)) {
+		DRV_LOG(ERR, "port %u queue index %u failed to join shared group: descriptor number mismatch",
+			dev->data->port_id, idx);
+		return false;
+	}
+	if (priv->mtu != spriv->mtu) {
+		DRV_LOG(ERR, "port %u queue index %u failed to join shared group: mtu mismatch",
+			dev->data->port_id, idx);
+		return false;
+	}
+	if (priv->dev_data->dev_conf.intr_conf.rxq !=
+	    spriv->dev_data->dev_conf.intr_conf.rxq) {
+		DRV_LOG(ERR, "port %u queue index %u failed to join shared group: interrupt mismatch",
+			dev->data->port_id, idx);
+		return false;
+	}
+	if (mp != NULL && rxq_ctrl->rxq.mp != mp) {
+		DRV_LOG(ERR, "port %u queue index %u failed to join shared group: mempool mismatch",
+			dev->data->port_id, idx);
+		return false;
+	} else if (mp == NULL) {
+		for (i = 0; i < conf->rx_nseg; i++) {
+			if (conf->rx_seg[i].split.mp !=
+			    rxq_ctrl->rxq.rxseg[i].mp ||
+			    conf->rx_seg[i].split.length !=
+			    rxq_ctrl->rxq.rxseg[i].length) {
+				DRV_LOG(ERR, "port %u queue index %u failed to join shared group: segment %u configuration mismatch",
+					dev->data->port_id, idx, i);
+				return false;
+			}
+		}
+	}
+	if (priv->config.hw_padding != spriv->config.hw_padding) {
+		DRV_LOG(ERR, "port %u queue index %u failed to join shared group: padding mismatch",
+			dev->data->port_id, idx);
+		return false;
+	}
+	if (priv->config.cqe_comp != spriv->config.cqe_comp ||
+	    (priv->config.cqe_comp &&
+	     priv->config.cqe_comp_fmt != spriv->config.cqe_comp_fmt)) {
+		DRV_LOG(ERR, "port %u queue index %u failed to join shared group: CQE compression mismatch",
+			dev->data->port_id, idx);
+		return false;
+	}
+	return true;
+}
+
 /**
  *
  * @param dev
@@ -692,12 +825,14 @@ mlx5_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_rxq_priv *rxq;
-	struct mlx5_rxq_ctrl *rxq_ctrl;
+	struct mlx5_rxq_ctrl *rxq_ctrl = NULL;
 	struct rte_eth_rxseg_split *rx_seg =
 				(struct rte_eth_rxseg_split *)conf->rx_seg;
 	struct rte_eth_rxseg_split rx_single = {.mp = mp};
 	uint16_t n_seg = conf->rx_nseg;
 	int res;
+	uint64_t offloads = conf->offloads |
+			    dev->data->dev_conf.rxmode.offloads;
 
 	if (mp) {
 		/*
@@ -709,9 +844,6 @@ mlx5_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		n_seg = 1;
 	}
 	if (n_seg > 1) {
-		uint64_t offloads = conf->offloads |
-				    dev->data->dev_conf.rxmode.offloads;
-
 		/* The offloads should be checked on rte_eth_dev layer. */
 		MLX5_ASSERT(offloads & RTE_ETH_RX_OFFLOAD_SCATTER);
 		if (!(offloads & RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT)) {
@@ -723,9 +855,46 @@ mlx5_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		}
 		MLX5_ASSERT(n_seg < MLX5_MAX_RXQ_NSEG);
 	}
-	res = mlx5_rx_queue_pre_setup(dev, idx, &desc);
+	if (conf->share_group > 0) {
+		if (!priv->config.hca_attr.mem_rq_rmp) {
+			DRV_LOG(ERR, "port %u queue index %u shared Rx queue not supported by fw",
+				     dev->data->port_id, idx);
+			rte_errno = EINVAL;
+			return -rte_errno;
+		}
+		if (priv->obj_ops.rxq_obj_new != devx_obj_ops.rxq_obj_new) {
+			DRV_LOG(ERR, "port %u queue index %u shared Rx queue needs DevX api",
+				     dev->data->port_id, idx);
+			rte_errno = EINVAL;
+			return -rte_errno;
+		}
+		if (conf->share_qid >= priv->rxqs_n) {
+			DRV_LOG(ERR, "port %u shared Rx queue index %u > number of Rx queues %u",
+				dev->data->port_id, conf->share_qid,
+				priv->rxqs_n);
+			rte_errno = EINVAL;
+			return -rte_errno;
+		}
+		if (priv->config.mprq.enabled) {
+			DRV_LOG(ERR, "port %u shared Rx queue index %u: not supported when MPRQ enabled",
+				dev->data->port_id, conf->share_qid);
+			rte_errno = EINVAL;
+			return -rte_errno;
+		}
+		/* Try to reuse shared RXQ. */
+		rxq_ctrl = mlx5_shared_rxq_get(dev, conf->share_group,
+					       conf->share_qid);
+		if (rxq_ctrl != NULL &&
+		    !mlx5_shared_rxq_match(rxq_ctrl, dev, idx, desc, socket,
+					   conf, mp)) {
+			rte_errno = EINVAL;
+			return -rte_errno;
+		}
+	}
+	res = mlx5_rx_queue_pre_setup(dev, idx, &desc, &rxq_ctrl);
 	if (res)
 		return res;
+	/* Allocate RXQ. */
 	rxq = mlx5_malloc(MLX5_MEM_RTE | MLX5_MEM_ZERO, sizeof(*rxq), 0,
 			  SOCKET_ID_ANY);
 	if (!rxq) {
@@ -737,15 +906,23 @@ mlx5_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 	rxq->priv = priv;
 	rxq->idx = idx;
 	(*priv->rxq_privs)[idx] = rxq;
-	rxq_ctrl = mlx5_rxq_new(dev, rxq, desc, socket, conf, rx_seg, n_seg);
-	if (!rxq_ctrl) {
-		DRV_LOG(ERR, "port %u unable to allocate rx queue index %u",
-			dev->data->port_id, idx);
-		mlx5_free(rxq);
-		(*priv->rxq_privs)[idx] = NULL;
-		rte_errno = ENOMEM;
-		return -rte_errno;
+	if (rxq_ctrl != NULL) {
+		/* Join owner list. */
+		LIST_INSERT_HEAD(&rxq_ctrl->owners, rxq, owner_entry);
+		rxq->ctrl = rxq_ctrl;
+	} else {
+		rxq_ctrl = mlx5_rxq_new(dev, rxq, desc, socket, conf, rx_seg,
+					n_seg);
+		if (rxq_ctrl == NULL) {
+			DRV_LOG(ERR, "port %u unable to allocate rx queue index %u",
+				dev->data->port_id, idx);
+			mlx5_free(rxq);
+			(*priv->rxq_privs)[idx] = NULL;
+			rte_errno = ENOMEM;
+			return -rte_errno;
+		}
 	}
+	mlx5_rxq_ref(dev, idx);
 	DRV_LOG(DEBUG, "port %u adding Rx queue %u to list",
 		dev->data->port_id, idx);
 	dev->data->rx_queues[idx] = &rxq_ctrl->rxq;
@@ -776,7 +953,7 @@ mlx5_rx_hairpin_queue_setup(struct rte_eth_dev *dev, uint16_t idx,
 	struct mlx5_rxq_ctrl *rxq_ctrl;
 	int res;
 
-	res = mlx5_rx_queue_pre_setup(dev, idx, &desc);
+	res = mlx5_rx_queue_pre_setup(dev, idx, &desc, NULL);
 	if (res)
 		return res;
 	if (hairpin_conf->peer_count != 1) {
@@ -1095,6 +1272,9 @@ mlx5_rxq_obj_verify(struct rte_eth_dev *dev)
 	struct mlx5_rxq_obj *rxq_obj;
 
 	LIST_FOREACH(rxq_obj, &priv->rxqsobj, next) {
+		if (rxq_obj->rxq_ctrl->rxq.shared &&
+		    !LIST_EMPTY(&rxq_obj->rxq_ctrl->owners))
+			continue;
 		DRV_LOG(DEBUG, "port %u Rx queue %u still referenced",
 			dev->data->port_id, rxq_obj->rxq_ctrl->rxq.idx);
 		++ret;
@@ -1413,6 +1593,11 @@ mlx5_rxq_new(struct rte_eth_dev *dev, struct mlx5_rxq_priv *rxq,
 		return NULL;
 	}
 	LIST_INIT(&tmpl->owners);
+	if (conf->share_group > 0) {
+		tmpl->rxq.shared = 1;
+		tmpl->share_group = conf->share_group;
+		LIST_INSERT_HEAD(&priv->sh->shared_rxqs, tmpl, share_entry);
+	}
 	rxq->ctrl = tmpl;
 	LIST_INSERT_HEAD(&tmpl->owners, rxq, owner_entry);
 	MLX5_ASSERT(n_seg && n_seg <= MLX5_MAX_RXQ_NSEG);
@@ -1660,8 +1845,9 @@ mlx5_rxq_new(struct rte_eth_dev *dev, struct mlx5_rxq_priv *rxq,
 #ifndef RTE_ARCH_64
 	tmpl->rxq.uar_lock_cq = &priv->sh->uar_lock_cq;
 #endif
+	if (conf->share_group > 0)
+		tmpl->share_qid = conf->share_qid;
 	tmpl->rxq.idx = idx;
-	mlx5_rxq_ref(dev, idx);
 	LIST_INSERT_HEAD(&priv->rxqsctrl, tmpl, next);
 	return tmpl;
 error:
@@ -1836,31 +2022,41 @@ mlx5_rxq_release(struct rte_eth_dev *dev, uint16_t idx)
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_rxq_priv *rxq;
 	struct mlx5_rxq_ctrl *rxq_ctrl;
+	uint32_t refcnt;
 
 	if (priv->rxq_privs == NULL)
 		return 0;
 	rxq = mlx5_rxq_get(dev, idx);
-	if (rxq == NULL)
+	if (rxq == NULL || rxq->refcnt == 0)
 		return 0;
-	if (mlx5_rxq_deref(dev, idx) > 1)
-		return 1;
 	rxq_ctrl = rxq->ctrl;
-	if (rxq_ctrl->obj != NULL) {
+	refcnt = mlx5_rxq_deref(dev, idx);
+	if (refcnt > 1) {
+		return 1;
+	} else if (refcnt == 1) { /* RxQ stopped. */
 		priv->obj_ops.rxq_obj_release(rxq);
-		LIST_REMOVE(rxq_ctrl->obj, next);
-		mlx5_free(rxq_ctrl->obj);
-		rxq_ctrl->obj = NULL;
-	}
-	if (rxq_ctrl->type == MLX5_RXQ_TYPE_STANDARD) {
-		rxq_free_elts(rxq_ctrl);
-		dev->data->rx_queue_state[idx] = RTE_ETH_QUEUE_STATE_STOPPED;
-	}
-	if (!__atomic_load_n(&rxq->refcnt, __ATOMIC_RELAXED)) {
-		if (rxq_ctrl->type == MLX5_RXQ_TYPE_STANDARD)
-			mlx5_mr_btree_free(&rxq_ctrl->rxq.mr_ctrl.cache_bh);
+		if (!rxq_ctrl->started && rxq_ctrl->obj != NULL) {
+			LIST_REMOVE(rxq_ctrl->obj, next);
+			mlx5_free(rxq_ctrl->obj);
+			rxq_ctrl->obj = NULL;
+		}
+		if (rxq_ctrl->type == MLX5_RXQ_TYPE_STANDARD) {
+			if (!rxq_ctrl->started)
+				rxq_free_elts(rxq_ctrl);
+			dev->data->rx_queue_state[idx] =
+					RTE_ETH_QUEUE_STATE_STOPPED;
+		}
+	} else { /* Refcnt zero, closing device. */
 		LIST_REMOVE(rxq, owner_entry);
-		LIST_REMOVE(rxq_ctrl, next);
-		mlx5_free(rxq_ctrl);
+		if (LIST_EMPTY(&rxq_ctrl->owners)) {
+			if (rxq_ctrl->type == MLX5_RXQ_TYPE_STANDARD)
+				mlx5_mr_btree_free
+					(&rxq_ctrl->rxq.mr_ctrl.cache_bh);
+			if (rxq_ctrl->rxq.shared)
+				LIST_REMOVE(rxq_ctrl, share_entry);
+			LIST_REMOVE(rxq_ctrl, next);
+			mlx5_free(rxq_ctrl);
+		}
 		dev->data->rx_queues[idx] = NULL;
 		mlx5_free(rxq);
 		(*priv->rxq_privs)[idx] = NULL;
diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index 72475e4b5b5..a3e62e95335 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -176,6 +176,39 @@ mlx5_rxq_stop(struct rte_eth_dev *dev)
 		mlx5_rxq_release(dev, i);
 }
 
+static int
+mlx5_rxq_ctrl_prepare(struct rte_eth_dev *dev, struct mlx5_rxq_ctrl *rxq_ctrl,
+		      unsigned int idx)
+{
+	int ret = 0;
+
+	if (rxq_ctrl->type == MLX5_RXQ_TYPE_STANDARD) {
+		/*
+		 * Pre-register the mempools. Regardless of whether
+		 * the implicit registration is enabled or not,
+		 * Rx mempool destruction is tracked to free MRs.
+		 */
+		if (mlx5_rxq_mempool_register(dev, rxq_ctrl) < 0)
+			return -rte_errno;
+		ret = rxq_alloc_elts(rxq_ctrl);
+		if (ret)
+			return ret;
+	}
+	MLX5_ASSERT(!rxq_ctrl->obj);
+	rxq_ctrl->obj = mlx5_malloc(MLX5_MEM_RTE | MLX5_MEM_ZERO,
+				    sizeof(*rxq_ctrl->obj), 0,
+				    rxq_ctrl->socket);
+	if (!rxq_ctrl->obj) {
+		DRV_LOG(ERR, "Port %u Rx queue %u can't allocate resources.",
+			dev->data->port_id, idx);
+		rte_errno = ENOMEM;
+		return -rte_errno;
+	}
+	DRV_LOG(DEBUG, "Port %u rxq %u updated with %p.", dev->data->port_id,
+		idx, (void *)&rxq_ctrl->obj);
+	return 0;
+}
+
 /**
  * Start traffic on Rx queues.
  *
@@ -208,28 +241,10 @@ mlx5_rxq_start(struct rte_eth_dev *dev)
 		if (rxq == NULL)
 			continue;
 		rxq_ctrl = rxq->ctrl;
-		if (rxq_ctrl->type == MLX5_RXQ_TYPE_STANDARD) {
-			/*
-			 * Pre-register the mempools. Regardless of whether
-			 * the implicit registration is enabled or not,
-			 * Rx mempool destruction is tracked to free MRs.
-			 */
-			if (mlx5_rxq_mempool_register(dev, rxq_ctrl) < 0)
-				goto error;
-			ret = rxq_alloc_elts(rxq_ctrl);
-			if (ret)
+		if (!rxq_ctrl->started) {
+			if (mlx5_rxq_ctrl_prepare(dev, rxq_ctrl, i) < 0)
 				goto error;
-		}
-		MLX5_ASSERT(!rxq_ctrl->obj);
-		rxq_ctrl->obj = mlx5_malloc(MLX5_MEM_RTE | MLX5_MEM_ZERO,
-					    sizeof(*rxq_ctrl->obj), 0,
-					    rxq_ctrl->socket);
-		if (!rxq_ctrl->obj) {
-			DRV_LOG(ERR,
-				"Port %u Rx queue %u can't allocate resources.",
-				dev->data->port_id, i);
-			rte_errno = ENOMEM;
-			goto error;
+			LIST_INSERT_HEAD(&priv->rxqsobj, rxq_ctrl->obj, next);
 		}
 		ret = priv->obj_ops.rxq_obj_new(rxq);
 		if (ret) {
@@ -237,9 +252,7 @@ mlx5_rxq_start(struct rte_eth_dev *dev)
 			rxq_ctrl->obj = NULL;
 			goto error;
 		}
-		DRV_LOG(DEBUG, "Port %u rxq %u updated with %p.",
-			dev->data->port_id, i, (void *)&rxq_ctrl->obj);
-		LIST_INSERT_HEAD(&priv->rxqsobj, rxq_ctrl->obj, next);
+		rxq_ctrl->started = true;
 	}
 	return 0;
 error:
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v3 14/14] net/mlx5: add shared Rx queue port datapath support
  2021-11-03  7:58 ` [dpdk-dev] [PATCH v3 00/14] net/mlx5: support " Xueming Li
                     ` (12 preceding siblings ...)
  2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 13/14] net/mlx5: support shared Rx queue Xueming Li
@ 2021-11-03  7:58   ` Xueming Li
  13 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-11-03  7:58 UTC (permalink / raw)
  To: dev
  Cc: Viacheslav Ovsiienko, xuemingl, Lior Margalit, Matan Azrad,
	David Christensen, Ruifeng Wang, Bruce Richardson,
	Konstantin Ananyev

From: Viacheslav Ovsiienko <viacheslavo@nvidia.com>

When receive packet, mlx5 PMD saves mbuf port number from
RxQ data.

To support shared RxQ, save port number into RQ context as user index.
Received packet resolve port number from CQE user index which derived
from RQ context.

Legacy Verbs API doesn't support RQ user index setting, still read from
RxQ port number.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/net/mlx5/mlx5_devx.c             |  1 +
 drivers/net/mlx5/mlx5_rx.c               |  1 +
 drivers/net/mlx5/mlx5_rxq.c              |  3 ++-
 drivers/net/mlx5/mlx5_rxtx_vec_altivec.h |  6 ++++++
 drivers/net/mlx5/mlx5_rxtx_vec_neon.h    | 12 +++++++++++-
 drivers/net/mlx5/mlx5_rxtx_vec_sse.h     |  8 +++++++-
 6 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index d3d189ab7f2..a9f9f4af700 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -277,6 +277,7 @@ mlx5_rxq_create_devx_rq_resources(struct mlx5_rxq_priv *rxq)
 						MLX5_WQ_END_PAD_MODE_NONE;
 	rq_attr.wq_attr.pd = cdev->pdn;
 	rq_attr.counter_set_id = priv->counter_set_id;
+	rq_attr.user_index = rte_cpu_to_be_16(priv->dev_data->port_id);
 	if (rxq_data->shared) /* Create RMP based RQ. */
 		rxq->devx_rq.rmp = &rxq_ctrl->obj->devx_rmp;
 	/* Create RQ using DevX API. */
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 1ffa1b95b88..4d85f64accd 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -709,6 +709,7 @@ rxq_cq_to_mbuf(struct mlx5_rxq_data *rxq, struct rte_mbuf *pkt,
 {
 	/* Update packet information. */
 	pkt->packet_type = rxq_cq_to_pkt_type(rxq, cqe, mcqe);
+	pkt->port = unlikely(rxq->shared) ? cqe->user_index_low : rxq->port_id;
 
 	if (rxq->rss_hash) {
 		uint32_t rss_hash_res = 0;
diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index 0f1f4660bc7..6c715c0803e 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -186,7 +186,8 @@ rxq_alloc_elts_sprq(struct mlx5_rxq_ctrl *rxq_ctrl)
 		mbuf_init->data_off = RTE_PKTMBUF_HEADROOM;
 		rte_mbuf_refcnt_set(mbuf_init, 1);
 		mbuf_init->nb_segs = 1;
-		mbuf_init->port = rxq->port_id;
+		/* For shared queues port is provided in CQE */
+		mbuf_init->port = rxq->shared ? 0 : rxq->port_id;
 		if (priv->flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF)
 			mbuf_init->ol_flags = RTE_MBUF_F_EXTERNAL;
 		/*
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h b/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h
index 1d00c1c43d1..423e229508c 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h
@@ -1189,6 +1189,12 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq,
 
 		/* D.5 fill in mbuf - rearm_data and packet_type. */
 		rxq_cq_to_ptype_oflags_v(rxq, cqes, opcode, &pkts[pos]);
+		if (unlikely(rxq->shared)) {
+			pkts[pos]->port = cq[pos].user_index_low;
+			pkts[pos + p1]->port = cq[pos + p1].user_index_low;
+			pkts[pos + p2]->port = cq[pos + p2].user_index_low;
+			pkts[pos + p3]->port = cq[pos + p3].user_index_low;
+		}
 		if (rxq->hw_timestamp) {
 			int offset = rxq->timestamp_offset;
 			if (rxq->rt_timestamp) {
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
index aa36df29a09..b1d16baa619 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
@@ -787,7 +787,17 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq,
 		/* C.4 fill in mbuf - rearm_data and packet_type. */
 		rxq_cq_to_ptype_oflags_v(rxq, ptype_info, flow_tag,
 					 opcode, &elts[pos]);
-		if (rxq->hw_timestamp) {
+		if (unlikely(rxq->shared)) {
+			elts[pos]->port = container_of(p0, struct mlx5_cqe,
+					      pkt_info)->user_index_low;
+			elts[pos + 1]->port = container_of(p1, struct mlx5_cqe,
+					      pkt_info)->user_index_low;
+			elts[pos + 2]->port = container_of(p2, struct mlx5_cqe,
+					      pkt_info)->user_index_low;
+			elts[pos + 3]->port = container_of(p3, struct mlx5_cqe,
+					      pkt_info)->user_index_low;
+		}
+		if (unlikely(rxq->hw_timestamp)) {
 			int offset = rxq->timestamp_offset;
 			if (rxq->rt_timestamp) {
 				struct mlx5_dev_ctx_shared *sh = rxq->sh;
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_sse.h b/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
index b0fc29d7b9e..f3d838389e2 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
@@ -736,7 +736,13 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq,
 		*err |= _mm_cvtsi128_si64(opcode);
 		/* D.5 fill in mbuf - rearm_data and packet_type. */
 		rxq_cq_to_ptype_oflags_v(rxq, cqes, opcode, &pkts[pos]);
-		if (rxq->hw_timestamp) {
+		if (unlikely(rxq->shared)) {
+			pkts[pos]->port = cq[pos].user_index_low;
+			pkts[pos + p1]->port = cq[pos + p1].user_index_low;
+			pkts[pos + p2]->port = cq[pos + p2].user_index_low;
+			pkts[pos + p3]->port = cq[pos + p3].user_index_low;
+		}
+		if (unlikely(rxq->hw_timestamp)) {
 			int offset = rxq->timestamp_offset;
 			if (rxq->rt_timestamp) {
 				struct mlx5_dev_ctx_shared *sh = rxq->sh;
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v3 01/14] common/mlx5: introduce user index field in completion
  2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 01/14] common/mlx5: introduce user index field in completion Xueming Li
@ 2021-11-04  9:14     ` Slava Ovsiienko
  0 siblings, 0 replies; 266+ messages in thread
From: Slava Ovsiienko @ 2021-11-04  9:14 UTC (permalink / raw)
  To: Xueming(Steven) Li, dev; +Cc: Lior Margalit, Matan Azrad, Ori Kam

> -----Original Message-----
> From: Xueming(Steven) Li <xuemingl@nvidia.com>
> Sent: Wednesday, November 3, 2021 9:58
> To: dev@dpdk.org
> Cc: Xueming(Steven) Li <xuemingl@nvidia.com>; Lior Margalit
> <lmargalit@nvidia.com>; Matan Azrad <matan@nvidia.com>; Slava Ovsiienko
> <viacheslavo@nvidia.com>; Ori Kam <orika@nvidia.com>
> Subject: [PATCH v3 01/14] common/mlx5: introduce user index field in
> completion
> 
> On ConnectX devices the completion entry provides the dedicated 24-bit
> field, that is filled up with some static value assigned at the Receiving Queue
> creation moment. This patch declares this field. This is a preparation step for
> supporting shared RQs and the field is supposed to provide actual port index
> while handling the shared receiving queue(s).
> 
> 
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
For the entire series:
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v4 00/14] net/mlx5: support shared Rx queue
  2021-07-27  3:42 [dpdk-dev] [RFC] ethdev: introduce shared Rx queue Xueming Li
                   ` (15 preceding siblings ...)
  2021-11-03  7:58 ` [dpdk-dev] [PATCH v3 00/14] net/mlx5: support " Xueming Li
@ 2021-11-04 12:33 ` Xueming Li
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 01/14] common/mlx5: introduce user index field in completion Xueming Li
                     ` (14 more replies)
  16 siblings, 15 replies; 266+ messages in thread
From: Xueming Li @ 2021-11-04 12:33 UTC (permalink / raw)
  To: dev; +Cc: xuemingl, Lior Margalit

Implemetation of Shared Rx queue.

v1:
- initial version
v2:
- rebased on latest dependent series
- fully tested
- support share_qid of RxQ configuration
v3:
- internally reviewed
- removed MPRQ support
- fixed multi-segment support
- fixed configure not applied after port restart
v4:
- rebase with latest code

Viacheslav Ovsiienko (1):
  net/mlx5: add shared Rx queue port datapath support

Xueming Li (13):
  common/mlx5: introduce user index field in completion
  net/mlx5: fix field reference for PPC
  common/mlx5: adds basic receive memory pool support
  common/mlx5: support receive memory pool
  net/mlx5: fix Rx queue memory allocation return value
  net/mlx5: clean Rx queue code
  net/mlx5: split Rx queue into shareable and private
  net/mlx5: move Rx queue reference count
  net/mlx5: move Rx queue hairpin info to private data
  net/mlx5: remove port info from shareable Rx queue
  net/mlx5: move Rx queue DevX resource
  net/mlx5: remove Rx queue data list from device
  net/mlx5: support shared Rx queue

 doc/guides/nics/features/mlx5.ini        |   1 +
 doc/guides/nics/mlx5.rst                 |   6 +
 drivers/common/mlx5/mlx5_common_devx.c   | 295 +++++++++--
 drivers/common/mlx5/mlx5_common_devx.h   |  19 +-
 drivers/common/mlx5/mlx5_devx_cmds.c     |  52 ++
 drivers/common/mlx5/mlx5_devx_cmds.h     |  16 +
 drivers/common/mlx5/mlx5_prm.h           |  93 +++-
 drivers/common/mlx5/version.map          |   1 +
 drivers/net/mlx5/linux/mlx5_os.c         |   2 +
 drivers/net/mlx5/linux/mlx5_verbs.c      | 169 +++---
 drivers/net/mlx5/mlx5.c                  |  10 +-
 drivers/net/mlx5/mlx5.h                  |  17 +-
 drivers/net/mlx5/mlx5_devx.c             | 270 +++++-----
 drivers/net/mlx5/mlx5_ethdev.c           |  21 +-
 drivers/net/mlx5/mlx5_flow.c             |  47 +-
 drivers/net/mlx5/mlx5_rss.c              |   6 +-
 drivers/net/mlx5/mlx5_rx.c               |  31 +-
 drivers/net/mlx5/mlx5_rx.h               |  45 +-
 drivers/net/mlx5/mlx5_rxq.c              | 630 +++++++++++++++++------
 drivers/net/mlx5/mlx5_rxtx.c             |   6 +-
 drivers/net/mlx5/mlx5_rxtx_vec.c         |   8 +-
 drivers/net/mlx5/mlx5_rxtx_vec_altivec.h |  14 +-
 drivers/net/mlx5/mlx5_rxtx_vec_neon.h    |  12 +-
 drivers/net/mlx5/mlx5_rxtx_vec_sse.h     |   8 +-
 drivers/net/mlx5/mlx5_stats.c            |   9 +-
 drivers/net/mlx5/mlx5_trigger.c          | 155 +++---
 drivers/net/mlx5/mlx5_vlan.c             |  16 +-
 drivers/regex/mlx5/mlx5_regex_fastpath.c |   2 +-
 28 files changed, 1377 insertions(+), 584 deletions(-)

-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v4 01/14] common/mlx5: introduce user index field in completion
  2021-11-04 12:33 ` [dpdk-dev] [PATCH v4 00/14] net/mlx5: support shared Rx queue Xueming Li
@ 2021-11-04 12:33   ` Xueming Li
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 02/14] net/mlx5: fix field reference for PPC Xueming Li
                     ` (13 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-11-04 12:33 UTC (permalink / raw)
  To: dev; +Cc: xuemingl, Lior Margalit, Slava Ovsiienko, Matan Azrad, Ori Kam

On ConnectX devices the completion entry provides the dedicated 24-bit
field, that is filled up with some static value assigned at the
Receiving Queue creation moment. This patch declares this field. This is
a preparation step for supporting shared RQs and the field is supposed
to provide actual port index while handling the shared receiving
queue(s).

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Slava Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/common/mlx5/mlx5_prm.h           | 8 +++++++-
 drivers/regex/mlx5/mlx5_regex_fastpath.c | 2 +-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/common/mlx5/mlx5_prm.h b/drivers/common/mlx5/mlx5_prm.h
index 8014ec2f925..c85634c774c 100644
--- a/drivers/common/mlx5/mlx5_prm.h
+++ b/drivers/common/mlx5/mlx5_prm.h
@@ -393,7 +393,13 @@ struct mlx5_cqe {
 	uint16_t hdr_type_etc;
 	uint16_t vlan_info;
 	uint8_t lro_num_seg;
-	uint8_t rsvd3[3];
+	union {
+		uint8_t user_index_bytes[3];
+		struct {
+			uint8_t user_index_hi;
+			uint16_t user_index_low;
+		} __rte_packed;
+	};
 	uint32_t flow_table_metadata;
 	uint8_t rsvd4[4];
 	uint32_t byte_cnt;
diff --git a/drivers/regex/mlx5/mlx5_regex_fastpath.c b/drivers/regex/mlx5/mlx5_regex_fastpath.c
index adb5343a46b..6836203ecf2 100644
--- a/drivers/regex/mlx5/mlx5_regex_fastpath.c
+++ b/drivers/regex/mlx5/mlx5_regex_fastpath.c
@@ -559,7 +559,7 @@ mlx5_regexdev_dequeue(struct rte_regexdev *dev, uint16_t qp_id,
 		uint16_t wq_counter
 			= (rte_be_to_cpu_16(cqe->wqe_counter) + 1) &
 			  MLX5_REGEX_MAX_WQE_INDEX;
-		size_t hw_qpid = cqe->rsvd3[2];
+		size_t hw_qpid = cqe->user_index_bytes[2];
 		struct mlx5_regex_hw_qp *qp_obj = &queue->qps[hw_qpid];
 
 		/* UMR mode WQE counter move as WQE set(4 WQEBBS).*/
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v4 02/14] net/mlx5: fix field reference for PPC
  2021-11-04 12:33 ` [dpdk-dev] [PATCH v4 00/14] net/mlx5: support shared Rx queue Xueming Li
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 01/14] common/mlx5: introduce user index field in completion Xueming Li
@ 2021-11-04 12:33   ` Xueming Li
  2021-11-04 17:07     ` Raslan Darawsheh
  2021-11-04 17:49     ` David Christensen
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 03/14] common/mlx5: adds basic receive memory pool support Xueming Li
                     ` (12 subsequent siblings)
  14 siblings, 2 replies; 266+ messages in thread
From: Xueming Li @ 2021-11-04 12:33 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Lior Margalit, viacheslavo, stable, David Christensen,
	Matan Azrad, Yongseok Koh

This patch fixes stale field reference.

Fixes: a18ac6113331 ("net/mlx5: add metadata support to Rx datapath")
Cc: viacheslavo@nvidia.com
Cc: stable@dpdk.org

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Slava Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/net/mlx5/mlx5_rxtx_vec_altivec.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h b/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h
index bcf487c34e9..1d00c1c43d1 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h
@@ -974,10 +974,10 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq,
 			(vector unsigned short)cqe_tmp1, cqe_sel_mask1);
 		cqe_tmp2 = (vector unsigned char)(vector unsigned long){
 			*(__rte_aligned(8) unsigned long *)
-			&cq[pos + p3].rsvd3[9], 0LL};
+			&cq[pos + p3].rsvd4[2], 0LL};
 		cqe_tmp1 = (vector unsigned char)(vector unsigned long){
 			*(__rte_aligned(8) unsigned long *)
-			&cq[pos + p2].rsvd3[9], 0LL};
+			&cq[pos + p2].rsvd4[2], 0LL};
 		cqes[3] = (vector unsigned char)
 			vec_sel((vector unsigned short)cqes[3],
 			(vector unsigned short)cqe_tmp2,
@@ -1037,10 +1037,10 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq,
 			(vector unsigned short)cqe_tmp1, cqe_sel_mask1);
 		cqe_tmp2 = (vector unsigned char)(vector unsigned long){
 			*(__rte_aligned(8) unsigned long *)
-			&cq[pos + p1].rsvd3[9], 0LL};
+			&cq[pos + p1].rsvd4[2], 0LL};
 		cqe_tmp1 = (vector unsigned char)(vector unsigned long){
 			*(__rte_aligned(8) unsigned long *)
-			&cq[pos].rsvd3[9], 0LL};
+			&cq[pos].rsvd4[2], 0LL};
 		cqes[1] = (vector unsigned char)
 			vec_sel((vector unsigned short)cqes[1],
 			(vector unsigned short)cqe_tmp2, cqe_sel_mask2);
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v4 03/14] common/mlx5: adds basic receive memory pool support
  2021-11-04 12:33 ` [dpdk-dev] [PATCH v4 00/14] net/mlx5: support shared Rx queue Xueming Li
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 01/14] common/mlx5: introduce user index field in completion Xueming Li
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 02/14] net/mlx5: fix field reference for PPC Xueming Li
@ 2021-11-04 12:33   ` Xueming Li
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 04/14] common/mlx5: support receive memory pool Xueming Li
                     ` (11 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-11-04 12:33 UTC (permalink / raw)
  To: dev; +Cc: xuemingl, Lior Margalit, Slava Ovsiienko, Matan Azrad, Ray Kinsella

The hardware Receive Memory Pool (RMP) object holds the destination for
incoming packets/messages that are routed to the RMP through RQs. RMP
enables sharing of memory across multiple Receive Queues. Multiple
Receive Queues can be attached to the same RMP and consume memory
from that shared poll. When using RMPs, completions are reported to the
CQ pointed to by the RQ, and this Completion Queue can be shared as
well.

This patch adds DevX supports of PRM RMP object.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Slava Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/common/mlx5/mlx5_devx_cmds.c | 52 +++++++++++++++++
 drivers/common/mlx5/mlx5_devx_cmds.h | 16 ++++++
 drivers/common/mlx5/mlx5_prm.h       | 85 +++++++++++++++++++++++++++-
 drivers/common/mlx5/version.map      |  1 +
 4 files changed, 153 insertions(+), 1 deletion(-)

diff --git a/drivers/common/mlx5/mlx5_devx_cmds.c b/drivers/common/mlx5/mlx5_devx_cmds.c
index 12c114a91b6..4ab3070da0c 100644
--- a/drivers/common/mlx5/mlx5_devx_cmds.c
+++ b/drivers/common/mlx5/mlx5_devx_cmds.c
@@ -836,6 +836,8 @@ mlx5_devx_cmd_query_hca_attr(void *ctx,
 			MLX5_GET(cmd_hca_cap, hcattr, flow_counter_bulk_alloc);
 	attr->flow_counters_dump = MLX5_GET(cmd_hca_cap, hcattr,
 					    flow_counters_dump);
+	attr->log_max_rmp = MLX5_GET(cmd_hca_cap, hcattr, log_max_rmp);
+	attr->mem_rq_rmp = MLX5_GET(cmd_hca_cap, hcattr, mem_rq_rmp);
 	attr->log_max_rqt_size = MLX5_GET(cmd_hca_cap, hcattr,
 					  log_max_rqt_size);
 	attr->eswitch_manager = MLX5_GET(cmd_hca_cap, hcattr, eswitch_manager);
@@ -1312,6 +1314,56 @@ mlx5_devx_cmd_modify_rq(struct mlx5_devx_obj *rq,
 }
 
 /**
+ * Create RMP using DevX API.
+ *
+ * @param[in] ctx
+ *   Context returned from mlx5 open_device() glue function.
+ * @param [in] rmp_attr
+ *   Pointer to create RMP attributes structure.
+ * @param [in] socket
+ *   CPU socket ID for allocations.
+ *
+ * @return
+ *   The DevX object created, NULL otherwise and rte_errno is set.
+ */
+struct mlx5_devx_obj *
+mlx5_devx_cmd_create_rmp(void *ctx,
+			 struct mlx5_devx_create_rmp_attr *rmp_attr,
+			 int socket)
+{
+	uint32_t in[MLX5_ST_SZ_DW(create_rmp_in)] = {0};
+	uint32_t out[MLX5_ST_SZ_DW(create_rmp_out)] = {0};
+	void *rmp_ctx, *wq_ctx;
+	struct mlx5_devx_wq_attr *wq_attr;
+	struct mlx5_devx_obj *rmp = NULL;
+
+	rmp = mlx5_malloc(MLX5_MEM_ZERO, sizeof(*rmp), 0, socket);
+	if (!rmp) {
+		DRV_LOG(ERR, "Failed to allocate RMP data");
+		rte_errno = ENOMEM;
+		return NULL;
+	}
+	MLX5_SET(create_rmp_in, in, opcode, MLX5_CMD_OP_CREATE_RMP);
+	rmp_ctx = MLX5_ADDR_OF(create_rmp_in, in, ctx);
+	MLX5_SET(rmpc, rmp_ctx, state, rmp_attr->state);
+	MLX5_SET(rmpc, rmp_ctx, basic_cyclic_rcv_wqe,
+		 rmp_attr->basic_cyclic_rcv_wqe);
+	wq_ctx = MLX5_ADDR_OF(rmpc, rmp_ctx, wq);
+	wq_attr = &rmp_attr->wq_attr;
+	devx_cmd_fill_wq_data(wq_ctx, wq_attr);
+	rmp->obj = mlx5_glue->devx_obj_create(ctx, in, sizeof(in), out,
+					      sizeof(out));
+	if (!rmp->obj) {
+		DRV_LOG(ERR, "Failed to create RMP using DevX");
+		rte_errno = errno;
+		mlx5_free(rmp);
+		return NULL;
+	}
+	rmp->id = MLX5_GET(create_rmp_out, out, rmpn);
+	return rmp;
+}
+
+/*
  * Create TIR using DevX API.
  *
  * @param[in] ctx
diff --git a/drivers/common/mlx5/mlx5_devx_cmds.h b/drivers/common/mlx5/mlx5_devx_cmds.h
index 2326f1e9686..86ee4f7b78b 100644
--- a/drivers/common/mlx5/mlx5_devx_cmds.h
+++ b/drivers/common/mlx5/mlx5_devx_cmds.h
@@ -152,6 +152,8 @@ mlx5_hca_parse_graph_node_base_hdr_len_mask
 struct mlx5_hca_attr {
 	uint32_t eswitch_manager:1;
 	uint32_t flow_counters_dump:1;
+	uint32_t mem_rq_rmp:1;
+	uint32_t log_max_rmp:5;
 	uint32_t log_max_rqt_size:5;
 	uint32_t parse_graph_flex_node:1;
 	uint8_t flow_counter_bulk_alloc_bitmap;
@@ -319,6 +321,17 @@ struct mlx5_devx_modify_rq_attr {
 	uint32_t lwm:16; /* Contained WQ lwm. */
 };
 
+/* Create RMP attributes structure, used by create RMP operation. */
+struct mlx5_devx_create_rmp_attr {
+	uint32_t rsvd0:8;
+	uint32_t state:4;
+	uint32_t rsvd1:20;
+	uint32_t basic_cyclic_rcv_wqe:1;
+	uint32_t rsvd4:31;
+	uint32_t rsvd8[10];
+	struct mlx5_devx_wq_attr wq_attr;
+};
+
 struct mlx5_rx_hash_field_select {
 	uint32_t l3_prot_type:1;
 	uint32_t l4_prot_type:1;
@@ -596,6 +609,9 @@ __rte_internal
 int mlx5_devx_cmd_modify_rq(struct mlx5_devx_obj *rq,
 			    struct mlx5_devx_modify_rq_attr *rq_attr);
 __rte_internal
+struct mlx5_devx_obj *mlx5_devx_cmd_create_rmp(void *ctx,
+			struct mlx5_devx_create_rmp_attr *rq_attr, int socket);
+__rte_internal
 struct mlx5_devx_obj *mlx5_devx_cmd_create_tir(void *ctx,
 					   struct mlx5_devx_tir_attr *tir_attr);
 __rte_internal
diff --git a/drivers/common/mlx5/mlx5_prm.h b/drivers/common/mlx5/mlx5_prm.h
index c85634c774c..304bcdf55a0 100644
--- a/drivers/common/mlx5/mlx5_prm.h
+++ b/drivers/common/mlx5/mlx5_prm.h
@@ -1069,6 +1069,10 @@ enum {
 	MLX5_CMD_OP_CREATE_RQ = 0x908,
 	MLX5_CMD_OP_MODIFY_RQ = 0x909,
 	MLX5_CMD_OP_QUERY_RQ = 0x90b,
+	MLX5_CMD_OP_CREATE_RMP = 0x90c,
+	MLX5_CMD_OP_MODIFY_RMP = 0x90d,
+	MLX5_CMD_OP_DESTROY_RMP = 0x90e,
+	MLX5_CMD_OP_QUERY_RMP = 0x90f,
 	MLX5_CMD_OP_CREATE_TIS = 0x912,
 	MLX5_CMD_OP_QUERY_TIS = 0x915,
 	MLX5_CMD_OP_CREATE_RQT = 0x916,
@@ -1569,7 +1573,8 @@ struct mlx5_ifc_cmd_hca_cap_bits {
 	u8 reserved_at_378[0x3];
 	u8 log_max_tis[0x5];
 	u8 basic_cyclic_rcv_wqe[0x1];
-	u8 reserved_at_381[0x2];
+	u8 reserved_at_381[0x1];
+	u8 mem_rq_rmp[0x1];
 	u8 log_max_rmp[0x5];
 	u8 reserved_at_388[0x3];
 	u8 log_max_rqt[0x5];
@@ -2243,6 +2248,84 @@ struct mlx5_ifc_query_rq_in_bits {
 	u8 reserved_at_60[0x20];
 };
 
+enum {
+	MLX5_RMPC_STATE_RDY = 0x1,
+	MLX5_RMPC_STATE_ERR = 0x3,
+};
+
+struct mlx5_ifc_rmpc_bits {
+	u8 reserved_at_0[0x8];
+	u8 state[0x4];
+	u8 reserved_at_c[0x14];
+	u8 basic_cyclic_rcv_wqe[0x1];
+	u8 reserved_at_21[0x1f];
+	u8 reserved_at_40[0x140];
+	struct mlx5_ifc_wq_bits wq;
+};
+
+struct mlx5_ifc_query_rmp_out_bits {
+	u8 status[0x8];
+	u8 reserved_at_8[0x18];
+	u8 syndrome[0x20];
+	u8 reserved_at_40[0xc0];
+	struct mlx5_ifc_rmpc_bits rmp_context;
+};
+
+struct mlx5_ifc_query_rmp_in_bits {
+	u8 opcode[0x10];
+	u8 reserved_at_10[0x10];
+	u8 reserved_at_20[0x10];
+	u8 op_mod[0x10];
+	u8 reserved_at_40[0x8];
+	u8 rmpn[0x18];
+	u8 reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_modify_rmp_out_bits {
+	u8 status[0x8];
+	u8 reserved_at_8[0x18];
+	u8 syndrome[0x20];
+	u8 reserved_at_40[0x40];
+};
+
+struct mlx5_ifc_rmp_bitmask_bits {
+	u8 reserved_at_0[0x20];
+	u8 reserved_at_20[0x1f];
+	u8 lwm[0x1];
+};
+
+struct mlx5_ifc_modify_rmp_in_bits {
+	u8 opcode[0x10];
+	u8 uid[0x10];
+	u8 reserved_at_20[0x10];
+	u8 op_mod[0x10];
+	u8 rmp_state[0x4];
+	u8 reserved_at_44[0x4];
+	u8 rmpn[0x18];
+	u8 reserved_at_60[0x20];
+	struct mlx5_ifc_rmp_bitmask_bits bitmask;
+	u8 reserved_at_c0[0x40];
+	struct mlx5_ifc_rmpc_bits ctx;
+};
+
+struct mlx5_ifc_create_rmp_out_bits {
+	u8 status[0x8];
+	u8 reserved_at_8[0x18];
+	u8 syndrome[0x20];
+	u8 reserved_at_40[0x8];
+	u8 rmpn[0x18];
+	u8 reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_create_rmp_in_bits {
+	u8 opcode[0x10];
+	u8 uid[0x10];
+	u8 reserved_at_20[0x10];
+	u8 op_mod[0x10];
+	u8 reserved_at_40[0xc0];
+	struct mlx5_ifc_rmpc_bits ctx;
+};
+
 struct mlx5_ifc_create_tis_out_bits {
 	u8 status[0x8];
 	u8 reserved_at_8[0x18];
diff --git a/drivers/common/mlx5/version.map b/drivers/common/mlx5/version.map
index 0ea8325f9ac..7265ff8c56f 100644
--- a/drivers/common/mlx5/version.map
+++ b/drivers/common/mlx5/version.map
@@ -30,6 +30,7 @@ INTERNAL {
 	mlx5_devx_cmd_create_geneve_tlv_option;
 	mlx5_devx_cmd_create_import_kek_obj;
 	mlx5_devx_cmd_create_qp;
+	mlx5_devx_cmd_create_rmp;
 	mlx5_devx_cmd_create_rq;
 	mlx5_devx_cmd_create_rqt;
 	mlx5_devx_cmd_create_sq;
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v4 04/14] common/mlx5: support receive memory pool
  2021-11-04 12:33 ` [dpdk-dev] [PATCH v4 00/14] net/mlx5: support shared Rx queue Xueming Li
                     ` (2 preceding siblings ...)
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 03/14] common/mlx5: adds basic receive memory pool support Xueming Li
@ 2021-11-04 12:33   ` Xueming Li
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 05/14] net/mlx5: fix Rx queue memory allocation return value Xueming Li
                     ` (10 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-11-04 12:33 UTC (permalink / raw)
  To: dev; +Cc: xuemingl, Lior Margalit, Slava Ovsiienko, Matan Azrad

The hardware Receive Memory Pool (RMP) object holds the destination for
incoming packets/messages that are routed to the RMP through RQs. RMP
enables sharing of memory across multiple Receive Queues. Multiple
Receive Queues can be attached to the same RMP and consume memory
from that shared poll. When using RMPs, completions are reported to the
CQ pointed to by the RQ, user index that set in RQ creation time is
carried to completion entry.

This patch enables RMP based RQ, RMP is created when mlx5_devx_rq.rmp is
set.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Slava Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/common/mlx5/mlx5_common_devx.c | 295 +++++++++++++++++++++----
 drivers/common/mlx5/mlx5_common_devx.h |  19 +-
 drivers/net/mlx5/mlx5_devx.c           |   4 +-
 3 files changed, 271 insertions(+), 47 deletions(-)

diff --git a/drivers/common/mlx5/mlx5_common_devx.c b/drivers/common/mlx5/mlx5_common_devx.c
index 825f84b1833..85b5282061a 100644
--- a/drivers/common/mlx5/mlx5_common_devx.c
+++ b/drivers/common/mlx5/mlx5_common_devx.c
@@ -271,6 +271,39 @@ mlx5_devx_sq_create(void *ctx, struct mlx5_devx_sq *sq_obj, uint16_t log_wqbb_n,
 	return -rte_errno;
 }
 
+/**
+ * Destroy DevX Receive Queue resources.
+ *
+ * @param[in] rq_res
+ *   DevX RQ resource to destroy.
+ */
+static void
+mlx5_devx_wq_res_destroy(struct mlx5_devx_wq_res *rq_res)
+{
+	if (rq_res->umem_obj)
+		claim_zero(mlx5_os_umem_dereg(rq_res->umem_obj));
+	if (rq_res->umem_buf)
+		mlx5_free((void *)(uintptr_t)rq_res->umem_buf);
+	memset(rq_res, 0, sizeof(*rq_res));
+}
+
+/**
+ * Destroy DevX Receive Memory Pool.
+ *
+ * @param[in] rmp
+ *   DevX RMP to destroy.
+ */
+static void
+mlx5_devx_rmp_destroy(struct mlx5_devx_rmp *rmp)
+{
+	MLX5_ASSERT(rmp->ref_cnt == 0);
+	if (rmp->rmp) {
+		claim_zero(mlx5_devx_cmd_destroy(rmp->rmp));
+		rmp->rmp = NULL;
+	}
+	mlx5_devx_wq_res_destroy(&rmp->wq);
+}
+
 /**
  * Destroy DevX Queue Pair.
  *
@@ -389,55 +422,48 @@ mlx5_devx_qp_create(void *ctx, struct mlx5_devx_qp *qp_obj, uint16_t log_wqbb_n,
 void
 mlx5_devx_rq_destroy(struct mlx5_devx_rq *rq)
 {
-	if (rq->rq)
+	if (rq->rq) {
 		claim_zero(mlx5_devx_cmd_destroy(rq->rq));
-	if (rq->umem_obj)
-		claim_zero(mlx5_os_umem_dereg(rq->umem_obj));
-	if (rq->umem_buf)
-		mlx5_free((void *)(uintptr_t)rq->umem_buf);
+		rq->rq = NULL;
+		if (rq->rmp)
+			rq->rmp->ref_cnt--;
+	}
+	if (rq->rmp == NULL) {
+		mlx5_devx_wq_res_destroy(&rq->wq);
+	} else {
+		if (rq->rmp->ref_cnt == 0)
+			mlx5_devx_rmp_destroy(rq->rmp);
+	}
 }
 
 /**
- * Create Receive Queue using DevX API.
- *
- * Get a pointer to partially initialized attributes structure, and updates the
- * following fields:
- *   wq_umem_valid
- *   wq_umem_id
- *   wq_umem_offset
- *   dbr_umem_valid
- *   dbr_umem_id
- *   dbr_addr
- *   log_wq_pg_sz
- * All other fields are updated by caller.
+ * Create WQ resources using DevX API.
  *
  * @param[in] ctx
  *   Context returned from mlx5 open_device() glue function.
- * @param[in/out] rq_obj
- *   Pointer to RQ to create.
  * @param[in] wqe_size
  *   Size of WQE structure.
  * @param[in] log_wqbb_n
  *   Log of number of WQBBs in queue.
- * @param[in] attr
- *   Pointer to RQ attributes structure.
  * @param[in] socket
  *   Socket to use for allocation.
+ * @param[out] wq_attr
+ *   Pointer to WQ attributes structure.
+ * @param[out] wq_res
+ *   Pointer to WQ resource to create.
  *
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
-int
-mlx5_devx_rq_create(void *ctx, struct mlx5_devx_rq *rq_obj, uint32_t wqe_size,
-		    uint16_t log_wqbb_n,
-		    struct mlx5_devx_create_rq_attr *attr, int socket)
+static int
+mlx5_devx_wq_init(void *ctx, uint32_t wqe_size, uint16_t log_wqbb_n, int socket,
+		  struct mlx5_devx_wq_attr *wq_attr,
+		  struct mlx5_devx_wq_res *wq_res)
 {
-	struct mlx5_devx_obj *rq = NULL;
 	struct mlx5dv_devx_umem *umem_obj = NULL;
 	void *umem_buf = NULL;
 	size_t alignment = MLX5_WQE_BUF_ALIGNMENT;
 	uint32_t umem_size, umem_dbrec;
-	uint16_t rq_size = 1 << log_wqbb_n;
 	int ret;
 
 	if (alignment == (size_t)-1) {
@@ -446,7 +472,7 @@ mlx5_devx_rq_create(void *ctx, struct mlx5_devx_rq *rq_obj, uint32_t wqe_size,
 		return -rte_errno;
 	}
 	/* Allocate memory buffer for WQEs and doorbell record. */
-	umem_size = wqe_size * rq_size;
+	umem_size = wqe_size * (1 << log_wqbb_n);
 	umem_dbrec = RTE_ALIGN(umem_size, MLX5_DBR_SIZE);
 	umem_size += MLX5_DBR_SIZE;
 	umem_buf = mlx5_malloc(MLX5_MEM_RTE | MLX5_MEM_ZERO, umem_size,
@@ -464,14 +490,60 @@ mlx5_devx_rq_create(void *ctx, struct mlx5_devx_rq *rq_obj, uint32_t wqe_size,
 		rte_errno = errno;
 		goto error;
 	}
+	/* Fill WQ attributes for RQ/RMP object creation. */
+	wq_attr->wq_umem_valid = 1;
+	wq_attr->wq_umem_id = mlx5_os_get_umem_id(umem_obj);
+	wq_attr->wq_umem_offset = 0;
+	wq_attr->dbr_umem_valid = 1;
+	wq_attr->dbr_umem_id = wq_attr->wq_umem_id;
+	wq_attr->dbr_addr = umem_dbrec;
+	wq_attr->log_wq_pg_sz = MLX5_LOG_PAGE_SIZE;
 	/* Fill attributes for RQ object creation. */
-	attr->wq_attr.wq_umem_valid = 1;
-	attr->wq_attr.wq_umem_id = mlx5_os_get_umem_id(umem_obj);
-	attr->wq_attr.wq_umem_offset = 0;
-	attr->wq_attr.dbr_umem_valid = 1;
-	attr->wq_attr.dbr_umem_id = attr->wq_attr.wq_umem_id;
-	attr->wq_attr.dbr_addr = umem_dbrec;
-	attr->wq_attr.log_wq_pg_sz = MLX5_LOG_PAGE_SIZE;
+	wq_res->umem_buf = umem_buf;
+	wq_res->umem_obj = umem_obj;
+	wq_res->db_rec = RTE_PTR_ADD(umem_buf, umem_dbrec);
+	return 0;
+error:
+	ret = rte_errno;
+	if (umem_obj)
+		claim_zero(mlx5_os_umem_dereg(umem_obj));
+	if (umem_buf)
+		mlx5_free((void *)(uintptr_t)umem_buf);
+	rte_errno = ret;
+	return -rte_errno;
+}
+
+/**
+ * Create standalone Receive Queue using DevX API.
+ *
+ * @param[in] ctx
+ *   Context returned from mlx5 open_device() glue function.
+ * @param[in/out] rq_obj
+ *   Pointer to RQ to create.
+ * @param[in] wqe_size
+ *   Size of WQE structure.
+ * @param[in] log_wqbb_n
+ *   Log of number of WQBBs in queue.
+ * @param[in] attr
+ *   Pointer to RQ attributes structure.
+ * @param[in] socket
+ *   Socket to use for allocation.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+static int
+mlx5_devx_rq_std_create(void *ctx, struct mlx5_devx_rq *rq_obj,
+			uint32_t wqe_size, uint16_t log_wqbb_n,
+			struct mlx5_devx_create_rq_attr *attr, int socket)
+{
+	struct mlx5_devx_obj *rq;
+	int ret;
+
+	ret = mlx5_devx_wq_init(ctx, wqe_size, log_wqbb_n, socket,
+				&attr->wq_attr, &rq_obj->wq);
+	if (ret != 0)
+		return ret;
 	/* Create receive queue object with DevX. */
 	rq = mlx5_devx_cmd_create_rq(ctx, attr, socket);
 	if (!rq) {
@@ -479,21 +551,160 @@ mlx5_devx_rq_create(void *ctx, struct mlx5_devx_rq *rq_obj, uint32_t wqe_size,
 		rte_errno = ENOMEM;
 		goto error;
 	}
-	rq_obj->umem_buf = umem_buf;
-	rq_obj->umem_obj = umem_obj;
 	rq_obj->rq = rq;
-	rq_obj->db_rec = RTE_PTR_ADD(rq_obj->umem_buf, umem_dbrec);
 	return 0;
 error:
 	ret = rte_errno;
-	if (umem_obj)
-		claim_zero(mlx5_os_umem_dereg(umem_obj));
-	if (umem_buf)
-		mlx5_free((void *)(uintptr_t)umem_buf);
+	mlx5_devx_wq_res_destroy(&rq_obj->wq);
+	rte_errno = ret;
+	return -rte_errno;
+}
+
+/**
+ * Create Receive Memory Pool using DevX API.
+ *
+ * @param[in] ctx
+ *   Context returned from mlx5 open_device() glue function.
+ * @param[in/out] rq_obj
+ *   Pointer to RQ to create.
+ * @param[in] wqe_size
+ *   Size of WQE structure.
+ * @param[in] log_wqbb_n
+ *   Log of number of WQBBs in queue.
+ * @param[in] attr
+ *   Pointer to RQ attributes structure.
+ * @param[in] socket
+ *   Socket to use for allocation.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+static int
+mlx5_devx_rmp_create(void *ctx, struct mlx5_devx_rmp *rmp_obj,
+		     uint32_t wqe_size, uint16_t log_wqbb_n,
+		     struct mlx5_devx_wq_attr *wq_attr, int socket)
+{
+	struct mlx5_devx_create_rmp_attr rmp_attr = { 0 };
+	int ret;
+
+	if (rmp_obj->rmp != NULL)
+		return 0;
+	rmp_attr.wq_attr = *wq_attr;
+	ret = mlx5_devx_wq_init(ctx, wqe_size, log_wqbb_n, socket,
+				&rmp_attr.wq_attr, &rmp_obj->wq);
+	if (ret != 0)
+		return ret;
+	rmp_attr.state = MLX5_RMPC_STATE_RDY;
+	rmp_attr.basic_cyclic_rcv_wqe =
+		wq_attr->wq_type != MLX5_WQ_TYPE_CYCLIC_STRIDING_RQ;
+	/* Create receive memory pool object with DevX. */
+	rmp_obj->rmp = mlx5_devx_cmd_create_rmp(ctx, &rmp_attr, socket);
+	if (rmp_obj->rmp == NULL) {
+		DRV_LOG(ERR, "Can't create DevX RMP object.");
+		rte_errno = ENOMEM;
+		goto error;
+	}
+	return 0;
+error:
+	ret = rte_errno;
+	mlx5_devx_wq_res_destroy(&rmp_obj->wq);
+	rte_errno = ret;
+	return -rte_errno;
+}
+
+/**
+ * Create Shared Receive Queue based on RMP using DevX API.
+ *
+ * @param[in] ctx
+ *   Context returned from mlx5 open_device() glue function.
+ * @param[in/out] rq_obj
+ *   Pointer to RQ to create.
+ * @param[in] wqe_size
+ *   Size of WQE structure.
+ * @param[in] log_wqbb_n
+ *   Log of number of WQBBs in queue.
+ * @param[in] attr
+ *   Pointer to RQ attributes structure.
+ * @param[in] socket
+ *   Socket to use for allocation.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+static int
+mlx5_devx_rq_shared_create(void *ctx, struct mlx5_devx_rq *rq_obj,
+			   uint32_t wqe_size, uint16_t log_wqbb_n,
+			   struct mlx5_devx_create_rq_attr *attr, int socket)
+{
+	struct mlx5_devx_obj *rq;
+	int ret;
+
+	ret = mlx5_devx_rmp_create(ctx, rq_obj->rmp, wqe_size, log_wqbb_n,
+				   &attr->wq_attr, socket);
+	if (ret != 0)
+		return ret;
+	attr->mem_rq_type = MLX5_RQC_MEM_RQ_TYPE_MEMORY_RQ_RMP;
+	attr->rmpn = rq_obj->rmp->rmp->id;
+	attr->flush_in_error_en = 0;
+	memset(&attr->wq_attr, 0, sizeof(attr->wq_attr));
+	/* Create receive queue object with DevX. */
+	rq = mlx5_devx_cmd_create_rq(ctx, attr, socket);
+	if (!rq) {
+		DRV_LOG(ERR, "Can't create DevX RMP RQ object.");
+		rte_errno = ENOMEM;
+		goto error;
+	}
+	rq_obj->rq = rq;
+	rq_obj->rmp->ref_cnt++;
+	return 0;
+error:
+	ret = rte_errno;
+	mlx5_devx_rq_destroy(rq_obj);
 	rte_errno = ret;
 	return -rte_errno;
 }
 
+/**
+ * Create Receive Queue using DevX API. Shared RQ is created only if rmp set.
+ *
+ * Get a pointer to partially initialized attributes structure, and updates the
+ * following fields:
+ *   wq_umem_valid
+ *   wq_umem_id
+ *   wq_umem_offset
+ *   dbr_umem_valid
+ *   dbr_umem_id
+ *   dbr_addr
+ *   log_wq_pg_sz
+ * All other fields are updated by caller.
+ *
+ * @param[in] ctx
+ *   Context returned from mlx5 open_device() glue function.
+ * @param[in/out] rq_obj
+ *   Pointer to RQ to create.
+ * @param[in] wqe_size
+ *   Size of WQE structure.
+ * @param[in] log_wqbb_n
+ *   Log of number of WQBBs in queue.
+ * @param[in] attr
+ *   Pointer to RQ attributes structure.
+ * @param[in] socket
+ *   Socket to use for allocation.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+int
+mlx5_devx_rq_create(void *ctx, struct mlx5_devx_rq *rq_obj,
+		    uint32_t wqe_size, uint16_t log_wqbb_n,
+		    struct mlx5_devx_create_rq_attr *attr, int socket)
+{
+	if (rq_obj->rmp == NULL)
+		return mlx5_devx_rq_std_create(ctx, rq_obj, wqe_size,
+					       log_wqbb_n, attr, socket);
+	return mlx5_devx_rq_shared_create(ctx, rq_obj, wqe_size,
+					  log_wqbb_n, attr, socket);
+}
 
 /**
  * Change QP state to RTS.
diff --git a/drivers/common/mlx5/mlx5_common_devx.h b/drivers/common/mlx5/mlx5_common_devx.h
index f699405f69b..7ceac040f8b 100644
--- a/drivers/common/mlx5/mlx5_common_devx.h
+++ b/drivers/common/mlx5/mlx5_common_devx.h
@@ -45,14 +45,27 @@ struct mlx5_devx_qp {
 	volatile uint32_t *db_rec; /* The QP doorbell record. */
 };
 
-/* DevX Receive Queue structure. */
-struct mlx5_devx_rq {
-	struct mlx5_devx_obj *rq; /* The RQ DevX object. */
+/* DevX Receive Queue resource structure. */
+struct mlx5_devx_wq_res {
 	void *umem_obj; /* The RQ umem object. */
 	volatile void *umem_buf;
 	volatile uint32_t *db_rec; /* The RQ doorbell record. */
 };
 
+/* DevX Receive Memory Pool structure. */
+struct mlx5_devx_rmp {
+	struct mlx5_devx_obj *rmp; /* The RMP DevX object. */
+	uint32_t ref_cnt; /* Reference count. */
+	struct mlx5_devx_wq_res wq;
+};
+
+/* DevX Receive Queue structure. */
+struct mlx5_devx_rq {
+	struct mlx5_devx_obj *rq; /* The RQ DevX object. */
+	struct mlx5_devx_rmp *rmp; /* Shared RQ RMP object. */
+	struct mlx5_devx_wq_res wq; /* WQ resource of standalone RQ. */
+};
+
 /* mlx5_common_devx.c */
 
 __rte_internal
diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index 424f77be790..443252df05d 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -515,8 +515,8 @@ mlx5_rxq_devx_obj_new(struct rte_eth_dev *dev, uint16_t idx)
 	ret = mlx5_devx_modify_rq(tmpl, MLX5_RXQ_MOD_RST2RDY);
 	if (ret)
 		goto error;
-	rxq_data->wqes = (void *)(uintptr_t)tmpl->rq_obj.umem_buf;
-	rxq_data->rq_db = (uint32_t *)(uintptr_t)tmpl->rq_obj.db_rec;
+	rxq_data->wqes = (void *)(uintptr_t)tmpl->rq_obj.wq.umem_buf;
+	rxq_data->rq_db = (uint32_t *)(uintptr_t)tmpl->rq_obj.wq.db_rec;
 	rxq_data->cq_arm_sn = 0;
 	rxq_data->cq_ci = 0;
 	mlx5_rxq_initialize(rxq_data);
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v4 05/14] net/mlx5: fix Rx queue memory allocation return value
  2021-11-04 12:33 ` [dpdk-dev] [PATCH v4 00/14] net/mlx5: support shared Rx queue Xueming Li
                     ` (3 preceding siblings ...)
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 04/14] common/mlx5: support receive memory pool Xueming Li
@ 2021-11-04 12:33   ` Xueming Li
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 06/14] net/mlx5: clean Rx queue code Xueming Li
                     ` (9 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-11-04 12:33 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Lior Margalit, akozyrev, stable, Slava Ovsiienko, Matan Azrad

If error happened during Rx queue mbuf allocation, boolean value
returned. From description, return value should be error number.

This patch returns negative error number.

Fixes: 0f20acbf5eda ("net/mlx5: implement vectorized MPRQ burst")
Cc: akozyrev@nvidia.com
Cc: stable@dpdk.org

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Slava Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/net/mlx5/mlx5_rxq.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index 9220bb2c15c..4567b43c1b6 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -129,7 +129,7 @@ rxq_alloc_elts_mprq(struct mlx5_rxq_ctrl *rxq_ctrl)
  *   Pointer to RX queue structure.
  *
  * @return
- *   0 on success, errno value on failure.
+ *   0 on success, negative errno value on failure.
  */
 static int
 rxq_alloc_elts_sprq(struct mlx5_rxq_ctrl *rxq_ctrl)
@@ -220,7 +220,7 @@ rxq_alloc_elts_sprq(struct mlx5_rxq_ctrl *rxq_ctrl)
  *   Pointer to RX queue structure.
  *
  * @return
- *   0 on success, errno value on failure.
+ *   0 on success, negative errno value on failure.
  */
 int
 rxq_alloc_elts(struct mlx5_rxq_ctrl *rxq_ctrl)
@@ -233,7 +233,9 @@ rxq_alloc_elts(struct mlx5_rxq_ctrl *rxq_ctrl)
 	 */
 	if (mlx5_rxq_mprq_enabled(&rxq_ctrl->rxq))
 		ret = rxq_alloc_elts_mprq(rxq_ctrl);
-	return (ret || rxq_alloc_elts_sprq(rxq_ctrl));
+	if (ret == 0)
+		ret = rxq_alloc_elts_sprq(rxq_ctrl);
+	return ret;
 }
 
 /**
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v4 06/14] net/mlx5: clean Rx queue code
  2021-11-04 12:33 ` [dpdk-dev] [PATCH v4 00/14] net/mlx5: support shared Rx queue Xueming Li
                     ` (4 preceding siblings ...)
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 05/14] net/mlx5: fix Rx queue memory allocation return value Xueming Li
@ 2021-11-04 12:33   ` Xueming Li
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 07/14] net/mlx5: split Rx queue into shareable and private Xueming Li
                     ` (8 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-11-04 12:33 UTC (permalink / raw)
  To: dev; +Cc: xuemingl, Lior Margalit, Slava Ovsiienko, Matan Azrad

This patch removes unused Rx queue code.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Slava Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/net/mlx5/mlx5_rxq.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index 4567b43c1b6..b2e4389ad60 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -674,9 +674,7 @@ mlx5_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		    struct rte_mempool *mp)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_data *rxq = (*priv->rxqs)[idx];
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-		container_of(rxq, struct mlx5_rxq_ctrl, rxq);
+	struct mlx5_rxq_ctrl *rxq_ctrl;
 	struct rte_eth_rxseg_split *rx_seg =
 				(struct rte_eth_rxseg_split *)conf->rx_seg;
 	struct rte_eth_rxseg_split rx_single = {.mp = mp};
@@ -743,9 +741,7 @@ mlx5_rx_hairpin_queue_setup(struct rte_eth_dev *dev, uint16_t idx,
 			    const struct rte_eth_hairpin_conf *hairpin_conf)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_data *rxq = (*priv->rxqs)[idx];
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-		container_of(rxq, struct mlx5_rxq_ctrl, rxq);
+	struct mlx5_rxq_ctrl *rxq_ctrl;
 	int res;
 
 	res = mlx5_rx_queue_pre_setup(dev, idx, &desc);
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v4 07/14] net/mlx5: split Rx queue into shareable and private
  2021-11-04 12:33 ` [dpdk-dev] [PATCH v4 00/14] net/mlx5: support shared Rx queue Xueming Li
                     ` (5 preceding siblings ...)
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 06/14] net/mlx5: clean Rx queue code Xueming Li
@ 2021-11-04 12:33   ` Xueming Li
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 08/14] net/mlx5: move Rx queue reference count Xueming Li
                     ` (7 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-11-04 12:33 UTC (permalink / raw)
  To: dev; +Cc: xuemingl, Lior Margalit, Slava Ovsiienko, Matan Azrad

To prepare shared Rx queue, splits RxQ data into shareable and private.
Struct mlx5_rxq_priv is per queue data.
Struct mlx5_rxq_ctrl is shared queue resources and data.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Slava Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/net/mlx5/mlx5.c        |  4 +++
 drivers/net/mlx5/mlx5.h        |  5 ++-
 drivers/net/mlx5/mlx5_ethdev.c | 10 ++++++
 drivers/net/mlx5/mlx5_rx.h     | 17 +++++++--
 drivers/net/mlx5/mlx5_rxq.c    | 66 ++++++++++++++++++++++++++++------
 5 files changed, 88 insertions(+), 14 deletions(-)

diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index dc15688f216..374cc9757aa 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -1700,6 +1700,10 @@ mlx5_dev_close(struct rte_eth_dev *dev)
 		mlx5_free(dev->intr_handle);
 		dev->intr_handle = NULL;
 	}
+	if (priv->rxq_privs != NULL) {
+		mlx5_free(priv->rxq_privs);
+		priv->rxq_privs = NULL;
+	}
 	if (priv->txqs != NULL) {
 		/* XXX race condition if mlx5_tx_burst() is still running. */
 		rte_delay_us_sleep(1000);
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 74af88ec194..4e99fe7d068 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1345,6 +1345,8 @@ enum mlx5_txq_modify_type {
 	MLX5_TXQ_MOD_ERR2RDY, /* modify state from error to ready. */
 };
 
+struct mlx5_rxq_priv;
+
 /* HW objects operations structure. */
 struct mlx5_obj_ops {
 	int (*rxq_obj_modify_vlan_strip)(struct mlx5_rxq_obj *rxq_obj, int on);
@@ -1408,7 +1410,8 @@ struct mlx5_priv {
 	/* RX/TX queues. */
 	unsigned int rxqs_n; /* RX queues array size. */
 	unsigned int txqs_n; /* TX queues array size. */
-	struct mlx5_rxq_data *(*rxqs)[]; /* RX queues. */
+	struct mlx5_rxq_priv *(*rxq_privs)[]; /* RX queue non-shared data. */
+	struct mlx5_rxq_data *(*rxqs)[]; /* (Shared) RX queues. */
 	struct mlx5_txq_data *(*txqs)[]; /* TX queues. */
 	struct rte_mempool *mprq_mp; /* Mempool for Multi-Packet RQ. */
 	struct rte_eth_rss_conf rss_conf; /* RSS configuration. */
diff --git a/drivers/net/mlx5/mlx5_ethdev.c b/drivers/net/mlx5/mlx5_ethdev.c
index 81fa8845bb5..cde505955df 100644
--- a/drivers/net/mlx5/mlx5_ethdev.c
+++ b/drivers/net/mlx5/mlx5_ethdev.c
@@ -104,6 +104,16 @@ mlx5_dev_configure(struct rte_eth_dev *dev)
 	       MLX5_RSS_HASH_KEY_LEN);
 	priv->rss_conf.rss_key_len = MLX5_RSS_HASH_KEY_LEN;
 	priv->rss_conf.rss_hf = dev->data->dev_conf.rx_adv_conf.rss_conf.rss_hf;
+	priv->rxq_privs = mlx5_realloc(priv->rxq_privs,
+				       MLX5_MEM_RTE | MLX5_MEM_ZERO,
+				       sizeof(void *) * rxqs_n, 0,
+				       SOCKET_ID_ANY);
+	if (priv->rxq_privs == NULL) {
+		DRV_LOG(ERR, "port %u cannot allocate rxq private data",
+			dev->data->port_id);
+		rte_errno = ENOMEM;
+		return -rte_errno;
+	}
 	priv->rxqs = (void *)dev->data->rx_queues;
 	priv->txqs = (void *)dev->data->tx_queues;
 	if (txqs_n != priv->txqs_n) {
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index 69b1263339e..fa24f5cdf3a 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -150,10 +150,14 @@ struct mlx5_rxq_ctrl {
 	struct mlx5_rxq_data rxq; /* Data path structure. */
 	LIST_ENTRY(mlx5_rxq_ctrl) next; /* Pointer to the next element. */
 	uint32_t refcnt; /* Reference counter. */
+	LIST_HEAD(priv, mlx5_rxq_priv) owners; /* Owner rxq list. */
 	struct mlx5_rxq_obj *obj; /* Verbs/DevX elements. */
+	struct mlx5_dev_ctx_shared *sh; /* Shared context. */
 	struct mlx5_priv *priv; /* Back pointer to private data. */
 	enum mlx5_rxq_type type; /* Rxq type. */
 	unsigned int socket; /* CPU socket ID for allocations. */
+	uint32_t share_group; /* Group ID of shared RXQ. */
+	uint16_t share_qid; /* Shared RxQ ID in group. */
 	unsigned int irq:1; /* Whether IRQ is enabled. */
 	uint32_t flow_mark_n; /* Number of Mark/Flag flows using this Queue. */
 	uint32_t flow_tunnels_n[MLX5_FLOW_TUNNEL]; /* Tunnels counters. */
@@ -163,6 +167,14 @@ struct mlx5_rxq_ctrl {
 	uint32_t hairpin_status; /* Hairpin binding status. */
 };
 
+/* RX queue private data. */
+struct mlx5_rxq_priv {
+	uint16_t idx; /* Queue index. */
+	struct mlx5_rxq_ctrl *ctrl; /* Shared Rx Queue. */
+	LIST_ENTRY(mlx5_rxq_priv) owner_entry; /* Entry in shared rxq_ctrl. */
+	struct mlx5_priv *priv; /* Back pointer to private data. */
+};
+
 /* mlx5_rxq.c */
 
 extern uint8_t rss_hash_default_key[];
@@ -186,13 +198,14 @@ void mlx5_rx_intr_vec_disable(struct rte_eth_dev *dev);
 int mlx5_rx_intr_enable(struct rte_eth_dev *dev, uint16_t rx_queue_id);
 int mlx5_rx_intr_disable(struct rte_eth_dev *dev, uint16_t rx_queue_id);
 int mlx5_rxq_obj_verify(struct rte_eth_dev *dev);
-struct mlx5_rxq_ctrl *mlx5_rxq_new(struct rte_eth_dev *dev, uint16_t idx,
+struct mlx5_rxq_ctrl *mlx5_rxq_new(struct rte_eth_dev *dev,
+				   struct mlx5_rxq_priv *rxq,
 				   uint16_t desc, unsigned int socket,
 				   const struct rte_eth_rxconf *conf,
 				   const struct rte_eth_rxseg_split *rx_seg,
 				   uint16_t n_seg);
 struct mlx5_rxq_ctrl *mlx5_rxq_hairpin_new
-	(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
+	(struct rte_eth_dev *dev, struct mlx5_rxq_priv *rxq, uint16_t desc,
 	 const struct rte_eth_hairpin_conf *hairpin_conf);
 struct mlx5_rxq_ctrl *mlx5_rxq_get(struct rte_eth_dev *dev, uint16_t idx);
 int mlx5_rxq_release(struct rte_eth_dev *dev, uint16_t idx);
diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index b2e4389ad60..00df245a5c6 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -674,6 +674,7 @@ mlx5_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		    struct rte_mempool *mp)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
+	struct mlx5_rxq_priv *rxq;
 	struct mlx5_rxq_ctrl *rxq_ctrl;
 	struct rte_eth_rxseg_split *rx_seg =
 				(struct rte_eth_rxseg_split *)conf->rx_seg;
@@ -708,10 +709,23 @@ mlx5_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 	res = mlx5_rx_queue_pre_setup(dev, idx, &desc);
 	if (res)
 		return res;
-	rxq_ctrl = mlx5_rxq_new(dev, idx, desc, socket, conf, rx_seg, n_seg);
+	rxq = mlx5_malloc(MLX5_MEM_RTE | MLX5_MEM_ZERO, sizeof(*rxq), 0,
+			  SOCKET_ID_ANY);
+	if (!rxq) {
+		DRV_LOG(ERR, "port %u unable to allocate rx queue index %u private data",
+			dev->data->port_id, idx);
+		rte_errno = ENOMEM;
+		return -rte_errno;
+	}
+	rxq->priv = priv;
+	rxq->idx = idx;
+	(*priv->rxq_privs)[idx] = rxq;
+	rxq_ctrl = mlx5_rxq_new(dev, rxq, desc, socket, conf, rx_seg, n_seg);
 	if (!rxq_ctrl) {
-		DRV_LOG(ERR, "port %u unable to allocate queue index %u",
+		DRV_LOG(ERR, "port %u unable to allocate rx queue index %u",
 			dev->data->port_id, idx);
+		mlx5_free(rxq);
+		(*priv->rxq_privs)[idx] = NULL;
 		rte_errno = ENOMEM;
 		return -rte_errno;
 	}
@@ -741,6 +755,7 @@ mlx5_rx_hairpin_queue_setup(struct rte_eth_dev *dev, uint16_t idx,
 			    const struct rte_eth_hairpin_conf *hairpin_conf)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
+	struct mlx5_rxq_priv *rxq;
 	struct mlx5_rxq_ctrl *rxq_ctrl;
 	int res;
 
@@ -776,14 +791,27 @@ mlx5_rx_hairpin_queue_setup(struct rte_eth_dev *dev, uint16_t idx,
 			return -rte_errno;
 		}
 	}
-	rxq_ctrl = mlx5_rxq_hairpin_new(dev, idx, desc, hairpin_conf);
+	rxq = mlx5_malloc(MLX5_MEM_RTE | MLX5_MEM_ZERO, sizeof(*rxq), 0,
+			  SOCKET_ID_ANY);
+	if (!rxq) {
+		DRV_LOG(ERR, "port %u unable to allocate hairpin rx queue index %u private data",
+			dev->data->port_id, idx);
+		rte_errno = ENOMEM;
+		return -rte_errno;
+	}
+	rxq->priv = priv;
+	rxq->idx = idx;
+	(*priv->rxq_privs)[idx] = rxq;
+	rxq_ctrl = mlx5_rxq_hairpin_new(dev, rxq, desc, hairpin_conf);
 	if (!rxq_ctrl) {
-		DRV_LOG(ERR, "port %u unable to allocate queue index %u",
+		DRV_LOG(ERR, "port %u unable to allocate hairpin queue index %u",
 			dev->data->port_id, idx);
+		mlx5_free(rxq);
+		(*priv->rxq_privs)[idx] = NULL;
 		rte_errno = ENOMEM;
 		return -rte_errno;
 	}
-	DRV_LOG(DEBUG, "port %u adding Rx queue %u to list",
+	DRV_LOG(DEBUG, "port %u adding hairpin Rx queue %u to list",
 		dev->data->port_id, idx);
 	(*priv->rxqs)[idx] = &rxq_ctrl->rxq;
 	return 0;
@@ -1319,8 +1347,8 @@ mlx5_max_lro_msg_size_adjust(struct rte_eth_dev *dev, uint16_t idx,
  *
  * @param dev
  *   Pointer to Ethernet device.
- * @param idx
- *   RX queue index.
+ * @param rxq
+ *   RX queue private data.
  * @param desc
  *   Number of descriptors to configure in queue.
  * @param socket
@@ -1330,10 +1358,12 @@ mlx5_max_lro_msg_size_adjust(struct rte_eth_dev *dev, uint16_t idx,
  *   A DPDK queue object on success, NULL otherwise and rte_errno is set.
  */
 struct mlx5_rxq_ctrl *
-mlx5_rxq_new(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
+mlx5_rxq_new(struct rte_eth_dev *dev, struct mlx5_rxq_priv *rxq,
+	     uint16_t desc,
 	     unsigned int socket, const struct rte_eth_rxconf *conf,
 	     const struct rte_eth_rxseg_split *rx_seg, uint16_t n_seg)
 {
+	uint16_t idx = rxq->idx;
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_rxq_ctrl *tmpl;
 	unsigned int mb_len = rte_pktmbuf_data_room_size(rx_seg[0].mp);
@@ -1377,6 +1407,9 @@ mlx5_rxq_new(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		rte_errno = ENOMEM;
 		return NULL;
 	}
+	LIST_INIT(&tmpl->owners);
+	rxq->ctrl = tmpl;
+	LIST_INSERT_HEAD(&tmpl->owners, rxq, owner_entry);
 	MLX5_ASSERT(n_seg && n_seg <= MLX5_MAX_RXQ_NSEG);
 	/*
 	 * Build the array of actual buffer offsets and lengths.
@@ -1610,6 +1643,7 @@ mlx5_rxq_new(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 	tmpl->rxq.rss_hash = !!priv->rss_conf.rss_hf &&
 		(!!(dev->data->dev_conf.rxmode.mq_mode & RTE_ETH_MQ_RX_RSS));
 	tmpl->rxq.port_id = dev->data->port_id;
+	tmpl->sh = priv->sh;
 	tmpl->priv = priv;
 	tmpl->rxq.mp = rx_seg[0].mp;
 	tmpl->rxq.elts_n = log2above(desc);
@@ -1637,8 +1671,8 @@ mlx5_rxq_new(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
  *
  * @param dev
  *   Pointer to Ethernet device.
- * @param idx
- *   RX queue index.
+ * @param rxq
+ *   RX queue.
  * @param desc
  *   Number of descriptors to configure in queue.
  * @param hairpin_conf
@@ -1648,9 +1682,11 @@ mlx5_rxq_new(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
  *   A DPDK queue object on success, NULL otherwise and rte_errno is set.
  */
 struct mlx5_rxq_ctrl *
-mlx5_rxq_hairpin_new(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
+mlx5_rxq_hairpin_new(struct rte_eth_dev *dev, struct mlx5_rxq_priv *rxq,
+		     uint16_t desc,
 		     const struct rte_eth_hairpin_conf *hairpin_conf)
 {
+	uint16_t idx = rxq->idx;
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_rxq_ctrl *tmpl;
 
@@ -1660,10 +1696,14 @@ mlx5_rxq_hairpin_new(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		rte_errno = ENOMEM;
 		return NULL;
 	}
+	LIST_INIT(&tmpl->owners);
+	rxq->ctrl = tmpl;
+	LIST_INSERT_HEAD(&tmpl->owners, rxq, owner_entry);
 	tmpl->type = MLX5_RXQ_TYPE_HAIRPIN;
 	tmpl->socket = SOCKET_ID_ANY;
 	tmpl->rxq.rss_hash = 0;
 	tmpl->rxq.port_id = dev->data->port_id;
+	tmpl->sh = priv->sh;
 	tmpl->priv = priv;
 	tmpl->rxq.mp = NULL;
 	tmpl->rxq.elts_n = log2above(desc);
@@ -1717,6 +1757,7 @@ mlx5_rxq_release(struct rte_eth_dev *dev, uint16_t idx)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_rxq_ctrl *rxq_ctrl;
+	struct mlx5_rxq_priv *rxq = (*priv->rxq_privs)[idx];
 
 	if (priv->rxqs == NULL || (*priv->rxqs)[idx] == NULL)
 		return 0;
@@ -1736,9 +1777,12 @@ mlx5_rxq_release(struct rte_eth_dev *dev, uint16_t idx)
 	if (!__atomic_load_n(&rxq_ctrl->refcnt, __ATOMIC_RELAXED)) {
 		if (rxq_ctrl->type == MLX5_RXQ_TYPE_STANDARD)
 			mlx5_mr_btree_free(&rxq_ctrl->rxq.mr_ctrl.cache_bh);
+		LIST_REMOVE(rxq, owner_entry);
 		LIST_REMOVE(rxq_ctrl, next);
 		mlx5_free(rxq_ctrl);
 		(*priv->rxqs)[idx] = NULL;
+		mlx5_free(rxq);
+		(*priv->rxq_privs)[idx] = NULL;
 	}
 	return 0;
 }
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v4 08/14] net/mlx5: move Rx queue reference count
  2021-11-04 12:33 ` [dpdk-dev] [PATCH v4 00/14] net/mlx5: support shared Rx queue Xueming Li
                     ` (6 preceding siblings ...)
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 07/14] net/mlx5: split Rx queue into shareable and private Xueming Li
@ 2021-11-04 12:33   ` Xueming Li
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 09/14] net/mlx5: move Rx queue hairpin info to private data Xueming Li
                     ` (6 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-11-04 12:33 UTC (permalink / raw)
  To: dev; +Cc: xuemingl, Lior Margalit, Slava Ovsiienko, Matan Azrad

Rx queue reference count is counter of RQ, used to count reference to RQ
object. To prepare for shared Rx queue, this patch moves it from rxq_ctrl
to Rx queue private data.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Slava Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/net/mlx5/mlx5_rx.h      |   8 +-
 drivers/net/mlx5/mlx5_rxq.c     | 169 +++++++++++++++++++++-----------
 drivers/net/mlx5/mlx5_trigger.c |  57 +++++------
 3 files changed, 142 insertions(+), 92 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index fa24f5cdf3a..eccfbf1108d 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -149,7 +149,6 @@ enum mlx5_rxq_type {
 struct mlx5_rxq_ctrl {
 	struct mlx5_rxq_data rxq; /* Data path structure. */
 	LIST_ENTRY(mlx5_rxq_ctrl) next; /* Pointer to the next element. */
-	uint32_t refcnt; /* Reference counter. */
 	LIST_HEAD(priv, mlx5_rxq_priv) owners; /* Owner rxq list. */
 	struct mlx5_rxq_obj *obj; /* Verbs/DevX elements. */
 	struct mlx5_dev_ctx_shared *sh; /* Shared context. */
@@ -170,6 +169,7 @@ struct mlx5_rxq_ctrl {
 /* RX queue private data. */
 struct mlx5_rxq_priv {
 	uint16_t idx; /* Queue index. */
+	uint32_t refcnt; /* Reference counter. */
 	struct mlx5_rxq_ctrl *ctrl; /* Shared Rx Queue. */
 	LIST_ENTRY(mlx5_rxq_priv) owner_entry; /* Entry in shared rxq_ctrl. */
 	struct mlx5_priv *priv; /* Back pointer to private data. */
@@ -207,7 +207,11 @@ struct mlx5_rxq_ctrl *mlx5_rxq_new(struct rte_eth_dev *dev,
 struct mlx5_rxq_ctrl *mlx5_rxq_hairpin_new
 	(struct rte_eth_dev *dev, struct mlx5_rxq_priv *rxq, uint16_t desc,
 	 const struct rte_eth_hairpin_conf *hairpin_conf);
-struct mlx5_rxq_ctrl *mlx5_rxq_get(struct rte_eth_dev *dev, uint16_t idx);
+struct mlx5_rxq_priv *mlx5_rxq_ref(struct rte_eth_dev *dev, uint16_t idx);
+uint32_t mlx5_rxq_deref(struct rte_eth_dev *dev, uint16_t idx);
+struct mlx5_rxq_priv *mlx5_rxq_get(struct rte_eth_dev *dev, uint16_t idx);
+struct mlx5_rxq_ctrl *mlx5_rxq_ctrl_get(struct rte_eth_dev *dev, uint16_t idx);
+struct mlx5_rxq_data *mlx5_rxq_data_get(struct rte_eth_dev *dev, uint16_t idx);
 int mlx5_rxq_release(struct rte_eth_dev *dev, uint16_t idx);
 int mlx5_rxq_verify(struct rte_eth_dev *dev);
 int rxq_alloc_elts(struct mlx5_rxq_ctrl *rxq_ctrl);
diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index 00df245a5c6..8071ddbd61c 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -386,15 +386,13 @@ mlx5_get_rx_port_offloads(void)
 static int
 mlx5_rxq_releasable(struct rte_eth_dev *dev, uint16_t idx)
 {
-	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_ctrl *rxq_ctrl;
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, idx);
 
-	if (!(*priv->rxqs)[idx]) {
+	if (rxq == NULL) {
 		rte_errno = EINVAL;
 		return -rte_errno;
 	}
-	rxq_ctrl = container_of((*priv->rxqs)[idx], struct mlx5_rxq_ctrl, rxq);
-	return (__atomic_load_n(&rxq_ctrl->refcnt, __ATOMIC_RELAXED) == 1);
+	return (__atomic_load_n(&rxq->refcnt, __ATOMIC_RELAXED) == 1);
 }
 
 /* Fetches and drops all SW-owned and error CQEs to synchronize CQ. */
@@ -874,8 +872,8 @@ mlx5_rx_intr_vec_enable(struct rte_eth_dev *dev)
 
 	for (i = 0; i != n; ++i) {
 		/* This rxq obj must not be released in this function. */
-		struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_get(dev, i);
-		struct mlx5_rxq_obj *rxq_obj = rxq_ctrl ? rxq_ctrl->obj : NULL;
+		struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, i);
+		struct mlx5_rxq_obj *rxq_obj = rxq ? rxq->ctrl->obj : NULL;
 		int rc;
 
 		/* Skip queues that cannot request interrupts. */
@@ -885,11 +883,9 @@ mlx5_rx_intr_vec_enable(struct rte_eth_dev *dev)
 			if (rte_intr_vec_list_index_set(intr_handle, i,
 			   RTE_INTR_VEC_RXTX_OFFSET + RTE_MAX_RXTX_INTR_VEC_ID))
 				return -rte_errno;
-			/* Decrease the rxq_ctrl's refcnt */
-			if (rxq_ctrl)
-				mlx5_rxq_release(dev, i);
 			continue;
 		}
+		mlx5_rxq_ref(dev, i);
 		if (count >= RTE_MAX_RXTX_INTR_VEC_ID) {
 			DRV_LOG(ERR,
 				"port %u too many Rx queues for interrupt"
@@ -954,7 +950,7 @@ mlx5_rx_intr_vec_disable(struct rte_eth_dev *dev)
 		 * Need to access directly the queue to release the reference
 		 * kept in mlx5_rx_intr_vec_enable().
 		 */
-		mlx5_rxq_release(dev, i);
+		mlx5_rxq_deref(dev, i);
 	}
 free:
 	rte_intr_free_epoll_fd(intr_handle);
@@ -1003,19 +999,14 @@ mlx5_arm_cq(struct mlx5_rxq_data *rxq, int sq_n_rxq)
 int
 mlx5_rx_intr_enable(struct rte_eth_dev *dev, uint16_t rx_queue_id)
 {
-	struct mlx5_rxq_ctrl *rxq_ctrl;
-
-	rxq_ctrl = mlx5_rxq_get(dev, rx_queue_id);
-	if (!rxq_ctrl)
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, rx_queue_id);
+	if (!rxq)
 		goto error;
-	if (rxq_ctrl->irq) {
-		if (!rxq_ctrl->obj) {
-			mlx5_rxq_release(dev, rx_queue_id);
+	if (rxq->ctrl->irq) {
+		if (!rxq->ctrl->obj)
 			goto error;
-		}
-		mlx5_arm_cq(&rxq_ctrl->rxq, rxq_ctrl->rxq.cq_arm_sn);
+		mlx5_arm_cq(&rxq->ctrl->rxq, rxq->ctrl->rxq.cq_arm_sn);
 	}
-	mlx5_rxq_release(dev, rx_queue_id);
 	return 0;
 error:
 	rte_errno = EINVAL;
@@ -1037,23 +1028,21 @@ int
 mlx5_rx_intr_disable(struct rte_eth_dev *dev, uint16_t rx_queue_id)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_ctrl *rxq_ctrl;
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, rx_queue_id);
 	int ret = 0;
 
-	rxq_ctrl = mlx5_rxq_get(dev, rx_queue_id);
-	if (!rxq_ctrl) {
+	if (!rxq) {
 		rte_errno = EINVAL;
 		return -rte_errno;
 	}
-	if (!rxq_ctrl->obj)
+	if (!rxq->ctrl->obj)
 		goto error;
-	if (rxq_ctrl->irq) {
-		ret = priv->obj_ops.rxq_event_get(rxq_ctrl->obj);
+	if (rxq->ctrl->irq) {
+		ret = priv->obj_ops.rxq_event_get(rxq->ctrl->obj);
 		if (ret < 0)
 			goto error;
-		rxq_ctrl->rxq.cq_arm_sn++;
+		rxq->ctrl->rxq.cq_arm_sn++;
 	}
-	mlx5_rxq_release(dev, rx_queue_id);
 	return 0;
 error:
 	/**
@@ -1064,12 +1053,9 @@ mlx5_rx_intr_disable(struct rte_eth_dev *dev, uint16_t rx_queue_id)
 		rte_errno = errno;
 	else
 		rte_errno = EINVAL;
-	ret = rte_errno; /* Save rte_errno before cleanup. */
-	mlx5_rxq_release(dev, rx_queue_id);
-	if (ret != EAGAIN)
+	if (rte_errno != EAGAIN)
 		DRV_LOG(WARNING, "port %u unable to disable interrupt on Rx queue %d",
 			dev->data->port_id, rx_queue_id);
-	rte_errno = ret; /* Restore rte_errno. */
 	return -rte_errno;
 }
 
@@ -1657,7 +1643,7 @@ mlx5_rxq_new(struct rte_eth_dev *dev, struct mlx5_rxq_priv *rxq,
 	tmpl->rxq.uar_lock_cq = &priv->sh->uar_lock_cq;
 #endif
 	tmpl->rxq.idx = idx;
-	__atomic_fetch_add(&tmpl->refcnt, 1, __ATOMIC_RELAXED);
+	mlx5_rxq_ref(dev, idx);
 	LIST_INSERT_HEAD(&priv->rxqsctrl, tmpl, next);
 	return tmpl;
 error:
@@ -1711,11 +1697,53 @@ mlx5_rxq_hairpin_new(struct rte_eth_dev *dev, struct mlx5_rxq_priv *rxq,
 	tmpl->rxq.mr_ctrl.cache_bh = (struct mlx5_mr_btree) { 0 };
 	tmpl->hairpin_conf = *hairpin_conf;
 	tmpl->rxq.idx = idx;
-	__atomic_fetch_add(&tmpl->refcnt, 1, __ATOMIC_RELAXED);
+	mlx5_rxq_ref(dev, idx);
 	LIST_INSERT_HEAD(&priv->rxqsctrl, tmpl, next);
 	return tmpl;
 }
 
+/**
+ * Increase Rx queue reference count.
+ *
+ * @param dev
+ *   Pointer to Ethernet device.
+ * @param idx
+ *   RX queue index.
+ *
+ * @return
+ *   A pointer to the queue if it exists, NULL otherwise.
+ */
+struct mlx5_rxq_priv *
+mlx5_rxq_ref(struct rte_eth_dev *dev, uint16_t idx)
+{
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, idx);
+
+	if (rxq != NULL)
+		__atomic_fetch_add(&rxq->refcnt, 1, __ATOMIC_RELAXED);
+	return rxq;
+}
+
+/**
+ * Dereference a Rx queue.
+ *
+ * @param dev
+ *   Pointer to Ethernet device.
+ * @param idx
+ *   RX queue index.
+ *
+ * @return
+ *   Updated reference count.
+ */
+uint32_t
+mlx5_rxq_deref(struct rte_eth_dev *dev, uint16_t idx)
+{
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, idx);
+
+	if (rxq == NULL)
+		return 0;
+	return __atomic_sub_fetch(&rxq->refcnt, 1, __ATOMIC_RELAXED);
+}
+
 /**
  * Get a Rx queue.
  *
@@ -1727,18 +1755,52 @@ mlx5_rxq_hairpin_new(struct rte_eth_dev *dev, struct mlx5_rxq_priv *rxq,
  * @return
  *   A pointer to the queue if it exists, NULL otherwise.
  */
-struct mlx5_rxq_ctrl *
+struct mlx5_rxq_priv *
 mlx5_rxq_get(struct rte_eth_dev *dev, uint16_t idx)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_data *rxq_data = (*priv->rxqs)[idx];
-	struct mlx5_rxq_ctrl *rxq_ctrl = NULL;
 
-	if (rxq_data) {
-		rxq_ctrl = container_of(rxq_data, struct mlx5_rxq_ctrl, rxq);
-		__atomic_fetch_add(&rxq_ctrl->refcnt, 1, __ATOMIC_RELAXED);
-	}
-	return rxq_ctrl;
+	if (priv->rxq_privs == NULL)
+		return NULL;
+	return (*priv->rxq_privs)[idx];
+}
+
+/**
+ * Get Rx queue shareable control.
+ *
+ * @param dev
+ *   Pointer to Ethernet device.
+ * @param idx
+ *   RX queue index.
+ *
+ * @return
+ *   A pointer to the queue control if it exists, NULL otherwise.
+ */
+struct mlx5_rxq_ctrl *
+mlx5_rxq_ctrl_get(struct rte_eth_dev *dev, uint16_t idx)
+{
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, idx);
+
+	return rxq == NULL ? NULL : rxq->ctrl;
+}
+
+/**
+ * Get Rx queue shareable data.
+ *
+ * @param dev
+ *   Pointer to Ethernet device.
+ * @param idx
+ *   RX queue index.
+ *
+ * @return
+ *   A pointer to the queue data if it exists, NULL otherwise.
+ */
+struct mlx5_rxq_data *
+mlx5_rxq_data_get(struct rte_eth_dev *dev, uint16_t idx)
+{
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, idx);
+
+	return rxq == NULL ? NULL : &rxq->ctrl->rxq;
 }
 
 /**
@@ -1756,13 +1818,12 @@ int
 mlx5_rxq_release(struct rte_eth_dev *dev, uint16_t idx)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_ctrl *rxq_ctrl;
-	struct mlx5_rxq_priv *rxq = (*priv->rxq_privs)[idx];
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, idx);
+	struct mlx5_rxq_ctrl *rxq_ctrl = rxq->ctrl;
 
 	if (priv->rxqs == NULL || (*priv->rxqs)[idx] == NULL)
 		return 0;
-	rxq_ctrl = container_of((*priv->rxqs)[idx], struct mlx5_rxq_ctrl, rxq);
-	if (__atomic_sub_fetch(&rxq_ctrl->refcnt, 1, __ATOMIC_RELAXED) > 1)
+	if (mlx5_rxq_deref(dev, idx) > 1)
 		return 1;
 	if (rxq_ctrl->obj) {
 		priv->obj_ops.rxq_obj_release(rxq_ctrl->obj);
@@ -1774,7 +1835,7 @@ mlx5_rxq_release(struct rte_eth_dev *dev, uint16_t idx)
 		rxq_free_elts(rxq_ctrl);
 		dev->data->rx_queue_state[idx] = RTE_ETH_QUEUE_STATE_STOPPED;
 	}
-	if (!__atomic_load_n(&rxq_ctrl->refcnt, __ATOMIC_RELAXED)) {
+	if (!__atomic_load_n(&rxq->refcnt, __ATOMIC_RELAXED)) {
 		if (rxq_ctrl->type == MLX5_RXQ_TYPE_STANDARD)
 			mlx5_mr_btree_free(&rxq_ctrl->rxq.mr_ctrl.cache_bh);
 		LIST_REMOVE(rxq, owner_entry);
@@ -1952,7 +2013,7 @@ mlx5_ind_table_obj_release(struct rte_eth_dev *dev,
 		return 1;
 	priv->obj_ops.ind_table_destroy(ind_tbl);
 	for (i = 0; i != ind_tbl->queues_n; ++i)
-		claim_nonzero(mlx5_rxq_release(dev, ind_tbl->queues[i]));
+		claim_nonzero(mlx5_rxq_deref(dev, ind_tbl->queues[i]));
 	mlx5_free(ind_tbl);
 	return 0;
 }
@@ -2009,7 +2070,7 @@ mlx5_ind_table_obj_setup(struct rte_eth_dev *dev,
 			       log2above(priv->config.ind_table_max_size);
 
 	for (i = 0; i != queues_n; ++i) {
-		if (!mlx5_rxq_get(dev, queues[i])) {
+		if (mlx5_rxq_ref(dev, queues[i]) == NULL) {
 			ret = -rte_errno;
 			goto error;
 		}
@@ -2022,7 +2083,7 @@ mlx5_ind_table_obj_setup(struct rte_eth_dev *dev,
 error:
 	err = rte_errno;
 	for (j = 0; j < i; j++)
-		mlx5_rxq_release(dev, ind_tbl->queues[j]);
+		mlx5_rxq_deref(dev, ind_tbl->queues[j]);
 	rte_errno = err;
 	DRV_LOG(DEBUG, "Port %u cannot setup indirection table.",
 		dev->data->port_id);
@@ -2118,7 +2179,7 @@ mlx5_ind_table_obj_modify(struct rte_eth_dev *dev,
 			  bool standalone)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	unsigned int i, j;
+	unsigned int i;
 	int ret = 0, err;
 	const unsigned int n = rte_is_power_of_2(queues_n) ?
 			       log2above(queues_n) :
@@ -2138,15 +2199,11 @@ mlx5_ind_table_obj_modify(struct rte_eth_dev *dev,
 	ret = priv->obj_ops.ind_table_modify(dev, n, queues, queues_n, ind_tbl);
 	if (ret)
 		goto error;
-	for (j = 0; j < ind_tbl->queues_n; j++)
-		mlx5_rxq_release(dev, ind_tbl->queues[j]);
 	ind_tbl->queues_n = queues_n;
 	ind_tbl->queues = queues;
 	return 0;
 error:
 	err = rte_errno;
-	for (j = 0; j < i; j++)
-		mlx5_rxq_release(dev, queues[j]);
 	rte_errno = err;
 	DRV_LOG(DEBUG, "Port %u cannot setup indirection table.",
 		dev->data->port_id);
diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index ebeeae279e2..e5d74d275f8 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -201,10 +201,12 @@ mlx5_rxq_start(struct rte_eth_dev *dev)
 	DRV_LOG(DEBUG, "Port %u device_attr.max_sge is %d.",
 		dev->data->port_id, priv->sh->device_attr.max_sge);
 	for (i = 0; i != priv->rxqs_n; ++i) {
-		struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_get(dev, i);
+		struct mlx5_rxq_priv *rxq = mlx5_rxq_ref(dev, i);
+		struct mlx5_rxq_ctrl *rxq_ctrl;
 
-		if (!rxq_ctrl)
+		if (rxq == NULL)
 			continue;
+		rxq_ctrl = rxq->ctrl;
 		if (rxq_ctrl->type == MLX5_RXQ_TYPE_STANDARD) {
 			/*
 			 * Pre-register the mempools. Regardless of whether
@@ -266,6 +268,7 @@ mlx5_hairpin_auto_bind(struct rte_eth_dev *dev)
 	struct mlx5_devx_modify_sq_attr sq_attr = { 0 };
 	struct mlx5_devx_modify_rq_attr rq_attr = { 0 };
 	struct mlx5_txq_ctrl *txq_ctrl;
+	struct mlx5_rxq_priv *rxq;
 	struct mlx5_rxq_ctrl *rxq_ctrl;
 	struct mlx5_devx_obj *sq;
 	struct mlx5_devx_obj *rq;
@@ -310,9 +313,8 @@ mlx5_hairpin_auto_bind(struct rte_eth_dev *dev)
 			return -rte_errno;
 		}
 		sq = txq_ctrl->obj->sq;
-		rxq_ctrl = mlx5_rxq_get(dev,
-					txq_ctrl->hairpin_conf.peers[0].queue);
-		if (!rxq_ctrl) {
+		rxq = mlx5_rxq_get(dev, txq_ctrl->hairpin_conf.peers[0].queue);
+		if (rxq == NULL) {
 			mlx5_txq_release(dev, i);
 			rte_errno = EINVAL;
 			DRV_LOG(ERR, "port %u no rxq object found: %d",
@@ -320,6 +322,7 @@ mlx5_hairpin_auto_bind(struct rte_eth_dev *dev)
 				txq_ctrl->hairpin_conf.peers[0].queue);
 			return -rte_errno;
 		}
+		rxq_ctrl = rxq->ctrl;
 		if (rxq_ctrl->type != MLX5_RXQ_TYPE_HAIRPIN ||
 		    rxq_ctrl->hairpin_conf.peers[0].queue != i) {
 			rte_errno = ENOMEM;
@@ -354,12 +357,10 @@ mlx5_hairpin_auto_bind(struct rte_eth_dev *dev)
 		rxq_ctrl->hairpin_status = 1;
 		txq_ctrl->hairpin_status = 1;
 		mlx5_txq_release(dev, i);
-		mlx5_rxq_release(dev, txq_ctrl->hairpin_conf.peers[0].queue);
 	}
 	return 0;
 error:
 	mlx5_txq_release(dev, i);
-	mlx5_rxq_release(dev, txq_ctrl->hairpin_conf.peers[0].queue);
 	return -rte_errno;
 }
 
@@ -432,27 +433,26 @@ mlx5_hairpin_queue_peer_update(struct rte_eth_dev *dev, uint16_t peer_queue,
 		peer_info->manual_bind = txq_ctrl->hairpin_conf.manual_bind;
 		mlx5_txq_release(dev, peer_queue);
 	} else { /* Peer port used as ingress. */
+		struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, peer_queue);
 		struct mlx5_rxq_ctrl *rxq_ctrl;
 
-		rxq_ctrl = mlx5_rxq_get(dev, peer_queue);
-		if (rxq_ctrl == NULL) {
+		if (rxq == NULL) {
 			rte_errno = EINVAL;
 			DRV_LOG(ERR, "Failed to get port %u Rx queue %d",
 				dev->data->port_id, peer_queue);
 			return -rte_errno;
 		}
+		rxq_ctrl = rxq->ctrl;
 		if (rxq_ctrl->type != MLX5_RXQ_TYPE_HAIRPIN) {
 			rte_errno = EINVAL;
 			DRV_LOG(ERR, "port %u queue %d is not a hairpin Rxq",
 				dev->data->port_id, peer_queue);
-			mlx5_rxq_release(dev, peer_queue);
 			return -rte_errno;
 		}
 		if (rxq_ctrl->obj == NULL || rxq_ctrl->obj->rq == NULL) {
 			rte_errno = ENOMEM;
 			DRV_LOG(ERR, "port %u no Rxq object found: %d",
 				dev->data->port_id, peer_queue);
-			mlx5_rxq_release(dev, peer_queue);
 			return -rte_errno;
 		}
 		peer_info->qp_id = rxq_ctrl->obj->rq->id;
@@ -460,7 +460,6 @@ mlx5_hairpin_queue_peer_update(struct rte_eth_dev *dev, uint16_t peer_queue,
 		peer_info->peer_q = rxq_ctrl->hairpin_conf.peers[0].queue;
 		peer_info->tx_explicit = rxq_ctrl->hairpin_conf.tx_explicit;
 		peer_info->manual_bind = rxq_ctrl->hairpin_conf.manual_bind;
-		mlx5_rxq_release(dev, peer_queue);
 	}
 	return 0;
 }
@@ -559,34 +558,32 @@ mlx5_hairpin_queue_peer_bind(struct rte_eth_dev *dev, uint16_t cur_queue,
 			txq_ctrl->hairpin_status = 1;
 		mlx5_txq_release(dev, cur_queue);
 	} else {
+		struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, cur_queue);
 		struct mlx5_rxq_ctrl *rxq_ctrl;
 		struct mlx5_devx_modify_rq_attr rq_attr = { 0 };
 
-		rxq_ctrl = mlx5_rxq_get(dev, cur_queue);
-		if (rxq_ctrl == NULL) {
+		if (rxq == NULL) {
 			rte_errno = EINVAL;
 			DRV_LOG(ERR, "Failed to get port %u Rx queue %d",
 				dev->data->port_id, cur_queue);
 			return -rte_errno;
 		}
+		rxq_ctrl = rxq->ctrl;
 		if (rxq_ctrl->type != MLX5_RXQ_TYPE_HAIRPIN) {
 			rte_errno = EINVAL;
 			DRV_LOG(ERR, "port %u queue %d not a hairpin Rxq",
 				dev->data->port_id, cur_queue);
-			mlx5_rxq_release(dev, cur_queue);
 			return -rte_errno;
 		}
 		if (rxq_ctrl->obj == NULL || rxq_ctrl->obj->rq == NULL) {
 			rte_errno = ENOMEM;
 			DRV_LOG(ERR, "port %u no Rxq object found: %d",
 				dev->data->port_id, cur_queue);
-			mlx5_rxq_release(dev, cur_queue);
 			return -rte_errno;
 		}
 		if (rxq_ctrl->hairpin_status != 0) {
 			DRV_LOG(DEBUG, "port %u Rx queue %d is already bound",
 				dev->data->port_id, cur_queue);
-			mlx5_rxq_release(dev, cur_queue);
 			return 0;
 		}
 		if (peer_info->tx_explicit !=
@@ -594,7 +591,6 @@ mlx5_hairpin_queue_peer_bind(struct rte_eth_dev *dev, uint16_t cur_queue,
 			rte_errno = EINVAL;
 			DRV_LOG(ERR, "port %u Rx queue %d and peer Tx rule mode"
 				" mismatch", dev->data->port_id, cur_queue);
-			mlx5_rxq_release(dev, cur_queue);
 			return -rte_errno;
 		}
 		if (peer_info->manual_bind !=
@@ -602,7 +598,6 @@ mlx5_hairpin_queue_peer_bind(struct rte_eth_dev *dev, uint16_t cur_queue,
 			rte_errno = EINVAL;
 			DRV_LOG(ERR, "port %u Rx queue %d and peer binding mode"
 				" mismatch", dev->data->port_id, cur_queue);
-			mlx5_rxq_release(dev, cur_queue);
 			return -rte_errno;
 		}
 		rq_attr.state = MLX5_SQC_STATE_RDY;
@@ -612,7 +607,6 @@ mlx5_hairpin_queue_peer_bind(struct rte_eth_dev *dev, uint16_t cur_queue,
 		ret = mlx5_devx_cmd_modify_rq(rxq_ctrl->obj->rq, &rq_attr);
 		if (ret == 0)
 			rxq_ctrl->hairpin_status = 1;
-		mlx5_rxq_release(dev, cur_queue);
 	}
 	return ret;
 }
@@ -677,34 +671,32 @@ mlx5_hairpin_queue_peer_unbind(struct rte_eth_dev *dev, uint16_t cur_queue,
 			txq_ctrl->hairpin_status = 0;
 		mlx5_txq_release(dev, cur_queue);
 	} else {
+		struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, cur_queue);
 		struct mlx5_rxq_ctrl *rxq_ctrl;
 		struct mlx5_devx_modify_rq_attr rq_attr = { 0 };
 
-		rxq_ctrl = mlx5_rxq_get(dev, cur_queue);
-		if (rxq_ctrl == NULL) {
+		if (rxq == NULL) {
 			rte_errno = EINVAL;
 			DRV_LOG(ERR, "Failed to get port %u Rx queue %d",
 				dev->data->port_id, cur_queue);
 			return -rte_errno;
 		}
+		rxq_ctrl = rxq->ctrl;
 		if (rxq_ctrl->type != MLX5_RXQ_TYPE_HAIRPIN) {
 			rte_errno = EINVAL;
 			DRV_LOG(ERR, "port %u queue %d not a hairpin Rxq",
 				dev->data->port_id, cur_queue);
-			mlx5_rxq_release(dev, cur_queue);
 			return -rte_errno;
 		}
 		if (rxq_ctrl->hairpin_status == 0) {
 			DRV_LOG(DEBUG, "port %u Rx queue %d is already unbound",
 				dev->data->port_id, cur_queue);
-			mlx5_rxq_release(dev, cur_queue);
 			return 0;
 		}
 		if (rxq_ctrl->obj == NULL || rxq_ctrl->obj->rq == NULL) {
 			rte_errno = ENOMEM;
 			DRV_LOG(ERR, "port %u no Rxq object found: %d",
 				dev->data->port_id, cur_queue);
-			mlx5_rxq_release(dev, cur_queue);
 			return -rte_errno;
 		}
 		rq_attr.state = MLX5_SQC_STATE_RST;
@@ -712,7 +704,6 @@ mlx5_hairpin_queue_peer_unbind(struct rte_eth_dev *dev, uint16_t cur_queue,
 		ret = mlx5_devx_cmd_modify_rq(rxq_ctrl->obj->rq, &rq_attr);
 		if (ret == 0)
 			rxq_ctrl->hairpin_status = 0;
-		mlx5_rxq_release(dev, cur_queue);
 	}
 	return ret;
 }
@@ -1014,7 +1005,6 @@ mlx5_hairpin_get_peer_ports(struct rte_eth_dev *dev, uint16_t *peer_ports,
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_txq_ctrl *txq_ctrl;
-	struct mlx5_rxq_ctrl *rxq_ctrl;
 	uint32_t i;
 	uint16_t pp;
 	uint32_t bits[(RTE_MAX_ETHPORTS + 31) / 32] = {0};
@@ -1043,24 +1033,23 @@ mlx5_hairpin_get_peer_ports(struct rte_eth_dev *dev, uint16_t *peer_ports,
 		}
 	} else {
 		for (i = 0; i < priv->rxqs_n; i++) {
-			rxq_ctrl = mlx5_rxq_get(dev, i);
-			if (!rxq_ctrl)
+			struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, i);
+			struct mlx5_rxq_ctrl *rxq_ctrl;
+
+			if (rxq == NULL)
 				continue;
-			if (rxq_ctrl->type != MLX5_RXQ_TYPE_HAIRPIN) {
-				mlx5_rxq_release(dev, i);
+			rxq_ctrl = rxq->ctrl;
+			if (rxq_ctrl->type != MLX5_RXQ_TYPE_HAIRPIN)
 				continue;
-			}
 			pp = rxq_ctrl->hairpin_conf.peers[0].port;
 			if (pp >= RTE_MAX_ETHPORTS) {
 				rte_errno = ERANGE;
-				mlx5_rxq_release(dev, i);
 				DRV_LOG(ERR, "port %hu queue %u peer port "
 					"out of range %hu",
 					priv->dev_data->port_id, i, pp);
 				return -rte_errno;
 			}
 			bits[pp / 32] |= 1 << (pp % 32);
-			mlx5_rxq_release(dev, i);
 		}
 	}
 	for (i = 0; i < RTE_MAX_ETHPORTS; i++) {
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v4 09/14] net/mlx5: move Rx queue hairpin info to private data
  2021-11-04 12:33 ` [dpdk-dev] [PATCH v4 00/14] net/mlx5: support shared Rx queue Xueming Li
                     ` (7 preceding siblings ...)
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 08/14] net/mlx5: move Rx queue reference count Xueming Li
@ 2021-11-04 12:33   ` Xueming Li
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 10/14] net/mlx5: remove port info from shareable Rx queue Xueming Li
                     ` (5 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-11-04 12:33 UTC (permalink / raw)
  To: dev; +Cc: xuemingl, Lior Margalit, Slava Ovsiienko, Matan Azrad

Hairpin info of Rx queue can't be shared, moves to private queue data.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Slava Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/net/mlx5/mlx5_rx.h      |  4 ++--
 drivers/net/mlx5/mlx5_rxq.c     | 13 +++++--------
 drivers/net/mlx5/mlx5_trigger.c | 24 ++++++++++++------------
 3 files changed, 19 insertions(+), 22 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index eccfbf1108d..b21918223b8 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -162,8 +162,6 @@ struct mlx5_rxq_ctrl {
 	uint32_t flow_tunnels_n[MLX5_FLOW_TUNNEL]; /* Tunnels counters. */
 	uint32_t wqn; /* WQ number. */
 	uint16_t dump_file_n; /* Number of dump files. */
-	struct rte_eth_hairpin_conf hairpin_conf; /* Hairpin configuration. */
-	uint32_t hairpin_status; /* Hairpin binding status. */
 };
 
 /* RX queue private data. */
@@ -173,6 +171,8 @@ struct mlx5_rxq_priv {
 	struct mlx5_rxq_ctrl *ctrl; /* Shared Rx Queue. */
 	LIST_ENTRY(mlx5_rxq_priv) owner_entry; /* Entry in shared rxq_ctrl. */
 	struct mlx5_priv *priv; /* Back pointer to private data. */
+	struct rte_eth_hairpin_conf hairpin_conf; /* Hairpin configuration. */
+	uint32_t hairpin_status; /* Hairpin binding status. */
 };
 
 /* mlx5_rxq.c */
diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index 8071ddbd61c..7b637fda643 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -1695,8 +1695,8 @@ mlx5_rxq_hairpin_new(struct rte_eth_dev *dev, struct mlx5_rxq_priv *rxq,
 	tmpl->rxq.elts_n = log2above(desc);
 	tmpl->rxq.elts = NULL;
 	tmpl->rxq.mr_ctrl.cache_bh = (struct mlx5_mr_btree) { 0 };
-	tmpl->hairpin_conf = *hairpin_conf;
 	tmpl->rxq.idx = idx;
+	rxq->hairpin_conf = *hairpin_conf;
 	mlx5_rxq_ref(dev, idx);
 	LIST_INSERT_HEAD(&priv->rxqsctrl, tmpl, next);
 	return tmpl;
@@ -1913,14 +1913,11 @@ const struct rte_eth_hairpin_conf *
 mlx5_rxq_get_hairpin_conf(struct rte_eth_dev *dev, uint16_t idx)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_ctrl *rxq_ctrl = NULL;
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, idx);
 
-	if (idx < priv->rxqs_n && (*priv->rxqs)[idx]) {
-		rxq_ctrl = container_of((*priv->rxqs)[idx],
-					struct mlx5_rxq_ctrl,
-					rxq);
-		if (rxq_ctrl->type == MLX5_RXQ_TYPE_HAIRPIN)
-			return &rxq_ctrl->hairpin_conf;
+	if (idx < priv->rxqs_n && rxq != NULL) {
+		if (rxq->ctrl->type == MLX5_RXQ_TYPE_HAIRPIN)
+			return &rxq->hairpin_conf;
 	}
 	return NULL;
 }
diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index e5d74d275f8..a124f74fcda 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -324,7 +324,7 @@ mlx5_hairpin_auto_bind(struct rte_eth_dev *dev)
 		}
 		rxq_ctrl = rxq->ctrl;
 		if (rxq_ctrl->type != MLX5_RXQ_TYPE_HAIRPIN ||
-		    rxq_ctrl->hairpin_conf.peers[0].queue != i) {
+		    rxq->hairpin_conf.peers[0].queue != i) {
 			rte_errno = ENOMEM;
 			DRV_LOG(ERR, "port %u Tx queue %d can't be binded to "
 				"Rx queue %d", dev->data->port_id,
@@ -354,7 +354,7 @@ mlx5_hairpin_auto_bind(struct rte_eth_dev *dev)
 		if (ret)
 			goto error;
 		/* Qs with auto-bind will be destroyed directly. */
-		rxq_ctrl->hairpin_status = 1;
+		rxq->hairpin_status = 1;
 		txq_ctrl->hairpin_status = 1;
 		mlx5_txq_release(dev, i);
 	}
@@ -457,9 +457,9 @@ mlx5_hairpin_queue_peer_update(struct rte_eth_dev *dev, uint16_t peer_queue,
 		}
 		peer_info->qp_id = rxq_ctrl->obj->rq->id;
 		peer_info->vhca_id = priv->config.hca_attr.vhca_id;
-		peer_info->peer_q = rxq_ctrl->hairpin_conf.peers[0].queue;
-		peer_info->tx_explicit = rxq_ctrl->hairpin_conf.tx_explicit;
-		peer_info->manual_bind = rxq_ctrl->hairpin_conf.manual_bind;
+		peer_info->peer_q = rxq->hairpin_conf.peers[0].queue;
+		peer_info->tx_explicit = rxq->hairpin_conf.tx_explicit;
+		peer_info->manual_bind = rxq->hairpin_conf.manual_bind;
 	}
 	return 0;
 }
@@ -581,20 +581,20 @@ mlx5_hairpin_queue_peer_bind(struct rte_eth_dev *dev, uint16_t cur_queue,
 				dev->data->port_id, cur_queue);
 			return -rte_errno;
 		}
-		if (rxq_ctrl->hairpin_status != 0) {
+		if (rxq->hairpin_status != 0) {
 			DRV_LOG(DEBUG, "port %u Rx queue %d is already bound",
 				dev->data->port_id, cur_queue);
 			return 0;
 		}
 		if (peer_info->tx_explicit !=
-		    rxq_ctrl->hairpin_conf.tx_explicit) {
+		    rxq->hairpin_conf.tx_explicit) {
 			rte_errno = EINVAL;
 			DRV_LOG(ERR, "port %u Rx queue %d and peer Tx rule mode"
 				" mismatch", dev->data->port_id, cur_queue);
 			return -rte_errno;
 		}
 		if (peer_info->manual_bind !=
-		    rxq_ctrl->hairpin_conf.manual_bind) {
+		    rxq->hairpin_conf.manual_bind) {
 			rte_errno = EINVAL;
 			DRV_LOG(ERR, "port %u Rx queue %d and peer binding mode"
 				" mismatch", dev->data->port_id, cur_queue);
@@ -606,7 +606,7 @@ mlx5_hairpin_queue_peer_bind(struct rte_eth_dev *dev, uint16_t cur_queue,
 		rq_attr.hairpin_peer_vhca = peer_info->vhca_id;
 		ret = mlx5_devx_cmd_modify_rq(rxq_ctrl->obj->rq, &rq_attr);
 		if (ret == 0)
-			rxq_ctrl->hairpin_status = 1;
+			rxq->hairpin_status = 1;
 	}
 	return ret;
 }
@@ -688,7 +688,7 @@ mlx5_hairpin_queue_peer_unbind(struct rte_eth_dev *dev, uint16_t cur_queue,
 				dev->data->port_id, cur_queue);
 			return -rte_errno;
 		}
-		if (rxq_ctrl->hairpin_status == 0) {
+		if (rxq->hairpin_status == 0) {
 			DRV_LOG(DEBUG, "port %u Rx queue %d is already unbound",
 				dev->data->port_id, cur_queue);
 			return 0;
@@ -703,7 +703,7 @@ mlx5_hairpin_queue_peer_unbind(struct rte_eth_dev *dev, uint16_t cur_queue,
 		rq_attr.rq_state = MLX5_SQC_STATE_RST;
 		ret = mlx5_devx_cmd_modify_rq(rxq_ctrl->obj->rq, &rq_attr);
 		if (ret == 0)
-			rxq_ctrl->hairpin_status = 0;
+			rxq->hairpin_status = 0;
 	}
 	return ret;
 }
@@ -1041,7 +1041,7 @@ mlx5_hairpin_get_peer_ports(struct rte_eth_dev *dev, uint16_t *peer_ports,
 			rxq_ctrl = rxq->ctrl;
 			if (rxq_ctrl->type != MLX5_RXQ_TYPE_HAIRPIN)
 				continue;
-			pp = rxq_ctrl->hairpin_conf.peers[0].port;
+			pp = rxq->hairpin_conf.peers[0].port;
 			if (pp >= RTE_MAX_ETHPORTS) {
 				rte_errno = ERANGE;
 				DRV_LOG(ERR, "port %hu queue %u peer port "
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v4 10/14] net/mlx5: remove port info from shareable Rx queue
  2021-11-04 12:33 ` [dpdk-dev] [PATCH v4 00/14] net/mlx5: support shared Rx queue Xueming Li
                     ` (8 preceding siblings ...)
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 09/14] net/mlx5: move Rx queue hairpin info to private data Xueming Li
@ 2021-11-04 12:33   ` Xueming Li
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 11/14] net/mlx5: move Rx queue DevX resource Xueming Li
                     ` (4 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-11-04 12:33 UTC (permalink / raw)
  To: dev; +Cc: xuemingl, Lior Margalit, Slava Ovsiienko, Matan Azrad

To prepare for shared Rx queue, removes port info from shareable Rx
queue control.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Slava Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/net/mlx5/mlx5_devx.c     |  2 +-
 drivers/net/mlx5/mlx5_rx.c       | 15 +++--------
 drivers/net/mlx5/mlx5_rx.h       |  7 ++++--
 drivers/net/mlx5/mlx5_rxq.c      | 43 ++++++++++++++++++++++----------
 drivers/net/mlx5/mlx5_rxtx_vec.c |  2 +-
 drivers/net/mlx5/mlx5_trigger.c  | 13 +++++-----
 6 files changed, 47 insertions(+), 35 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index 443252df05d..8b3651f5034 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -918,7 +918,7 @@ mlx5_rxq_devx_obj_drop_create(struct rte_eth_dev *dev)
 	}
 	rxq->rxq_ctrl = rxq_ctrl;
 	rxq_ctrl->type = MLX5_RXQ_TYPE_STANDARD;
-	rxq_ctrl->priv = priv;
+	rxq_ctrl->sh = priv->sh;
 	rxq_ctrl->obj = rxq;
 	rxq_data = &rxq_ctrl->rxq;
 	/* Create CQ using DevX API. */
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 258a6453144..d41905a2a04 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -118,15 +118,7 @@ int
 mlx5_rx_descriptor_status(void *rx_queue, uint16_t offset)
 {
 	struct mlx5_rxq_data *rxq = rx_queue;
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-			container_of(rxq, struct mlx5_rxq_ctrl, rxq);
-	struct rte_eth_dev *dev = ETH_DEV(rxq_ctrl->priv);
 
-	if (dev->rx_pkt_burst == NULL ||
-	    dev->rx_pkt_burst == removed_rx_burst) {
-		rte_errno = ENOTSUP;
-		return -rte_errno;
-	}
 	if (offset >= (1 << rxq->cqe_n)) {
 		rte_errno = EINVAL;
 		return -rte_errno;
@@ -438,10 +430,10 @@ mlx5_rx_err_handle(struct mlx5_rxq_data *rxq, uint8_t vec)
 		sm.is_wq = 1;
 		sm.queue_id = rxq->idx;
 		sm.state = IBV_WQS_RESET;
-		if (mlx5_queue_state_modify(ETH_DEV(rxq_ctrl->priv), &sm))
+		if (mlx5_queue_state_modify(RXQ_DEV(rxq_ctrl), &sm))
 			return -1;
 		if (rxq_ctrl->dump_file_n <
-		    rxq_ctrl->priv->config.max_dump_files_num) {
+		    RXQ_PORT(rxq_ctrl)->config.max_dump_files_num) {
 			MKSTR(err_str, "Unexpected CQE error syndrome "
 			      "0x%02x CQN = %u RQN = %u wqe_counter = %u"
 			      " rq_ci = %u cq_ci = %u", u.err_cqe->syndrome,
@@ -478,8 +470,7 @@ mlx5_rx_err_handle(struct mlx5_rxq_data *rxq, uint8_t vec)
 			sm.is_wq = 1;
 			sm.queue_id = rxq->idx;
 			sm.state = IBV_WQS_RDY;
-			if (mlx5_queue_state_modify(ETH_DEV(rxq_ctrl->priv),
-						    &sm))
+			if (mlx5_queue_state_modify(RXQ_DEV(rxq_ctrl), &sm))
 				return -1;
 			if (vec) {
 				const uint32_t elts_n =
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index b21918223b8..c04c0c73349 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -22,6 +22,10 @@
 /* Support tunnel matching. */
 #define MLX5_FLOW_TUNNEL 10
 
+#define RXQ_PORT(rxq_ctrl) LIST_FIRST(&(rxq_ctrl)->owners)->priv
+#define RXQ_DEV(rxq_ctrl) ETH_DEV(RXQ_PORT(rxq_ctrl))
+#define RXQ_PORT_ID(rxq_ctrl) PORT_ID(RXQ_PORT(rxq_ctrl))
+
 /* First entry must be NULL for comparison. */
 #define mlx5_mr_btree_len(bt) ((bt)->len - 1)
 
@@ -152,7 +156,6 @@ struct mlx5_rxq_ctrl {
 	LIST_HEAD(priv, mlx5_rxq_priv) owners; /* Owner rxq list. */
 	struct mlx5_rxq_obj *obj; /* Verbs/DevX elements. */
 	struct mlx5_dev_ctx_shared *sh; /* Shared context. */
-	struct mlx5_priv *priv; /* Back pointer to private data. */
 	enum mlx5_rxq_type type; /* Rxq type. */
 	unsigned int socket; /* CPU socket ID for allocations. */
 	uint32_t share_group; /* Group ID of shared RXQ. */
@@ -318,7 +321,7 @@ mlx5_rx_addr2mr(struct mlx5_rxq_data *rxq, uintptr_t addr)
 	 */
 	rxq_ctrl = container_of(rxq, struct mlx5_rxq_ctrl, rxq);
 	mp = mlx5_rxq_mprq_enabled(rxq) ? rxq->mprq_mp : rxq->mp;
-	return mlx5_mr_mempool2mr_bh(&rxq_ctrl->priv->sh->cdev->mr_scache,
+	return mlx5_mr_mempool2mr_bh(&rxq_ctrl->sh->cdev->mr_scache,
 				     mr_ctrl, mp, addr);
 }
 
diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index 7b637fda643..5a20966e2ca 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -148,8 +148,14 @@ rxq_alloc_elts_sprq(struct mlx5_rxq_ctrl *rxq_ctrl)
 
 		buf = rte_pktmbuf_alloc(seg->mp);
 		if (buf == NULL) {
-			DRV_LOG(ERR, "port %u empty mbuf pool",
-				PORT_ID(rxq_ctrl->priv));
+			if (rxq_ctrl->share_group == 0)
+				DRV_LOG(ERR, "port %u queue %u empty mbuf pool",
+					RXQ_PORT_ID(rxq_ctrl),
+					rxq_ctrl->rxq.idx);
+			else
+				DRV_LOG(ERR, "share group %u queue %u empty mbuf pool",
+					rxq_ctrl->share_group,
+					rxq_ctrl->share_qid);
 			rte_errno = ENOMEM;
 			goto error;
 		}
@@ -193,11 +199,16 @@ rxq_alloc_elts_sprq(struct mlx5_rxq_ctrl *rxq_ctrl)
 		for (j = 0; j < MLX5_VPMD_DESCS_PER_LOOP; ++j)
 			(*rxq->elts)[elts_n + j] = &rxq->fake_mbuf;
 	}
-	DRV_LOG(DEBUG,
-		"port %u SPRQ queue %u allocated and configured %u segments"
-		" (max %u packets)",
-		PORT_ID(rxq_ctrl->priv), rxq_ctrl->rxq.idx, elts_n,
-		elts_n / (1 << rxq_ctrl->rxq.sges_n));
+	if (rxq_ctrl->share_group == 0)
+		DRV_LOG(DEBUG,
+			"port %u SPRQ queue %u allocated and configured %u segments (max %u packets)",
+			RXQ_PORT_ID(rxq_ctrl), rxq_ctrl->rxq.idx, elts_n,
+			elts_n / (1 << rxq_ctrl->rxq.sges_n));
+	else
+		DRV_LOG(DEBUG,
+			"share group %u SPRQ queue %u allocated and configured %u segments (max %u packets)",
+			rxq_ctrl->share_group, rxq_ctrl->share_qid, elts_n,
+			elts_n / (1 << rxq_ctrl->rxq.sges_n));
 	return 0;
 error:
 	err = rte_errno; /* Save rte_errno before cleanup. */
@@ -207,8 +218,12 @@ rxq_alloc_elts_sprq(struct mlx5_rxq_ctrl *rxq_ctrl)
 			rte_pktmbuf_free_seg((*rxq_ctrl->rxq.elts)[i]);
 		(*rxq_ctrl->rxq.elts)[i] = NULL;
 	}
-	DRV_LOG(DEBUG, "port %u SPRQ queue %u failed, freed everything",
-		PORT_ID(rxq_ctrl->priv), rxq_ctrl->rxq.idx);
+	if (rxq_ctrl->share_group == 0)
+		DRV_LOG(DEBUG, "port %u SPRQ queue %u failed, freed everything",
+			RXQ_PORT_ID(rxq_ctrl), rxq_ctrl->rxq.idx);
+	else
+		DRV_LOG(DEBUG, "share group %u SPRQ queue %u failed, freed everything",
+			rxq_ctrl->share_group, rxq_ctrl->share_qid);
 	rte_errno = err; /* Restore rte_errno. */
 	return -rte_errno;
 }
@@ -284,8 +299,12 @@ rxq_free_elts_sprq(struct mlx5_rxq_ctrl *rxq_ctrl)
 	uint16_t used = q_n - (elts_ci - rxq->rq_pi);
 	uint16_t i;
 
-	DRV_LOG(DEBUG, "port %u Rx queue %u freeing %d WRs",
-		PORT_ID(rxq_ctrl->priv), rxq->idx, q_n);
+	if (rxq_ctrl->share_group == 0)
+		DRV_LOG(DEBUG, "port %u Rx queue %u freeing %d WRs",
+			RXQ_PORT_ID(rxq_ctrl), rxq->idx, q_n);
+	else
+		DRV_LOG(DEBUG, "share group %u Rx queue %u freeing %d WRs",
+			rxq_ctrl->share_group, rxq_ctrl->share_qid, q_n);
 	if (rxq->elts == NULL)
 		return;
 	/**
@@ -1630,7 +1649,6 @@ mlx5_rxq_new(struct rte_eth_dev *dev, struct mlx5_rxq_priv *rxq,
 		(!!(dev->data->dev_conf.rxmode.mq_mode & RTE_ETH_MQ_RX_RSS));
 	tmpl->rxq.port_id = dev->data->port_id;
 	tmpl->sh = priv->sh;
-	tmpl->priv = priv;
 	tmpl->rxq.mp = rx_seg[0].mp;
 	tmpl->rxq.elts_n = log2above(desc);
 	tmpl->rxq.rq_repl_thresh =
@@ -1690,7 +1708,6 @@ mlx5_rxq_hairpin_new(struct rte_eth_dev *dev, struct mlx5_rxq_priv *rxq,
 	tmpl->rxq.rss_hash = 0;
 	tmpl->rxq.port_id = dev->data->port_id;
 	tmpl->sh = priv->sh;
-	tmpl->priv = priv;
 	tmpl->rxq.mp = NULL;
 	tmpl->rxq.elts_n = log2above(desc);
 	tmpl->rxq.elts = NULL;
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.c b/drivers/net/mlx5/mlx5_rxtx_vec.c
index ecd273e00a8..511681841ca 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec.c
+++ b/drivers/net/mlx5/mlx5_rxtx_vec.c
@@ -550,7 +550,7 @@ mlx5_rxq_check_vec_support(struct mlx5_rxq_data *rxq)
 	struct mlx5_rxq_ctrl *ctrl =
 		container_of(rxq, struct mlx5_rxq_ctrl, rxq);
 
-	if (!ctrl->priv->config.rx_vec_en || rxq->sges_n != 0)
+	if (!RXQ_PORT(ctrl)->config.rx_vec_en || rxq->sges_n != 0)
 		return -ENOTSUP;
 	if (rxq->lro)
 		return -ENOTSUP;
diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index a124f74fcda..caafdf27e8f 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -131,9 +131,11 @@ mlx5_rxq_mempool_register_cb(struct rte_mempool *mp, void *opaque,
  *   0 on success, (-1) on failure and rte_errno is set.
  */
 static int
-mlx5_rxq_mempool_register(struct mlx5_rxq_ctrl *rxq_ctrl)
+mlx5_rxq_mempool_register(struct rte_eth_dev *dev,
+			  struct mlx5_rxq_ctrl *rxq_ctrl)
 {
-	struct mlx5_priv *priv = rxq_ctrl->priv;
+	struct mlx5_priv *priv = dev->data->dev_private;
+	struct mlx5_dev_ctx_shared *sh = rxq_ctrl->sh;
 	struct rte_mempool *mp;
 	uint32_t s;
 	int ret = 0;
@@ -148,9 +150,8 @@ mlx5_rxq_mempool_register(struct mlx5_rxq_ctrl *rxq_ctrl)
 	}
 	for (s = 0; s < rxq_ctrl->rxq.rxseg_n; s++) {
 		mp = rxq_ctrl->rxq.rxseg[s].mp;
-		ret = mlx5_mr_mempool_register(&priv->sh->cdev->mr_scache,
-					       priv->sh->cdev->pd, mp,
-					       &priv->mp_id);
+		ret = mlx5_mr_mempool_register(&sh->cdev->mr_scache,
+					       sh->cdev->pd, mp, &priv->mp_id);
 		if (ret < 0 && rte_errno != EEXIST)
 			return ret;
 		rte_mempool_mem_iter(mp, mlx5_rxq_mempool_register_cb,
@@ -213,7 +214,7 @@ mlx5_rxq_start(struct rte_eth_dev *dev)
 			 * the implicit registration is enabled or not,
 			 * Rx mempool destruction is tracked to free MRs.
 			 */
-			if (mlx5_rxq_mempool_register(rxq_ctrl) < 0)
+			if (mlx5_rxq_mempool_register(dev, rxq_ctrl) < 0)
 				goto error;
 			ret = rxq_alloc_elts(rxq_ctrl);
 			if (ret)
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v4 11/14] net/mlx5: move Rx queue DevX resource
  2021-11-04 12:33 ` [dpdk-dev] [PATCH v4 00/14] net/mlx5: support shared Rx queue Xueming Li
                     ` (9 preceding siblings ...)
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 10/14] net/mlx5: remove port info from shareable Rx queue Xueming Li
@ 2021-11-04 12:33   ` Xueming Li
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 12/14] net/mlx5: remove Rx queue data list from device Xueming Li
                     ` (3 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-11-04 12:33 UTC (permalink / raw)
  To: dev
  Cc: xuemingl, Lior Margalit, Slava Ovsiienko, Matan Azrad, Anatoly Burakov

To support shared RX queue, moves DevX RQ which is per queue resource to
Rx queue private data.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Slava Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/net/mlx5/linux/mlx5_verbs.c | 154 +++++++++++--------
 drivers/net/mlx5/mlx5.h             |  11 +-
 drivers/net/mlx5/mlx5_devx.c        | 227 +++++++++++++---------------
 drivers/net/mlx5/mlx5_rx.h          |   1 +
 drivers/net/mlx5/mlx5_rxq.c         |  44 +++---
 drivers/net/mlx5/mlx5_rxtx.c        |   6 +-
 drivers/net/mlx5/mlx5_trigger.c     |   2 +-
 drivers/net/mlx5/mlx5_vlan.c        |  16 +-
 8 files changed, 240 insertions(+), 221 deletions(-)

diff --git a/drivers/net/mlx5/linux/mlx5_verbs.c b/drivers/net/mlx5/linux/mlx5_verbs.c
index 4779b37aa65..5d4ae3ea752 100644
--- a/drivers/net/mlx5/linux/mlx5_verbs.c
+++ b/drivers/net/mlx5/linux/mlx5_verbs.c
@@ -29,13 +29,13 @@
 /**
  * Modify Rx WQ vlan stripping offload
  *
- * @param rxq_obj
- *   Rx queue object.
+ * @param rxq
+ *   Rx queue.
  *
  * @return 0 on success, non-0 otherwise
  */
 static int
-mlx5_rxq_obj_modify_wq_vlan_strip(struct mlx5_rxq_obj *rxq_obj, int on)
+mlx5_rxq_obj_modify_wq_vlan_strip(struct mlx5_rxq_priv *rxq, int on)
 {
 	uint16_t vlan_offloads =
 		(on ? IBV_WQ_FLAGS_CVLAN_STRIPPING : 0) |
@@ -47,14 +47,14 @@ mlx5_rxq_obj_modify_wq_vlan_strip(struct mlx5_rxq_obj *rxq_obj, int on)
 		.flags = vlan_offloads,
 	};
 
-	return mlx5_glue->modify_wq(rxq_obj->wq, &mod);
+	return mlx5_glue->modify_wq(rxq->ctrl->obj->wq, &mod);
 }
 
 /**
  * Modifies the attributes for the specified WQ.
  *
- * @param rxq_obj
- *   Verbs Rx queue object.
+ * @param rxq
+ *   Verbs Rx queue.
  * @param type
  *   Type of change queue state.
  *
@@ -62,14 +62,14 @@ mlx5_rxq_obj_modify_wq_vlan_strip(struct mlx5_rxq_obj *rxq_obj, int on)
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
 static int
-mlx5_ibv_modify_wq(struct mlx5_rxq_obj *rxq_obj, uint8_t type)
+mlx5_ibv_modify_wq(struct mlx5_rxq_priv *rxq, uint8_t type)
 {
 	struct ibv_wq_attr mod = {
 		.attr_mask = IBV_WQ_ATTR_STATE,
 		.wq_state = (enum ibv_wq_state)type,
 	};
 
-	return mlx5_glue->modify_wq(rxq_obj->wq, &mod);
+	return mlx5_glue->modify_wq(rxq->ctrl->obj->wq, &mod);
 }
 
 /**
@@ -139,21 +139,18 @@ mlx5_ibv_modify_qp(struct mlx5_txq_obj *obj, enum mlx5_txq_modify_type type,
 /**
  * Create a CQ Verbs object.
  *
- * @param dev
- *   Pointer to Ethernet device.
- * @param idx
- *   Queue index in DPDK Rx queue array.
+ * @param rxq
+ *   Pointer to Rx queue.
  *
  * @return
  *   The Verbs CQ object initialized, NULL otherwise and rte_errno is set.
  */
 static struct ibv_cq *
-mlx5_rxq_ibv_cq_create(struct rte_eth_dev *dev, uint16_t idx)
+mlx5_rxq_ibv_cq_create(struct mlx5_rxq_priv *rxq)
 {
-	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_data *rxq_data = (*priv->rxqs)[idx];
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-		container_of(rxq_data, struct mlx5_rxq_ctrl, rxq);
+	struct mlx5_priv *priv = rxq->priv;
+	struct mlx5_rxq_ctrl *rxq_ctrl = rxq->ctrl;
+	struct mlx5_rxq_data *rxq_data = &rxq_ctrl->rxq;
 	struct mlx5_rxq_obj *rxq_obj = rxq_ctrl->obj;
 	unsigned int cqe_n = mlx5_rxq_cqe_num(rxq_data);
 	struct {
@@ -199,7 +196,7 @@ mlx5_rxq_ibv_cq_create(struct rte_eth_dev *dev, uint16_t idx)
 		DRV_LOG(DEBUG,
 			"Port %u Rx CQE compression is disabled for HW"
 			" timestamp.",
-			dev->data->port_id);
+			priv->dev_data->port_id);
 	}
 #ifdef HAVE_IBV_MLX5_MOD_CQE_128B_PAD
 	if (RTE_CACHE_LINE_SIZE == 128) {
@@ -216,21 +213,18 @@ mlx5_rxq_ibv_cq_create(struct rte_eth_dev *dev, uint16_t idx)
 /**
  * Create a WQ Verbs object.
  *
- * @param dev
- *   Pointer to Ethernet device.
- * @param idx
- *   Queue index in DPDK Rx queue array.
+ * @param rxq
+ *   Pointer to Rx queue.
  *
  * @return
  *   The Verbs WQ object initialized, NULL otherwise and rte_errno is set.
  */
 static struct ibv_wq *
-mlx5_rxq_ibv_wq_create(struct rte_eth_dev *dev, uint16_t idx)
+mlx5_rxq_ibv_wq_create(struct mlx5_rxq_priv *rxq)
 {
-	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_data *rxq_data = (*priv->rxqs)[idx];
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-		container_of(rxq_data, struct mlx5_rxq_ctrl, rxq);
+	struct mlx5_priv *priv = rxq->priv;
+	struct mlx5_rxq_ctrl *rxq_ctrl = rxq->ctrl;
+	struct mlx5_rxq_data *rxq_data = &rxq_ctrl->rxq;
 	struct mlx5_rxq_obj *rxq_obj = rxq_ctrl->obj;
 	unsigned int wqe_n = 1 << rxq_data->elts_n;
 	struct {
@@ -297,7 +291,7 @@ mlx5_rxq_ibv_wq_create(struct rte_eth_dev *dev, uint16_t idx)
 			DRV_LOG(ERR,
 				"Port %u Rx queue %u requested %u*%u but got"
 				" %u*%u WRs*SGEs.",
-				dev->data->port_id, idx,
+				priv->dev_data->port_id, rxq->idx,
 				wqe_n >> rxq_data->sges_n,
 				(1 << rxq_data->sges_n),
 				wq_attr.ibv.max_wr, wq_attr.ibv.max_sge);
@@ -312,21 +306,20 @@ mlx5_rxq_ibv_wq_create(struct rte_eth_dev *dev, uint16_t idx)
 /**
  * Create the Rx queue Verbs object.
  *
- * @param dev
- *   Pointer to Ethernet device.
- * @param idx
- *   Queue index in DPDK Rx queue array.
+ * @param rxq
+ *   Pointer to Rx queue.
  *
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
 static int
-mlx5_rxq_ibv_obj_new(struct rte_eth_dev *dev, uint16_t idx)
+mlx5_rxq_ibv_obj_new(struct mlx5_rxq_priv *rxq)
 {
-	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_data *rxq_data = (*priv->rxqs)[idx];
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-		container_of(rxq_data, struct mlx5_rxq_ctrl, rxq);
+	uint16_t idx = rxq->idx;
+	struct mlx5_priv *priv = rxq->priv;
+	uint16_t port_id = priv->dev_data->port_id;
+	struct mlx5_rxq_ctrl *rxq_ctrl = rxq->ctrl;
+	struct mlx5_rxq_data *rxq_data = &rxq_ctrl->rxq;
 	struct mlx5_rxq_obj *tmpl = rxq_ctrl->obj;
 	struct mlx5dv_cq cq_info;
 	struct mlx5dv_rwq rwq;
@@ -341,17 +334,17 @@ mlx5_rxq_ibv_obj_new(struct rte_eth_dev *dev, uint16_t idx)
 			mlx5_glue->create_comp_channel(priv->sh->cdev->ctx);
 		if (!tmpl->ibv_channel) {
 			DRV_LOG(ERR, "Port %u: comp channel creation failure.",
-				dev->data->port_id);
+				port_id);
 			rte_errno = ENOMEM;
 			goto error;
 		}
 		tmpl->fd = ((struct ibv_comp_channel *)(tmpl->ibv_channel))->fd;
 	}
 	/* Create CQ using Verbs API. */
-	tmpl->ibv_cq = mlx5_rxq_ibv_cq_create(dev, idx);
+	tmpl->ibv_cq = mlx5_rxq_ibv_cq_create(rxq);
 	if (!tmpl->ibv_cq) {
 		DRV_LOG(ERR, "Port %u Rx queue %u CQ creation failure.",
-			dev->data->port_id, idx);
+			port_id, idx);
 		rte_errno = ENOMEM;
 		goto error;
 	}
@@ -366,7 +359,7 @@ mlx5_rxq_ibv_obj_new(struct rte_eth_dev *dev, uint16_t idx)
 		DRV_LOG(ERR,
 			"Port %u wrong MLX5_CQE_SIZE environment "
 			"variable value: it should be set to %u.",
-			dev->data->port_id, RTE_CACHE_LINE_SIZE);
+			port_id, RTE_CACHE_LINE_SIZE);
 		rte_errno = EINVAL;
 		goto error;
 	}
@@ -377,19 +370,19 @@ mlx5_rxq_ibv_obj_new(struct rte_eth_dev *dev, uint16_t idx)
 	rxq_data->cq_uar = cq_info.cq_uar;
 	rxq_data->cqn = cq_info.cqn;
 	/* Create WQ (RQ) using Verbs API. */
-	tmpl->wq = mlx5_rxq_ibv_wq_create(dev, idx);
+	tmpl->wq = mlx5_rxq_ibv_wq_create(rxq);
 	if (!tmpl->wq) {
 		DRV_LOG(ERR, "Port %u Rx queue %u WQ creation failure.",
-			dev->data->port_id, idx);
+			port_id, idx);
 		rte_errno = ENOMEM;
 		goto error;
 	}
 	/* Change queue state to ready. */
-	ret = mlx5_ibv_modify_wq(tmpl, IBV_WQS_RDY);
+	ret = mlx5_ibv_modify_wq(rxq, IBV_WQS_RDY);
 	if (ret) {
 		DRV_LOG(ERR,
 			"Port %u Rx queue %u WQ state to IBV_WQS_RDY failed.",
-			dev->data->port_id, idx);
+			port_id, idx);
 		rte_errno = ret;
 		goto error;
 	}
@@ -405,7 +398,7 @@ mlx5_rxq_ibv_obj_new(struct rte_eth_dev *dev, uint16_t idx)
 	rxq_data->cq_arm_sn = 0;
 	mlx5_rxq_initialize(rxq_data);
 	rxq_data->cq_ci = 0;
-	dev->data->rx_queue_state[idx] = RTE_ETH_QUEUE_STATE_STARTED;
+	priv->dev_data->rx_queue_state[idx] = RTE_ETH_QUEUE_STATE_STARTED;
 	rxq_ctrl->wqn = ((struct ibv_wq *)(tmpl->wq))->wq_num;
 	return 0;
 error:
@@ -423,12 +416,14 @@ mlx5_rxq_ibv_obj_new(struct rte_eth_dev *dev, uint16_t idx)
 /**
  * Release an Rx verbs queue object.
  *
- * @param rxq_obj
- *   Verbs Rx queue object.
+ * @param rxq
+ *   Pointer to Rx queue.
  */
 static void
-mlx5_rxq_ibv_obj_release(struct mlx5_rxq_obj *rxq_obj)
+mlx5_rxq_ibv_obj_release(struct mlx5_rxq_priv *rxq)
 {
+	struct mlx5_rxq_obj *rxq_obj = rxq->ctrl->obj;
+
 	MLX5_ASSERT(rxq_obj);
 	MLX5_ASSERT(rxq_obj->wq);
 	MLX5_ASSERT(rxq_obj->ibv_cq);
@@ -652,12 +647,24 @@ static void
 mlx5_rxq_ibv_obj_drop_release(struct rte_eth_dev *dev)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_obj *rxq = priv->drop_queue.rxq;
+	struct mlx5_rxq_priv *rxq = priv->drop_queue.rxq;
+	struct mlx5_rxq_obj *rxq_obj;
 
-	if (rxq->wq)
-		claim_zero(mlx5_glue->destroy_wq(rxq->wq));
-	if (rxq->ibv_cq)
-		claim_zero(mlx5_glue->destroy_cq(rxq->ibv_cq));
+	if (rxq == NULL)
+		return;
+	if (rxq->ctrl == NULL)
+		goto free_priv;
+	rxq_obj = rxq->ctrl->obj;
+	if (rxq_obj == NULL)
+		goto free_ctrl;
+	if (rxq_obj->wq)
+		claim_zero(mlx5_glue->destroy_wq(rxq_obj->wq));
+	if (rxq_obj->ibv_cq)
+		claim_zero(mlx5_glue->destroy_cq(rxq_obj->ibv_cq));
+	mlx5_free(rxq_obj);
+free_ctrl:
+	mlx5_free(rxq->ctrl);
+free_priv:
 	mlx5_free(rxq);
 	priv->drop_queue.rxq = NULL;
 }
@@ -676,39 +683,58 @@ mlx5_rxq_ibv_obj_drop_create(struct rte_eth_dev *dev)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct ibv_context *ctx = priv->sh->cdev->ctx;
-	struct mlx5_rxq_obj *rxq = priv->drop_queue.rxq;
+	struct mlx5_rxq_priv *rxq = priv->drop_queue.rxq;
+	struct mlx5_rxq_ctrl *rxq_ctrl = NULL;
+	struct mlx5_rxq_obj *rxq_obj = NULL;
 
-	if (rxq)
+	if (rxq != NULL)
 		return 0;
 	rxq = mlx5_malloc(MLX5_MEM_ZERO, sizeof(*rxq), 0, SOCKET_ID_ANY);
-	if (!rxq) {
+	if (rxq == NULL) {
 		DRV_LOG(DEBUG, "Port %u cannot allocate drop Rx queue memory.",
 		      dev->data->port_id);
 		rte_errno = ENOMEM;
 		return -rte_errno;
 	}
 	priv->drop_queue.rxq = rxq;
-	rxq->ibv_cq = mlx5_glue->create_cq(ctx, 1, NULL, NULL, 0);
-	if (!rxq->ibv_cq) {
+	rxq_ctrl = mlx5_malloc(MLX5_MEM_ZERO, sizeof(*rxq_ctrl), 0,
+			       SOCKET_ID_ANY);
+	if (rxq_ctrl == NULL) {
+		DRV_LOG(DEBUG, "Port %u cannot allocate drop Rx queue control memory.",
+		      dev->data->port_id);
+		rte_errno = ENOMEM;
+		goto error;
+	}
+	rxq->ctrl = rxq_ctrl;
+	rxq_obj = mlx5_malloc(MLX5_MEM_ZERO, sizeof(*rxq_obj), 0,
+			      SOCKET_ID_ANY);
+	if (rxq_obj == NULL) {
+		DRV_LOG(DEBUG, "Port %u cannot allocate drop Rx queue memory.",
+		      dev->data->port_id);
+		rte_errno = ENOMEM;
+		goto error;
+	}
+	rxq_ctrl->obj = rxq_obj;
+	rxq_obj->ibv_cq = mlx5_glue->create_cq(ctx, 1, NULL, NULL, 0);
+	if (!rxq_obj->ibv_cq) {
 		DRV_LOG(DEBUG, "Port %u cannot allocate CQ for drop queue.",
 		      dev->data->port_id);
 		rte_errno = errno;
 		goto error;
 	}
-	rxq->wq = mlx5_glue->create_wq(ctx, &(struct ibv_wq_init_attr){
+	rxq_obj->wq = mlx5_glue->create_wq(ctx, &(struct ibv_wq_init_attr){
 						    .wq_type = IBV_WQT_RQ,
 						    .max_wr = 1,
 						    .max_sge = 1,
 						    .pd = priv->sh->cdev->pd,
-						    .cq = rxq->ibv_cq,
+						    .cq = rxq_obj->ibv_cq,
 					      });
-	if (!rxq->wq) {
+	if (!rxq_obj->wq) {
 		DRV_LOG(DEBUG, "Port %u cannot allocate WQ for drop queue.",
 		      dev->data->port_id);
 		rte_errno = errno;
 		goto error;
 	}
-	priv->drop_queue.rxq = rxq;
 	return 0;
 error:
 	mlx5_rxq_ibv_obj_drop_release(dev);
@@ -737,7 +763,7 @@ mlx5_ibv_drop_action_create(struct rte_eth_dev *dev)
 	ret = mlx5_rxq_ibv_obj_drop_create(dev);
 	if (ret < 0)
 		goto error;
-	rxq = priv->drop_queue.rxq;
+	rxq = priv->drop_queue.rxq->ctrl->obj;
 	ind_tbl = mlx5_glue->create_rwq_ind_table
 				(priv->sh->cdev->ctx,
 				 &(struct ibv_rwq_ind_table_init_attr){
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 4e99fe7d068..967d92b4ad6 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -300,7 +300,7 @@ struct mlx5_vf_vlan {
 /* Flow drop context necessary due to Verbs API. */
 struct mlx5_drop {
 	struct mlx5_hrxq *hrxq; /* Hash Rx queue queue. */
-	struct mlx5_rxq_obj *rxq; /* Rx queue object. */
+	struct mlx5_rxq_priv *rxq; /* Rx queue. */
 };
 
 /* Loopback dummy queue resources required due to Verbs API. */
@@ -1267,7 +1267,6 @@ struct mlx5_rxq_obj {
 		};
 		struct mlx5_devx_obj *rq; /* DevX RQ object for hairpin. */
 		struct {
-			struct mlx5_devx_rq rq_obj; /* DevX RQ object. */
 			struct mlx5_devx_cq cq_obj; /* DevX CQ object. */
 			void *devx_channel;
 		};
@@ -1349,11 +1348,11 @@ struct mlx5_rxq_priv;
 
 /* HW objects operations structure. */
 struct mlx5_obj_ops {
-	int (*rxq_obj_modify_vlan_strip)(struct mlx5_rxq_obj *rxq_obj, int on);
-	int (*rxq_obj_new)(struct rte_eth_dev *dev, uint16_t idx);
+	int (*rxq_obj_modify_vlan_strip)(struct mlx5_rxq_priv *rxq, int on);
+	int (*rxq_obj_new)(struct mlx5_rxq_priv *rxq);
 	int (*rxq_event_get)(struct mlx5_rxq_obj *rxq_obj);
-	int (*rxq_obj_modify)(struct mlx5_rxq_obj *rxq_obj, uint8_t type);
-	void (*rxq_obj_release)(struct mlx5_rxq_obj *rxq_obj);
+	int (*rxq_obj_modify)(struct mlx5_rxq_priv *rxq, uint8_t type);
+	void (*rxq_obj_release)(struct mlx5_rxq_priv *rxq);
 	int (*ind_table_new)(struct rte_eth_dev *dev, const unsigned int log_n,
 			     struct mlx5_ind_table_obj *ind_tbl);
 	int (*ind_table_modify)(struct rte_eth_dev *dev,
diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index 8b3651f5034..b90a5d82458 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -30,14 +30,16 @@
 /**
  * Modify RQ vlan stripping offload
  *
- * @param rxq_obj
- *   Rx queue object.
+ * @param rxq
+ *   Rx queue.
+ * @param on
+ *   Enable/disable VLAN stripping.
  *
  * @return
  *   0 on success, non-0 otherwise
  */
 static int
-mlx5_rxq_obj_modify_rq_vlan_strip(struct mlx5_rxq_obj *rxq_obj, int on)
+mlx5_rxq_obj_modify_rq_vlan_strip(struct mlx5_rxq_priv *rxq, int on)
 {
 	struct mlx5_devx_modify_rq_attr rq_attr;
 
@@ -46,14 +48,14 @@ mlx5_rxq_obj_modify_rq_vlan_strip(struct mlx5_rxq_obj *rxq_obj, int on)
 	rq_attr.state = MLX5_RQC_STATE_RDY;
 	rq_attr.vsd = (on ? 0 : 1);
 	rq_attr.modify_bitmask = MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_VSD;
-	return mlx5_devx_cmd_modify_rq(rxq_obj->rq_obj.rq, &rq_attr);
+	return mlx5_devx_cmd_modify_rq(rxq->devx_rq.rq, &rq_attr);
 }
 
 /**
  * Modify RQ using DevX API.
  *
- * @param rxq_obj
- *   DevX Rx queue object.
+ * @param rxq
+ *   DevX rx queue.
  * @param type
  *   Type of change queue state.
  *
@@ -61,7 +63,7 @@ mlx5_rxq_obj_modify_rq_vlan_strip(struct mlx5_rxq_obj *rxq_obj, int on)
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
 static int
-mlx5_devx_modify_rq(struct mlx5_rxq_obj *rxq_obj, uint8_t type)
+mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type)
 {
 	struct mlx5_devx_modify_rq_attr rq_attr;
 
@@ -86,7 +88,7 @@ mlx5_devx_modify_rq(struct mlx5_rxq_obj *rxq_obj, uint8_t type)
 	default:
 		break;
 	}
-	return mlx5_devx_cmd_modify_rq(rxq_obj->rq_obj.rq, &rq_attr);
+	return mlx5_devx_cmd_modify_rq(rxq->devx_rq.rq, &rq_attr);
 }
 
 /**
@@ -145,42 +147,34 @@ mlx5_txq_devx_modify(struct mlx5_txq_obj *obj, enum mlx5_txq_modify_type type,
 	return 0;
 }
 
-/**
- * Destroy the Rx queue DevX object.
- *
- * @param rxq_obj
- *   Rxq object to destroy.
- */
-static void
-mlx5_rxq_release_devx_resources(struct mlx5_rxq_obj *rxq_obj)
-{
-	mlx5_devx_rq_destroy(&rxq_obj->rq_obj);
-	memset(&rxq_obj->rq_obj, 0, sizeof(rxq_obj->rq_obj));
-	mlx5_devx_cq_destroy(&rxq_obj->cq_obj);
-	memset(&rxq_obj->cq_obj, 0, sizeof(rxq_obj->cq_obj));
-}
-
 /**
  * Release an Rx DevX queue object.
  *
- * @param rxq_obj
- *   DevX Rx queue object.
+ * @param rxq
+ *   DevX Rx queue.
  */
 static void
-mlx5_rxq_devx_obj_release(struct mlx5_rxq_obj *rxq_obj)
+mlx5_rxq_devx_obj_release(struct mlx5_rxq_priv *rxq)
 {
-	MLX5_ASSERT(rxq_obj);
+	struct mlx5_rxq_ctrl *rxq_ctrl = rxq->ctrl;
+	struct mlx5_rxq_obj *rxq_obj = rxq_ctrl->obj;
+
+	MLX5_ASSERT(rxq != NULL);
+	MLX5_ASSERT(rxq_ctrl != NULL);
 	if (rxq_obj->rxq_ctrl->type == MLX5_RXQ_TYPE_HAIRPIN) {
 		MLX5_ASSERT(rxq_obj->rq);
-		mlx5_devx_modify_rq(rxq_obj, MLX5_RXQ_MOD_RDY2RST);
+		mlx5_devx_modify_rq(rxq, MLX5_RXQ_MOD_RDY2RST);
 		claim_zero(mlx5_devx_cmd_destroy(rxq_obj->rq));
 	} else {
-		MLX5_ASSERT(rxq_obj->cq_obj.cq);
-		MLX5_ASSERT(rxq_obj->rq_obj.rq);
-		mlx5_rxq_release_devx_resources(rxq_obj);
-		if (rxq_obj->devx_channel)
+		mlx5_devx_rq_destroy(&rxq->devx_rq);
+		memset(&rxq->devx_rq, 0, sizeof(rxq->devx_rq));
+		mlx5_devx_cq_destroy(&rxq_obj->cq_obj);
+		memset(&rxq_obj->cq_obj, 0, sizeof(rxq_obj->cq_obj));
+		if (rxq_obj->devx_channel) {
 			mlx5_os_devx_destroy_event_channel
 							(rxq_obj->devx_channel);
+			rxq_obj->devx_channel = NULL;
+		}
 	}
 }
 
@@ -224,22 +218,19 @@ mlx5_rx_devx_get_event(struct mlx5_rxq_obj *rxq_obj)
 /**
  * Create a RQ object using DevX.
  *
- * @param dev
- *   Pointer to Ethernet device.
- * @param rxq_data
- *   RX queue data.
+ * @param rxq
+ *   Pointer to Rx queue.
  *
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
 static int
-mlx5_rxq_create_devx_rq_resources(struct rte_eth_dev *dev,
-				  struct mlx5_rxq_data *rxq_data)
+mlx5_rxq_create_devx_rq_resources(struct mlx5_rxq_priv *rxq)
 {
-	struct mlx5_priv *priv = dev->data->dev_private;
+	struct mlx5_priv *priv = rxq->priv;
 	struct mlx5_common_device *cdev = priv->sh->cdev;
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-		container_of(rxq_data, struct mlx5_rxq_ctrl, rxq);
+	struct mlx5_rxq_ctrl *rxq_ctrl = rxq->ctrl;
+	struct mlx5_rxq_data *rxq_data = &rxq->ctrl->rxq;
 	struct mlx5_devx_create_rq_attr rq_attr = { 0 };
 	uint16_t log_desc_n = rxq_data->elts_n - rxq_data->sges_n;
 	uint32_t wqe_size, log_wqe_size;
@@ -281,31 +272,29 @@ mlx5_rxq_create_devx_rq_resources(struct rte_eth_dev *dev,
 	rq_attr.wq_attr.pd = cdev->pdn;
 	rq_attr.counter_set_id = priv->counter_set_id;
 	/* Create RQ using DevX API. */
-	return mlx5_devx_rq_create(cdev->ctx, &rxq_ctrl->obj->rq_obj, wqe_size,
+	return mlx5_devx_rq_create(cdev->ctx, &rxq->devx_rq, wqe_size,
 				   log_desc_n, &rq_attr, rxq_ctrl->socket);
 }
 
 /**
  * Create a DevX CQ object for an Rx queue.
  *
- * @param dev
- *   Pointer to Ethernet device.
- * @param rxq_data
- *   RX queue data.
+ * @param rxq
+ *   Pointer to Rx queue.
  *
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
 static int
-mlx5_rxq_create_devx_cq_resources(struct rte_eth_dev *dev,
-				  struct mlx5_rxq_data *rxq_data)
+mlx5_rxq_create_devx_cq_resources(struct mlx5_rxq_priv *rxq)
 {
 	struct mlx5_devx_cq *cq_obj = 0;
 	struct mlx5_devx_cq_attr cq_attr = { 0 };
-	struct mlx5_priv *priv = dev->data->dev_private;
+	struct mlx5_priv *priv = rxq->priv;
 	struct mlx5_dev_ctx_shared *sh = priv->sh;
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-		container_of(rxq_data, struct mlx5_rxq_ctrl, rxq);
+	uint16_t port_id = priv->dev_data->port_id;
+	struct mlx5_rxq_ctrl *rxq_ctrl = rxq->ctrl;
+	struct mlx5_rxq_data *rxq_data = &rxq_ctrl->rxq;
 	unsigned int cqe_n = mlx5_rxq_cqe_num(rxq_data);
 	uint32_t log_cqe_n;
 	uint16_t event_nums[1] = { 0 };
@@ -346,7 +335,7 @@ mlx5_rxq_create_devx_cq_resources(struct rte_eth_dev *dev,
 		}
 		DRV_LOG(DEBUG,
 			"Port %u Rx CQE compression is enabled, format %d.",
-			dev->data->port_id, priv->config.cqe_comp_fmt);
+			port_id, priv->config.cqe_comp_fmt);
 		/*
 		 * For vectorized Rx, it must not be doubled in order to
 		 * make cq_ci and rq_ci aligned.
@@ -355,13 +344,12 @@ mlx5_rxq_create_devx_cq_resources(struct rte_eth_dev *dev,
 			cqe_n *= 2;
 	} else if (priv->config.cqe_comp && rxq_data->hw_timestamp) {
 		DRV_LOG(DEBUG,
-			"Port %u Rx CQE compression is disabled for HW"
-			" timestamp.",
-			dev->data->port_id);
+			"Port %u Rx CQE compression is disabled for HW timestamp.",
+			port_id);
 	} else if (priv->config.cqe_comp && rxq_data->lro) {
 		DRV_LOG(DEBUG,
 			"Port %u Rx CQE compression is disabled for LRO.",
-			dev->data->port_id);
+			port_id);
 	}
 	cq_attr.uar_page_id = mlx5_os_get_devx_uar_page_id(sh->devx_rx_uar);
 	log_cqe_n = log2above(cqe_n);
@@ -399,27 +387,23 @@ mlx5_rxq_create_devx_cq_resources(struct rte_eth_dev *dev,
 /**
  * Create the Rx hairpin queue object.
  *
- * @param dev
- *   Pointer to Ethernet device.
- * @param idx
- *   Queue index in DPDK Rx queue array.
+ * @param rxq
+ *   Pointer to Rx queue.
  *
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
 static int
-mlx5_rxq_obj_hairpin_new(struct rte_eth_dev *dev, uint16_t idx)
+mlx5_rxq_obj_hairpin_new(struct mlx5_rxq_priv *rxq)
 {
-	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_data *rxq_data = (*priv->rxqs)[idx];
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-		container_of(rxq_data, struct mlx5_rxq_ctrl, rxq);
+	uint16_t idx = rxq->idx;
+	struct mlx5_priv *priv = rxq->priv;
+	struct mlx5_rxq_ctrl *rxq_ctrl = rxq->ctrl;
 	struct mlx5_devx_create_rq_attr attr = { 0 };
 	struct mlx5_rxq_obj *tmpl = rxq_ctrl->obj;
 	uint32_t max_wq_data;
 
-	MLX5_ASSERT(rxq_data);
-	MLX5_ASSERT(tmpl);
+	MLX5_ASSERT(rxq != NULL && rxq->ctrl != NULL && tmpl != NULL);
 	tmpl->rxq_ctrl = rxq_ctrl;
 	attr.hairpin = 1;
 	max_wq_data = priv->config.hca_attr.log_max_hairpin_wq_data_sz;
@@ -448,39 +432,36 @@ mlx5_rxq_obj_hairpin_new(struct rte_eth_dev *dev, uint16_t idx)
 	if (!tmpl->rq) {
 		DRV_LOG(ERR,
 			"Port %u Rx hairpin queue %u can't create rq object.",
-			dev->data->port_id, idx);
+			priv->dev_data->port_id, idx);
 		rte_errno = errno;
 		return -rte_errno;
 	}
-	dev->data->rx_queue_state[idx] = RTE_ETH_QUEUE_STATE_HAIRPIN;
+	priv->dev_data->rx_queue_state[idx] = RTE_ETH_QUEUE_STATE_HAIRPIN;
 	return 0;
 }
 
 /**
  * Create the Rx queue DevX object.
  *
- * @param dev
- *   Pointer to Ethernet device.
- * @param idx
- *   Queue index in DPDK Rx queue array.
+ * @param rxq
+ *   Pointer to Rx queue.
  *
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
 static int
-mlx5_rxq_devx_obj_new(struct rte_eth_dev *dev, uint16_t idx)
+mlx5_rxq_devx_obj_new(struct mlx5_rxq_priv *rxq)
 {
-	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_data *rxq_data = (*priv->rxqs)[idx];
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-		container_of(rxq_data, struct mlx5_rxq_ctrl, rxq);
+	struct mlx5_priv *priv = rxq->priv;
+	struct mlx5_rxq_ctrl *rxq_ctrl = rxq->ctrl;
+	struct mlx5_rxq_data *rxq_data = &rxq_ctrl->rxq;
 	struct mlx5_rxq_obj *tmpl = rxq_ctrl->obj;
 	int ret = 0;
 
 	MLX5_ASSERT(rxq_data);
 	MLX5_ASSERT(tmpl);
 	if (rxq_ctrl->type == MLX5_RXQ_TYPE_HAIRPIN)
-		return mlx5_rxq_obj_hairpin_new(dev, idx);
+		return mlx5_rxq_obj_hairpin_new(rxq);
 	tmpl->rxq_ctrl = rxq_ctrl;
 	if (rxq_ctrl->irq) {
 		int devx_ev_flag =
@@ -498,34 +479,32 @@ mlx5_rxq_devx_obj_new(struct rte_eth_dev *dev, uint16_t idx)
 		tmpl->fd = mlx5_os_get_devx_channel_fd(tmpl->devx_channel);
 	}
 	/* Create CQ using DevX API. */
-	ret = mlx5_rxq_create_devx_cq_resources(dev, rxq_data);
+	ret = mlx5_rxq_create_devx_cq_resources(rxq);
 	if (ret) {
 		DRV_LOG(ERR, "Failed to create CQ.");
 		goto error;
 	}
 	/* Create RQ using DevX API. */
-	ret = mlx5_rxq_create_devx_rq_resources(dev, rxq_data);
+	ret = mlx5_rxq_create_devx_rq_resources(rxq);
 	if (ret) {
 		DRV_LOG(ERR, "Port %u Rx queue %u RQ creation failure.",
-			dev->data->port_id, idx);
+			priv->dev_data->port_id, rxq->idx);
 		rte_errno = ENOMEM;
 		goto error;
 	}
 	/* Change queue state to ready. */
-	ret = mlx5_devx_modify_rq(tmpl, MLX5_RXQ_MOD_RST2RDY);
+	ret = mlx5_devx_modify_rq(rxq, MLX5_RXQ_MOD_RST2RDY);
 	if (ret)
 		goto error;
-	rxq_data->wqes = (void *)(uintptr_t)tmpl->rq_obj.wq.umem_buf;
-	rxq_data->rq_db = (uint32_t *)(uintptr_t)tmpl->rq_obj.wq.db_rec;
-	rxq_data->cq_arm_sn = 0;
-	rxq_data->cq_ci = 0;
+	rxq_data->wqes = (void *)(uintptr_t)rxq->devx_rq.wq.umem_buf;
+	rxq_data->rq_db = (uint32_t *)(uintptr_t)rxq->devx_rq.wq.db_rec;
 	mlx5_rxq_initialize(rxq_data);
-	dev->data->rx_queue_state[idx] = RTE_ETH_QUEUE_STATE_STARTED;
-	rxq_ctrl->wqn = tmpl->rq_obj.rq->id;
+	priv->dev_data->rx_queue_state[rxq->idx] = RTE_ETH_QUEUE_STATE_STARTED;
+	rxq_ctrl->wqn = rxq->devx_rq.rq->id;
 	return 0;
 error:
 	ret = rte_errno; /* Save rte_errno before cleanup. */
-	mlx5_rxq_devx_obj_release(tmpl);
+	mlx5_rxq_devx_obj_release(rxq);
 	rte_errno = ret; /* Restore rte_errno. */
 	return -rte_errno;
 }
@@ -571,15 +550,15 @@ mlx5_devx_ind_table_create_rqt_attr(struct rte_eth_dev *dev,
 	rqt_attr->rqt_actual_size = rqt_n;
 	if (queues == NULL) {
 		for (i = 0; i < rqt_n; i++)
-			rqt_attr->rq_list[i] = priv->drop_queue.rxq->rq->id;
+			rqt_attr->rq_list[i] =
+					priv->drop_queue.rxq->devx_rq.rq->id;
 		return rqt_attr;
 	}
 	for (i = 0; i != queues_n; ++i) {
-		struct mlx5_rxq_data *rxq = (*priv->rxqs)[queues[i]];
-		struct mlx5_rxq_ctrl *rxq_ctrl =
-				container_of(rxq, struct mlx5_rxq_ctrl, rxq);
+		struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, queues[i]);
 
-		rqt_attr->rq_list[i] = rxq_ctrl->obj->rq_obj.rq->id;
+		MLX5_ASSERT(rxq != NULL);
+		rqt_attr->rq_list[i] = rxq->devx_rq.rq->id;
 	}
 	MLX5_ASSERT(i > 0);
 	for (j = 0; i != rqt_n; ++j, ++i)
@@ -719,7 +698,7 @@ mlx5_devx_tir_attr_set(struct rte_eth_dev *dev, const uint8_t *rss_key,
 			}
 		}
 	} else {
-		rxq_obj_type = priv->drop_queue.rxq->rxq_ctrl->type;
+		rxq_obj_type = priv->drop_queue.rxq->ctrl->type;
 	}
 	memset(tir_attr, 0, sizeof(*tir_attr));
 	tir_attr->disp_type = MLX5_TIRC_DISP_TYPE_INDIRECT;
@@ -891,9 +870,9 @@ mlx5_rxq_devx_obj_drop_create(struct rte_eth_dev *dev)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 	int socket_id = dev->device->numa_node;
-	struct mlx5_rxq_ctrl *rxq_ctrl;
-	struct mlx5_rxq_data *rxq_data;
-	struct mlx5_rxq_obj *rxq = NULL;
+	struct mlx5_rxq_priv *rxq;
+	struct mlx5_rxq_ctrl *rxq_ctrl = NULL;
+	struct mlx5_rxq_obj *rxq_obj = NULL;
 	int ret;
 
 	/*
@@ -901,6 +880,13 @@ mlx5_rxq_devx_obj_drop_create(struct rte_eth_dev *dev)
 	 * They are required to hold pointers for cleanup
 	 * and are only accessible via drop queue DevX objects.
 	 */
+	rxq = mlx5_malloc(MLX5_MEM_ZERO, sizeof(*rxq), 0, socket_id);
+	if (rxq == NULL) {
+		DRV_LOG(ERR, "Port %u could not allocate drop queue private",
+			dev->data->port_id);
+		rte_errno = ENOMEM;
+		goto error;
+	}
 	rxq_ctrl = mlx5_malloc(MLX5_MEM_ZERO, sizeof(*rxq_ctrl),
 			       0, socket_id);
 	if (rxq_ctrl == NULL) {
@@ -909,27 +895,29 @@ mlx5_rxq_devx_obj_drop_create(struct rte_eth_dev *dev)
 		rte_errno = ENOMEM;
 		goto error;
 	}
-	rxq = mlx5_malloc(MLX5_MEM_ZERO, sizeof(*rxq), 0, socket_id);
-	if (rxq == NULL) {
+	rxq_obj = mlx5_malloc(MLX5_MEM_ZERO, sizeof(*rxq_obj), 0, socket_id);
+	if (rxq_obj == NULL) {
 		DRV_LOG(ERR, "Port %u could not allocate drop queue object",
 			dev->data->port_id);
 		rte_errno = ENOMEM;
 		goto error;
 	}
-	rxq->rxq_ctrl = rxq_ctrl;
+	rxq_obj->rxq_ctrl = rxq_ctrl;
 	rxq_ctrl->type = MLX5_RXQ_TYPE_STANDARD;
 	rxq_ctrl->sh = priv->sh;
-	rxq_ctrl->obj = rxq;
-	rxq_data = &rxq_ctrl->rxq;
+	rxq_ctrl->obj = rxq_obj;
+	rxq->ctrl = rxq_ctrl;
+	rxq->priv = priv;
+	LIST_INSERT_HEAD(&rxq_ctrl->owners, rxq, owner_entry);
 	/* Create CQ using DevX API. */
-	ret = mlx5_rxq_create_devx_cq_resources(dev, rxq_data);
+	ret = mlx5_rxq_create_devx_cq_resources(rxq);
 	if (ret != 0) {
 		DRV_LOG(ERR, "Port %u drop queue CQ creation failed.",
 			dev->data->port_id);
 		goto error;
 	}
 	/* Create RQ using DevX API. */
-	ret = mlx5_rxq_create_devx_rq_resources(dev, rxq_data);
+	ret = mlx5_rxq_create_devx_rq_resources(rxq);
 	if (ret != 0) {
 		DRV_LOG(ERR, "Port %u drop queue RQ creation failed.",
 			dev->data->port_id);
@@ -945,18 +933,20 @@ mlx5_rxq_devx_obj_drop_create(struct rte_eth_dev *dev)
 	return 0;
 error:
 	ret = rte_errno; /* Save rte_errno before cleanup. */
-	if (rxq != NULL) {
-		if (rxq->rq_obj.rq != NULL)
-			mlx5_devx_rq_destroy(&rxq->rq_obj);
-		if (rxq->cq_obj.cq != NULL)
-			mlx5_devx_cq_destroy(&rxq->cq_obj);
-		if (rxq->devx_channel)
+	if (rxq != NULL && rxq->devx_rq.rq != NULL)
+		mlx5_devx_rq_destroy(&rxq->devx_rq);
+	if (rxq_obj != NULL) {
+		if (rxq_obj->cq_obj.cq != NULL)
+			mlx5_devx_cq_destroy(&rxq_obj->cq_obj);
+		if (rxq_obj->devx_channel)
 			mlx5_os_devx_destroy_event_channel
-							(rxq->devx_channel);
-		mlx5_free(rxq);
+							(rxq_obj->devx_channel);
+		mlx5_free(rxq_obj);
 	}
 	if (rxq_ctrl != NULL)
 		mlx5_free(rxq_ctrl);
+	if (rxq != NULL)
+		mlx5_free(rxq);
 	rte_errno = ret; /* Restore rte_errno. */
 	return -rte_errno;
 }
@@ -971,12 +961,13 @@ static void
 mlx5_rxq_devx_obj_drop_release(struct rte_eth_dev *dev)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_obj *rxq = priv->drop_queue.rxq;
-	struct mlx5_rxq_ctrl *rxq_ctrl = rxq->rxq_ctrl;
+	struct mlx5_rxq_priv *rxq = priv->drop_queue.rxq;
+	struct mlx5_rxq_ctrl *rxq_ctrl = rxq->ctrl;
 
 	mlx5_rxq_devx_obj_release(rxq);
-	mlx5_free(rxq);
+	mlx5_free(rxq_ctrl->obj);
 	mlx5_free(rxq_ctrl);
+	mlx5_free(rxq);
 	priv->drop_queue.rxq = NULL;
 }
 
@@ -996,7 +987,7 @@ mlx5_devx_drop_action_destroy(struct rte_eth_dev *dev)
 		mlx5_devx_tir_destroy(hrxq);
 	if (hrxq->ind_table->ind_table != NULL)
 		mlx5_devx_ind_table_destroy(hrxq->ind_table);
-	if (priv->drop_queue.rxq->rq != NULL)
+	if (priv->drop_queue.rxq->devx_rq.rq != NULL)
 		mlx5_rxq_devx_obj_drop_release(dev);
 }
 
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index c04c0c73349..337dcca59fb 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -174,6 +174,7 @@ struct mlx5_rxq_priv {
 	struct mlx5_rxq_ctrl *ctrl; /* Shared Rx Queue. */
 	LIST_ENTRY(mlx5_rxq_priv) owner_entry; /* Entry in shared rxq_ctrl. */
 	struct mlx5_priv *priv; /* Back pointer to private data. */
+	struct mlx5_devx_rq devx_rq;
 	struct rte_eth_hairpin_conf hairpin_conf; /* Hairpin configuration. */
 	uint32_t hairpin_status; /* Hairpin binding status. */
 };
diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index 5a20966e2ca..2850a220399 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -471,13 +471,13 @@ int
 mlx5_rx_queue_stop_primary(struct rte_eth_dev *dev, uint16_t idx)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_data *rxq = (*priv->rxqs)[idx];
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-			container_of(rxq, struct mlx5_rxq_ctrl, rxq);
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, idx);
+	struct mlx5_rxq_ctrl *rxq_ctrl = rxq->ctrl;
 	int ret;
 
+	MLX5_ASSERT(rxq != NULL && rxq_ctrl != NULL);
 	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
-	ret = priv->obj_ops.rxq_obj_modify(rxq_ctrl->obj, MLX5_RXQ_MOD_RDY2RST);
+	ret = priv->obj_ops.rxq_obj_modify(rxq, MLX5_RXQ_MOD_RDY2RST);
 	if (ret) {
 		DRV_LOG(ERR, "Cannot change Rx WQ state to RESET:  %s",
 			strerror(errno));
@@ -485,7 +485,7 @@ mlx5_rx_queue_stop_primary(struct rte_eth_dev *dev, uint16_t idx)
 		return ret;
 	}
 	/* Remove all processes CQEs. */
-	rxq_sync_cq(rxq);
+	rxq_sync_cq(&rxq_ctrl->rxq);
 	/* Free all involved mbufs. */
 	rxq_free_elts(rxq_ctrl);
 	/* Set the actual queue state. */
@@ -557,26 +557,26 @@ int
 mlx5_rx_queue_start_primary(struct rte_eth_dev *dev, uint16_t idx)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_data *rxq = (*priv->rxqs)[idx];
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-			container_of(rxq, struct mlx5_rxq_ctrl, rxq);
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, idx);
+	struct mlx5_rxq_data *rxq_data = &rxq->ctrl->rxq;
 	int ret;
 
-	MLX5_ASSERT(rte_eal_process_type() ==  RTE_PROC_PRIMARY);
+	MLX5_ASSERT(rxq != NULL && rxq->ctrl != NULL);
+	MLX5_ASSERT(rte_eal_process_type() == RTE_PROC_PRIMARY);
 	/* Allocate needed buffers. */
-	ret = rxq_alloc_elts(rxq_ctrl);
+	ret = rxq_alloc_elts(rxq->ctrl);
 	if (ret) {
 		DRV_LOG(ERR, "Cannot reallocate buffers for Rx WQ");
 		rte_errno = errno;
 		return ret;
 	}
 	rte_io_wmb();
-	*rxq->cq_db = rte_cpu_to_be_32(rxq->cq_ci);
+	*rxq_data->cq_db = rte_cpu_to_be_32(rxq_data->cq_ci);
 	rte_io_wmb();
 	/* Reset RQ consumer before moving queue to READY state. */
-	*rxq->rq_db = rte_cpu_to_be_32(0);
+	*rxq_data->rq_db = rte_cpu_to_be_32(0);
 	rte_io_wmb();
-	ret = priv->obj_ops.rxq_obj_modify(rxq_ctrl->obj, MLX5_RXQ_MOD_RST2RDY);
+	ret = priv->obj_ops.rxq_obj_modify(rxq, MLX5_RXQ_MOD_RST2RDY);
 	if (ret) {
 		DRV_LOG(ERR, "Cannot change Rx WQ state to READY:  %s",
 			strerror(errno));
@@ -584,8 +584,8 @@ mlx5_rx_queue_start_primary(struct rte_eth_dev *dev, uint16_t idx)
 		return ret;
 	}
 	/* Reinitialize RQ - set WQEs. */
-	mlx5_rxq_initialize(rxq);
-	rxq->err_state = MLX5_RXQ_ERR_STATE_NO_ERROR;
+	mlx5_rxq_initialize(rxq_data);
+	rxq_data->err_state = MLX5_RXQ_ERR_STATE_NO_ERROR;
 	/* Set actual queue state. */
 	dev->data->rx_queue_state[idx] = RTE_ETH_QUEUE_STATE_STARTED;
 	return 0;
@@ -1835,15 +1835,19 @@ int
 mlx5_rxq_release(struct rte_eth_dev *dev, uint16_t idx)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, idx);
-	struct mlx5_rxq_ctrl *rxq_ctrl = rxq->ctrl;
+	struct mlx5_rxq_priv *rxq;
+	struct mlx5_rxq_ctrl *rxq_ctrl;
 
-	if (priv->rxqs == NULL || (*priv->rxqs)[idx] == NULL)
+	if (priv->rxq_privs == NULL)
+		return 0;
+	rxq = mlx5_rxq_get(dev, idx);
+	if (rxq == NULL)
 		return 0;
 	if (mlx5_rxq_deref(dev, idx) > 1)
 		return 1;
-	if (rxq_ctrl->obj) {
-		priv->obj_ops.rxq_obj_release(rxq_ctrl->obj);
+	rxq_ctrl = rxq->ctrl;
+	if (rxq_ctrl->obj != NULL) {
+		priv->obj_ops.rxq_obj_release(rxq);
 		LIST_REMOVE(rxq_ctrl->obj, next);
 		mlx5_free(rxq_ctrl->obj);
 		rxq_ctrl->obj = NULL;
diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index 0bcdff1b116..54d410b513b 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -373,11 +373,9 @@ mlx5_queue_state_modify_primary(struct rte_eth_dev *dev,
 	struct mlx5_priv *priv = dev->data->dev_private;
 
 	if (sm->is_wq) {
-		struct mlx5_rxq_data *rxq = (*priv->rxqs)[sm->queue_id];
-		struct mlx5_rxq_ctrl *rxq_ctrl =
-			container_of(rxq, struct mlx5_rxq_ctrl, rxq);
+		struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, sm->queue_id);
 
-		ret = priv->obj_ops.rxq_obj_modify(rxq_ctrl->obj, sm->state);
+		ret = priv->obj_ops.rxq_obj_modify(rxq, sm->state);
 		if (ret) {
 			DRV_LOG(ERR, "Cannot change Rx WQ state to %u  - %s",
 					sm->state, strerror(errno));
diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index caafdf27e8f..2cf62a9780d 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -231,7 +231,7 @@ mlx5_rxq_start(struct rte_eth_dev *dev)
 			rte_errno = ENOMEM;
 			goto error;
 		}
-		ret = priv->obj_ops.rxq_obj_new(dev, i);
+		ret = priv->obj_ops.rxq_obj_new(rxq);
 		if (ret) {
 			mlx5_free(rxq_ctrl->obj);
 			rxq_ctrl->obj = NULL;
diff --git a/drivers/net/mlx5/mlx5_vlan.c b/drivers/net/mlx5/mlx5_vlan.c
index 07792fc5d94..ea841bb32fb 100644
--- a/drivers/net/mlx5/mlx5_vlan.c
+++ b/drivers/net/mlx5/mlx5_vlan.c
@@ -91,11 +91,11 @@ void
 mlx5_vlan_strip_queue_set(struct rte_eth_dev *dev, uint16_t queue, int on)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_data *rxq = (*priv->rxqs)[queue];
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-		container_of(rxq, struct mlx5_rxq_ctrl, rxq);
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, queue);
+	struct mlx5_rxq_data *rxq_data = &rxq->ctrl->rxq;
 	int ret = 0;
 
+	MLX5_ASSERT(rxq != NULL && rxq->ctrl != NULL);
 	/* Validate hw support */
 	if (!priv->config.hw_vlan_strip) {
 		DRV_LOG(ERR, "port %u VLAN stripping is not supported",
@@ -109,20 +109,20 @@ mlx5_vlan_strip_queue_set(struct rte_eth_dev *dev, uint16_t queue, int on)
 		return;
 	}
 	DRV_LOG(DEBUG, "port %u set VLAN stripping offloads %d for port %uqueue %d",
-		dev->data->port_id, on, rxq->port_id, queue);
-	if (!rxq_ctrl->obj) {
+		dev->data->port_id, on, rxq_data->port_id, queue);
+	if (rxq->ctrl->obj == NULL) {
 		/* Update related bits in RX queue. */
-		rxq->vlan_strip = !!on;
+		rxq_data->vlan_strip = !!on;
 		return;
 	}
-	ret = priv->obj_ops.rxq_obj_modify_vlan_strip(rxq_ctrl->obj, on);
+	ret = priv->obj_ops.rxq_obj_modify_vlan_strip(rxq, on);
 	if (ret) {
 		DRV_LOG(ERR, "Port %u failed to modify object stripping mode:"
 			" %s", dev->data->port_id, strerror(rte_errno));
 		return;
 	}
 	/* Update related bits in RX queue. */
-	rxq->vlan_strip = !!on;
+	rxq_data->vlan_strip = !!on;
 }
 
 /**
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v4 12/14] net/mlx5: remove Rx queue data list from device
  2021-11-04 12:33 ` [dpdk-dev] [PATCH v4 00/14] net/mlx5: support shared Rx queue Xueming Li
                     ` (10 preceding siblings ...)
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 11/14] net/mlx5: move Rx queue DevX resource Xueming Li
@ 2021-11-04 12:33   ` Xueming Li
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 13/14] net/mlx5: support shared Rx queue Xueming Li
                     ` (2 subsequent siblings)
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-11-04 12:33 UTC (permalink / raw)
  To: dev; +Cc: xuemingl, Lior Margalit, Slava Ovsiienko, Matan Azrad

Rx queue data list(priv->rxqs) can be replaced by Rx queue
list(priv->rxq_privs), removes it and replaces with universal wrapper
API.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Slava Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/net/mlx5/linux/mlx5_verbs.c |  7 ++---
 drivers/net/mlx5/mlx5.c             | 10 +-----
 drivers/net/mlx5/mlx5.h             |  1 -
 drivers/net/mlx5/mlx5_devx.c        | 12 +++++---
 drivers/net/mlx5/mlx5_ethdev.c      |  6 +---
 drivers/net/mlx5/mlx5_flow.c        | 47 +++++++++++++++--------------
 drivers/net/mlx5/mlx5_rss.c         |  6 ++--
 drivers/net/mlx5/mlx5_rx.c          | 15 +++++----
 drivers/net/mlx5/mlx5_rx.h          |  9 +++---
 drivers/net/mlx5/mlx5_rxq.c         | 43 ++++++++++++--------------
 drivers/net/mlx5/mlx5_rxtx_vec.c    |  6 ++--
 drivers/net/mlx5/mlx5_stats.c       |  9 +++---
 drivers/net/mlx5/mlx5_trigger.c     |  2 +-
 13 files changed, 79 insertions(+), 94 deletions(-)

diff --git a/drivers/net/mlx5/linux/mlx5_verbs.c b/drivers/net/mlx5/linux/mlx5_verbs.c
index 5d4ae3ea752..f78916c868f 100644
--- a/drivers/net/mlx5/linux/mlx5_verbs.c
+++ b/drivers/net/mlx5/linux/mlx5_verbs.c
@@ -486,11 +486,10 @@ mlx5_ibv_ind_table_new(struct rte_eth_dev *dev, const unsigned int log_n,
 
 	MLX5_ASSERT(ind_tbl);
 	for (i = 0; i != ind_tbl->queues_n; ++i) {
-		struct mlx5_rxq_data *rxq = (*priv->rxqs)[ind_tbl->queues[i]];
-		struct mlx5_rxq_ctrl *rxq_ctrl =
-				container_of(rxq, struct mlx5_rxq_ctrl, rxq);
+		struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev,
+							 ind_tbl->queues[i]);
 
-		wq[i] = rxq_ctrl->obj->wq;
+		wq[i] = rxq->ctrl->obj->wq;
 	}
 	MLX5_ASSERT(i > 0);
 	/* Finalise indirection table. */
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 374cc9757aa..8614b8ffddd 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -1687,20 +1687,12 @@ mlx5_dev_close(struct rte_eth_dev *dev)
 	/* Free the eCPRI flex parser resource. */
 	mlx5_flex_parser_ecpri_release(dev);
 	mlx5_flex_item_port_cleanup(dev);
-	if (priv->rxqs != NULL) {
+	if (priv->rxq_privs != NULL) {
 		/* XXX race condition if mlx5_rx_burst() is still running. */
 		rte_delay_us_sleep(1000);
 		for (i = 0; (i != priv->rxqs_n); ++i)
 			mlx5_rxq_release(dev, i);
 		priv->rxqs_n = 0;
-		priv->rxqs = NULL;
-	}
-	if (priv->representor) {
-		/* Each representor has a dedicated interrupts handler */
-		mlx5_free(dev->intr_handle);
-		dev->intr_handle = NULL;
-	}
-	if (priv->rxq_privs != NULL) {
 		mlx5_free(priv->rxq_privs);
 		priv->rxq_privs = NULL;
 	}
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 967d92b4ad6..a037a33debf 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1410,7 +1410,6 @@ struct mlx5_priv {
 	unsigned int rxqs_n; /* RX queues array size. */
 	unsigned int txqs_n; /* TX queues array size. */
 	struct mlx5_rxq_priv *(*rxq_privs)[]; /* RX queue non-shared data. */
-	struct mlx5_rxq_data *(*rxqs)[]; /* (Shared) RX queues. */
 	struct mlx5_txq_data *(*txqs)[]; /* TX queues. */
 	struct rte_mempool *mprq_mp; /* Mempool for Multi-Packet RQ. */
 	struct rte_eth_rss_conf rss_conf; /* RSS configuration. */
diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index b90a5d82458..668d47025e8 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -684,15 +684,17 @@ mlx5_devx_tir_attr_set(struct rte_eth_dev *dev, const uint8_t *rss_key,
 
 	/* NULL queues designate drop queue. */
 	if (ind_tbl->queues != NULL) {
-		struct mlx5_rxq_data *rxq_data =
-					(*priv->rxqs)[ind_tbl->queues[0]];
 		struct mlx5_rxq_ctrl *rxq_ctrl =
-			container_of(rxq_data, struct mlx5_rxq_ctrl, rxq);
-		rxq_obj_type = rxq_ctrl->type;
+				mlx5_rxq_ctrl_get(dev, ind_tbl->queues[0]);
+		rxq_obj_type = rxq_ctrl != NULL ? rxq_ctrl->type :
+						  MLX5_RXQ_TYPE_STANDARD;
 
 		/* Enable TIR LRO only if all the queues were configured for. */
 		for (i = 0; i < ind_tbl->queues_n; ++i) {
-			if (!(*priv->rxqs)[ind_tbl->queues[i]]->lro) {
+			struct mlx5_rxq_data *rxq_i =
+				mlx5_rxq_data_get(dev, ind_tbl->queues[i]);
+
+			if (rxq_i != NULL && !rxq_i->lro) {
 				lro = false;
 				break;
 			}
diff --git a/drivers/net/mlx5/mlx5_ethdev.c b/drivers/net/mlx5/mlx5_ethdev.c
index cde505955df..bb38d5d2ade 100644
--- a/drivers/net/mlx5/mlx5_ethdev.c
+++ b/drivers/net/mlx5/mlx5_ethdev.c
@@ -114,7 +114,6 @@ mlx5_dev_configure(struct rte_eth_dev *dev)
 		rte_errno = ENOMEM;
 		return -rte_errno;
 	}
-	priv->rxqs = (void *)dev->data->rx_queues;
 	priv->txqs = (void *)dev->data->tx_queues;
 	if (txqs_n != priv->txqs_n) {
 		DRV_LOG(INFO, "port %u Tx queues number update: %u -> %u",
@@ -171,11 +170,8 @@ mlx5_dev_configure_rss_reta(struct rte_eth_dev *dev)
 		return -rte_errno;
 	}
 	for (i = 0, j = 0; i < rxqs_n; i++) {
-		struct mlx5_rxq_data *rxq_data;
-		struct mlx5_rxq_ctrl *rxq_ctrl;
+		struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_ctrl_get(dev, i);
 
-		rxq_data = (*priv->rxqs)[i];
-		rxq_ctrl = container_of(rxq_data, struct mlx5_rxq_ctrl, rxq);
 		if (rxq_ctrl && rxq_ctrl->type == MLX5_RXQ_TYPE_STANDARD)
 			rss_queue_arr[j++] = i;
 	}
diff --git a/drivers/net/mlx5/mlx5_flow.c b/drivers/net/mlx5/mlx5_flow.c
index 5435660a2dd..2f30a355258 100644
--- a/drivers/net/mlx5/mlx5_flow.c
+++ b/drivers/net/mlx5/mlx5_flow.c
@@ -1210,10 +1210,11 @@ flow_drv_rxq_flags_set(struct rte_eth_dev *dev,
 		return;
 	for (i = 0; i != ind_tbl->queues_n; ++i) {
 		int idx = ind_tbl->queues[i];
-		struct mlx5_rxq_ctrl *rxq_ctrl =
-			container_of((*priv->rxqs)[idx],
-				     struct mlx5_rxq_ctrl, rxq);
+		struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_ctrl_get(dev, idx);
 
+		MLX5_ASSERT(rxq_ctrl != NULL);
+		if (rxq_ctrl == NULL)
+			continue;
 		/*
 		 * To support metadata register copy on Tx loopback,
 		 * this must be always enabled (metadata may arive
@@ -1305,10 +1306,11 @@ flow_drv_rxq_flags_trim(struct rte_eth_dev *dev,
 	MLX5_ASSERT(dev->data->dev_started);
 	for (i = 0; i != ind_tbl->queues_n; ++i) {
 		int idx = ind_tbl->queues[i];
-		struct mlx5_rxq_ctrl *rxq_ctrl =
-			container_of((*priv->rxqs)[idx],
-				     struct mlx5_rxq_ctrl, rxq);
+		struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_ctrl_get(dev, idx);
 
+		MLX5_ASSERT(rxq_ctrl != NULL);
+		if (rxq_ctrl == NULL)
+			continue;
 		if (priv->config.dv_flow_en &&
 		    priv->config.dv_xmeta_en != MLX5_XMETA_MODE_LEGACY &&
 		    mlx5_flow_ext_mreg_supported(dev)) {
@@ -1369,18 +1371,16 @@ flow_rxq_flags_clear(struct rte_eth_dev *dev)
 	unsigned int i;
 
 	for (i = 0; i != priv->rxqs_n; ++i) {
-		struct mlx5_rxq_ctrl *rxq_ctrl;
+		struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, i);
 		unsigned int j;
 
-		if (!(*priv->rxqs)[i])
+		if (rxq == NULL || rxq->ctrl == NULL)
 			continue;
-		rxq_ctrl = container_of((*priv->rxqs)[i],
-					struct mlx5_rxq_ctrl, rxq);
-		rxq_ctrl->flow_mark_n = 0;
-		rxq_ctrl->rxq.mark = 0;
+		rxq->ctrl->flow_mark_n = 0;
+		rxq->ctrl->rxq.mark = 0;
 		for (j = 0; j != MLX5_FLOW_TUNNEL; ++j)
-			rxq_ctrl->flow_tunnels_n[j] = 0;
-		rxq_ctrl->rxq.tunnel = 0;
+			rxq->ctrl->flow_tunnels_n[j] = 0;
+		rxq->ctrl->rxq.tunnel = 0;
 	}
 }
 
@@ -1394,13 +1394,15 @@ void
 mlx5_flow_rxq_dynf_metadata_set(struct rte_eth_dev *dev)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_data *data;
 	unsigned int i;
 
 	for (i = 0; i != priv->rxqs_n; ++i) {
-		if (!(*priv->rxqs)[i])
+		struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, i);
+		struct mlx5_rxq_data *data;
+
+		if (rxq == NULL || rxq->ctrl == NULL)
 			continue;
-		data = (*priv->rxqs)[i];
+		data = &rxq->ctrl->rxq;
 		if (!rte_flow_dynf_metadata_avail()) {
 			data->dynf_meta = 0;
 			data->flow_meta_mask = 0;
@@ -1591,7 +1593,7 @@ mlx5_flow_validate_action_queue(const struct rte_flow_action *action,
 					  RTE_FLOW_ERROR_TYPE_ACTION_CONF,
 					  &queue->index,
 					  "queue index out of range");
-	if (!(*priv->rxqs)[queue->index])
+	if (mlx5_rxq_get(dev, queue->index) == NULL)
 		return rte_flow_error_set(error, EINVAL,
 					  RTE_FLOW_ERROR_TYPE_ACTION_CONF,
 					  &queue->index,
@@ -1622,7 +1624,7 @@ mlx5_flow_validate_action_queue(const struct rte_flow_action *action,
  *   0 on success, a negative errno code on error.
  */
 static int
-mlx5_validate_rss_queues(const struct rte_eth_dev *dev,
+mlx5_validate_rss_queues(struct rte_eth_dev *dev,
 			 const uint16_t *queues, uint32_t queues_n,
 			 const char **error, uint32_t *queue_idx)
 {
@@ -1631,20 +1633,19 @@ mlx5_validate_rss_queues(const struct rte_eth_dev *dev,
 	uint32_t i;
 
 	for (i = 0; i != queues_n; ++i) {
-		struct mlx5_rxq_ctrl *rxq_ctrl;
+		struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_ctrl_get(dev,
+								   queues[i]);
 
 		if (queues[i] >= priv->rxqs_n) {
 			*error = "queue index out of range";
 			*queue_idx = i;
 			return -EINVAL;
 		}
-		if (!(*priv->rxqs)[queues[i]]) {
+		if (rxq_ctrl == NULL) {
 			*error =  "queue is not configured";
 			*queue_idx = i;
 			return -EINVAL;
 		}
-		rxq_ctrl = container_of((*priv->rxqs)[queues[i]],
-					struct mlx5_rxq_ctrl, rxq);
 		if (i == 0)
 			rxq_type = rxq_ctrl->type;
 		if (rxq_type != rxq_ctrl->type) {
diff --git a/drivers/net/mlx5/mlx5_rss.c b/drivers/net/mlx5/mlx5_rss.c
index a04e22398db..75af05b7b02 100644
--- a/drivers/net/mlx5/mlx5_rss.c
+++ b/drivers/net/mlx5/mlx5_rss.c
@@ -65,9 +65,11 @@ mlx5_rss_hash_update(struct rte_eth_dev *dev,
 	priv->rss_conf.rss_hf = rss_conf->rss_hf;
 	/* Enable the RSS hash in all Rx queues. */
 	for (i = 0, idx = 0; idx != priv->rxqs_n; ++i) {
-		if (!(*priv->rxqs)[i])
+		struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, i);
+
+		if (rxq == NULL || rxq->ctrl == NULL)
 			continue;
-		(*priv->rxqs)[i]->rss_hash = !!rss_conf->rss_hf &&
+		rxq->ctrl->rxq.rss_hash = !!rss_conf->rss_hf &&
 			!!(dev->data->dev_conf.rxmode.mq_mode & RTE_ETH_MQ_RX_RSS);
 		++idx;
 	}
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index d41905a2a04..1ffa1b95b88 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -148,10 +148,8 @@ void
 mlx5_rxq_info_get(struct rte_eth_dev *dev, uint16_t rx_queue_id,
 		  struct rte_eth_rxq_info *qinfo)
 {
-	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_data *rxq = (*priv->rxqs)[rx_queue_id];
-	struct mlx5_rxq_ctrl *rxq_ctrl =
-		container_of(rxq, struct mlx5_rxq_ctrl, rxq);
+	struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_ctrl_get(dev, rx_queue_id);
+	struct mlx5_rxq_data *rxq = mlx5_rxq_data_get(dev, rx_queue_id);
 
 	if (!rxq)
 		return;
@@ -162,7 +160,10 @@ mlx5_rxq_info_get(struct rte_eth_dev *dev, uint16_t rx_queue_id,
 	qinfo->conf.rx_thresh.wthresh = 0;
 	qinfo->conf.rx_free_thresh = rxq->rq_repl_thresh;
 	qinfo->conf.rx_drop_en = 1;
-	qinfo->conf.rx_deferred_start = rxq_ctrl ? 0 : 1;
+	if (rxq_ctrl == NULL || rxq_ctrl->obj == NULL)
+		qinfo->conf.rx_deferred_start = 0;
+	else
+		qinfo->conf.rx_deferred_start = 1;
 	qinfo->conf.offloads = dev->data->dev_conf.rxmode.offloads;
 	qinfo->scattered_rx = dev->data->scattered_rx;
 	qinfo->nb_desc = mlx5_rxq_mprq_enabled(rxq) ?
@@ -191,10 +192,8 @@ mlx5_rx_burst_mode_get(struct rte_eth_dev *dev,
 		       struct rte_eth_burst_mode *mode)
 {
 	eth_rx_burst_t pkt_burst = dev->rx_pkt_burst;
-	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_data *rxq;
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, rx_queue_id);
 
-	rxq = (*priv->rxqs)[rx_queue_id];
 	if (!rxq) {
 		rte_errno = EINVAL;
 		return -rte_errno;
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index 337dcca59fb..413e36f6d8d 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -603,14 +603,13 @@ mlx5_mprq_enabled(struct rte_eth_dev *dev)
 		return 0;
 	/* All the configured queues should be enabled. */
 	for (i = 0; i < priv->rxqs_n; ++i) {
-		struct mlx5_rxq_data *rxq = (*priv->rxqs)[i];
-		struct mlx5_rxq_ctrl *rxq_ctrl = container_of
-			(rxq, struct mlx5_rxq_ctrl, rxq);
+		struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_ctrl_get(dev, i);
 
-		if (rxq == NULL || rxq_ctrl->type != MLX5_RXQ_TYPE_STANDARD)
+		if (rxq_ctrl == NULL ||
+		    rxq_ctrl->type != MLX5_RXQ_TYPE_STANDARD)
 			continue;
 		n_ibv++;
-		if (mlx5_rxq_mprq_enabled(rxq))
+		if (mlx5_rxq_mprq_enabled(&rxq_ctrl->rxq))
 			++n;
 	}
 	/* Multi-Packet RQ can't be partially configured. */
diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index 2850a220399..f3fc618ed2c 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -748,7 +748,7 @@ mlx5_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 	}
 	DRV_LOG(DEBUG, "port %u adding Rx queue %u to list",
 		dev->data->port_id, idx);
-	(*priv->rxqs)[idx] = &rxq_ctrl->rxq;
+	dev->data->rx_queues[idx] = &rxq_ctrl->rxq;
 	return 0;
 }
 
@@ -830,7 +830,7 @@ mlx5_rx_hairpin_queue_setup(struct rte_eth_dev *dev, uint16_t idx,
 	}
 	DRV_LOG(DEBUG, "port %u adding hairpin Rx queue %u to list",
 		dev->data->port_id, idx);
-	(*priv->rxqs)[idx] = &rxq_ctrl->rxq;
+	dev->data->rx_queues[idx] = &rxq_ctrl->rxq;
 	return 0;
 }
 
@@ -1163,7 +1163,7 @@ mlx5_mprq_free_mp(struct rte_eth_dev *dev)
 	rte_mempool_free(mp);
 	/* Unset mempool for each Rx queue. */
 	for (i = 0; i != priv->rxqs_n; ++i) {
-		struct mlx5_rxq_data *rxq = (*priv->rxqs)[i];
+		struct mlx5_rxq_data *rxq = mlx5_rxq_data_get(dev, i);
 
 		if (rxq == NULL)
 			continue;
@@ -1204,12 +1204,13 @@ mlx5_mprq_alloc_mp(struct rte_eth_dev *dev)
 		return 0;
 	/* Count the total number of descriptors configured. */
 	for (i = 0; i != priv->rxqs_n; ++i) {
-		struct mlx5_rxq_data *rxq = (*priv->rxqs)[i];
-		struct mlx5_rxq_ctrl *rxq_ctrl = container_of
-			(rxq, struct mlx5_rxq_ctrl, rxq);
+		struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_ctrl_get(dev, i);
+		struct mlx5_rxq_data *rxq;
 
-		if (rxq == NULL || rxq_ctrl->type != MLX5_RXQ_TYPE_STANDARD)
+		if (rxq_ctrl == NULL ||
+		    rxq_ctrl->type != MLX5_RXQ_TYPE_STANDARD)
 			continue;
+		rxq = &rxq_ctrl->rxq;
 		n_ibv++;
 		desc += 1 << rxq->elts_n;
 		/* Get the max number of strides. */
@@ -1292,13 +1293,12 @@ mlx5_mprq_alloc_mp(struct rte_eth_dev *dev)
 exit:
 	/* Set mempool for each Rx queue. */
 	for (i = 0; i != priv->rxqs_n; ++i) {
-		struct mlx5_rxq_data *rxq = (*priv->rxqs)[i];
-		struct mlx5_rxq_ctrl *rxq_ctrl = container_of
-			(rxq, struct mlx5_rxq_ctrl, rxq);
+		struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_ctrl_get(dev, i);
 
-		if (rxq == NULL || rxq_ctrl->type != MLX5_RXQ_TYPE_STANDARD)
+		if (rxq_ctrl == NULL ||
+		    rxq_ctrl->type != MLX5_RXQ_TYPE_STANDARD)
 			continue;
-		rxq->mprq_mp = mp;
+		rxq_ctrl->rxq.mprq_mp = mp;
 	}
 	DRV_LOG(INFO, "port %u Multi-Packet RQ is configured",
 		dev->data->port_id);
@@ -1777,8 +1777,7 @@ mlx5_rxq_get(struct rte_eth_dev *dev, uint16_t idx)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 
-	if (priv->rxq_privs == NULL)
-		return NULL;
+	MLX5_ASSERT(priv->rxq_privs != NULL);
 	return (*priv->rxq_privs)[idx];
 }
 
@@ -1862,7 +1861,7 @@ mlx5_rxq_release(struct rte_eth_dev *dev, uint16_t idx)
 		LIST_REMOVE(rxq, owner_entry);
 		LIST_REMOVE(rxq_ctrl, next);
 		mlx5_free(rxq_ctrl);
-		(*priv->rxqs)[idx] = NULL;
+		dev->data->rx_queues[idx] = NULL;
 		mlx5_free(rxq);
 		(*priv->rxq_privs)[idx] = NULL;
 	}
@@ -1908,14 +1907,10 @@ enum mlx5_rxq_type
 mlx5_rxq_get_type(struct rte_eth_dev *dev, uint16_t idx)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	struct mlx5_rxq_ctrl *rxq_ctrl = NULL;
+	struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_ctrl_get(dev, idx);
 
-	if (idx < priv->rxqs_n && (*priv->rxqs)[idx]) {
-		rxq_ctrl = container_of((*priv->rxqs)[idx],
-					struct mlx5_rxq_ctrl,
-					rxq);
+	if (idx < priv->rxqs_n && rxq_ctrl != NULL)
 		return rxq_ctrl->type;
-	}
 	return MLX5_RXQ_TYPE_UNDEFINED;
 }
 
@@ -2682,13 +2677,13 @@ mlx5_rxq_timestamp_set(struct rte_eth_dev *dev)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_dev_ctx_shared *sh = priv->sh;
-	struct mlx5_rxq_data *data;
 	unsigned int i;
 
 	for (i = 0; i != priv->rxqs_n; ++i) {
-		if (!(*priv->rxqs)[i])
+		struct mlx5_rxq_data *data = mlx5_rxq_data_get(dev, i);
+
+		if (data == NULL)
 			continue;
-		data = (*priv->rxqs)[i];
 		data->sh = sh;
 		data->rt_timestamp = priv->config.rt_timestamp;
 	}
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.c b/drivers/net/mlx5/mlx5_rxtx_vec.c
index 511681841ca..6212ce8247d 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec.c
+++ b/drivers/net/mlx5/mlx5_rxtx_vec.c
@@ -578,11 +578,11 @@ mlx5_check_vec_rx_support(struct rte_eth_dev *dev)
 		return -ENOTSUP;
 	/* All the configured queues should support. */
 	for (i = 0; i < priv->rxqs_n; ++i) {
-		struct mlx5_rxq_data *rxq = (*priv->rxqs)[i];
+		struct mlx5_rxq_data *rxq_data = mlx5_rxq_data_get(dev, i);
 
-		if (!rxq)
+		if (!rxq_data)
 			continue;
-		if (mlx5_rxq_check_vec_support(rxq) < 0)
+		if (mlx5_rxq_check_vec_support(rxq_data) < 0)
 			break;
 	}
 	if (i != priv->rxqs_n)
diff --git a/drivers/net/mlx5/mlx5_stats.c b/drivers/net/mlx5/mlx5_stats.c
index ae2f5668a74..732775954ad 100644
--- a/drivers/net/mlx5/mlx5_stats.c
+++ b/drivers/net/mlx5/mlx5_stats.c
@@ -107,7 +107,7 @@ mlx5_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *stats)
 	memset(&tmp, 0, sizeof(tmp));
 	/* Add software counters. */
 	for (i = 0; (i != priv->rxqs_n); ++i) {
-		struct mlx5_rxq_data *rxq = (*priv->rxqs)[i];
+		struct mlx5_rxq_data *rxq = mlx5_rxq_data_get(dev, i);
 
 		if (rxq == NULL)
 			continue;
@@ -181,10 +181,11 @@ mlx5_stats_reset(struct rte_eth_dev *dev)
 	unsigned int i;
 
 	for (i = 0; (i != priv->rxqs_n); ++i) {
-		if ((*priv->rxqs)[i] == NULL)
+		struct mlx5_rxq_data *rxq_data = mlx5_rxq_data_get(dev, i);
+
+		if (rxq_data == NULL)
 			continue;
-		memset(&(*priv->rxqs)[i]->stats, 0,
-		       sizeof(struct mlx5_rxq_stats));
+		memset(&rxq_data->stats, 0, sizeof(struct mlx5_rxq_stats));
 	}
 	for (i = 0; (i != priv->txqs_n); ++i) {
 		if ((*priv->txqs)[i] == NULL)
diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index 2cf62a9780d..72475e4b5b5 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -227,7 +227,7 @@ mlx5_rxq_start(struct rte_eth_dev *dev)
 		if (!rxq_ctrl->obj) {
 			DRV_LOG(ERR,
 				"Port %u Rx queue %u can't allocate resources.",
-				dev->data->port_id, (*priv->rxqs)[i]->idx);
+				dev->data->port_id, i);
 			rte_errno = ENOMEM;
 			goto error;
 		}
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v4 13/14] net/mlx5: support shared Rx queue
  2021-11-04 12:33 ` [dpdk-dev] [PATCH v4 00/14] net/mlx5: support shared Rx queue Xueming Li
                     ` (11 preceding siblings ...)
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 12/14] net/mlx5: remove Rx queue data list from device Xueming Li
@ 2021-11-04 12:33   ` Xueming Li
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 14/14] net/mlx5: add shared Rx queue port datapath support Xueming Li
  2021-11-04 20:06   ` [dpdk-dev] [PATCH v4 00/14] net/mlx5: support shared Rx queue Raslan Darawsheh
  14 siblings, 0 replies; 266+ messages in thread
From: Xueming Li @ 2021-11-04 12:33 UTC (permalink / raw)
  To: dev; +Cc: xuemingl, Lior Margalit, Slava Ovsiienko, Matan Azrad

This patch introduces shared RxQ. All shared Rx queues with same group
and queue ID share the same rxq_ctrl. Rxq_ctrl and rxq_data are shared,
all queues from different member port share same WQ and CQ, essentially
one Rx WQ, mbufs are filled into this singleton WQ.

Shared rxq_data is set into device Rx queues of all member ports as
RxQ object, used for receiving packets. Polling queue of any member
ports returns packets of any member, mbuf->port is used to identify
source port.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Acked-by: Slava Ovsiienko <viacheslavo@nvidia.com>
---
 doc/guides/nics/features/mlx5.ini   |   1 +
 doc/guides/nics/mlx5.rst            |   6 +
 drivers/net/mlx5/linux/mlx5_os.c    |   2 +
 drivers/net/mlx5/linux/mlx5_verbs.c |   8 +-
 drivers/net/mlx5/mlx5.h             |   2 +
 drivers/net/mlx5/mlx5_devx.c        |  46 +++--
 drivers/net/mlx5/mlx5_ethdev.c      |   5 +
 drivers/net/mlx5/mlx5_rx.h          |   3 +
 drivers/net/mlx5/mlx5_rxq.c         | 273 ++++++++++++++++++++++++----
 drivers/net/mlx5/mlx5_trigger.c     |  61 ++++---
 10 files changed, 329 insertions(+), 78 deletions(-)

diff --git a/doc/guides/nics/features/mlx5.ini b/doc/guides/nics/features/mlx5.ini
index 403f58cd7e2..7cbd11bb160 100644
--- a/doc/guides/nics/features/mlx5.ini
+++ b/doc/guides/nics/features/mlx5.ini
@@ -11,6 +11,7 @@ Removal event        = Y
 Rx interrupt         = Y
 Fast mbuf free       = Y
 Queue start/stop     = Y
+Shared Rx queue      = Y
 Burst mode info      = Y
 Power mgmt address monitor = Y
 MTU update           = Y
diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index bb92520dff4..824971d89ae 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -113,6 +113,7 @@ Features
 - Connection tracking.
 - Sub-Function representors.
 - Sub-Function.
+- Shared Rx queue.
 
 
 Limitations
@@ -465,6 +466,11 @@ Limitations
   - In order to achieve best insertion rate, application should manage the flows per lcore.
   - Better to disable memory reclaim by setting ``reclaim_mem_mode`` to 0 to accelerate the flow object allocation and release with cache.
 
+ Shared Rx queue:
+
+  - Counters of received packets and bytes number of devices in same share group are same.
+  - Counters of received packets and bytes number of queues in same group and queue ID are same.
+
 - HW hashed bonding
 
   - TXQ affinity subjects to HW hash once enabled.
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index f51da8c3a38..e0304b685e5 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -420,6 +420,7 @@ mlx5_alloc_shared_dr(struct mlx5_priv *priv)
 			mlx5_glue->dr_create_flow_action_default_miss();
 	if (!sh->default_miss_action)
 		DRV_LOG(WARNING, "Default miss action is not supported.");
+	LIST_INIT(&sh->shared_rxqs);
 	return 0;
 error:
 	/* Rollback the created objects. */
@@ -494,6 +495,7 @@ mlx5_os_free_shared_dr(struct mlx5_priv *priv)
 	MLX5_ASSERT(sh && sh->refcnt);
 	if (sh->refcnt > 1)
 		return;
+	MLX5_ASSERT(LIST_EMPTY(&sh->shared_rxqs));
 #ifdef HAVE_MLX5DV_DR
 	if (sh->rx_domain) {
 		mlx5_glue->dr_destroy_domain(sh->rx_domain);
diff --git a/drivers/net/mlx5/linux/mlx5_verbs.c b/drivers/net/mlx5/linux/mlx5_verbs.c
index f78916c868f..9d299542614 100644
--- a/drivers/net/mlx5/linux/mlx5_verbs.c
+++ b/drivers/net/mlx5/linux/mlx5_verbs.c
@@ -424,14 +424,16 @@ mlx5_rxq_ibv_obj_release(struct mlx5_rxq_priv *rxq)
 {
 	struct mlx5_rxq_obj *rxq_obj = rxq->ctrl->obj;
 
-	MLX5_ASSERT(rxq_obj);
-	MLX5_ASSERT(rxq_obj->wq);
-	MLX5_ASSERT(rxq_obj->ibv_cq);
+	if (rxq_obj == NULL || rxq_obj->wq == NULL)
+		return;
 	claim_zero(mlx5_glue->destroy_wq(rxq_obj->wq));
+	rxq_obj->wq = NULL;
+	MLX5_ASSERT(rxq_obj->ibv_cq);
 	claim_zero(mlx5_glue->destroy_cq(rxq_obj->ibv_cq));
 	if (rxq_obj->ibv_channel)
 		claim_zero(mlx5_glue->destroy_comp_channel
 							(rxq_obj->ibv_channel));
+	rxq->ctrl->started = false;
 }
 
 /**
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index a037a33debf..51f45788381 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1200,6 +1200,7 @@ struct mlx5_dev_ctx_shared {
 	struct mlx5_ecpri_parser_profile ecpri_parser;
 	/* Flex parser profiles information. */
 	void *devx_rx_uar; /* DevX UAR for Rx. */
+	LIST_HEAD(shared_rxqs, mlx5_rxq_ctrl) shared_rxqs; /* Shared RXQs. */
 	struct mlx5_aso_age_mng *aso_age_mng;
 	/* Management data for aging mechanism using ASO Flow Hit. */
 	struct mlx5_geneve_tlv_option_resource *geneve_tlv_option_resource;
@@ -1267,6 +1268,7 @@ struct mlx5_rxq_obj {
 		};
 		struct mlx5_devx_obj *rq; /* DevX RQ object for hairpin. */
 		struct {
+			struct mlx5_devx_rmp devx_rmp; /* RMP for shared RQ. */
 			struct mlx5_devx_cq cq_obj; /* DevX CQ object. */
 			void *devx_channel;
 		};
diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index 668d47025e8..d3d189ab7f2 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -88,6 +88,8 @@ mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type)
 	default:
 		break;
 	}
+	if (rxq->ctrl->type == MLX5_RXQ_TYPE_HAIRPIN)
+		return mlx5_devx_cmd_modify_rq(rxq->ctrl->obj->rq, &rq_attr);
 	return mlx5_devx_cmd_modify_rq(rxq->devx_rq.rq, &rq_attr);
 }
 
@@ -156,18 +158,21 @@ mlx5_txq_devx_modify(struct mlx5_txq_obj *obj, enum mlx5_txq_modify_type type,
 static void
 mlx5_rxq_devx_obj_release(struct mlx5_rxq_priv *rxq)
 {
-	struct mlx5_rxq_ctrl *rxq_ctrl = rxq->ctrl;
-	struct mlx5_rxq_obj *rxq_obj = rxq_ctrl->obj;
+	struct mlx5_rxq_obj *rxq_obj = rxq->ctrl->obj;
 
-	MLX5_ASSERT(rxq != NULL);
-	MLX5_ASSERT(rxq_ctrl != NULL);
+	if (rxq_obj == NULL)
+		return;
 	if (rxq_obj->rxq_ctrl->type == MLX5_RXQ_TYPE_HAIRPIN) {
-		MLX5_ASSERT(rxq_obj->rq);
+		if (rxq_obj->rq == NULL)
+			return;
 		mlx5_devx_modify_rq(rxq, MLX5_RXQ_MOD_RDY2RST);
 		claim_zero(mlx5_devx_cmd_destroy(rxq_obj->rq));
 	} else {
+		if (rxq->devx_rq.rq == NULL)
+			return;
 		mlx5_devx_rq_destroy(&rxq->devx_rq);
-		memset(&rxq->devx_rq, 0, sizeof(rxq->devx_rq));
+		if (rxq->devx_rq.rmp != NULL && rxq->devx_rq.rmp->ref_cnt > 0)
+			return;
 		mlx5_devx_cq_destroy(&rxq_obj->cq_obj);
 		memset(&rxq_obj->cq_obj, 0, sizeof(rxq_obj->cq_obj));
 		if (rxq_obj->devx_channel) {
@@ -176,6 +181,7 @@ mlx5_rxq_devx_obj_release(struct mlx5_rxq_priv *rxq)
 			rxq_obj->devx_channel = NULL;
 		}
 	}
+	rxq->ctrl->started = false;
 }
 
 /**
@@ -271,6 +277,8 @@ mlx5_rxq_create_devx_rq_resources(struct mlx5_rxq_priv *rxq)
 						MLX5_WQ_END_PAD_MODE_NONE;
 	rq_attr.wq_attr.pd = cdev->pdn;
 	rq_attr.counter_set_id = priv->counter_set_id;
+	if (rxq_data->shared) /* Create RMP based RQ. */
+		rxq->devx_rq.rmp = &rxq_ctrl->obj->devx_rmp;
 	/* Create RQ using DevX API. */
 	return mlx5_devx_rq_create(cdev->ctx, &rxq->devx_rq, wqe_size,
 				   log_desc_n, &rq_attr, rxq_ctrl->socket);
@@ -300,6 +308,8 @@ mlx5_rxq_create_devx_cq_resources(struct mlx5_rxq_priv *rxq)
 	uint16_t event_nums[1] = { 0 };
 	int ret = 0;
 
+	if (rxq_ctrl->started)
+		return 0;
 	if (priv->config.cqe_comp && !rxq_data->hw_timestamp &&
 	    !rxq_data->lro) {
 		cq_attr.cqe_comp_en = 1u;
@@ -365,6 +375,7 @@ mlx5_rxq_create_devx_cq_resources(struct mlx5_rxq_priv *rxq)
 	rxq_data->cq_uar = mlx5_os_get_devx_uar_base_addr(sh->devx_rx_uar);
 	rxq_data->cqe_n = log_cqe_n;
 	rxq_data->cqn = cq_obj->cq->id;
+	rxq_data->cq_ci = 0;
 	if (rxq_ctrl->obj->devx_channel) {
 		ret = mlx5_os_devx_subscribe_devx_event
 					      (rxq_ctrl->obj->devx_channel,
@@ -463,7 +474,7 @@ mlx5_rxq_devx_obj_new(struct mlx5_rxq_priv *rxq)
 	if (rxq_ctrl->type == MLX5_RXQ_TYPE_HAIRPIN)
 		return mlx5_rxq_obj_hairpin_new(rxq);
 	tmpl->rxq_ctrl = rxq_ctrl;
-	if (rxq_ctrl->irq) {
+	if (rxq_ctrl->irq && !rxq_ctrl->started) {
 		int devx_ev_flag =
 			  MLX5DV_DEVX_CREATE_EVENT_CHANNEL_FLAGS_OMIT_EV_DATA;
 
@@ -496,11 +507,19 @@ mlx5_rxq_devx_obj_new(struct mlx5_rxq_priv *rxq)
 	ret = mlx5_devx_modify_rq(rxq, MLX5_RXQ_MOD_RST2RDY);
 	if (ret)
 		goto error;
-	rxq_data->wqes = (void *)(uintptr_t)rxq->devx_rq.wq.umem_buf;
-	rxq_data->rq_db = (uint32_t *)(uintptr_t)rxq->devx_rq.wq.db_rec;
-	mlx5_rxq_initialize(rxq_data);
+	if (!rxq_data->shared) {
+		rxq_data->wqes = (void *)(uintptr_t)rxq->devx_rq.wq.umem_buf;
+		rxq_data->rq_db = (uint32_t *)(uintptr_t)rxq->devx_rq.wq.db_rec;
+	} else if (!rxq_ctrl->started) {
+		rxq_data->wqes = (void *)(uintptr_t)tmpl->devx_rmp.wq.umem_buf;
+		rxq_data->rq_db =
+				(uint32_t *)(uintptr_t)tmpl->devx_rmp.wq.db_rec;
+	}
+	if (!rxq_ctrl->started) {
+		mlx5_rxq_initialize(rxq_data);
+		rxq_ctrl->wqn = rxq->devx_rq.rq->id;
+	}
 	priv->dev_data->rx_queue_state[rxq->idx] = RTE_ETH_QUEUE_STATE_STARTED;
-	rxq_ctrl->wqn = rxq->devx_rq.rq->id;
 	return 0;
 error:
 	ret = rte_errno; /* Save rte_errno before cleanup. */
@@ -558,7 +577,10 @@ mlx5_devx_ind_table_create_rqt_attr(struct rte_eth_dev *dev,
 		struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, queues[i]);
 
 		MLX5_ASSERT(rxq != NULL);
-		rqt_attr->rq_list[i] = rxq->devx_rq.rq->id;
+		if (rxq->ctrl->type == MLX5_RXQ_TYPE_HAIRPIN)
+			rqt_attr->rq_list[i] = rxq->ctrl->obj->rq->id;
+		else
+			rqt_attr->rq_list[i] = rxq->devx_rq.rq->id;
 	}
 	MLX5_ASSERT(i > 0);
 	for (j = 0; i != rqt_n; ++j, ++i)
diff --git a/drivers/net/mlx5/mlx5_ethdev.c b/drivers/net/mlx5/mlx5_ethdev.c
index bb38d5d2ade..dc647d5580c 100644
--- a/drivers/net/mlx5/mlx5_ethdev.c
+++ b/drivers/net/mlx5/mlx5_ethdev.c
@@ -26,6 +26,7 @@
 #include "mlx5_rx.h"
 #include "mlx5_tx.h"
 #include "mlx5_autoconf.h"
+#include "mlx5_devx.h"
 
 /**
  * Get the interface index from device name.
@@ -336,9 +337,13 @@ mlx5_dev_infos_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *info)
 	info->flow_type_rss_offloads = ~MLX5_RSS_HF_MASK;
 	mlx5_set_default_params(dev, info);
 	mlx5_set_txlimit_params(dev, info);
+	if (priv->config.hca_attr.mem_rq_rmp &&
+	    priv->obj_ops.rxq_obj_new == devx_obj_ops.rxq_obj_new)
+		info->dev_capa |= RTE_ETH_DEV_CAPA_RXQ_SHARE;
 	info->switch_info.name = dev->data->name;
 	info->switch_info.domain_id = priv->domain_id;
 	info->switch_info.port_id = priv->representor_id;
+	info->switch_info.rx_domain = 0; /* No sub Rx domains. */
 	if (priv->representor) {
 		uint16_t port_id;
 
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index 413e36f6d8d..eda6eca8dea 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -96,6 +96,7 @@ struct mlx5_rxq_data {
 	unsigned int lro:1; /* Enable LRO. */
 	unsigned int dynf_meta:1; /* Dynamic metadata is configured. */
 	unsigned int mcqe_format:3; /* CQE compression format. */
+	unsigned int shared:1; /* Shared RXQ. */
 	volatile uint32_t *rq_db;
 	volatile uint32_t *cq_db;
 	uint16_t port_id;
@@ -158,8 +159,10 @@ struct mlx5_rxq_ctrl {
 	struct mlx5_dev_ctx_shared *sh; /* Shared context. */
 	enum mlx5_rxq_type type; /* Rxq type. */
 	unsigned int socket; /* CPU socket ID for allocations. */
+	LIST_ENTRY(mlx5_rxq_ctrl) share_entry; /* Entry in shared RXQ list. */
 	uint32_t share_group; /* Group ID of shared RXQ. */
 	uint16_t share_qid; /* Shared RxQ ID in group. */
+	unsigned int started:1; /* Whether (shared) RXQ has been started. */
 	unsigned int irq:1; /* Whether IRQ is enabled. */
 	uint32_t flow_mark_n; /* Number of Mark/Flag flows using this Queue. */
 	uint32_t flow_tunnels_n[MLX5_FLOW_TUNNEL]; /* Tunnels counters. */
diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index f3fc618ed2c..8feb3e2c0fb 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -29,6 +29,7 @@
 #include "mlx5_rx.h"
 #include "mlx5_utils.h"
 #include "mlx5_autoconf.h"
+#include "mlx5_devx.h"
 
 
 /* Default RSS hash key also used for ConnectX-3. */
@@ -633,14 +634,19 @@ mlx5_rx_queue_start(struct rte_eth_dev *dev, uint16_t idx)
  *   RX queue index.
  * @param desc
  *   Number of descriptors to configure in queue.
+ * @param[out] rxq_ctrl
+ *   Address of pointer to shared Rx queue control.
  *
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
 static int
-mlx5_rx_queue_pre_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t *desc)
+mlx5_rx_queue_pre_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t *desc,
+			struct mlx5_rxq_ctrl **rxq_ctrl)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
+	struct mlx5_rxq_priv *rxq;
+	bool empty;
 
 	if (!rte_is_power_of_2(*desc)) {
 		*desc = 1 << log2above(*desc);
@@ -657,16 +663,143 @@ mlx5_rx_queue_pre_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t *desc)
 		rte_errno = EOVERFLOW;
 		return -rte_errno;
 	}
-	if (!mlx5_rxq_releasable(dev, idx)) {
-		DRV_LOG(ERR, "port %u unable to release queue index %u",
-			dev->data->port_id, idx);
-		rte_errno = EBUSY;
-		return -rte_errno;
+	if (rxq_ctrl == NULL || *rxq_ctrl == NULL)
+		return 0;
+	if (!(*rxq_ctrl)->rxq.shared) {
+		if (!mlx5_rxq_releasable(dev, idx)) {
+			DRV_LOG(ERR, "port %u unable to release queue index %u",
+				dev->data->port_id, idx);
+			rte_errno = EBUSY;
+			return -rte_errno;
+		}
+		mlx5_rxq_release(dev, idx);
+	} else {
+		if ((*rxq_ctrl)->obj != NULL)
+			/* Some port using shared Rx queue has been started. */
+			return 0;
+		/* Release all owner RxQ to reconfigure Shared RxQ. */
+		do {
+			rxq = LIST_FIRST(&(*rxq_ctrl)->owners);
+			LIST_REMOVE(rxq, owner_entry);
+			empty = LIST_EMPTY(&(*rxq_ctrl)->owners);
+			mlx5_rxq_release(ETH_DEV(rxq->priv), rxq->idx);
+		} while (!empty);
+		*rxq_ctrl = NULL;
 	}
-	mlx5_rxq_release(dev, idx);
 	return 0;
 }
 
+/**
+ * Get the shared Rx queue object that matches group and queue index.
+ *
+ * @param dev
+ *   Pointer to Ethernet device structure.
+ * @param group
+ *   Shared RXQ group.
+ * @param share_qid
+ *   Shared RX queue index.
+ *
+ * @return
+ *   Shared RXQ object that matching, or NULL if not found.
+ */
+static struct mlx5_rxq_ctrl *
+mlx5_shared_rxq_get(struct rte_eth_dev *dev, uint32_t group, uint16_t share_qid)
+{
+	struct mlx5_rxq_ctrl *rxq_ctrl;
+	struct mlx5_priv *priv = dev->data->dev_private;
+
+	LIST_FOREACH(rxq_ctrl, &priv->sh->shared_rxqs, share_entry) {
+		if (rxq_ctrl->share_group == group &&
+		    rxq_ctrl->share_qid == share_qid)
+			return rxq_ctrl;
+	}
+	return NULL;
+}
+
+/**
+ * Check whether requested Rx queue configuration matches shared RXQ.
+ *
+ * @param rxq_ctrl
+ *   Pointer to shared RXQ.
+ * @param dev
+ *   Pointer to Ethernet device structure.
+ * @param idx
+ *   Queue index.
+ * @param desc
+ *   Number of descriptors to configure in queue.
+ * @param socket
+ *   NUMA socket on which memory must be allocated.
+ * @param[in] conf
+ *   Thresholds parameters.
+ * @param mp
+ *   Memory pool for buffer allocations.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+static bool
+mlx5_shared_rxq_match(struct mlx5_rxq_ctrl *rxq_ctrl, struct rte_eth_dev *dev,
+		      uint16_t idx, uint16_t desc, unsigned int socket,
+		      const struct rte_eth_rxconf *conf,
+		      struct rte_mempool *mp)
+{
+	struct mlx5_priv *spriv = LIST_FIRST(&rxq_ctrl->owners)->priv;
+	struct mlx5_priv *priv = dev->data->dev_private;
+	unsigned int i;
+
+	RTE_SET_USED(conf);
+	if (rxq_ctrl->socket != socket) {
+		DRV_LOG(ERR, "port %u queue index %u failed to join shared group: socket mismatch",
+			dev->data->port_id, idx);
+		return false;
+	}
+	if (rxq_ctrl->rxq.elts_n != log2above(desc)) {
+		DRV_LOG(ERR, "port %u queue index %u failed to join shared group: descriptor number mismatch",
+			dev->data->port_id, idx);
+		return false;
+	}
+	if (priv->mtu != spriv->mtu) {
+		DRV_LOG(ERR, "port %u queue index %u failed to join shared group: mtu mismatch",
+			dev->data->port_id, idx);
+		return false;
+	}
+	if (priv->dev_data->dev_conf.intr_conf.rxq !=
+	    spriv->dev_data->dev_conf.intr_conf.rxq) {
+		DRV_LOG(ERR, "port %u queue index %u failed to join shared group: interrupt mismatch",
+			dev->data->port_id, idx);
+		return false;
+	}
+	if (mp != NULL && rxq_ctrl->rxq.mp != mp) {
+		DRV_LOG(ERR, "port %u queue index %u failed to join shared group: mempool mismatch",
+			dev->data->port_id, idx);
+		return false;
+	} else if (mp == NULL) {
+		for (i = 0; i < conf->rx_nseg; i++) {
+			if (conf->rx_seg[i].split.mp !=
+			    rxq_ctrl->rxq.rxseg[i].mp ||
+			    conf->rx_seg[i].split.length !=
+			    rxq_ctrl->rxq.rxseg[i].length) {
+				DRV_LOG(ERR, "port %u queue index %u failed to join shared group: segment %u configuration mismatch",
+					dev->data->port_id, idx, i);
+				return false;
+			}
+		}
+	}
+	if (priv->config.hw_padding != spriv->config.hw_padding) {
+		DRV_LOG(ERR, "port %u queue index %u failed to join shared group: padding mismatch",
+			dev->data->port_id, idx);
+		return false;
+	}
+	if (priv->config.cqe_comp != spriv->config.cqe_comp ||
+	    (priv->config.cqe_comp &&
+	     priv->config.cqe_comp_fmt != spriv->config.cqe_comp_fmt)) {
+		DRV_LOG(ERR, "port %u queue index %u failed to join shared group: CQE compression mismatch",
+			dev->data->port_id, idx);
+		return false;
+	}
+	return true;
+}
+
 /**
  *
  * @param dev
@@ -692,12 +825,14 @@ mlx5_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_rxq_priv *rxq;
-	struct mlx5_rxq_ctrl *rxq_ctrl;
+	struct mlx5_rxq_ctrl *rxq_ctrl = NULL;
 	struct rte_eth_rxseg_split *rx_seg =
 				(struct rte_eth_rxseg_split *)conf->rx_seg;
 	struct rte_eth_rxseg_split rx_single = {.mp = mp};
 	uint16_t n_seg = conf->rx_nseg;
 	int res;
+	uint64_t offloads = conf->offloads |
+			    dev->data->dev_conf.rxmode.offloads;
 
 	if (mp) {
 		/*
@@ -709,9 +844,6 @@ mlx5_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		n_seg = 1;
 	}
 	if (n_seg > 1) {
-		uint64_t offloads = conf->offloads |
-				    dev->data->dev_conf.rxmode.offloads;
-
 		/* The offloads should be checked on rte_eth_dev layer. */
 		MLX5_ASSERT(offloads & RTE_ETH_RX_OFFLOAD_SCATTER);
 		if (!(offloads & RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT)) {
@@ -723,9 +855,46 @@ mlx5_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 		}
 		MLX5_ASSERT(n_seg < MLX5_MAX_RXQ_NSEG);
 	}
-	res = mlx5_rx_queue_pre_setup(dev, idx, &desc);
+	if (conf->share_group > 0) {
+		if (!priv->config.hca_attr.mem_rq_rmp) {
+			DRV_LOG(ERR, "port %u queue index %u shared Rx queue not supported by fw",
+				     dev->data->port_id, idx);
+			rte_errno = EINVAL;
+			return -rte_errno;
+		}
+		if (priv->obj_ops.rxq_obj_new != devx_obj_ops.rxq_obj_new) {
+			DRV_LOG(ERR, "port %u queue index %u shared Rx queue needs DevX api",
+				     dev->data->port_id, idx);
+			rte_errno = EINVAL;
+			return -rte_errno;
+		}
+		if (conf->share_qid >= priv->rxqs_n) {
+			DRV_LOG(ERR, "port %u shared Rx queue index %u > number of Rx queues %u",
+				dev->data->port_id, conf->share_qid,
+				priv->rxqs_n);
+			rte_errno = EINVAL;
+			return -rte_errno;
+		}
+		if (priv->config.mprq.enabled) {
+			DRV_LOG(ERR, "port %u shared Rx queue index %u: not supported when MPRQ enabled",
+				dev->data->port_id, conf->share_qid);
+			rte_errno = EINVAL;
+			return -rte_errno;
+		}
+		/* Try to reuse shared RXQ. */
+		rxq_ctrl = mlx5_shared_rxq_get(dev, conf->share_group,
+					       conf->share_qid);
+		if (rxq_ctrl != NULL &&
+		    !mlx5_shared_rxq_match(rxq_ctrl, dev, idx, desc, socket,
+					   conf, mp)) {
+			rte_errno = EINVAL;
+			return -rte_errno;
+		}
+	}
+	res = mlx5_rx_queue_pre_setup(dev, idx, &desc, &rxq_ctrl);
 	if (res)
 		return res;
+	/* Allocate RXQ. */
 	rxq = mlx5_malloc(MLX5_MEM_RTE | MLX5_MEM_ZERO, sizeof(*rxq), 0,
 			  SOCKET_ID_ANY);
 	if (!rxq) {
@@ -737,15 +906,23 @@ mlx5_rx_queue_setup(struct rte_eth_dev *dev, uint16_t idx, uint16_t desc,
 	rxq->priv = priv;
 	rxq->idx = idx;
 	(*priv->rxq_privs)[idx] = rxq;
-	rxq_ctrl = mlx5_rxq_new(dev, rxq, desc, socket, conf, rx_seg, n_seg);
-	if (!rxq_ctrl) {
-		DRV_LOG(ERR, "port %u unable to allocate rx queue index %u",
-			dev->data->port_id, idx);
-		mlx5_free(rxq);
-		(*priv->rxq_privs)[idx] = NULL;
-		rte_errno = ENOMEM;
-		return -rte_errno;
+	if (rxq_ctrl != NULL) {
+		/* Join owner list. */
+		LIST_INSERT_HEAD(&rxq_ctrl->owners, rxq, owner_entry);
+		rxq->ctrl = rxq_ctrl;
+	} else {
+		rxq_ctrl = mlx5_rxq_new(dev, rxq, desc, socket, conf, rx_seg,
+					n_seg);
+		if (rxq_ctrl == NULL) {
+			DRV_LOG(ERR, "port %u unable to allocate rx queue index %u",
+				dev->data->port_id, idx);
+			mlx5_free(rxq);
+			(*priv->rxq_privs)[idx] = NULL;
+			rte_errno = ENOMEM;
+			return -rte_errno;
+		}
 	}
+	mlx5_rxq_ref(dev, idx);
 	DRV_LOG(DEBUG, "port %u adding Rx queue %u to list",
 		dev->data->port_id, idx);
 	dev->data->rx_queues[idx] = &rxq_ctrl->rxq;
@@ -776,7 +953,7 @@ mlx5_rx_hairpin_queue_setup(struct rte_eth_dev *dev, uint16_t idx,
 	struct mlx5_rxq_ctrl *rxq_ctrl;
 	int res;
 
-	res = mlx5_rx_queue_pre_setup(dev, idx, &desc);
+	res = mlx5_rx_queue_pre_setup(dev, idx, &desc, NULL);
 	if (res)
 		return res;
 	if (hairpin_conf->peer_count != 1) {
@@ -1095,6 +1272,9 @@ mlx5_rxq_obj_verify(struct rte_eth_dev *dev)
 	struct mlx5_rxq_obj *rxq_obj;
 
 	LIST_FOREACH(rxq_obj, &priv->rxqsobj, next) {
+		if (rxq_obj->rxq_ctrl->rxq.shared &&
+		    !LIST_EMPTY(&rxq_obj->rxq_ctrl->owners))
+			continue;
 		DRV_LOG(DEBUG, "port %u Rx queue %u still referenced",
 			dev->data->port_id, rxq_obj->rxq_ctrl->rxq.idx);
 		++ret;
@@ -1413,6 +1593,12 @@ mlx5_rxq_new(struct rte_eth_dev *dev, struct mlx5_rxq_priv *rxq,
 		return NULL;
 	}
 	LIST_INIT(&tmpl->owners);
+	if (conf->share_group > 0) {
+		tmpl->rxq.shared = 1;
+		tmpl->share_group = conf->share_group;
+		tmpl->share_qid = conf->share_qid;
+		LIST_INSERT_HEAD(&priv->sh->shared_rxqs, tmpl, share_entry);
+	}
 	rxq->ctrl = tmpl;
 	LIST_INSERT_HEAD(&tmpl->owners, rxq, owner_entry);
 	MLX5_ASSERT(n_seg && n_seg <= MLX5_MAX_RXQ_NSEG);
@@ -1661,7 +1847,6 @@ mlx5_rxq_new(struct rte_eth_dev *dev, struct mlx5_rxq_priv *rxq,
 	tmpl->rxq.uar_lock_cq = &priv->sh->uar_lock_cq;
 #endif
 	tmpl->rxq.idx = idx;
-	mlx5_rxq_ref(dev, idx);
 	LIST_INSERT_HEAD(&priv->rxqsctrl, tmpl, next);
 	return tmpl;
 error:
@@ -1836,31 +2021,41 @@ mlx5_rxq_release(struct rte_eth_dev *dev, uint16_t idx)
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_rxq_priv *rxq;
 	struct mlx5_rxq_ctrl *rxq_ctrl;
+	uint32_t refcnt;
 
 	if (priv->rxq_privs == NULL)
 		return 0;
 	rxq = mlx5_rxq_get(dev, idx);
-	if (rxq == NULL)
+	if (rxq == NULL || rxq->refcnt == 0)
 		return 0;
-	if (mlx5_rxq_deref(dev, idx) > 1)
-		return 1;
 	rxq_ctrl = rxq->ctrl;
-	if (rxq_ctrl->obj != NULL) {
+	refcnt = mlx5_rxq_deref(dev, idx);
+	if (refcnt > 1) {
+		return 1;
+	} else if (refcnt == 1) { /* RxQ stopped. */
 		priv->obj_ops.rxq_obj_release(rxq);
-		LIST_REMOVE(rxq_ctrl->obj, next);
-		mlx5_free(rxq_ctrl->obj);
-		rxq_ctrl->obj = NULL;
-	}
-	if (rxq_ctrl->type == MLX5_RXQ_TYPE_STANDARD) {
-		rxq_free_elts(rxq_ctrl);
-		dev->data->rx_queue_state[idx] = RTE_ETH_QUEUE_STATE_STOPPED;
-	}
-	if (!__atomic_load_n(&rxq->refcnt, __ATOMIC_RELAXED)) {
-		if (rxq_ctrl->type == MLX5_RXQ_TYPE_STANDARD)
-			mlx5_mr_btree_free(&rxq_ctrl->rxq.mr_ctrl.cache_bh);
+		if (!rxq_ctrl->started && rxq_ctrl->obj != NULL) {
+			LIST_REMOVE(rxq_ctrl->obj, next);
+			mlx5_free(rxq_ctrl->obj);
+			rxq_ctrl->obj = NULL;
+		}
+		if (rxq_ctrl->type == MLX5_RXQ_TYPE_STANDARD) {
+			if (!rxq_ctrl->started)
+				rxq_free_elts(rxq_ctrl);
+			dev->data->rx_queue_state[idx] =
+					RTE_ETH_QUEUE_STATE_STOPPED;
+		}
+	} else { /* Refcnt zero, closing device. */
 		LIST_REMOVE(rxq, owner_entry);
-		LIST_REMOVE(rxq_ctrl, next);
-		mlx5_free(rxq_ctrl);
+		if (LIST_EMPTY(&rxq_ctrl->owners)) {
+			if (rxq_ctrl->type == MLX5_RXQ_TYPE_STANDARD)
+				mlx5_mr_btree_free
+					(&rxq_ctrl->rxq.mr_ctrl.cache_bh);
+			if (rxq_ctrl->rxq.shared)
+				LIST_REMOVE(rxq_ctrl, share_entry);
+			LIST_REMOVE(rxq_ctrl, next);
+			mlx5_free(rxq_ctrl);
+		}
 		dev->data->rx_queues[idx] = NULL;
 		mlx5_free(rxq);
 		(*priv->rxq_privs)[idx] = NULL;
diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index 72475e4b5b5..a3e62e95335 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -176,6 +176,39 @@ mlx5_rxq_stop(struct rte_eth_dev *dev)
 		mlx5_rxq_release(dev, i);
 }
 
+static int
+mlx5_rxq_ctrl_prepare(struct rte_eth_dev *dev, struct mlx5_rxq_ctrl *rxq_ctrl,
+		      unsigned int idx)
+{
+	int ret = 0;
+
+	if (rxq_ctrl->type == MLX5_RXQ_TYPE_STANDARD) {
+		/*
+		 * Pre-register the mempools. Regardless of whether
+		 * the implicit registration is enabled or not,
+		 * Rx mempool destruction is tracked to free MRs.
+		 */
+		if (mlx5_rxq_mempool_register(dev, rxq_ctrl) < 0)
+			return -rte_errno;
+		ret = rxq_alloc_elts(rxq_ctrl);
+		if (ret)
+			return ret;
+	}
+	MLX5_ASSERT(!rxq_ctrl->obj);
+	rxq_ctrl->obj = mlx5_malloc(MLX5_MEM_RTE | MLX5_MEM_ZERO,
+				    sizeof(*rxq_ctrl->obj), 0,
+				    rxq_ctrl->socket);
+	if (!rxq_ctrl->obj) {
+		DRV_LOG(ERR, "Port %u Rx queue %u can't allocate resources.",
+			dev->data->port_id, idx);
+		rte_errno = ENOMEM;
+		return -rte_errno;
+	}
+	DRV_LOG(DEBUG, "Port %u rxq %u updated with %p.", dev->data->port_id,
+		idx, (void *)&rxq_ctrl->obj);
+	return 0;
+}
+
 /**
  * Start traffic on Rx queues.
  *
@@ -208,28 +241,10 @@ mlx5_rxq_start(struct rte_eth_dev *dev)
 		if (rxq == NULL)
 			continue;
 		rxq_ctrl = rxq->ctrl;
-		if (rxq_ctrl->type == MLX5_RXQ_TYPE_STANDARD) {
-			/*
-			 * Pre-register the mempools. Regardless of whether
-			 * the implicit registration is enabled or not,
-			 * Rx mempool destruction is tracked to free MRs.
-			 */
-			if (mlx5_rxq_mempool_register(dev, rxq_ctrl) < 0)
-				goto error;
-			ret = rxq_alloc_elts(rxq_ctrl);
-			if (ret)
+		if (!rxq_ctrl->started) {
+			if (mlx5_rxq_ctrl_prepare(dev, rxq_ctrl, i) < 0)
 				goto error;
-		}
-		MLX5_ASSERT(!rxq_ctrl->obj);
-		rxq_ctrl->obj = mlx5_malloc(MLX5_MEM_RTE | MLX5_MEM_ZERO,
-					    sizeof(*rxq_ctrl->obj), 0,
-					    rxq_ctrl->socket);
-		if (!rxq_ctrl->obj) {
-			DRV_LOG(ERR,
-				"Port %u Rx queue %u can't allocate resources.",
-				dev->data->port_id, i);
-			rte_errno = ENOMEM;
-			goto error;
+			LIST_INSERT_HEAD(&priv->rxqsobj, rxq_ctrl->obj, next);
 		}
 		ret = priv->obj_ops.rxq_obj_new(rxq);
 		if (ret) {
@@ -237,9 +252,7 @@ mlx5_rxq_start(struct rte_eth_dev *dev)
 			rxq_ctrl->obj = NULL;
 			goto error;
 		}
-		DRV_LOG(DEBUG, "Port %u rxq %u updated with %p.",
-			dev->data->port_id, i, (void *)&rxq_ctrl->obj);
-		LIST_INSERT_HEAD(&priv->rxqsobj, rxq_ctrl->obj, next);
+		rxq_ctrl->started = true;
 	}
 	return 0;
 error:
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* [dpdk-dev] [PATCH v4 14/14] net/mlx5: add shared Rx queue port datapath support
  2021-11-04 12:33 ` [dpdk-dev] [PATCH v4 00/14] net/mlx5: support shared Rx queue Xueming Li
                     ` (12 preceding siblings ...)
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 13/14] net/mlx5: support shared Rx queue Xueming Li
@ 2021-11-04 12:33   ` Xueming Li
  2021-11-04 17:50     ` David Christensen
  2021-11-05  6:40     ` Ruifeng Wang
  2021-11-04 20:06   ` [dpdk-dev] [PATCH v4 00/14] net/mlx5: support shared Rx queue Raslan Darawsheh
  14 siblings, 2 replies; 266+ messages in thread
From: Xueming Li @ 2021-11-04 12:33 UTC (permalink / raw)
  To: dev
  Cc: Viacheslav Ovsiienko, xuemingl, Lior Margalit, Matan Azrad,
	David Christensen, Ruifeng Wang, Bruce Richardson,
	Konstantin Ananyev

From: Viacheslav Ovsiienko <viacheslavo@nvidia.com>

When receive packet, mlx5 PMD saves mbuf port number from
RxQ data.

To support shared RxQ, save port number into RQ context as user index.
Received packet resolve port number from CQE user index which derived
from RQ context.

Legacy Verbs API doesn't support RQ user index setting, still read from
RxQ port number.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Acked-by: Slava Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/net/mlx5/mlx5_devx.c             |  1 +
 drivers/net/mlx5/mlx5_rx.c               |  1 +
 drivers/net/mlx5/mlx5_rxq.c              |  3 ++-
 drivers/net/mlx5/mlx5_rxtx_vec_altivec.h |  6 ++++++
 drivers/net/mlx5/mlx5_rxtx_vec_neon.h    | 12 +++++++++++-
 drivers/net/mlx5/mlx5_rxtx_vec_sse.h     |  8 +++++++-
 6 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index d3d189ab7f2..a9f9f4af700 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -277,6 +277,7 @@ mlx5_rxq_create_devx_rq_resources(struct mlx5_rxq_priv *rxq)
 						MLX5_WQ_END_PAD_MODE_NONE;
 	rq_attr.wq_attr.pd = cdev->pdn;
 	rq_attr.counter_set_id = priv->counter_set_id;
+	rq_attr.user_index = rte_cpu_to_be_16(priv->dev_data->port_id);
 	if (rxq_data->shared) /* Create RMP based RQ. */
 		rxq->devx_rq.rmp = &rxq_ctrl->obj->devx_rmp;
 	/* Create RQ using DevX API. */
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 1ffa1b95b88..4d85f64accd 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -709,6 +709,7 @@ rxq_cq_to_mbuf(struct mlx5_rxq_data *rxq, struct rte_mbuf *pkt,
 {
 	/* Update packet information. */
 	pkt->packet_type = rxq_cq_to_pkt_type(rxq, cqe, mcqe);
+	pkt->port = unlikely(rxq->shared) ? cqe->user_index_low : rxq->port_id;
 
 	if (rxq->rss_hash) {
 		uint32_t rss_hash_res = 0;
diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index 8feb3e2c0fb..4515d531835 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -186,7 +186,8 @@ rxq_alloc_elts_sprq(struct mlx5_rxq_ctrl *rxq_ctrl)
 		mbuf_init->data_off = RTE_PKTMBUF_HEADROOM;
 		rte_mbuf_refcnt_set(mbuf_init, 1);
 		mbuf_init->nb_segs = 1;
-		mbuf_init->port = rxq->port_id;
+		/* For shared queues port is provided in CQE */
+		mbuf_init->port = rxq->shared ? 0 : rxq->port_id;
 		if (priv->flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF)
 			mbuf_init->ol_flags = RTE_MBUF_F_EXTERNAL;
 		/*
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h b/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h
index 1d00c1c43d1..423e229508c 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h
@@ -1189,6 +1189,12 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq,
 
 		/* D.5 fill in mbuf - rearm_data and packet_type. */
 		rxq_cq_to_ptype_oflags_v(rxq, cqes, opcode, &pkts[pos]);
+		if (unlikely(rxq->shared)) {
+			pkts[pos]->port = cq[pos].user_index_low;
+			pkts[pos + p1]->port = cq[pos + p1].user_index_low;
+			pkts[pos + p2]->port = cq[pos + p2].user_index_low;
+			pkts[pos + p3]->port = cq[pos + p3].user_index_low;
+		}
 		if (rxq->hw_timestamp) {
 			int offset = rxq->timestamp_offset;
 			if (rxq->rt_timestamp) {
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
index aa36df29a09..b1d16baa619 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
@@ -787,7 +787,17 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq,
 		/* C.4 fill in mbuf - rearm_data and packet_type. */
 		rxq_cq_to_ptype_oflags_v(rxq, ptype_info, flow_tag,
 					 opcode, &elts[pos]);
-		if (rxq->hw_timestamp) {
+		if (unlikely(rxq->shared)) {
+			elts[pos]->port = container_of(p0, struct mlx5_cqe,
+					      pkt_info)->user_index_low;
+			elts[pos + 1]->port = container_of(p1, struct mlx5_cqe,
+					      pkt_info)->user_index_low;
+			elts[pos + 2]->port = container_of(p2, struct mlx5_cqe,
+					      pkt_info)->user_index_low;
+			elts[pos + 3]->port = container_of(p3, struct mlx5_cqe,
+					      pkt_info)->user_index_low;
+		}
+		if (unlikely(rxq->hw_timestamp)) {
 			int offset = rxq->timestamp_offset;
 			if (rxq->rt_timestamp) {
 				struct mlx5_dev_ctx_shared *sh = rxq->sh;
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_sse.h b/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
index b0fc29d7b9e..f3d838389e2 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
@@ -736,7 +736,13 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq,
 		*err |= _mm_cvtsi128_si64(opcode);
 		/* D.5 fill in mbuf - rearm_data and packet_type. */
 		rxq_cq_to_ptype_oflags_v(rxq, cqes, opcode, &pkts[pos]);
-		if (rxq->hw_timestamp) {
+		if (unlikely(rxq->shared)) {
+			pkts[pos]->port = cq[pos].user_index_low;
+			pkts[pos + p1]->port = cq[pos + p1].user_index_low;
+			pkts[pos + p2]->port = cq[pos + p2].user_index_low;
+			pkts[pos + p3]->port = cq[pos + p3].user_index_low;
+		}
+		if (unlikely(rxq->hw_timestamp)) {
 			int offset = rxq->timestamp_offset;
 			if (rxq->rt_timestamp) {
 				struct mlx5_dev_ctx_shared *sh = rxq->sh;
-- 
2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v13 0/7] ethdev: introduce shared Rx queue
  2021-10-21 10:41 ` [dpdk-dev] [PATCH v13 0/7] ethdev: introduce " Xueming Li
                     ` (7 preceding siblings ...)
  2021-10-21 23:41   ` [dpdk-dev] [PATCH v13 0/7] ethdev: introduce " Ferruh Yigit
@ 2021-11-04 15:52   ` Tom Barbette
  8 siblings, 0 replies; 266+ messages in thread
From: Tom Barbette @ 2021-11-04 15:52 UTC (permalink / raw)
  To: Xueming Li, dev, Zhang Yuying, Li Xiaoyun
  Cc: Jerin Jacob, Ferruh Yigit, Andrew Rybchenko,
	Viacheslav Ovsiienko, Thomas Monjalon, Lior Margalit,
	Ananyev Konstantin, Ajit Khaparde

Le 21-10-21 à 12:41, Xueming Li a écrit :
> In current DPDK framework, all Rx queues is pre-loaded with mbufs for
> incoming packets. When number of representors scale out in a switch
> domain, the memory consumption became significant. Further more,
> polling all ports leads to high cache miss, high latency and low
> throughputs.
>
> This patch introduces shared Rx queue. PF and representors in same
> Rx domain and switch domain could share Rx queue set by specifying
> non-zero share group value in Rx queue configuration.
>
> All ports that share Rx queue actually shares hardware descriptor
> queue and feed all Rx queues with one descriptor supply, memory is saved.
>
> Polling any queue using same shared Rx queue receives packets from all
> member ports. Source port is identified by mbuf->port.
>
> Multiple groups is supported by group ID. Port queue number in a shared
> group should be identical. Queue index is 1:1 mapped in shared group.
> An example of two share groups:
>   Group1, 4 shared Rx queues per member port: PF, repr0, repr1
>   Group2, 2 shared Rx queues per member port: repr2, repr3, ... repr127
>   Poll first port for each group:
>    core	port	queue
>    0	0	0
>    1	0	1
>    2	0	2
>    3	0	3
>    4	2	0
>    5	2	1
>
> Shared Rx queue must be polled on single thread or core. If both PF0 and
> representor0 joined same share group, can't poll pf0rxq0 on core1 and
> rep0rxq0 on core2. Actually, polling one port within share group is
> sufficient since polling any port in group will return packets for any
> port in group.
>
> There was some discussion to aggregate member ports in same group into a
> dummy port, several ways to achieve it. Since it optional, need to collect
> more feedback and requirement from user, make better decision later.
>
> v1:
>    - initial version
> v2:
>    - add testpmd patches
> v3:
>    - change common forwarding api to macro for performance, thanks Jerin.
>    - save global variable accessed in forwarding to flowstream to minimize
>      cache miss
>    - combined patches for each forwarding engine
>    - support multiple groups in testpmd "--share-rxq" parameter
>    - new api to aggregate shared rxq group
> v4:
>    - spelling fixes
>    - remove shared-rxq support for all forwarding engines
>    - add dedicate shared-rxq forwarding engine
> v5:
>   - fix grammars
>   - remove aggregate api and leave it for later discussion
>   - add release notes
>   - add deployment example
> v6:
>   - replace RxQ offload flag with device offload capability flag
>   - add Rx domain
>   - RxQ is shared when share group > 0
>   - update testpmd accordingly
> v7:
>   - fix testpmd share group id allocation
>   - change rx_domain to 16bits
> v8:
>   - add new patch for testpmd to show device Rx domain ID and capability
>   - new share_qid in RxQ configuration
> v9:
>   - fix some spelling
> v10:
>   - add device capability name api
> v11:
>   - remove macro from device capability name list
> v12:
>   - rephrase
>   - in forwarding core check, add  global flag and RxQ enabled check
> v13:
>   - update imports of new forwarding engine
>   - rephrase
>
> Xueming Li (7):
>    ethdev: introduce shared Rx queue
>    ethdev: get device capability name as string
>    app/testpmd: dump device capability and Rx domain info
>    app/testpmd: new parameter to enable shared Rx queue
>    app/testpmd: dump port info for shared Rx queue
>    app/testpmd: force shared Rx queue polled on same core
>    app/testpmd: add forwarding engine for shared Rx queue
>
>   app/test-pmd/config.c                         | 141 +++++++++++++++++-
>   app/test-pmd/meson.build                      |   1 +
>   app/test-pmd/parameters.c                     |  13 ++
>   app/test-pmd/shared_rxq_fwd.c                 | 115 ++++++++++++++
>   app/test-pmd/testpmd.c                        |  26 +++-
>   app/test-pmd/testpmd.h                        |   5 +
>   app/test-pmd/util.c                           |   3 +
>   doc/guides/nics/features.rst                  |  13 ++
>   doc/guides/nics/features/default.ini          |   1 +
>   .../prog_guide/switch_representation.rst      |  11 ++
>   doc/guides/rel_notes/release_21_11.rst        |   6 +
>   doc/guides/testpmd_app_ug/run_app.rst         |   9 ++
>   doc/guides/testpmd_app_ug/testpmd_funcs.rst   |   5 +-
>   lib/ethdev/rte_ethdev.c                       |  33 ++++
>   lib/ethdev/rte_ethdev.h                       |  38 +++++
>   lib/ethdev/version.map                        |   1 +
>   16 files changed, 415 insertions(+), 6 deletions(-)
>   create mode 100644 app/test-pmd/shared_rxq_fwd.c

Hi all,

Sorry to jump in this late but I think this solves only a consequence of 
another "problem", the fact the mbuf descriptor is coupled with the 
buffer. And you might want to consider another approach that does not 
require API change.

The problem (partially solved by this patch) is that you'll "touch" many 
descriptors (the rte_mbuf itself) if you have many queues, or even a few 
queues but with quite large rings. Those descriptors, will all be likely 
out of cache when you access them. However, as we demonstrated with mlx5 
(see https://packetmill.io/) you can build a descriptor from scratch out 
of the NIC hw ring that points to the underlying buffer in an indirect 
way. This descriptor can be taken out of the thread-local buffer pool. 
You'll actualy keep as much mbufs descriptors in-flight as your burst 
size. Which probably even defeats what this patch can do, as you can 
actually use only 32 descriptors per thread for any number of queues of 
any size.

What that solution does not solve is the need to poll many different 
queues. I think that is orthogonal, with the NICs getting smarter we're 
going to have many rules sending traffic to per-application, 
per-priority queues anyway. Maybe even per-microflows. To solve this we 
would need a kind of queue bitmask set in hw to indicate which queue to 
poll instead of trying all of them. Maybe this can be done through a FW 
update? It's a feature we'll want in the future in any cases.

The shared RX queue is surely an easy fix for the polling itself, but 
one problem of the shared RX queue is that it will lead to scattered 
batches. We'll get batches of packets from all ports that will surely 
take different code path for anything above forwarding, breaking the 
benefit of batching (this can also lead up to 50% of performance penalty 
due to interleaved burst, see 
https://people.kth.se/~dejanko/documents/publications/ordermatters-nsdi22.pdf). 


Cheers,

Tom



^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/14] net/mlx5: fix field reference for PPC
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 02/14] net/mlx5: fix field reference for PPC Xueming Li
@ 2021-11-04 17:07     ` Raslan Darawsheh
  2021-11-04 17:49     ` David Christensen
  1 sibling, 0 replies; 266+ messages in thread
From: Raslan Darawsheh @ 2021-11-04 17:07 UTC (permalink / raw)
  To: Xueming(Steven) Li, dev
  Cc: Xueming(Steven) Li, Lior Margalit, Slava Ovsiienko, stable,
	David Christensen, Matan Azrad, Yongseok Koh

Hi,

> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Xueming Li
> Sent: Thursday, November 4, 2021 2:33 PM
> To: dev@dpdk.org
> Cc: Xueming(Steven) Li <xuemingl@nvidia.com>; Lior Margalit
> <lmargalit@nvidia.com>; Slava Ovsiienko <viacheslavo@nvidia.com>;
> stable@dpdk.org; David Christensen <drc@linux.vnet.ibm.com>; Matan
> Azrad <matan@nvidia.com>; Yongseok Koh <yskoh@mellanox.com>
> Subject: [dpdk-dev] [PATCH v4 02/14] net/mlx5: fix field reference for PPC
> 
> This patch fixes stale field reference.
> 
> Fixes: a18ac6113331 ("net/mlx5: add metadata support to Rx datapath")
> Cc: viacheslavo@nvidia.com
> Cc: stable@dpdk.org
> 

This should be the first patch in the series since the first patch is removing the rsvd3 from the structure.
I'll change the order during integration,

Kindest regards,
Raslan Darawsheh

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/14] net/mlx5: fix field reference for PPC
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 02/14] net/mlx5: fix field reference for PPC Xueming Li
  2021-11-04 17:07     ` Raslan Darawsheh
@ 2021-11-04 17:49     ` David Christensen
  1 sibling, 0 replies; 266+ messages in thread
From: David Christensen @ 2021-11-04 17:49 UTC (permalink / raw)
  To: Xueming Li, dev
  Cc: Lior Margalit, viacheslavo, stable, Matan Azrad, Yongseok Koh



On 11/4/21 5:33 AM, Xueming Li wrote:
> This patch fixes stale field reference.
> 
> Fixes: a18ac6113331 ("net/mlx5: add metadata support to Rx datapath")
> Cc: viacheslavo@nvidia.com
> Cc: stable@dpdk.org
> 
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> Acked-by: Slava Ovsiienko <viacheslavo@nvidia.com>
> ---
>   drivers/net/mlx5/mlx5_rxtx_vec_altivec.h | 8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h b/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h
> index bcf487c34e9..1d00c1c43d1 100644
> --- a/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h
> +++ b/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h
> @@ -974,10 +974,10 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq,
>   			(vector unsigned short)cqe_tmp1, cqe_sel_mask1);
>   		cqe_tmp2 = (vector unsigned char)(vector unsigned long){
>   			*(__rte_aligned(8) unsigned long *)
> -			&cq[pos + p3].rsvd3[9], 0LL};
> +			&cq[pos + p3].rsvd4[2], 0LL};
>   		cqe_tmp1 = (vector unsigned char)(vector unsigned long){
>   			*(__rte_aligned(8) unsigned long *)
> -			&cq[pos + p2].rsvd3[9], 0LL};
> +			&cq[pos + p2].rsvd4[2], 0LL};
>   		cqes[3] = (vector unsigned char)
>   			vec_sel((vector unsigned short)cqes[3],
>   			(vector unsigned short)cqe_tmp2,
> @@ -1037,10 +1037,10 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq,
>   			(vector unsigned short)cqe_tmp1, cqe_sel_mask1);
>   		cqe_tmp2 = (vector unsigned char)(vector unsigned long){
>   			*(__rte_aligned(8) unsigned long *)
> -			&cq[pos + p1].rsvd3[9], 0LL};
> +			&cq[pos + p1].rsvd4[2], 0LL};
>   		cqe_tmp1 = (vector unsigned char)(vector unsigned long){
>   			*(__rte_aligned(8) unsigned long *)
> -			&cq[pos].rsvd3[9], 0LL};
> +			&cq[pos].rsvd4[2], 0LL};
>   		cqes[1] = (vector unsigned char)
>   			vec_sel((vector unsigned short)cqes[1],
>   			(vector unsigned short)cqe_tmp2, cqe_sel_mask2);
> 

Reviewed-by: David Christensen <drc@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v4 14/14] net/mlx5: add shared Rx queue port datapath support
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 14/14] net/mlx5: add shared Rx queue port datapath support Xueming Li
@ 2021-11-04 17:50     ` David Christensen
  2021-11-05  6:40     ` Ruifeng Wang
  1 sibling, 0 replies; 266+ messages in thread
From: David Christensen @ 2021-11-04 17:50 UTC (permalink / raw)
  To: Xueming Li, dev
  Cc: Viacheslav Ovsiienko, Lior Margalit, Matan Azrad, Ruifeng Wang,
	Bruce Richardson, Konstantin Ananyev



On 11/4/21 5:33 AM, Xueming Li wrote:
> From: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
> 
> When receive packet, mlx5 PMD saves mbuf port number from
> RxQ data.
> 
> To support shared RxQ, save port number into RQ context as user index.
> Received packet resolve port number from CQE user index which derived
> from RQ context.
> 
> Legacy Verbs API doesn't support RQ user index setting, still read from
> RxQ port number.
> 
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
> Acked-by: Slava Ovsiienko <viacheslavo@nvidia.com>
> ---
>   drivers/net/mlx5/mlx5_devx.c             |  1 +
>   drivers/net/mlx5/mlx5_rx.c               |  1 +
>   drivers/net/mlx5/mlx5_rxq.c              |  3 ++-
>   drivers/net/mlx5/mlx5_rxtx_vec_altivec.h |  6 ++++++
>   drivers/net/mlx5/mlx5_rxtx_vec_neon.h    | 12 +++++++++++-
>   drivers/net/mlx5/mlx5_rxtx_vec_sse.h     |  8 +++++++-
>   6 files changed, 28 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
> index d3d189ab7f2..a9f9f4af700 100644
> --- a/drivers/net/mlx5/mlx5_devx.c
> +++ b/drivers/net/mlx5/mlx5_devx.c
> @@ -277,6 +277,7 @@ mlx5_rxq_create_devx_rq_resources(struct mlx5_rxq_priv *rxq)
>   						MLX5_WQ_END_PAD_MODE_NONE;
>   	rq_attr.wq_attr.pd = cdev->pdn;
>   	rq_attr.counter_set_id = priv->counter_set_id;
> +	rq_attr.user_index = rte_cpu_to_be_16(priv->dev_data->port_id);
>   	if (rxq_data->shared) /* Create RMP based RQ. */
>   		rxq->devx_rq.rmp = &rxq_ctrl->obj->devx_rmp;
>   	/* Create RQ using DevX API. */
> diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
> index 1ffa1b95b88..4d85f64accd 100644
> --- a/drivers/net/mlx5/mlx5_rx.c
> +++ b/drivers/net/mlx5/mlx5_rx.c
> @@ -709,6 +709,7 @@ rxq_cq_to_mbuf(struct mlx5_rxq_data *rxq, struct rte_mbuf *pkt,
>   {
>   	/* Update packet information. */
>   	pkt->packet_type = rxq_cq_to_pkt_type(rxq, cqe, mcqe);
> +	pkt->port = unlikely(rxq->shared) ? cqe->user_index_low : rxq->port_id;
> 
>   	if (rxq->rss_hash) {
>   		uint32_t rss_hash_res = 0;
> diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
> index 8feb3e2c0fb..4515d531835 100644
> --- a/drivers/net/mlx5/mlx5_rxq.c
> +++ b/drivers/net/mlx5/mlx5_rxq.c
> @@ -186,7 +186,8 @@ rxq_alloc_elts_sprq(struct mlx5_rxq_ctrl *rxq_ctrl)
>   		mbuf_init->data_off = RTE_PKTMBUF_HEADROOM;
>   		rte_mbuf_refcnt_set(mbuf_init, 1);
>   		mbuf_init->nb_segs = 1;
> -		mbuf_init->port = rxq->port_id;
> +		/* For shared queues port is provided in CQE */
> +		mbuf_init->port = rxq->shared ? 0 : rxq->port_id;
>   		if (priv->flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF)
>   			mbuf_init->ol_flags = RTE_MBUF_F_EXTERNAL;
>   		/*
> diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h b/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h
> index 1d00c1c43d1..423e229508c 100644
> --- a/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h
> +++ b/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h
> @@ -1189,6 +1189,12 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq,
> 
>   		/* D.5 fill in mbuf - rearm_data and packet_type. */
>   		rxq_cq_to_ptype_oflags_v(rxq, cqes, opcode, &pkts[pos]);
> +		if (unlikely(rxq->shared)) {
> +			pkts[pos]->port = cq[pos].user_index_low;
> +			pkts[pos + p1]->port = cq[pos + p1].user_index_low;
> +			pkts[pos + p2]->port = cq[pos + p2].user_index_low;
> +			pkts[pos + p3]->port = cq[pos + p3].user_index_low;
> +		}
>   		if (rxq->hw_timestamp) {
>   			int offset = rxq->timestamp_offset;
>   			if (rxq->rt_timestamp) {
> diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> index aa36df29a09..b1d16baa619 100644
> --- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> +++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> @@ -787,7 +787,17 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq,
>   		/* C.4 fill in mbuf - rearm_data and packet_type. */
>   		rxq_cq_to_ptype_oflags_v(rxq, ptype_info, flow_tag,
>   					 opcode, &elts[pos]);
> -		if (rxq->hw_timestamp) {
> +		if (unlikely(rxq->shared)) {
> +			elts[pos]->port = container_of(p0, struct mlx5_cqe,
> +					      pkt_info)->user_index_low;
> +			elts[pos + 1]->port = container_of(p1, struct mlx5_cqe,
> +					      pkt_info)->user_index_low;
> +			elts[pos + 2]->port = container_of(p2, struct mlx5_cqe,
> +					      pkt_info)->user_index_low;
> +			elts[pos + 3]->port = container_of(p3, struct mlx5_cqe,
> +					      pkt_info)->user_index_low;
> +		}
> +		if (unlikely(rxq->hw_timestamp)) {
>   			int offset = rxq->timestamp_offset;
>   			if (rxq->rt_timestamp) {
>   				struct mlx5_dev_ctx_shared *sh = rxq->sh;
> diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_sse.h b/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
> index b0fc29d7b9e..f3d838389e2 100644
> --- a/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
> +++ b/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
> @@ -736,7 +736,13 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq,
>   		*err |= _mm_cvtsi128_si64(opcode);
>   		/* D.5 fill in mbuf - rearm_data and packet_type. */
>   		rxq_cq_to_ptype_oflags_v(rxq, cqes, opcode, &pkts[pos]);
> -		if (rxq->hw_timestamp) {
> +		if (unlikely(rxq->shared)) {
> +			pkts[pos]->port = cq[pos].user_index_low;
> +			pkts[pos + p1]->port = cq[pos + p1].user_index_low;
> +			pkts[pos + p2]->port = cq[pos + p2].user_index_low;
> +			pkts[pos + p3]->port = cq[pos + p3].user_index_low;
> +		}
> +		if (unlikely(rxq->hw_timestamp)) {
>   			int offset = rxq->timestamp_offset;
>   			if (rxq->rt_timestamp) {
>   				struct mlx5_dev_ctx_shared *sh = rxq->sh;
> 

Reviewed-by: David Christensen <drc@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v4 00/14] net/mlx5: support shared Rx queue
  2021-11-04 12:33 ` [dpdk-dev] [PATCH v4 00/14] net/mlx5: support shared Rx queue Xueming Li
                     ` (13 preceding siblings ...)
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 14/14] net/mlx5: add shared Rx queue port datapath support Xueming Li
@ 2021-11-04 20:06   ` Raslan Darawsheh
  14 siblings, 0 replies; 266+ messages in thread
From: Raslan Darawsheh @ 2021-11-04 20:06 UTC (permalink / raw)
  To: Xueming(Steven) Li, dev; +Cc: Xueming(Steven) Li, Lior Margalit

Hi,

> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Xueming Li
> Sent: Thursday, November 4, 2021 2:33 PM
> To: dev@dpdk.org
> Cc: Xueming(Steven) Li <xuemingl@nvidia.com>; Lior Margalit
> <lmargalit@nvidia.com>
> Subject: [dpdk-dev] [PATCH v4 00/14] net/mlx5: support shared Rx queue
> 
> Implemetation of Shared Rx queue.
> 
> v1:
> - initial version
> v2:
> - rebased on latest dependent series
> - fully tested
> - support share_qid of RxQ configuration
> v3:
> - internally reviewed
> - removed MPRQ support
> - fixed multi-segment support
> - fixed configure not applied after port restart
> v4:
> - rebase with latest code
> 
> Viacheslav Ovsiienko (1):
>   net/mlx5: add shared Rx queue port datapath support
> 
> Xueming Li (13):
>   common/mlx5: introduce user index field in completion
>   net/mlx5: fix field reference for PPC
>   common/mlx5: adds basic receive memory pool support
>   common/mlx5: support receive memory pool
>   net/mlx5: fix Rx queue memory allocation return value
>   net/mlx5: clean Rx queue code
>   net/mlx5: split Rx queue into shareable and private
>   net/mlx5: move Rx queue reference count
>   net/mlx5: move Rx queue hairpin info to private data
>   net/mlx5: remove port info from shareable Rx queue
>   net/mlx5: move Rx queue DevX resource
>   net/mlx5: remove Rx queue data list from device
>   net/mlx5: support shared Rx queue
> 
>  doc/guides/nics/features/mlx5.ini        |   1 +
>  doc/guides/nics/mlx5.rst                 |   6 +
>  drivers/common/mlx5/mlx5_common_devx.c   | 295 +++++++++--
>  drivers/common/mlx5/mlx5_common_devx.h   |  19 +-
>  drivers/common/mlx5/mlx5_devx_cmds.c     |  52 ++
>  drivers/common/mlx5/mlx5_devx_cmds.h     |  16 +
>  drivers/common/mlx5/mlx5_prm.h           |  93 +++-
>  drivers/common/mlx5/version.map          |   1 +
>  drivers/net/mlx5/linux/mlx5_os.c         |   2 +
>  drivers/net/mlx5/linux/mlx5_verbs.c      | 169 +++---
>  drivers/net/mlx5/mlx5.c                  |  10 +-
>  drivers/net/mlx5/mlx5.h                  |  17 +-
>  drivers/net/mlx5/mlx5_devx.c             | 270 +++++-----
>  drivers/net/mlx5/mlx5_ethdev.c           |  21 +-
>  drivers/net/mlx5/mlx5_flow.c             |  47 +-
>  drivers/net/mlx5/mlx5_rss.c              |   6 +-
>  drivers/net/mlx5/mlx5_rx.c               |  31 +-
>  drivers/net/mlx5/mlx5_rx.h               |  45 +-
>  drivers/net/mlx5/mlx5_rxq.c              | 630 +++++++++++++++++------
>  drivers/net/mlx5/mlx5_rxtx.c             |   6 +-
>  drivers/net/mlx5/mlx5_rxtx_vec.c         |   8 +-
>  drivers/net/mlx5/mlx5_rxtx_vec_altivec.h |  14 +-
>  drivers/net/mlx5/mlx5_rxtx_vec_neon.h    |  12 +-
>  drivers/net/mlx5/mlx5_rxtx_vec_sse.h     |   8 +-
>  drivers/net/mlx5/mlx5_stats.c            |   9 +-
>  drivers/net/mlx5/mlx5_trigger.c          | 155 +++---
>  drivers/net/mlx5/mlx5_vlan.c             |  16 +-
>  drivers/regex/mlx5/mlx5_regex_fastpath.c |   2 +-
>  28 files changed, 1377 insertions(+), 584 deletions(-)
> 
> --
> 2.33.0

Series applied to next-net-mlx,

Kindest regards,
Raslan Darawsheh

^ permalink raw reply	[flat|nested] 266+ messages in thread

* Re: [dpdk-dev] [PATCH v4 14/14] net/mlx5: add shared Rx queue port datapath support
  2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 14/14] net/mlx5: add shared Rx queue port datapath support Xueming Li
  2021-11-04 17:50     ` David Christensen
@ 2021-11-05  6:40     ` Ruifeng Wang
  1 sibling, 0 replies; 266+ messages in thread
From: Ruifeng Wang @ 2021-11-05  6:40 UTC (permalink / raw)
  To: Xueming Li, dev
  Cc: Viacheslav Ovsiienko, Lior Margalit, Matan Azrad,
	David Christensen, Bruce Richardson, Konstantin Ananyev, nd

> -----Original Message-----
> From: Xueming Li <xuemingl@nvidia.com>
> Sent: Thursday, November 4, 2021 8:33 PM
> To: dev@dpdk.org
> Cc: Viacheslav Ovsiienko <viacheslavo@nvidia.com>; xuemingl@nvidia.com;
> Lior Margalit <lmargalit@nvidia.com>; Matan Azrad <matan@nvidia.com>;
> David Christensen <drc@linux.vnet.ibm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; Bruce Richardson
> <bruce.richardson@intel.com>; Konstantin Ananyev
> <konstantin.ananyev@intel.com>
> Subject: [PATCH v4 14/14] net/mlx5: add shared Rx queue port datapath
> support
> 
> From: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
> 
> When receive packet, mlx5 PMD saves mbuf port number from RxQ data.
> 
> To support shared RxQ, save port number into RQ context as user index.
> Received packet resolve port number from CQE user index which derived
> from RQ context.
> 
> Legacy Verbs API doesn't support RQ user index setting, still read from RxQ
> port number.
> 
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
> Acked-by: Slava Ovsiienko <viacheslavo@nvidia.com>
> ---
>  drivers/net/mlx5/mlx5_devx.c             |  1 +
>  drivers/net/mlx5/mlx5_rx.c               |  1 +
>  drivers/net/mlx5/mlx5_rxq.c              |  3 ++-
>  drivers/net/mlx5/mlx5_rxtx_vec_altivec.h |  6 ++++++
>  drivers/net/mlx5/mlx5_rxtx_vec_neon.h    | 12 +++++++++++-
>  drivers/net/mlx5/mlx5_rxtx_vec_sse.h     |  8 +++++++-
>  6 files changed, 28 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
> index d3d189ab7f2..a9f9f4af700 100644
> --- a/drivers/net/mlx5/mlx5_devx.c
> +++ b/drivers/net/mlx5/mlx5_devx.c
> @@ -277,6 +277,7 @@ mlx5_rxq_create_devx_rq_resources(struct
> mlx5_rxq_priv *rxq)
> 
> 	MLX5_WQ_END_PAD_MODE_NONE;
>  	rq_attr.wq_attr.pd = cdev->pdn;
>  	rq_attr.counter_set_id = priv->counter_set_id;
> +	rq_attr.user_index = rte_cpu_to_be_16(priv->dev_data->port_id);
>  	if (rxq_data->shared) /* Create RMP based RQ. */
>  		rxq->devx_rq.rmp = &rxq_ctrl->obj->devx_rmp;
>  	/* Create RQ using DevX API. */
> diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c index
> 1ffa1b95b88..4d85f64accd 100644
> --- a/drivers/net/mlx5/mlx5_rx.c
> +++ b/drivers/net/mlx5/mlx5_rx.c
> @@ -709,6 +709,7 @@ rxq_cq_to_mbuf(struct mlx5_rxq_data *rxq, struct
> rte_mbuf *pkt,  {
>  	/* Update packet information. */
>  	pkt->packet_type = rxq_cq_to_pkt_type(rxq, cqe, mcqe);
> +	pkt->port = unlikely(rxq->shared) ? cqe->user_index_low :
> +rxq->port_id;
> 
>  	if (rxq->rss_hash) {
>  		uint32_t rss_hash_res = 0;
> diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
> index 8feb3e2c0fb..4515d531835 100644
> --- a/drivers/net/mlx5/mlx5_rxq.c
> +++ b/drivers/net/mlx5/mlx5_rxq.c
> @@ -186,7 +186,8 @@ rxq_alloc_elts_sprq(struct mlx5_rxq_ctrl *rxq_ctrl)
>  		mbuf_init->data_off = RTE_PKTMBUF_HEADROOM;
>  		rte_mbuf_refcnt_set(mbuf_init, 1);
>  		mbuf_init->nb_segs = 1;
> -		mbuf_init->port = rxq->port_id;
> +		/* For shared queues port is provided in CQE */
> +		mbuf_init->port = rxq->shared ? 0 : rxq->port_id;
>  		if (priv->flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF)
>  			mbuf_init->ol_flags = RTE_MBUF_F_EXTERNAL;
>  		/*
> diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h
> b/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h
> index 1d00c1c43d1..423e229508c 100644
> --- a/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h
> +++ b/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h
> @@ -1189,6 +1189,12 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq,
> volatile struct mlx5_cqe *cq,
> 
>  		/* D.5 fill in mbuf - rearm_data and packet_type. */
>  		rxq_cq_to_ptype_oflags_v(rxq, cqes, opcode, &pkts[pos]);
> +		if (unlikely(rxq->shared)) {
> +			pkts[pos]->port = cq[pos].user_index_low;
> +			pkts[pos + p1]->port = cq[pos + p1].user_index_low;
> +			pkts[pos + p2]->port = cq[pos + p2].user_index_low;
> +			pkts[pos + p3]->port = cq[pos + p3].user_index_low;
> +		}
>  		if (rxq->hw_timestamp) {
>  			int offset = rxq->timestamp_offset;
>  			if (rxq->rt_timestamp) {
> diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> index aa36df29a09..b1d16baa619 100644
> --- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> +++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> @@ -787,7 +787,17 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq,
> volatile struct mlx5_cqe *cq,
>  		/* C.4 fill in mbuf - rearm_data and packet_type. */
>  		rxq_cq_to_ptype_oflags_v(rxq, ptype_info, flow_tag,
>  					 opcode, &elts[pos]);
> -		if (rxq->hw_timestamp) {
> +		if (unlikely(rxq->shared)) {
> +			elts[pos]->port = container_of(p0, struct mlx5_cqe,
> +					      pkt_info)->user_index_low;

I don't know hardware details on CQE, just comparing with other parts.
1. Is there a need to convert 'user_index_low' with rte_be_to_cpu_16?
2. 'cq' is available as input parameter. Can we directly use cq[pos] instead of using container_of(p0), just like what was done on other architectures?

Thanks.
> +			elts[pos + 1]->port = container_of(p1, struct
> mlx5_cqe,
> +					      pkt_info)->user_index_low;
> +			elts[pos + 2]->port = container_of(p2, struct
> mlx5_cqe,
> +					      pkt_info)->user_index_low;
> +			elts[pos + 3]->port = container_of(p3, struct
> mlx5_cqe,
> +					      pkt_info)->user_index_low;
> +		}
> +		if (unlikely(rxq->hw_timestamp)) {
>  			int offset = rxq->timestamp_offset;
>  			if (rxq->rt_timestamp) {
>  				struct mlx5_dev_ctx_shared *sh = rxq->sh;
> diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
> b/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
> index b0fc29d7b9e..f3d838389e2 100644
> --- a/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
> +++ b/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
> @@ -736,7 +736,13 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq,
> volatile struct mlx5_cqe *cq,
>  		*err |= _mm_cvtsi128_si64(opcode);
>  		/* D.5 fill in mbuf - rearm_data and packet_type. */
>  		rxq_cq_to_ptype_oflags_v(rxq, cqes, opcode, &pkts[pos]);
> -		if (rxq->hw_timestamp) {
> +		if (unlikely(rxq->shared)) {
> +			pkts[pos]->port = cq[pos].user_index_low;
> +			pkts[pos + p1]->port = cq[pos + p1].user_index_low;
> +			pkts[pos + p2]->port = cq[pos + p2].user_index_low;
> +			pkts[pos + p3]->port = cq[pos + p3].user_index_low;
> +		}
> +		if (unlikely(rxq->hw_timestamp)) {
>  			int offset = rxq->timestamp_offset;
>  			if (rxq->rt_timestamp) {
>  				struct mlx5_dev_ctx_shared *sh = rxq->sh;
> --
> 2.33.0


^ permalink raw reply	[flat|nested] 266+ messages in thread

end of thread, other threads:[~2021-11-05  6:41 UTC | newest]

Thread overview: 266+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-27  3:42 [dpdk-dev] [RFC] ethdev: introduce shared Rx queue Xueming Li
2021-07-28  7:56 ` Andrew Rybchenko
2021-07-28  8:20   ` Xueming(Steven) Li
2021-08-09 11:47 ` [dpdk-dev] [PATCH v1] " Xueming Li
2021-08-09 13:50   ` Jerin Jacob
2021-08-09 14:16     ` Xueming(Steven) Li
2021-08-11  8:02       ` Jerin Jacob
2021-08-11  8:28         ` Xueming(Steven) Li
2021-08-11 12:04           ` Ferruh Yigit
2021-08-11 12:59             ` Xueming(Steven) Li
2021-08-12 14:35               ` Xueming(Steven) Li
2021-09-15 15:34               ` Xueming(Steven) Li
2021-09-26  5:35             ` Xueming(Steven) Li
2021-09-28  9:35               ` Jerin Jacob
2021-09-28 11:36                 ` Xueming(Steven) Li
2021-09-28 11:37                 ` Xueming(Steven) Li
2021-09-28 11:37                 ` Xueming(Steven) Li
2021-09-28 12:58                   ` Jerin Jacob
2021-09-28 13:25                     ` Xueming(Steven) Li
2021-09-28 13:38                       ` Jerin Jacob
2021-09-28 13:59                         ` Ananyev, Konstantin
2021-09-28 14:40                           ` Xueming(Steven) Li
2021-09-28 14:59                             ` Jerin Jacob
2021-09-29  7:41                               ` Xueming(Steven) Li
2021-09-29  8:05                                 ` Jerin Jacob
2021-10-08  8:26                                   ` Xueming(Steven) Li
2021-10-10  9:46                                     ` Jerin Jacob
2021-10-10 13:40                                       ` Xueming(Steven) Li
2021-10-11  4:10                                         ` Jerin Jacob
2021-09-29  0:26                             ` Ananyev, Konstantin
2021-09-29  8:40                               ` Xueming(Steven) Li
2021-09-29 10:20                                 ` Ananyev, Konstantin
2021-09-29 13:25                                   ` Xueming(Steven) Li
2021-09-30  9:59                                     ` Ananyev, Konstantin
2021-10-06  7:54                                       ` Xueming(Steven) Li
2021-09-29  9:12                               ` Xueming(Steven) Li
2021-09-29  9:52                                 ` Ananyev, Konstantin
2021-09-29 11:07                                   ` Bruce Richardson
2021-09-29 11:46                                     ` Ananyev, Konstantin
2021-09-29 12:17                                       ` Bruce Richardson
2021-09-29 12:08                                   ` Xueming(Steven) Li
2021-09-29 12:35                                     ` Ananyev, Konstantin
2021-09-29 14:54                                       ` Xueming(Steven) Li
2021-09-28 14:51                         ` Xueming(Steven) Li
2021-09-28 12:59                 ` Xueming(Steven) Li
2021-08-11 14:04 ` [dpdk-dev] [PATCH v2 01/15] " Xueming Li
2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 02/15] app/testpmd: dump port and queue info for each packet Xueming Li
2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 03/15] app/testpmd: new parameter to enable shared Rx queue Xueming Li
2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 04/15] app/testpmd: make sure shared Rx queue polled on same core Xueming Li
2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 05/15] app/testpmd: adds common forwarding for shared Rx queue Xueming Li
2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 06/15] app/testpmd: add common fwd wrapper function Xueming Li
2021-08-17  9:37     ` Jerin Jacob
2021-08-18 11:27       ` Xueming(Steven) Li
2021-08-18 11:47         ` Jerin Jacob
2021-08-18 14:08           ` Xueming(Steven) Li
2021-08-26 11:28             ` Jerin Jacob
2021-08-29  7:07               ` Xueming(Steven) Li
2021-09-01 14:44                 ` Xueming(Steven) Li
2021-09-28  5:54                   ` Xueming(Steven) Li
2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 07/15] app/testpmd: support shared Rx queues for IO forwarding Xueming Li
2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 08/15] app/testpmd: support shared Rx queue for rxonly forwarding Xueming Li
2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 09/15] app/testpmd: support shared Rx queue for icmpecho fwd Xueming Li
2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 10/15] app/testpmd: support shared Rx queue for csum fwd Xueming Li
2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 11/15] app/testpmd: support shared Rx queue for flowgen Xueming Li
2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 12/15] app/testpmd: support shared Rx queue for MAC fwd Xueming Li
2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 13/15] app/testpmd: support shared Rx queue for macswap fwd Xueming Li
2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 14/15] app/testpmd: support shared Rx queue for 5tuple fwd Xueming Li
2021-08-11 14:04   ` [dpdk-dev] [PATCH v2 15/15] app/testpmd: support shared Rx queue for ieee1588 fwd Xueming Li
2021-08-17  9:33   ` [dpdk-dev] [PATCH v2 01/15] ethdev: introduce shared Rx queue Jerin Jacob
2021-08-17 11:31     ` Xueming(Steven) Li
2021-08-17 15:11       ` Jerin Jacob
2021-08-18 11:14         ` Xueming(Steven) Li
2021-08-19  5:26           ` Jerin Jacob
2021-08-19 12:09             ` Xueming(Steven) Li
2021-08-26 11:58               ` Jerin Jacob
2021-08-28 14:16                 ` Xueming(Steven) Li
2021-08-30  9:31                   ` Jerin Jacob
2021-08-30 10:13                     ` Xueming(Steven) Li
2021-09-15 14:45                     ` Xueming(Steven) Li
2021-09-16  4:16                       ` Jerin Jacob
2021-09-28  5:50                         ` Xueming(Steven) Li
2021-09-17  8:01 ` [dpdk-dev] [PATCH v3 0/8] " Xueming Li
2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 1/8] " Xueming Li
2021-09-27 23:53     ` Ajit Khaparde
2021-09-28 14:24       ` Xueming(Steven) Li
2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 2/8] ethdev: new API to aggregate shared Rx queue group Xueming Li
2021-09-26 17:54     ` Ajit Khaparde
2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 3/8] app/testpmd: dump port and queue info for each packet Xueming Li
2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 4/8] app/testpmd: new parameter to enable shared Rx queue Xueming Li
2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 5/8] app/testpmd: force shared Rx queue polled on same core Xueming Li
2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 6/8] app/testpmd: add common fwd wrapper Xueming Li
2021-09-17 11:24     ` Jerin Jacob
2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 7/8] app/testpmd: improve forwarding cache miss Xueming Li
2021-09-17  8:01   ` [dpdk-dev] [PATCH v3 8/8] app/testpmd: support shared Rx queue forwarding Xueming Li
2021-09-30 14:55 ` [dpdk-dev] [PATCH v4 0/6] ethdev: introduce shared Rx queue Xueming Li
2021-09-30 14:55   ` [dpdk-dev] [PATCH v4 1/6] " Xueming Li
2021-10-11 10:47     ` Andrew Rybchenko
2021-10-11 13:12       ` Xueming(Steven) Li
2021-09-30 14:55   ` [dpdk-dev] [PATCH v4 2/6] ethdev: new API to aggregate shared Rx queue group Xueming Li
2021-09-30 14:55   ` [dpdk-dev] [PATCH v4 3/6] app/testpmd: new parameter to enable shared Rx queue Xueming Li
2021-09-30 14:56   ` [dpdk-dev] [PATCH v4 4/6] app/testpmd: dump port info for " Xueming Li
2021-09-30 14:56   ` [dpdk-dev] [PATCH v4 5/6] app/testpmd: force shared Rx queue polled on same core Xueming Li
2021-09-30 14:56   ` [dpdk-dev] [PATCH v4 6/6] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
2021-10-11 11:49   ` [dpdk-dev] [PATCH v4 0/6] ethdev: introduce " Andrew Rybchenko
2021-10-11 15:11     ` Xueming(Steven) Li
2021-10-12  6:37       ` Xueming(Steven) Li
2021-10-12  8:48         ` Andrew Rybchenko
2021-10-12 10:55           ` Xueming(Steven) Li
2021-10-12 11:28             ` Andrew Rybchenko
2021-10-12 11:33               ` Xueming(Steven) Li
2021-10-13  7:53               ` Xueming(Steven) Li
2021-10-11 12:37 ` [dpdk-dev] [PATCH v5 0/5] " Xueming Li
2021-10-11 12:37   ` [dpdk-dev] [PATCH v5 1/5] " Xueming Li
2021-10-11 12:37   ` [dpdk-dev] [PATCH v5 2/5] app/testpmd: new parameter to enable " Xueming Li
2021-10-11 12:37   ` [dpdk-dev] [PATCH v5 3/5] app/testpmd: dump port info for " Xueming Li
2021-10-11 12:37   ` [dpdk-dev] [PATCH v5 4/5] app/testpmd: force shared Rx queue polled on same core Xueming Li
2021-10-11 12:37   ` [dpdk-dev] [PATCH v5 5/5] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
2021-10-12 14:39 ` [dpdk-dev] [PATCH v6 0/5] ethdev: introduce " Xueming Li
2021-10-12 14:39   ` [dpdk-dev] [PATCH v6 1/5] " Xueming Li
2021-10-15  9:28     ` Andrew Rybchenko
2021-10-15 10:54       ` Xueming(Steven) Li
2021-10-18  6:46         ` Andrew Rybchenko
2021-10-18  6:57           ` Xueming(Steven) Li
2021-10-15 17:20     ` Ferruh Yigit
2021-10-16  9:14       ` Xueming(Steven) Li
2021-10-12 14:39   ` [dpdk-dev] [PATCH v6 2/5] app/testpmd: new parameter to enable " Xueming Li
2021-10-12 14:39   ` [dpdk-dev] [PATCH v6 3/5] app/testpmd: dump port info for " Xueming Li
2021-10-12 14:39   ` [dpdk-dev] [PATCH v6 4/5] app/testpmd: force shared Rx queue polled on same core Xueming Li
2021-10-12 14:39   ` [dpdk-dev] [PATCH v6 5/5] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
2021-10-16  8:42 ` [dpdk-dev] [PATCH v7 0/5] ethdev: introduce " Xueming Li
2021-10-16  8:42   ` [dpdk-dev] [PATCH v7 1/5] " Xueming Li
2021-10-17  5:33     ` Ajit Khaparde
2021-10-17  7:29       ` Xueming(Steven) Li
2021-10-16  8:42   ` [dpdk-dev] [PATCH v7 2/5] app/testpmd: new parameter to enable " Xueming Li
2021-10-16  8:42   ` [dpdk-dev] [PATCH v7 3/5] app/testpmd: dump port info for " Xueming Li
2021-10-16  8:42   ` [dpdk-dev] [PATCH v7 4/5] app/testpmd: force shared Rx queue polled on same core Xueming Li
2021-10-16  8:42   ` [dpdk-dev] [PATCH v7 5/5] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
2021-10-18 12:59 ` [dpdk-dev] [PATCH v8 0/6] ethdev: introduce " Xueming Li
2021-10-18 12:59   ` [dpdk-dev] [PATCH v8 1/6] " Xueming Li
2021-10-19  0:21     ` Ajit Khaparde
2021-10-19  5:54       ` Xueming(Steven) Li
2021-10-19  6:28     ` Andrew Rybchenko
2021-10-18 12:59   ` [dpdk-dev] [PATCH v8 2/6] app/testpmd: dump device capability and Rx domain info Xueming Li
2021-10-18 12:59   ` [dpdk-dev] [PATCH v8 3/6] app/testpmd: new parameter to enable shared Rx queue Xueming Li
2021-10-18 12:59   ` [dpdk-dev] [PATCH v8 4/6] app/testpmd: dump port info for " Xueming Li
2021-10-18 12:59   ` [dpdk-dev] [PATCH v8 5/6] app/testpmd: force shared Rx queue polled on same core Xueming Li
2021-10-18 12:59   ` [dpdk-dev] [PATCH v8 6/6] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
2021-10-19  8:17 ` [dpdk-dev] [PATCH v9 0/6] ethdev: introduce " Xueming Li
2021-10-19  8:17   ` [dpdk-dev] [PATCH v9 1/6] " Xueming Li
2021-10-19  8:17   ` [dpdk-dev] [PATCH v9 2/6] app/testpmd: dump device capability and Rx domain info Xueming Li
2021-10-19  8:33     ` Andrew Rybchenko
2021-10-19  9:10       ` Xueming(Steven) Li
2021-10-19  9:39         ` Andrew Rybchenko
2021-10-19  8:17   ` [dpdk-dev] [PATCH v9 3/6] app/testpmd: new parameter to enable shared Rx queue Xueming Li
2021-10-19  8:17   ` [dpdk-dev] [PATCH v9 4/6] app/testpmd: dump port info for " Xueming Li
2021-10-19  8:17   ` [dpdk-dev] [PATCH v9 5/6] app/testpmd: force shared Rx queue polled on same core Xueming Li
2021-10-19  8:17   ` [dpdk-dev] [PATCH v9 6/6] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
2021-10-19 15:20 ` [dpdk-dev] [PATCH v10 0/6] ethdev: introduce " Xueming Li
2021-10-19 15:20   ` [dpdk-dev] [PATCH v10 1/6] ethdev: new API to resolve device capability name Xueming Li
2021-10-19 15:20   ` [dpdk-dev] [PATCH v10 2/6] app/testpmd: dump device capability and Rx domain info Xueming Li
2021-10-19 15:20   ` [dpdk-dev] [PATCH v10 3/6] app/testpmd: new parameter to enable shared Rx queue Xueming Li
2021-10-19 15:20   ` [dpdk-dev] [PATCH v10 4/6] app/testpmd: dump port info for " Xueming Li
2021-10-19 15:20   ` [dpdk-dev] [PATCH v10 5/6] app/testpmd: force shared Rx queue polled on same core Xueming Li
2021-10-19 15:20   ` [dpdk-dev] [PATCH v10 6/6] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
2021-10-19 15:28 ` [dpdk-dev] [PATCH v10 0/7] ethdev: introduce " Xueming Li
2021-10-19 15:28   ` [dpdk-dev] [PATCH v10 1/7] " Xueming Li
2021-10-19 15:28   ` [dpdk-dev] [PATCH v10 2/7] ethdev: new API to resolve device capability name Xueming Li
2021-10-19 17:57     ` Andrew Rybchenko
2021-10-20  7:47       ` Xueming(Steven) Li
2021-10-20  7:48         ` Andrew Rybchenko
2021-10-19 15:28   ` [dpdk-dev] [PATCH v10 3/7] app/testpmd: dump device capability and Rx domain info Xueming Li
2021-10-19 15:28   ` [dpdk-dev] [PATCH v10 4/7] app/testpmd: new parameter to enable shared Rx queue Xueming Li
2021-10-19 15:28   ` [dpdk-dev] [PATCH v10 5/7] app/testpmd: dump port info for " Xueming Li
2021-10-19 15:28   ` [dpdk-dev] [PATCH v10 6/7] app/testpmd: force shared Rx queue polled on same core Xueming Li
2021-10-19 15:28   ` [dpdk-dev] [PATCH v10 7/7] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
2021-10-20  7:53 ` [dpdk-dev] [PATCH v11 0/7] ethdev: introduce " Xueming Li
2021-10-20  7:53   ` [dpdk-dev] [PATCH v11 1/7] " Xueming Li
2021-10-20 17:14     ` Ajit Khaparde
2021-10-20  7:53   ` [dpdk-dev] [PATCH v11 2/7] ethdev: new API to resolve device capability name Xueming Li
2021-10-20 10:52     ` Andrew Rybchenko
2021-10-20 17:16       ` Ajit Khaparde
2021-10-20 18:42     ` Thomas Monjalon
2021-10-20  7:53   ` [dpdk-dev] [PATCH v11 3/7] app/testpmd: dump device capability and Rx domain info Xueming Li
2021-10-21  3:24     ` Li, Xiaoyun
2021-10-21  3:28       ` Ajit Khaparde
2021-10-20  7:53   ` [dpdk-dev] [PATCH v11 4/7] app/testpmd: new parameter to enable shared Rx queue Xueming Li
2021-10-20 17:29     ` Ajit Khaparde
2021-10-20 19:14       ` Thomas Monjalon
2021-10-21  4:09         ` Xueming(Steven) Li
2021-10-21  3:49       ` Xueming(Steven) Li
2021-10-21  3:24     ` Li, Xiaoyun
2021-10-21  3:58       ` Xueming(Steven) Li
2021-10-21  5:15         ` Li, Xiaoyun
2021-10-20  7:53   ` [dpdk-dev] [PATCH v11 5/7] app/testpmd: dump port info for " Xueming Li
2021-10-21  3:24     ` Li, Xiaoyun
2021-10-20  7:53   ` [dpdk-dev] [PATCH v11 6/7] app/testpmd: force shared Rx queue polled on same core Xueming Li
2021-10-21  3:24     ` Li, Xiaoyun
2021-10-21  4:21       ` Xueming(Steven) Li
2021-10-20  7:53   ` [dpdk-dev] [PATCH v11 7/7] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
2021-10-20 19:20     ` Thomas Monjalon
2021-10-21  3:26       ` Li, Xiaoyun
2021-10-21  4:39       ` Xueming(Steven) Li
2021-10-21  5:08 ` [dpdk-dev] [PATCH v12 0/7] ethdev: introduce " Xueming Li
2021-10-21  5:08   ` [dpdk-dev] [PATCH v12 1/7] " Xueming Li
2021-10-21  5:08   ` [dpdk-dev] [PATCH v12 2/7] ethdev: get device capability name as string Xueming Li
2021-10-21  5:08   ` [dpdk-dev] [PATCH v12 3/7] app/testpmd: dump device capability and Rx domain info Xueming Li
2021-10-21  5:08   ` [dpdk-dev] [PATCH v12 4/7] app/testpmd: new parameter to enable shared Rx queue Xueming Li
2021-10-21  9:20     ` Thomas Monjalon
2021-10-21  5:08   ` [dpdk-dev] [PATCH v12 5/7] app/testpmd: dump port info for " Xueming Li
2021-10-21  5:08   ` [dpdk-dev] [PATCH v12 6/7] app/testpmd: force shared Rx queue polled on same core Xueming Li
2021-10-21  6:35     ` Li, Xiaoyun
2021-10-21  5:08   ` [dpdk-dev] [PATCH v12 7/7] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
2021-10-21  6:33     ` Li, Xiaoyun
2021-10-21  7:58       ` Xueming(Steven) Li
2021-10-21  8:01         ` Li, Xiaoyun
2021-10-21  8:22           ` Xueming(Steven) Li
2021-10-21  9:28     ` Thomas Monjalon
2021-10-21 10:41 ` [dpdk-dev] [PATCH v13 0/7] ethdev: introduce " Xueming Li
2021-10-21 10:41   ` [dpdk-dev] [PATCH v13 1/7] " Xueming Li
2021-10-21 10:41   ` [dpdk-dev] [PATCH v13 2/7] ethdev: get device capability name as string Xueming Li
2021-10-21 10:41   ` [dpdk-dev] [PATCH v13 3/7] app/testpmd: dump device capability and Rx domain info Xueming Li
2021-10-21 10:41   ` [dpdk-dev] [PATCH v13 4/7] app/testpmd: new parameter to enable shared Rx queue Xueming Li
2021-10-21 19:45     ` Ajit Khaparde
2021-10-21 10:41   ` [dpdk-dev] [PATCH v13 5/7] app/testpmd: dump port info for " Xueming Li
2021-10-21 19:48     ` Ajit Khaparde
2021-10-21 10:41   ` [dpdk-dev] [PATCH v13 6/7] app/testpmd: force shared Rx queue polled on same core Xueming Li
2021-10-21 10:41   ` [dpdk-dev] [PATCH v13 7/7] app/testpmd: add forwarding engine for shared Rx queue Xueming Li
2021-10-21 23:41   ` [dpdk-dev] [PATCH v13 0/7] ethdev: introduce " Ferruh Yigit
2021-10-22  6:31     ` Xueming(Steven) Li
2021-11-04 15:52   ` Tom Barbette
2021-11-03  7:58 ` [dpdk-dev] [PATCH v3 00/14] net/mlx5: support " Xueming Li
2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 01/14] common/mlx5: introduce user index field in completion Xueming Li
2021-11-04  9:14     ` Slava Ovsiienko
2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 02/14] net/mlx5: fix field reference for PPC Xueming Li
2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 03/14] common/mlx5: adds basic receive memory pool support Xueming Li
2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 04/14] common/mlx5: support receive memory pool Xueming Li
2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 05/14] net/mlx5: fix Rx queue memory allocation return value Xueming Li
2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 06/14] net/mlx5: clean Rx queue code Xueming Li
2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 07/14] net/mlx5: split Rx queue into shareable and private Xueming Li
2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 08/14] net/mlx5: move Rx queue reference count Xueming Li
2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 09/14] net/mlx5: move Rx queue hairpin info to private data Xueming Li
2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 10/14] net/mlx5: remove port info from shareable Rx queue Xueming Li
2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 11/14] net/mlx5: move Rx queue DevX resource Xueming Li
2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 12/14] net/mlx5: remove Rx queue data list from device Xueming Li
2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 13/14] net/mlx5: support shared Rx queue Xueming Li
2021-11-03  7:58   ` [dpdk-dev] [PATCH v3 14/14] net/mlx5: add shared Rx queue port datapath support Xueming Li
2021-11-04 12:33 ` [dpdk-dev] [PATCH v4 00/14] net/mlx5: support shared Rx queue Xueming Li
2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 01/14] common/mlx5: introduce user index field in completion Xueming Li
2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 02/14] net/mlx5: fix field reference for PPC Xueming Li
2021-11-04 17:07     ` Raslan Darawsheh
2021-11-04 17:49     ` David Christensen
2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 03/14] common/mlx5: adds basic receive memory pool support Xueming Li
2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 04/14] common/mlx5: support receive memory pool Xueming Li
2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 05/14] net/mlx5: fix Rx queue memory allocation return value Xueming Li
2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 06/14] net/mlx5: clean Rx queue code Xueming Li
2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 07/14] net/mlx5: split Rx queue into shareable and private Xueming Li
2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 08/14] net/mlx5: move Rx queue reference count Xueming Li
2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 09/14] net/mlx5: move Rx queue hairpin info to private data Xueming Li
2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 10/14] net/mlx5: remove port info from shareable Rx queue Xueming Li
2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 11/14] net/mlx5: move Rx queue DevX resource Xueming Li
2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 12/14] net/mlx5: remove Rx queue data list from device Xueming Li
2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 13/14] net/mlx5: support shared Rx queue Xueming Li
2021-11-04 12:33   ` [dpdk-dev] [PATCH v4 14/14] net/mlx5: add shared Rx queue port datapath support Xueming Li
2021-11-04 17:50     ` David Christensen
2021-11-05  6:40     ` Ruifeng Wang
2021-11-04 20:06   ` [dpdk-dev] [PATCH v4 00/14] net/mlx5: support shared Rx queue Raslan Darawsheh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).